The intent of this repo is to compare the performance delta between dense quantized MPT-7B and 70% sparse-quantized MPT-7B on OpenVINO. Quantization here is 8-bit on both weight and activation. Benchmark metric is decoding (next token) latency with context length 512.
Target HW: Intel 4th gen Xeon (Sapphire Rapids)
SW
git clone https://huggingface.co/vuiseng9/ov-mpt-7b-gsm8k-sparse70
pip install openvino==2024.2.0
Benchmarking with OpenVINO
- ./benchmarkapp_w8a8.bash
- ./benchmarkapp_w8a8_sparse70.bash
Note: do remove the numactl if your node does not support it.
Implementation of Sparse Weight Decompression in OpenVINO
This is the first commit of Sparse Weight Decompression on OpenVINO’s fork of oneDNN. https://github.com/openvinotoolkit/oneDNN/pull/158/files
you can browse this via the left pane.
initialization: src/cpu/reorder/simple_sparse_reorder.hpp (line 113)
decompression: src/cpu/x64/jit_brgemm_decompress_kernel.cpp (line 41)
If you'd like to build OpenVINO runtime from source for debug, see wiki page. Benchmark_app is compiled as well.
Related materials:
OpenVINO blog on Sparse-Quantized BERT (corresponding notebook)