JeremyHibiki
/

bge-m3-onnx-o4

Feature Extraction

Model card Files Files and versions Community

JeremyHibiki commited on 19 days ago

Commit

35ad950

·

verified ·

1 Parent(s): 3960134

Update README.md

Files changed (1) hide show

README.md +107 -1

README.md CHANGED Viewed

@@ -5,4 +5,110 @@ pipeline_tag: feature-extraction
 tags:
 - bge-m3
 - onnx
----

 tags:
 - bge-m3
 - onnx
+---
+Based on `aapot/bge-m3-onnx` and `philipchung/bge-m3-onnx`
+## Deploy with tritonserver
+- Folder structure
+```
+.
+├── model_repository
+│   └── bge-m3
+│       ├── 1
+│       │   ├── model.onnx
+│       │   └── model.onnx.data
+│       └── config.pbtxt
+```
+- `config.pbtxt` file
+```
+name: "bge-m3"
+backend: "onnxruntime"
+max_batch_size : 4
+input [
+  {
+    name: "input_ids"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "attention_mask"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  }
+]
+output [
+  {
+    name: "dense_vecs"
+    data_type: TYPE_FP32
+    dims: [ 1024 ]
+  },
+  {
+    name: "sparse_vecs"
+    data_type: TYPE_FP32
+    dims: [ -1, 1 ]
+  },
+  {
+    name: "colbert_vecs"
+    data_type: TYPE_FP32
+    dims: [ -1, 1024 ]
+  }
+]
+```
+- Run with tritonserver docker image
+```bash
+docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ./model_repository:/models nvcr.io/nvidia/tritonserver:24.12-py3 tritonserver --
+model-repository=/models
+```
+- Infer with `tritonsclient`
+```python
+from typing import List
+from tritonclient.http import InferenceServerClient, InferInput
+from datasets import load_dataset
+from transformers import AutoTokenizer
+BS = 4
+TOKENIZER_NAME = "BAAI/bge-m3"
+TRITON_MODEL_NAME = "bge-m3"
+tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
+data: List[str] = [x["text"] for x in load_dataset("BeiR/scidocs", "corpus")["corpus"]]
+batch = data[:BS]
+client = InferenceServerClient("localhost:8000")
+tokenized = tokenizer(batch, padding=True, truncation=True, return_tensors="np")
+input_ids, attention_mask = tokenized.input_ids, tokenized.attention_mask
+inputs = [
+    InferInput("input_ids", [len(batch), len(input_ids[0])], "INT64"),
+    InferInput("attention_mask", [len(batch), len(attention_mask[0])], "INT64"),
+]
+inputs[0].set_data_from_numpy(input_ids)
+inputs[1].set_data_from_numpy(attention_mask)
+results = client.infer(TRITON_MODEL_NAME, inputs)
+dense_vecs = results.as_numpy("dense_vecs")
+sparse_vecs = results.as_numpy("sparse_vecs").squeeze(-1)
+colbert_vecs = results.as_numpy("colbert_vecs")
+output = {
+    "dense_vecs": dense_vecs.tolist(),
+    "sparse_vecs": sparse_vecs.tolist(),
+    "colbert_vecs": colbert_vecs.tolist(),
+}
+print(output)
+```