--- pipeline_tag: sentence-similarity tags: - feature-extraction - sentence-similarity language: - de - en - es - fr --- # Model Card for `vectorizer-v1-S-multilingual` This model is a vectorizer developed by Sinequa. It produces an embedding vector given a passage or a query. The passage vectors are stored in our vector index and the query vector is used at query time to look up relevant passages in the index. Model name: `vectorizer-v1-S-multilingual` ## Supported Languages The model was trained and tested in the following languages: - English - French - German - Spanish ## Scores | Metric | Value | |:-----------------------|------:| | Relevance (Recall@100) | 0.448 | Note that the relevance score is computed as an average over 14 retrieval datasets (see [details below](#evaluation-metrics)). ## Inference Times | GPU | Batch size 1 (at query time) | Batch size 32 (at indexing) | |:-----------|-----------------------------:|----------------------------:| | NVIDIA A10 | 2 ms | 14 ms | | NVIDIA T4 | 4 ms | 51 ms | The inference times only measure the time the model takes to process a single batch, it does not include pre- or post-processing steps like the tokenization. ## Requirements - Minimal Sinequa version: 11.10.0 - GPU memory usage: 580 MiB Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which can be around 0.5 to 1 GiB depending on the used GPU. ## Model Details ### Overview - Number of parameters: 39 million - Base language model: Homegrown Sinequa BERT-Small ([Paper](https://arxiv.org/abs/1908.08962)) pretrained in the four supported languages - Insensitive to casing and accents - Training procedure: Query-passage pairs using in-batch negatives ### Training Data - Natural Questions ([Paper](https://research.google/pubs/pub47761/), [Official Page](https://github.com/google-research-datasets/natural-questions)) - Original English dataset - Translated datasets for the other three supported languages ### Evaluation Metrics To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English. | Dataset | Recall@100 | |:------------------|-----------:| | Average | 0.448 | | | | | Arguana | 0.835 | | CLIMATE-FEVER | 0.350 | | DBPedia Entity | 0.287 | | FEVER | 0.645 | | FiQA-2018 | 0.305 | | HotpotQA | 0.396 | | MS MARCO | 0.533 | | NFCorpus | 0.162 | | NQ | 0.701 | | Quora | 0.947 | | SCIDOCS | 0.194 | | SciFact | 0.580 | | TREC-COVID | 0.051 | | Webis-Touche-2020 | 0.289 | We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages. | Language | Recall@100 | |:---------|-----------:| | French | 0.583 | | German | 0.524 | | Spanish | 0.483 |