File size: 3,418 Bytes
ed74736 4bb06bd ed74736 9bd81ec ed74736 9bd81ec ed74736 9bd81ec ed74736 9bd81ec ed74736 9bd81ec ed74736 9bd81ec ed74736 9bd81ec ed74736 4bb06bd ed74736 9bd81ec ed74736 9bd81ec ed74736 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
language: en
library_name: bm25s
tags:
- bm25
- bm25s
- retrieval
- search
- lexical
---
# BM25S Index
This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.0`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
BM25S Related Links:
* 🏠[Homepage](https://bm25s.github.io)
* 💻[GitHub Repository](https://github.com/xhluca/bm25s)
* 🤗[Blog Post](https://huggingface.co/blog/xhluca/bm25s)
* 📝[Technical Report](https://arxiv.org/abs/2407.03618)
## Installation
You can install the `bm25s` library with `pip`:
```bash
pip install "bm25s==0.2.0"
# For huggingface hub usage
pip install huggingface_hub
```
## Loading a `bm25s` index
You can use this index for information retrieval tasks. Here is an example:
```python
import bm25s
from bm25s.hf import BM25HF
# Load the index
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency")
# You can retrieve now
query = "a cat is a feline"
results = retriever.retrieve(bm25s.tokenize(query), k=3)
```
## Saving a `bm25s` index
You can save a `bm25s` index to the Hugging Face Hub. Here is an example:
```python
import bm25s
from bm25s.hf import BM25HF
corpus = [
"northwest bank",
"misfits market",
"merrick bank login",
"marketing",
"market place",
"jetblue customer service",
"internal revenue service",
"how to make money online",
"gordon food service",
"futures market",
"frontier airlines customer service",
"food banks near me",
"first convenience bank",
"eastern bank",
"dollar bank",
]
retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))
token = None # You can get a token from the Hugging Face website
retriever.save_to_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", token=token)
```
## Advanced usage
You can leverage more advanced features of the BM25S library during `load_from_hub`:
```python
# Load corpus and index in memory-map (mmap=True) to reduce memory
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", load_corpus=True, mmap=True)
# Load a different branch/revision
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", revision="main")
# Change directory where the local files should be downloaded
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", local_dir="/path/to/dir")
# Load private repositories with a token:
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", token=token)
```
## Stats
This dataset was created using the following data: 497 keywords Cryptocurrency (semrush)
| Statistic | Value |
| --- | --- |
| Number of documents | 602959 |
| Number of tokens | 2414020 |
| Average tokens per document | 4.0 |
## Parameters
The index was created with the following parameters:
| Parameter | Value |
| --- | --- |
| k1 | `1.5` |
| b | `0.75` |
| delta | `0.5` |
| method | `lucene` |
| idf method | `lucene` |
## Citation
To cite `bm25s`, please use the following bibtex:
```
@misc{lu_2024_bm25s,
title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring},
author={Xing Han Lù},
year={2024},
eprint={2407.03618},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.03618},
}
```
|