dadashzadeh commited on
Commit
ed74736
1 Parent(s): 3075f05

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: bm25s
4
+ tags:
5
+ - bm25
6
+ - bm25s
7
+ - retrieval
8
+ - search
9
+ - lexical
10
+ ---
11
+
12
+ # BM25S Index
13
+
14
+ This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `{version}`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
15
+
16
+ BM25S Related Links:
17
+
18
+ * 🏠[Homepage](https://bm25s.github.io)
19
+ * 💻[GitHub Repository](https://github.com/xhluca/bm25s)
20
+ * 🤗[Blog Post](https://huggingface.co/blog/xhluca/bm25s)
21
+ * 📝[Technical Report](https://arxiv.org/abs/2407.03618)
22
+
23
+
24
+ ## Installation
25
+
26
+ You can install the `bm25s` library with `pip`:
27
+
28
+ ```bash
29
+ pip install "bm25s==0.2.0"
30
+
31
+ # For huggingface hub usage
32
+ pip install huggingface_hub
33
+ ```
34
+
35
+ ## Loading a `bm25s` index
36
+
37
+ You can use this index for information retrieval tasks. Here is an example:
38
+
39
+ ```python
40
+ import bm25s
41
+ from bm25s.hf import BM25HF
42
+
43
+ # Load the index
44
+ retriever = BM25HF.load_from_hub("{username}/{repo_name}}")
45
+
46
+ # You can retrieve now
47
+ query = "a cat is a feline"
48
+ results = retriever.retrieve(bm25s.tokenize(query), k=3)
49
+ ```
50
+
51
+ ## Saving a `bm25s` index
52
+
53
+ You can save a `bm25s` index to the Hugging Face Hub. Here is an example:
54
+
55
+ ```python
56
+ import bm25s
57
+ from bm25s.hf import BM25HF
58
+
59
+ corpus = [
60
+ "a cat is a feline and likes to purr",
61
+ "a dog is the human's best friend and loves to play",
62
+ "a bird is a beautiful animal that can fly",
63
+ "a fish is a creature that lives in water and swims",
64
+ ]
65
+
66
+ retriever = BM25HF(corpus=corpus)
67
+ retriever.index(bm25s.tokenize(corpus))
68
+
69
+ token = None # You can get a token from the Hugging Face website
70
+ retriever.save_to_hub("{username}/{repo_name}", token=token)
71
+ ```
72
+
73
+ ## Advanced usage
74
+
75
+ You can leverage more advanced features of the BM25S library during `load_from_hub`:
76
+
77
+ ```python
78
+ # Load corpus and index in memory-map (mmap=True) to reduce memory
79
+ retriever = BM25HF.load_from_hub("{username}/{repo_name}", load_corpus=True, mmap=True)
80
+
81
+ # Load a different branch/revision
82
+ retriever = BM25HF.load_from_hub("{username}/{repo_name}", revision="main")
83
+
84
+ # Change directory where the local files should be downloaded
85
+ retriever = BM25HF.load_from_hub("{username}/{repo_name}", local_dir="/path/to/dir")
86
+
87
+ # Load private repositories with a token:
88
+ retriever = BM25HF.load_from_hub("{username}/{repo_name}", token=token)
89
+ ```
90
+
91
+ ## Stats
92
+
93
+ This dataset was created using the following data:
94
+
95
+ | Statistic | Value |
96
+ | --- | --- |
97
+ | Number of documents | {num_docs} |
98
+ | Number of tokens | {num_tokens} |
99
+ | Average tokens per document | {avg_tokens_per_doc} |
100
+
101
+ ## Parameters
102
+
103
+ The index was created with the following parameters:
104
+
105
+ | Parameter | Value |
106
+ | --- | --- |
107
+ | k1 | `{k1}` |
108
+ | b | `{b}` |
109
+ | delta | `{delta}` |
110
+ | method | `{method}` |
111
+ | idf method | `{idf_method}` |
112
+
113
+ ## Citation
114
+
115
+ To cite `bm25s`, please use the following bibtex:
116
+
117
+ ```
118
+ @misc{lu_2024_bm25s,
119
+ title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring},
120
+ author={Xing Han Lù},
121
+ year={2024},
122
+ eprint={2407.03618},
123
+ archivePrefix={arXiv},
124
+ primaryClass={cs.IR},
125
+ url={https://arxiv.org/abs/2407.03618},
126
+ }
127
+ ```