Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,65 @@ language:
|
|
7 |
pipeline_tag: sentence-similarity
|
8 |
---
|
9 |
Under Construction, please come back in a few days!
|
10 |
-
工事中です。数日後にまたお越しください。
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pipeline_tag: sentence-similarity
|
8 |
---
|
9 |
Under Construction, please come back in a few days!
|
10 |
+
工事中です。数日後にまたお越しください。
|
11 |
+
|
12 |
+
# Usage
|
13 |
+
|
14 |
+
## Installation
|
15 |
+
|
16 |
+
Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers:
|
17 |
+
|
18 |
+
To use JaColBERT, you will need to install the main ColBERT and those dependencies library:
|
19 |
+
|
20 |
+
```
|
21 |
+
pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite
|
22 |
+
```
|
23 |
+
|
24 |
+
ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying.
|
25 |
+
|
26 |
+
## Indexing
|
27 |
+
|
28 |
+
> ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs.
|
29 |
+
|
30 |
+
In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
|
31 |
+
Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
|
32 |
+
Indexing is the slowest step -- retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
|
33 |
+
|
34 |
+
```python
|
35 |
+
from colbert import Indexer
|
36 |
+
from colbert.infra import Run, RunConfig
|
37 |
+
|
38 |
+
n_gpu: int = 1 # Set your number of available GPUs
|
39 |
+
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
40 |
+
index_name: str = "" # The name of your index, i.e. the name of your vector database
|
41 |
+
|
42 |
+
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
43 |
+
indexer = Indexer(checkpoint="bclavie/JaColBERT")
|
44 |
+
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
|
45 |
+
...
|
46 |
+
]
|
47 |
+
indexer.index(name=index_name, collection=documents)
|
48 |
+
```
|
49 |
+
|
50 |
+
And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
|
51 |
+
|
52 |
+
|
53 |
+
## Searching
|
54 |
+
|
55 |
+
Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage:
|
56 |
+
|
57 |
+
```python
|
58 |
+
from colbert import Searcher
|
59 |
+
from colbert.infra import Run, RunConfig
|
60 |
+
|
61 |
+
n_gpu: int = 0
|
62 |
+
experiment: str = "" # Name of the folder where the logs and created indices will be stored
|
63 |
+
index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
|
64 |
+
k: int = 10 # how many results you want to retrieve
|
65 |
+
|
66 |
+
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
67 |
+
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
68 |
+
query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか"
|
69 |
+
results = searcher.search(query, k=k)
|
70 |
+
results_dict = results.todict()
|
71 |
+
```
|