Sentence Similarity
Safetensors
Japanese
RAGatouille
bert
ColBERT
bclavie commited on
Commit
7ee11db
·
1 Parent(s): 4381874

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -1
README.md CHANGED
@@ -7,4 +7,65 @@ language:
7
  pipeline_tag: sentence-similarity
8
  ---
9
  Under Construction, please come back in a few days!
10
- 工事中です。数日後にまたお越しください。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pipeline_tag: sentence-similarity
8
  ---
9
  Under Construction, please come back in a few days!
10
+ 工事中です。数日後にまたお越しください。
11
+
12
+ # Usage
13
+
14
+ ## Installation
15
+
16
+ Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers:
17
+
18
+ To use JaColBERT, you will need to install the main ColBERT and those dependencies library:
19
+
20
+ ```
21
+ pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite
22
+ ```
23
+
24
+ ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying.
25
+
26
+ ## Indexing
27
+
28
+ > ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs.
29
+
30
+ In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
31
+ Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
32
+ Indexing is the slowest step -- retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
33
+
34
+ ```python
35
+ from colbert import Indexer
36
+ from colbert.infra import Run, RunConfig
37
+
38
+ n_gpu: int = 1 # Set your number of available GPUs
39
+ experiment: str = "" # Name of the folder where the logs and created indices will be stored
40
+ index_name: str = "" # The name of your index, i.e. the name of your vector database
41
+
42
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
43
+ indexer = Indexer(checkpoint="bclavie/JaColBERT")
44
+ documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか?マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
45
+ ...
46
+ ]
47
+ indexer.index(name=index_name, collection=documents)
48
+ ```
49
+
50
+ And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
51
+
52
+
53
+ ## Searching
54
+
55
+ Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage:
56
+
57
+ ```python
58
+ from colbert import Searcher
59
+ from colbert.infra import Run, RunConfig
60
+
61
+ n_gpu: int = 0
62
+ experiment: str = "" # Name of the folder where the logs and created indices will be stored
63
+ index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
64
+ k: int = 10 # how many results you want to retrieve
65
+
66
+ with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
67
+ searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
68
+ query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか"
69
+ results = searcher.search(query, k=k)
70
+ results_dict = results.todict()
71
+ ```