bclavie
/

JaColBERT

 pipeline_tag: sentence-similarity
 ---
 Under Construction, please come back in a few days!
+工事中です。数日後にまたお越しください。
+# Usage
+## Installation
+Using this model is slightly different from using typical dense embedding models. The model relies on `faiss`, for efficient indexing, and `torch`, for NN operations. JaColBERT is built upon bert-base-japanese-v3, so you also need to install the required dictionary and tokenizers:
+To use JaColBERT, you will need to install the main ColBERT and those dependencies library:
+```
+pip install colbert-ir[faiss-gpu] faiss torch fugashi unidic-lite
+```
+ColBERT looks slightly more unfriendly than a usual `transformers` model, but a lot of it is just making the config apparent so you can easily modify it! Running with all defaults work very well, so don't be anxious about trying.
+## Indexing
+> ⚠️ ColBERT indexing requires a GPU! You can, however, very easily index thousands and thousands of documents using Google Colab's free GPUs.
+In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.
+Think of it like using an embedding model, like e5, to embed all your documents and storing them in a vector database.
+Indexing is the slowest step -- retrieval is extremely quick. There are some tricks to speed it up, but the default settings work fairly well:
+```python
+from colbert import Indexer
+from colbert.infra import Run, RunConfig
+n_gpu: int = 1 # Set your number of available GPUs
+experiment: str = "" # Name of the folder where the logs and created indices will be stored
+index_name: str = "" # The name of your index, i.e. the name of your vector database
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    indexer = Indexer(checkpoint="bclavie/JaColBERT")
+    documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか？マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",
+    ...
+    ]
+    indexer.index(name=index_name, collection=documents)
+```
+And that's it! Let it run, and your index and all its representations (compressed to 2bits by default) will have been generated.
+## Searching
+Once you have created an index, searching through it is just as simple, again with the Run() syntactic sugar to manage GPUs and storage:
+```python
+from colbert import Searcher
+from colbert.infra import Run, RunConfig
+n_gpu: int = 0
+experiment: str = "" # Name of the folder where the logs and created indices will be stored
+index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
+k: int = 10 # how many results you want to retrieve
+with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
+    searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
+    query = "マクドナルドの小さなフライドポテトのカロリーはいくつですか"
+    results = searcher.search(query, k=k)
+    results_dict = results.todict()
+```