--- base_model: 99eren99/ModernBERT-base-Turkish-uncased-mlm language: - tr library_name: PyLate pipeline_tag: sentence-similarity tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - generated_from_trainer - reranker - bert license: apache-2.0 --- # Turkish Long Context ColBERT Based Reranker This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [99eren99/ModernBERT-base-Turkish-uncased-mlm](99eren99/ModernBERT-base-Turkish-uncased-mlm). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. # Model Sources - **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/) - **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate) - **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate) # Evaluation Results nDCG and Recall scores for long context late interaction retrieval models, test code and detailed metrics in ["./assets"](https://huggingface.co/99eren99/ColBERT-ModernBERT-base-Turkish-uncased/tree/main/assets) drawing # Usage First install the PyLate library: ```bash pip install -U einops flash_attn pip install -U pylate ``` Then normalize your text - > lambda x: x.replace("İ", "i").replace("I", "ı").lower() # Retrieval PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval. # Indexing documents First, load the ColBERT model and initialize the Voyager index, then encode and index your documents: ```python from pylate import indexes, models, retrieve # Step 1: Load the ColBERT model document_length = 180#some integer [0,8192] for truncating documents, you can maybe try rope scaling for longer inputs model = models.ColBERT( model_name_or_path="99eren99/ColBERT-ModernBERT-base-Turkish-uncased", document_length=document_length ) try: model.tokenizer.model_input_names.remove("token_type_ids") except: pass #model.to("cuda") # Step 2: Initialize the Voyager index index = indexes.Voyager( index_folder="pylate-index", index_name="index", override=True, # This overwrites the existing index if any ) # Step 3: Encode the documents documents_ids = ["1", "2", "3"] documents = ["document 1 text", "document 2 text", "document 3 text"] documents_embeddings = model.encode( documents, batch_size=32, is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries show_progress_bar=True, ) # Step 4: Add document embeddings to the index by providing embeddings and corresponding ids index.add_documents( documents_ids=documents_ids, documents_embeddings=documents_embeddings, ) ``` Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: ```python # To load an index, simply instantiate it with the correct folder/name and without overriding it index = indexes.Voyager( index_folder="pylate-index", index_name="index", ) ``` # Retrieving top-k documents for queries Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: ```python # Step 1: Initialize the ColBERT retriever retriever = retrieve.ColBERT(index=index) # Step 2: Encode the queries queries_embeddings = model.encode( ["query for document 3", "query for document 1"], batch_size=32, is_query=True, # # Ensure that it is set to False to indicate that these are queries show_progress_bar=True, ) # Step 3: Retrieve top-k documents scores = retriever.retrieve( queries_embeddings=queries_embeddings, k=10, # Retrieve the top 10 matches for each query ) ``` # Reranking If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: ```python from pylate import rank, models queries = [ "query A", "query B", ] documents = [ ["document A", "document B"], ["document 1", "document C", "document B"], ] documents_ids = [ [1, 2], [1, 3, 2], ] model = models.ColBERT( model_name_or_path=pylate_model_id, ) queries_embeddings = model.encode( queries, is_query=True, ) documents_embeddings = model.encode( documents, is_query=False, ) reranked_documents = rank.rerank( documents_ids=documents_ids, queries_embeddings=queries_embeddings, documents_embeddings=documents_embeddings, ) ```