metadata

base_model: meta-llama/Llama-2-7b-hf
library_name: peft
license: apache-2.0
language:
  - en
tags:
  - retrieval
  - instructions
datasets:
  - samaya-ai/msmarco-w-instructions

Model Summary

Promptriever is a new way of using dense retriever models. This version, promptriever-llama2-7b-v1 was instruction-trained on a corpus of 490k MSMarco samples with instructions and 490k without instructions. See the paper for more details.

Repository: orionw/Promptriever
Paper: Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
Instruction-Training Dataset: samaya-ai/msmarco-w-instructions

Use

Below is an example to compute the similarity score of a query-document pair

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel, PeftConfig
import numpy as np

class Promptriever:
    def __init__(self, model_name_or_path):
        self.model, self.tokenizer = self.get_model(model_name_or_path)
        self.model.eval()

    def get_model(self, peft_model_name):
        # Load the PEFT configuration to get the base model name
        peft_config = PeftConfig.from_pretrained(peft_model_name)
        base_model_name = peft_config.base_model_name_or_path

        # Load the base model and tokenizer
        base_model = AutoModel.from_pretrained(base_model_name)
        tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        tokenizer.pad_token = tokenizer.eos_token

        # Load and merge the PEFT model
        model = PeftModel.from_pretrained(base_model, peft_model_name)
        model = model.merge_and_unload()

        return model, tokenizer

    def encode(self, texts):
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0]  # Using [CLS] token
        return F.normalize(embeddings, p=2, dim=1)

# Initialize the model
model = Promptriever("samaya-ai/promptriever-llama2-7b-v1")

# Example query and instruction
query = "What universities are in Baltimore, Maryland?"
instruction = "A relevant document would describe any university in Baltimore. I am only interested in the United States, so ignore any document with a campus in Italy."

# Combine query and instruction with two spaces after "query: "
input_text = f"query:  {query} {instruction}"

# Example documents
doc1 = "Johns Hopkins University (often abbreviated as Johns Hopkins, Hopkins, or JHU) is a private research university in Baltimore, Maryland. Founded in 1876, Johns Hopkins was the first American university based on the European research institution model. The university also has graduate campuses in Italy, China, and Washington, D.C."
doc2 = "Johns Hopkins University (often abbreviated as Johns Hopkins, Hopkins, or JHU) is a private research university in Baltimore, Maryland. Founded in 1876, Johns Hopkins was the first American university based on the European research institution model. The university also has graduate campuses in China, and Washington, D.C."

# Encode query and documents
query_embedding = model.encode([input_text])
doc_embeddings = model.encode([doc1, doc2])

# Calculate similarities
similarities = np.dot(query_embedding, doc_embeddings.T)[0]

# Print results
print("Similarities:")
print(f"Document 1: {similarities[0]:.4f}")
print(f"Document 2: {similarities[1]:.4f}")

Training

We used a fork of Tevatron to fine-tune promptriever with the samaya-ai/msmarco-w-instructions dataset.

You can reproduce this with this script (reproduced here for convenience).

#!/bin/bash
deepspeed --include localhost:0,1,2,3 --master_port "60002" --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-instructions-llama2 \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_r 32 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 500 \
  --dataset_name samaya-ai/msmarco-w-instructions \
  --query_prefix "query: " \
  --passage_prefix "passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 304 \
  --passage_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --warmup_steps 100 \
  --gradient_accumulation_steps 4 \
  --negatives_first_n 3

License

This model is released under the Apache-2 license, following the terms of service of the Llama licence. This model was used for research efforts and is not used in any production systems at Samaya AI.

Citation

@article{weller2024promptriever,
  title={Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
  author={Weller, Orion and Van Durme, Benjamin and Lawrie, Dawn and Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack},
  journal={arXiv preprint TODO},
  year={2024}
}