|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- information retrieval |
|
- llama2 |
|
- document expansion |
|
- LoRA |
|
--- |
|
|
|
This repository contains the LoRA weights for fine-tuning pre-trained Llama 2 7B for document expansion for use with [DeeperImpact](https://arxiv.org/abs/2405.17093). |
|
|
|
We use the same dataset as DocT5Query for fine-tuning the pre-trained Llama 2 model i.e. 532k document-query pairs from MSMARCO Passage Qrels Train Dataset. |
|
|
|
Please refer to the following GitHub repository to learn how to use it for document expansion: [inference_deeper_impact.ipynb](https://github.com/basnetsoyuj/improving-learned-index/blob/master/inference_deeper_impact.ipynb) |
|
|
|
You can also clone the [DeeperImpact repo](https://github.com/basnetsoyuj/improving-learned-index/blob/master) and run expansions on a collection of documents using the following command: |
|
|
|
``` |
|
python -m src.llama2.generate \ |
|
--llama_path <path | HuggingFaceHub link> \ |
|
--collection_path <path> \ |
|
--collection_type [msmarco | beir] \ |
|
--output_path <path> \ |
|
--batch_size <batch_size> \ |
|
--max_tokens 512 \ |
|
--num_return_sequences 80 \ |
|
--max_new_tokens 50 \ |
|
--top_k 50 \ |
|
--top_p 0.95 \ |
|
--peft_path soyuj/llama2-doc2query |
|
``` |
|
|
|
This will generate a jsonl file with expansions for each document in the collection. To append the unique expansion terms to the original collection, use the following command: |
|
|
|
``` |
|
python -m src.llama2.merge \ |
|
--collection_path <path> \ |
|
--collection_type [msmarco | beir] \ |
|
--queries_path <jsonl file generated above> \ |
|
--output_path <path> |
|
``` |
|
|
|
|
|
|