File size: 1,611 Bytes
d104f07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- information retrieval
- llama2
- document expansion
- LoRA
---

This repository contains the LoRA weights for fine-tuning pre-trained Llama 2 7B for document expansion for use with [DeeperImpact](https://arxiv.org/abs/2405.17093).

We use the same dataset as DocT5Query for fine-tuning the pre-trained Llama 2 model i.e. 532k document-query pairs from MSMARCO Passage Qrels Train Dataset.

Please refer to the following GitHub repository to learn how to use it for document expansion: [inference_deeper_impact.ipynb](https://github.com/basnetsoyuj/improving-learned-index/blob/master/inference_deeper_impact.ipynb)

You can also clone the [DeeperImpact repo](https://github.com/basnetsoyuj/improving-learned-index/blob/master) and run expansions on a collection of documents using the following command:

```
python -m src.llama2.generate \
    --llama_path <path | HuggingFaceHub link> \
    --collection_path <path> \
    --collection_type [msmarco | beir] \
    --output_path <path> \
    --batch_size <batch_size> \
    --max_tokens 512 \
    --num_return_sequences 80 \
    --max_new_tokens 50 \
    --top_k 50 \
    --top_p 0.95 \
    --peft_path soyuj/llama2-doc2query
```

This will generate a jsonl file with expansions for each document in the collection. To append the unique expansion terms to the original collection, use the following command:

```
python -m src.llama2.merge \
  --collection_path <path> \
  --collection_type [msmarco | beir] \
  --queries_path <jsonl file generated above> \
  --output_path <path>
```