---
base_model: tokyotech-llm/Swallow-7b-hf
library_name: peft
---

# Model Info

This is a model that applies LLM2Vec to Swallow. Only the PEFT Adapter is distributed.
LLM2Vec is fine-tuned on two tasks: MNTP and SimCSE, and this repository contains the results of applying SimCSE after MNTP. 
For the MNTP Adapter, please refer to [this link](https://huggingface.co/uzabase/LLM2Vec-Llama-2-7b-hf-wikipedia-jp-mntp).

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Model type:** PEFT
- **Language(s) (NLP):** Japanese
- **License:** Apache2.0
- **Finetuned from model:** [Swallow-7b-hf](https://huggingface.co/tokyotech-llm/Swallow-7b-hf)

### Model Sources [optional]

- **Repository:**  https://github.com/McGill-NLP/llm2vec
- **Paper:** https://arxiv.org/abs/2404.05961

## Usage

- Please see [original LLM2Vec repo](https://huggingface.co/McGill-NLP/LLM2Vec-Llama-2-7b-chat-hf-mntp-unsup-simcse#usage)

## Training Details

### Training Data

- Make Corpus from SimCSE from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
- Script for making SimCSE Corpus
```
import argparse
import random
import re
from pathlib import Path
from datasets import load_dataset
from tqdm import tqdm

def main(args):
    random.seed(args.seed)
    wiki_ds = load_dataset("wikimedia/wikipedia", "20231101.ja")
    sampled_index = random.sample(range(len(wiki_ds["train"])), args.N)
    sample_wiki = wiki_ds["train"][sampled_index]
    output_texts = []
    for title, text in tqdm(zip(sample_wiki["title"], sample_wiki["text"])):
        output_texts.append(title)
        sentences = re.split("[\n。]", text)
        for sentence in sentences:
            if len(sentence) > args.min_sentence_len: 
                output_texts.append(sentence.strip()+"。")
    with args.output_path.open(mode="w") as f:
        for line in output_texts:
            f.write(line)
            f.write("\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--N", default=200000, type=int)
    parser.add_argument("--seed", default=42, type=int)
    parser.add_argument("-o", "--output_path", type=Path)
    parser.add_argument("--min_sentence_len", default=50, type=int)

    args = parser.parse_args()
    main(args)
```
  

#### Training Hyperparameter
- simcse_dropout: 0.3
- bidirectional: true
- pooling_mode: "mean"
- remove_unused_columns: false
- learning_rate: 3e-5
- loss_scale: 20
- batch_size: 256
- gradient_accumulation_steps: 1
- max_seq_length: 128
- lora_r: 16
- torch_dtype: "bfloat16"
- attn_implementation: "flash_attention_2"
- seed: 42
- bf16: true
- gradient_checkpointing: true
    

#### Accelerator Settings
- deepspeed_config:
  - gradient_accumulation_steps: 1
  - gradient_clipping: 1.0
  - offload_optimizer_device: nvme
  - offload_optimizer_nvme_path: /nvme
  - zero3_save_16bit_model: true
  - zero_stage: 2 
- distributed_type: DEEPSPEED
- downcast_bf16: 'no'
- dynamo_config:
  - dynamo_backend: INDUCTOR
  - dynamo_mode: default
  - dynamo_use_dynamic: true
  - dynamo_use_fullgraph: true
- enable_cpu_affinity: false
- machine_rank: 0
- main_training_function: main
- mixed_precision: bf16
- num_machines: 1
- num_processes: 2
- rdzv_backend: static
- same_network: true
- quse_cpu: false


### Framework versions

- Python: 3.12.3
- PEFT 0.11.1
- Sentence Transformers: 3.0.1
- Transformers: 4.41.0
- PyTorch: 2.3.0
- Accelerate: 0.30.1
- Datasets: 2.20.0
- Tokenizers: 0.19.1
- MTEB: 1.13.0