|
--- |
|
license: mit |
|
datasets: |
|
- unicamp-dl/mmarco |
|
language: |
|
- de |
|
--- |
|
|
|
|
|
# ColBERTv2-mmarco-de-0.1 |
|
|
|
This is a German ColBERT implementation based on [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) |
|
|
|
- Base Model: [dbmdz/bert-base-german-cased](https://huggingface.co/dbmdz/bert-base-german-cased) |
|
- Training Data: [unicamp-dl/mmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2) --> 10Mio random sample |
|
- Framework used for training [RAGatouille](https://github.com/bclavie/RAGatouille) Thanks a ton [@bclavie](https://huggingface.co/bclavie) ! |
|
|
|
|
|
As I'm limited on GPU Training did not go through all the way. "Only" 10 checkpoints were trained. |
|
|
|
# Code |
|
My code is probably a mess, but YOLO! |
|
|
|
|
|
## data prep |
|
```python |
|
from datasets import load_dataset |
|
from ragatouille import RAGTrainer |
|
from tqdm import tqdm |
|
import pickle |
|
from concurrent.futures import ThreadPoolExecutor |
|
from tqdm.notebook import tqdm |
|
import concurrent |
|
|
|
SAMPLE_SIZE = -1 |
|
|
|
|
|
|
|
def int_to_string(number): |
|
if number < 0: |
|
return "full" |
|
elif number < 1000: |
|
return str(number) |
|
elif number < 1000000: |
|
return f"{number // 1000}K" |
|
elif number >= 1000000: |
|
return f"{number // 1000000}M" |
|
|
|
def process_chunk(chunk): |
|
return [list(item) for item in zip(chunk["query"], chunk["positive"], chunk["negative"])] |
|
|
|
def chunked_iterable(iterable, chunk_size): |
|
"""Yield successive chunks from iterable.""" |
|
for i in range(0, len(iterable), chunk_size): |
|
yield iterable[i:i + chunk_size] |
|
|
|
def process_dataset_concurrently(dataset, chunksize=1000): |
|
with ThreadPoolExecutor() as executor: |
|
# Wrap the dataset with tqdm for real-time updates |
|
wrapped_dataset = tqdm(chunked_iterable(dataset, chunksize), total=(len(dataset) + chunksize - 1) // chunksize) |
|
# Submit each chunk to the executor |
|
futures = [executor.submit(process_chunk, chunk) for chunk in wrapped_dataset] |
|
results = [] |
|
for future in concurrent.futures.as_completed(futures): |
|
results.extend(future.result()) |
|
return results |
|
|
|
dataset = load_dataset('unicamp-dl/mmarco', 'german', trust_remote_code=True) |
|
|
|
|
|
# Shuffle the dataset and seed for reproducibility if needed |
|
shuffled_dataset = dataset['train'].shuffle(seed=42) |
|
|
|
|
|
if SAMPLE_SIZE > 0: |
|
sampled_dataset = shuffled_dataset.select(range(SAMPLE_SIZE)) |
|
else: |
|
sampled_dataset = shuffled_dataset |
|
|
|
|
|
triplets = process_dataset_concurrently(sampled_dataset, chunksize=10000) |
|
trainer = RAGTrainer(model_name=f"ColBERT-mmacro-de-{int_to_string(SAMPLE_SIZE)}", pretrained_model_name="dbmdz/bert-base-german-cased", language_code="de",) |
|
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False) |
|
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
```python |
|
from datasets import load_dataset |
|
import os |
|
from ragatouille import RAGTrainer |
|
from tqdm import tqdm |
|
import pickle |
|
from concurrent.futures import ThreadPoolExecutor |
|
from tqdm.notebook import tqdm |
|
import concurrent |
|
from pathlib import Path |
|
|
|
|
|
def int_to_string(number): |
|
if number < 1000: |
|
return str(number) |
|
elif number < 1000000: |
|
return f"{number // 1000}K" |
|
elif number >= 1000000: |
|
return f"{number // 1000000}M" |
|
|
|
|
|
|
|
SAMPLE_SIZE = 1000000 |
|
|
|
|
|
trainer = RAGTrainer(model_name=f"ColBERT-mmacro-de-{int_to_string(SAMPLE_SIZE)}", pretrained_model_name="dbmdz/bert-base-german-cased", language_code="de",) |
|
|
|
trainer.data_dir = Path("/kaggle/input/mmarco-de-10m") |
|
|
|
trainer.train(batch_size=32, |
|
nbits=4, # How many bits will the trained model use when compressing indexes |
|
maxsteps=500000, # Maximum steps hard stop |
|
use_ib_negatives=True, # Use in-batch negative to calculate loss |
|
dim=128, # How many dimensions per embedding. 128 is the default and works well. |
|
learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot) |
|
doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well. |
|
use_relu=False, # Disable ReLU -- doesn't improve performance |
|
warmup_steps="auto", # Defaults to 10% |
|
) |
|
``` |