Update README.md

da28189 verified 12 months ago

4.14 kB

	---
	license: mit
	datasets:
	- unicamp-dl/mmarco
	language:
	- de
	---


	# ColBERTv2-mmarco-de-0.1

	This is a German ColBERT implementation based on [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0)

	- Base Model: [dbmdz/bert-base-german-cased](https://huggingface.co/dbmdz/bert-base-german-cased)
	- Training Data: [unicamp-dl/mmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2) --> 10Mio random sample
	- Framework used for training [RAGatouille](https://github.com/bclavie/RAGatouille) Thanks a ton [@bclavie](https://huggingface.co/bclavie) !


	As I'm limited on GPU Training did not go through all the way. "Only" 10 checkpoints were trained.

	# Code
	My code is probably a mess, but YOLO!


	## data prep
	```python
	from datasets import load_dataset
	from ragatouille import RAGTrainer
	from tqdm import tqdm
	import pickle
	from concurrent.futures import ThreadPoolExecutor
	from tqdm.notebook import tqdm
	import concurrent

	SAMPLE_SIZE = -1



	def int_to_string(number):
	if number < 0:
	return "full"
	elif number < 1000:
	return str(number)
	elif number < 1000000:
	return f"{number // 1000}K"
	elif number >= 1000000:
	return f"{number // 1000000}M"

	def process_chunk(chunk):
	return [list(item) for item in zip(chunk["query"], chunk["positive"], chunk["negative"])]

	def chunked_iterable(iterable, chunk_size):
	"""Yield successive chunks from iterable."""
	for i in range(0, len(iterable), chunk_size):
	yield iterable[i:i + chunk_size]

	def process_dataset_concurrently(dataset, chunksize=1000):
	with ThreadPoolExecutor() as executor:
	# Wrap the dataset with tqdm for real-time updates
	wrapped_dataset = tqdm(chunked_iterable(dataset, chunksize), total=(len(dataset) + chunksize - 1) // chunksize)
	# Submit each chunk to the executor
	futures = [executor.submit(process_chunk, chunk) for chunk in wrapped_dataset]
	results = []
	for future in concurrent.futures.as_completed(futures):
	results.extend(future.result())
	return results

	dataset = load_dataset('unicamp-dl/mmarco', 'german', trust_remote_code=True)


	# Shuffle the dataset and seed for reproducibility if needed
	shuffled_dataset = dataset['train'].shuffle(seed=42)


	if SAMPLE_SIZE > 0:
	sampled_dataset = shuffled_dataset.select(range(SAMPLE_SIZE))
	else:
	sampled_dataset = shuffled_dataset


	triplets = process_dataset_concurrently(sampled_dataset, chunksize=10000)
	trainer = RAGTrainer(model_name=f"ColBERT-mmacro-de-{int_to_string(SAMPLE_SIZE)}", pretrained_model_name="dbmdz/bert-base-german-cased", language_code="de",)
	trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

	```


	## Training

	```python
	from datasets import load_dataset
	import os
	from ragatouille import RAGTrainer
	from tqdm import tqdm
	import pickle
	from concurrent.futures import ThreadPoolExecutor
	from tqdm.notebook import tqdm
	import concurrent
	from pathlib import Path


	def int_to_string(number):
	if number < 1000:
	return str(number)
	elif number < 1000000:
	return f"{number // 1000}K"
	elif number >= 1000000:
	return f"{number // 1000000}M"



	SAMPLE_SIZE = 1000000


	trainer = RAGTrainer(model_name=f"ColBERT-mmacro-de-{int_to_string(SAMPLE_SIZE)}", pretrained_model_name="dbmdz/bert-base-german-cased", language_code="de",)

	trainer.data_dir = Path("/kaggle/input/mmarco-de-10m")

	trainer.train(batch_size=32,
	nbits=4, # How many bits will the trained model use when compressing indexes
	maxsteps=500000, # Maximum steps hard stop
	use_ib_negatives=True, # Use in-batch negative to calculate loss
	dim=128, # How many dimensions per embedding. 128 is the default and works well.
	learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
	doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
	use_relu=False, # Disable ReLU -- doesn't improve performance
	warmup_steps="auto", # Defaults to 10%
	)
	```