--- license: cc-by-nc-4.0 language: - ar - bn - zh - en - fi - fr - de - hi - id - it - ja - ko - fa - pt - ru - es - sw - te - th - yo pipeline_tag: sentence-similarity library_name: transformers --- # DRAMA-base (0.1B): Diverse Augmentation from Large Language Models to Smaller Dense Retrievers DRAMA-base (0.1B) is a dense retrieval model built upon a pruned large language model backbone. It is derived by pruning a large language model and fine-tuned for efficient and generalizable multilingual text retrieval. By leveraging large language models for high-quality data augmentation, DRAMA-base achieves strong performance across both English and multilingual retrieval tasks, despite its compact size of 0.1B non-embedding parameters. The default embedding size of `drama-base` is 768, as we adopt Matryoshka Representation Learning, the dimionality can be flexiblely truncated to dimensionalities such as 512 or 256. Please check our [paper](https://arxiv.org/abs/2502.18460) for the detials. ## Usage Below is an example using `drama-base` to encode query and document examples from the MIRACL dataset: ```python import torch from transformers import AutoTokenizer, AutoModel queries = [ 'What percentage of the Earth\'s atmosphere is oxygen?', '意大利首都是哪里?', ] documents = [ "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.", "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心,位于意大利半島中部的台伯河下游平原地,建城初期在七座小山丘上,故又名“七丘之城”。按城市范围内的人口计算,罗马是意大利人口最多的城市,也是欧盟人口第三多的城市。", ] model_name = "facebook/drama-base" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device) query_embs = model.encode_queries(tokenizer, queries) doc_embs = model.encode_documents(tokenizer, documents) scores = query_embs @ doc_embs.T print(scores.tolist()) # Expected output: [[0.5310, 0.0821], [0.1298, 0.6181]] ``` > The `trust_remote_code` will use our customized `drama_modeling.py` with two details: >- We use bi-directional attention instead of uni-directional attention >- We add `"Query: "` as prefix for query text. (No prefix added to document) DRAMA models are trained using Matryoshka Representation Learning ([MRL](https://github.com/RAIVNLab/MRL)) to support flexible dimensionality. Both queries and documents can be encoded into smaller dimensions, such as 256, using the following: ```python query_embs = model.encode_queries(tokenizer, queries, dim=256) doc_embs = model.encode_documents(tokenizer, documents, dim=256) scores = query_embs @ doc_embs.T print(scores.tolist()) # Expected output: [[0.6031, 0.1750], [0.2005, 0.7251]] ``` ## Evaluation The model has been evaluated on multiple retrieval benchmarks, including [BEIR](https://github.com/beir-cellar/beir), [MIRACL](https://github.com/project-miracl/miracl), [MLDR](https://huggingface.co/datasets/Shitao/MLDR), and several multilingual retrieval tasks in [MTEB](https://github.com/embeddings-benchmark/mteb). It demonstrates strong performance in both English and multilingual retrieval tasks.