--- license: cc-by-4.0 language: - en tags: - retrieval - retriever - rag inference: false --- # Model Description RoBERTA ReRanker for Retrieved Results or **R*** (pronounced R-star) is an advanced model designed to enhance search results' relevance and accuracy through reranking. By integrating the retrieval capabilities of **R*** with generative models, this hybrid approach significantly enhances the relevance and contextual depth of search results. Based on the [RoBERTa tiny](https://huggingface.co/haisongzhang/roberta-tiny-cased) architecture, **R*** is specialized in distinguishing relevant from irrelevant query-passage pairs, thereby refining the output of LLMs in retrieval and generative tasks. This model is an experiment featured and presented in [PACLIC 38 (2024)](https://sites.google.com/view/paclic38), which would be published in the ACL Anthology. ## Training Data R* was trained on a dataset derived from the MS MARCO passage ranking dataset, consisting of 2.5 million query-positive passage pairs and an equal number of query-negative passage pairs, totaling 5 million query-passage pairs. This ensures a balanced training approach, exposing R* to both relevant and irrelevant examples equally. ## Training Procedure Training focused on binary classification, aiming to assign a continuous relevance score ranging from 0 (irrelevant) to 1 (relevant) for each query-passage pair. The model underwent training for 7 epochs with a batch size of 2048, utilizing a Colab Pro instance equipped with a V100 GPU (16 GB VRAM) and 51 GB RAM, completing in approximately 16 hours. ## Evaluation and Performance Coming soon. ## Use Cases R* is particularly suitable for applications that demand high precision in information retrieval, such as RAG reranking, search engine results, document searching in legal or academic databases, recommendation systems, and beyond. ## How to Use ### With Transformers For usage with the Transformers library, you can follow this generic example: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained('jaspercatapang/R-star') tokenizer = AutoTokenizer.from_pretrained('jaspercatapang/R-star') features = tokenizer(['Your query here', 'First relevant passage for first query'], ['Your query here', 'Second relevant passage for second query'], padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits print(scores) ``` ### With SentenceTransformers ```python from sentence_transformers import CrossEncoder model = CrossEncoder('jaspercatapang/R-star', max_length=512) scores = model.predict([('Your query here', 'First relevant passage for first query'), ('Your query here', 'Second relevant passage for second query')]) ``` ### Training and Evaluation 1. For training, the Colab notebook can be found [here](https://colab.research.google.com/drive/1F105XTCchub-flcGB1XqqoaYlJr16YR3). 2. For evaluation, the Colab notebook can be found [here](https://colab.research.google.com/drive/1H5RppJX9cfRXd8Hls2_Vis5sb6SHB1zf). ## Limitations Based on our evaluation, R* tends to favor longer passages when scoring, which could introduce a bias. This is true for most cross-encoder models. It is advisable to preprocess text to normalize passage lengths for fair comparison. Note that R* is optimized for passage-level comparisons and may not perform well on word- or phrase-level similarity tasks. ## Ethical Considerations The use of R* introduces several ethical considerations, including potential biases in the training data, privacy concerns, and the implications of automating decision-making processes. Users are encouraged to critically evaluate the model's fairness and transparency, ensuring its equitable use across diverse demographics. ## Contact Details For additional information or inquiries about R*, please contact the developer via jasperkylecatapang@gmail.com ## Disclaimer R* is an AI language model developed by Jasper Kyle Catapang. It is provided "as is" without warranty of any kind, expressed or implied. The model developer shall not be liable for any direct or indirect damages arising from the use of this model. ## Acknowledgments Thank you to Microsoft for the MS MARCO dataset. We would also like to extend our gratitude to [Haisong Zhang](https://huggingface.co/haisongzhang) for the base model.