--- license: apache-2.0 language: - en base_model: - sfairXC/FsfairX-LLaMA3-RM-v0.1 tags: - reward model - fine-grained --- # MDCureRM [📄 Paper](https://arxiv.org/pdf/2410.23463) | [🤗 HF Collection](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395) | [⚙️ GitHub Repo](https://github.com/yale-nlp/MDCure) ## Introduction **MDCure** is an effective and scalable procedure for generating high-quality multi-document (MD) instruction tuning data to improve MD capabilities of LLMs. Using MDCure, we construct a suite of MD instruction datasets complementary to collections such as [FLAN](https://github.com/google-research/FLAN) and fine-tune a variety of already instruction-tuned LLMs from the FlanT5, Qwen2, and LLAMA3.1 model families, up to 70B parameters in size. We additionally introduce **MDCureRM**, an evaluator model specifically designed for the MD setting to filter and select high-quality MD instruction data in a cost-effective, RM-as-a-judge fashion. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks show MDCure consistently improves performance over pre-trained baselines and over corresponding base models by up to 75.5%. We release MDCure datasets of size 12k, 36k, and 72k. We also release MDCureRM and the best MDCure'd model for each architecture/size combination. To access all our models and datasets, please visit our [HF Collection](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395). For further details regarding dataset construction, please see our [paper](https://arxiv.org/pdf/2410.23463) and [Github repo](https://github.com/yale-nlp/MDCure). For additional details regarding how to use **yale-nlp/MDCure-Qwen2-7B-Instruct**, please see below.
The MDCure pipeline generates diverse multi-document instructions, filters them via fine-grained scoring by MDCureRM, and tunes a base LLM to enhance its multi-document capabilities.
## Model Details **yale-nlp/MDCureRM** is initialized from [sfairXC/FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) and trained via a multi-objective reward modeling framework to obtain a MD-specific reward model that evaluates candidate instructions based on six different criteria. These criteria capture both the overall quality of the instruction-response pairs and their effectiveness in handling multi-document content. Using MDCureRM via a fine-grained RM-as-a-Judge mechanism enables us to yield high-quality training data. ## Requirements We recommend using the latest version of HF Transformers, or any `transformers>=4.45.0`, to avoid any potential errors when using this model. ## Quickstart Below we provide a code snippet demonstrating how to load the tokenizer and model and score a candidate instruction. We strongly recommend to format the instruction input as shown to maintain consistency with the format of the data used during training of MDCureRM. As the model outputs values normalized to the 0-1 range, we scale outputted scores up to the 1-5 range for more interpretable results. Relative weighting of fine-grained rewards may be configured as desired to obtain the final score; we reproduce the weights used in our implementation in `reward_weights` below. ```python from transformers import AutoTokenizer, AutoModel, AutoConfig, LlamaConfig, PreTrainedModel, LlamaForSequenceClassification import torch.nn as nn import torch # Login to HF to access LLAMA model from huggingface_hub import login login("") # HF token class RewardModelConfig(LlamaConfig): model_type = "RewardModel" def __init__(self, reward_dim=None, base_model_name=None, **kwargs): super().__init__(**kwargs) self.reward_dim = reward_dim self.base_model_name = base_model_name class RewardModel(PreTrainedModel): config_class = RewardModelConfig def create_base_model(self): # use sequence classification model for consistency with https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1 BACKBONE_MODEL = LlamaForSequenceClassification.from_pretrained( self.config.base_model_name, config=LlamaConfig.from_pretrained(self.config.base_model_name), ) BACKBONE_MODEL.config.pad_token_id = BACKBONE_MODEL.config.eos_token_id BACKBONE_MODEL.config.output_hidden_states = True for param in BACKBONE_MODEL.parameters(): param.requires_grad = False return BACKBONE_MODEL def __init__(self, config): super(RewardModel, self).__init__(config) # use .base_model to remove lm_head self.BASE_MODEL = self.create_base_model().base_model # regression head for reward prediction self.regression_head = nn.Linear(self.BASE_MODEL.config.hidden_size, config.reward_dim) def forward(self, input_ids, attention_mask=None, rewards=None, **kwargs): # forward pass through the base model outputs = self.BASE_MODEL(input_ids, attention_mask=attention_mask, **kwargs) hidden_states = outputs.hidden_states[-1] # access hidden state corresponding to the last token in each sequence across the batch last_token_hidden_state = hidden_states[:, -1, :] reward_predictions = self.regression_head(last_token_hidden_state) return reward_predictions def prepare_inputs_for_generation(self, *args, **kwargs): return self.BASE_MODEL.prepare_inputs_for_generation(*args, **kwargs) AutoConfig.register("RewardModel", RewardModelConfig) AutoModel.register(RewardModelConfig, RewardModel) model = AutoModel.from_pretrained("yale-nlp/MDCureRM").to(torch.device("cuda")) tokenizer = AutoTokenizer.from_pretrained("yale-nlp/MDCureRM", use_fast=True) tokenizer.pad_token = tokenizer.eos_token reward_weights = torch.tensor([1/9, 1/9, 1/9, 2/9, 2/9, 2/9], device="cuda") source_text_1 = ... source_text_2 = ... source_text_3 = ... context = f"{source_text_1}\n\n{source_text_2}\n\n{source_text_3}" instruction = "What happened in CHAMPAIGN regarding Lovie Smith and the 2019 defense improvements? Respond with 1-2 sentences." input_text = f"Instruction: {instruction}\n\n{context}" tokenized_input = tokenizer( input_text, return_tensors='pt', truncation=True, padding=True, ).to(torch.device("cuda")) all_six_scores = model(tokenized_input["input_ids"]).squeeze(0) # flatten for dot product all_six_scores = all_six_scores*4. + 1. # scale up to 1->5 range final_score = torch.dot(all_six_scores, reward_weights).cpu().item() print(score) ``` ## All MDCure Models Beyond MDCureRM, we open-source our best MDCure'd models at the following links: | Model | Huggingface Repo | Description | |---------------------------|---------------------|------------------------------| | **MDCure-FlanT5-Base** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-FlanT5-Base) | **FlanT5-Base** fine-tuned with MDCure-72k | | **MDCure-FlanT5-Large** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-FlanT5-Large) | **FlanT5-Large** fine-tuned with MDCure-72k | | **MDCure-Qwen2-1.5B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-Qwen2-1.5B-Instruct) | **Qwen2-1.5B-Instruct** fine-tuned with MDCure-72k | | **MDCure-Qwen2-7B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-Qwen2-7B-Instruct) | **Qwen2-7B-Instruct** fine-tuned with MDCure-72k | | **MDCure-LLAMA3.1-8B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-8B-Instruct) | **LLAMA3.1-8B-Instruct** fine-tuned with MDCure-72k | | **MDCure-LLAMA3.1-70B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-70B-Instruct) | **LLAMA3.1-70B-Instruct** fine-tuned with MDCure-72 | ## Citation If you find our work useful, please cite our paper as: ```bibtex @article{liu2024mdcure, title={MDCure: A Scalable Pipeline for Multi-Document Instruction-Following}, author={Gabrielle Kaili-May Liu and Bowen Shi and Avi Caciularu and Idan Szpektor and Arman Cohan}, journal={arXiv preprint arXiv:2410.23463}, year={2024}, url={https://arxiv.org/abs/2410.23463} } ```