Model Overview
This is a multimodal large language model fine-tuned from Qwen2.5-VL on the R1-Onevision dataset. The model enhances vision-language understanding and reasoning capabilities, making it suitable for various tasks such as visual reasoning, image understanding. With its robust ability to perform multimodal reasoning, R1-Onevision emerges as a powerful AI assistant capable of addressing a wide range of problem-solving challenges across different domains.
Training Configuration and Curve
- Framework: The training process uses the open-source LLama-Factory library, with Qwen2.5-VL-Instruct as the base model. This model comes in three variants: 3B, 7B, and 32B.
- Parameters: For efficiency, we use a resolution of 518 for image inputs to save GPU memory. The training follows a full model SFT (Supervised Fine-Tuning) approach with a learning rate of 1e-5, trained for one epoch.
The training configuration is as follows:
image_resolution: 518
cutoff_len: 8192
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 1.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
flash_attn: fa2
Training loss curve:
Usage
You can load the model using the Hugging Face transformers
library:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
import torch
from qwen_vl_utils import process_vision_info
MODEL_ID = "Fancy-MLLM/R1-Onevision-7B"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16
).to("cuda").eval()
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "<your image path>"},
{"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Ongoing Work
Rule-Based Reinforcement Learning (RL)
We are actively exploring the integration of rule-based systems into reinforcement learning to enhance the agent's decision-making process. This approach combines domain-specific rules with the learning process, aiming to improve the efficiency and safety of learning in complex environments.
Training with General Data and Multimodal Reasoning CoT
Our ongoing work includes expanding the training datasets by incorporating more general data alongside multimodal reasoning Chain-of-Thought (CoT) data. This will enable the model to benefit from a broader range of information, enhancing its ability to handle diverse reasoning tasks across various domains.
Incorporating Chinese Multimodal Reasoning CoT Data
We are also focused on integrating Chinese multimodal reasoning CoT data into the training process. By adding this language-specific dataset, we aim to improve the model’s capability to perform reasoning tasks in Chinese, expanding its multilingual and multimodal reasoning proficiency.
Release of the 3B Model
We are working on the release of a smaller, more efficient 3B model, which is designed to provide a balance between performance and resource efficiency. This model aims to deliver strong multimodal reasoning capabilities while being more accessible and optimized for environments with limited computational resources, offering a more compact alternative to the current 7B model.
R1-Onevision Authors
- Yi Yang*, Xiaoxuan He*, Hongkun Pan*, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Minfeng Zhu†, Bo Zhang†, Wei Chen†
- *Equal contribution. †Corresponding authors.
Model Contact
- Xiaoxuan He: [email protected]
- Hongkun Pan: [email protected]
- Yi Yang: [email protected]
- Downloads last month
- 205
Model tree for Fancy-MLLM/R1-Onevision-7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct