Safetensors
GGUF
llama
Inference Endpoints
conversational
File size: 4,267 Bytes
49f50bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: llama3.2
datasets:
- open-thoughts/OpenThoughts-114k
- FreedomIntelligence/medical-o1-verifiable-problem
- open-r1/OpenR1-Math-220k
base_model:
- meta-llama/Llama-3.2-3B-Instruct
---

# mkurman/Llama-3.2-MedIT-3B-R1

**Important Notice:**  
This model is provided strictly for research purposes and is not intended for production use. It should not be considered a validated source of medical or professional advice. Use only in controlled experimental settings.

---

## Model Overview

mkurman/Llama-3.2-MedIT-3B-R1 is a fine-tuned variant of meta-llama/Llama-3.2-3B-Instruct, adapted specifically for exploring natural language understanding and reasoning. This model leverages a multi-stage training approach, combining Blurred Thoughts Supervised Fine-Tuning (BT-SFT) and Group Relative Policy Optimization (GRPO) with an LLM evaluator to enhance its performance on specialized tasks.

---

## Training Procedure

The model was developed through the following sequential steps:

1. **Initial Blurred Thoughts Supervised Fine-Tuning (BT-SFT):**  
   - **Base Model:** meta-llama/Llama-3.2-3B-Instruct  
   - **Parameters:** 2000 steps, batch size 2, accumulation iterations 16, learning rate 1e-6  
   - **Dataset:** open-thoughts/OpenThoughts-114k  
   - **Details:** For further information on BT-SFT, see the [detailed post](https://huggingface.co/posts/mkurman/496852395740108) and the [GitHub repository](https://github.com/mkurman/blurred-thoughts-SFT).

2. **Group Relative Policy Optimization (GRPO) Stage 1:**  
   - **Dataset:** FreedomIntelligence/medical-o1-verifiable-problem
   - **Training:** 200 steps
   - **LLM Evaluator** mkurman/Qwen2.5-14B-DeepSeek-R1-1M
   - **Details:** For further information on GRPO with LLM evaluators, see the [GitHub repository](https://github.com/mkurman/grpo-llm-evaluator).

3. **Group Relative Policy Optimization (GRPO) Stage 2:**  
   - **Dataset:** open-r1/OpenR1-Math-220k 
   - **Training:** 200 steps
   - **LLM Evaluator** deepseek/deepseek-r1-distill-qwen-14b (OpenRouterAI)

---

## Datasets Utilized

- **open-thoughts/OpenThoughts-114k:**  
  A dataset consisting of open-ended thoughts that supports diverse conversational contexts during the initial supervised fine-tuning.

- **FreedomIntelligence/medical-o1-verifiable-problem:**  
  A dataset curated for enhancing the model's capabilities in addressing verifiable medical problems.

- **open-r1/OpenR1-Math-220k:**
  A dataset designed to improve the model's reasoning and problem-solving skills in mathematical contexts.

---

## Intended Use

- **Research and Experimental Applications:**  
  This model is optimized for academic research and exploratory projects. It is ideal for investigating advanced fine-tuning methods and evaluating performance on task-oriented conversational scenarios.

- **Controlled Environments:**  
  Users should deploy this model only within controlled experimental frameworks where rigorous evaluation and proper safety guardrails are in place.

---

## Limitations and Ethical Considerations

- **Not for Clinical or Production Use:**  
  The model’s outputs have not been validated for clinical accuracy or professional decision-making. It must not be used as a primary source for medical, legal, or safety-critical information.

- **Safety and Guardrails:**  
  All users must implement appropriate safety measures and validation protocols. The model may produce biased or inaccurate results and should be used with caution.

- **Experimental Nature:**  
  Given its research-oriented design, the model’s performance can vary widely based on input and context. It is essential to perform thorough testing and validation before drawing any conclusions from its outputs.

---

## License

This model is released under the Llama 3.2 license. Users must adhere to the terms specified in the license when utilizing this model.

---

## Final Notice

All outputs from **mkurman/Llama-3.2-MedIT-3B-R1** are intended solely for research purposes. This model is not a comprehensive knowledge source and should not be used as a substitute for professional advice or decision-making. Ensure that all necessary guardrails and safety protocols are in place when conducting any experiments with this model.