OpenMeditron
/

Meditron3-Gemma2-9B

Text Generation

Model card Files Files and versions Community

Alexandre Sallinen commited on 20 days ago

Commit

f2c32b7

·

verified ·

1 Parent(s): d96b862

Update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -60,7 +60,12 @@ Additional information about the datasets will be included in the Meditron-3 pub
 #### Evaluation
-Evaluation results for the Gemma[2]-Meditron-3[9B] are coming soon!
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.

 #### Evaluation
+| Model Name                  | MedmcQA | MedQA  | PubmedQA | Average |
+|-----------------------------|---------|--------|----------|---------|
+| google/gemma-2-9b           | 56.60   | 63.32  | 76.80    | 65.57   |
+| gemMeditron-2-9b-4818       | 57.21   | 63.79  | 77.00    | 66.00   |
+| Difference (gemMeditron vs.)| 0.61    | 0.47   | 0.20     | 0.43    |
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.