OpenMeditron
/

Meditron3-Gemma2-9B

Text Generation

Model card Files Files and versions Community

Update README.md

#1

by ETraKoZ - opened 7 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -63,9 +63,15 @@ Additional information about the datasets will be included in the Meditron-3 pub
 | Model Name                  | MedmcQA | MedQA  | PubmedQA | Average |
 |-----------------------------|---------|--------|----------|---------|
 | google/gemma-2-9b           | 56.60   | 63.32  | 76.80    | 65.57   |
-| gemMeditron-2-9b-4818       | 57.21   | 63.79  | 77.00    | 66.00   |
 | Difference (gemMeditron vs.)| 0.61    | 0.47   | 0.20     | 0.43    |
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.

 | Model Name                  | MedmcQA | MedQA  | PubmedQA | Average |
 |-----------------------------|---------|--------|----------|---------|
 | google/gemma-2-9b           | 56.60   | 63.32  | 76.80    | 65.57   |
+| gemMeditron-2-9b            | 57.21   | 63.79  | 77.00    | 66.00   |
 | Difference (gemMeditron vs.)| 0.61    | 0.47   | 0.20     | 0.43    |
+| Model Name                  | AfrimedQA |
+|-----------------------------|-----------|
+| google/gemma-2-9b           | 51.25     |
+| gemMeditron-2-9b            | 58.23     |
+| Difference (gemMeditron vs.)| 6.98      |
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.