Alexandre Sallinen commited on
Commit
f2c32b7
·
verified ·
1 Parent(s): d96b862

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -60,7 +60,12 @@ Additional information about the datasets will be included in the Meditron-3 pub
60
 
61
  #### Evaluation
62
 
63
- Evaluation results for the Gemma[2]-Meditron-3[9B] are coming soon!
 
 
 
 
 
64
 
65
  We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
66
  While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.
 
60
 
61
  #### Evaluation
62
 
63
+ | Model Name | MedmcQA | MedQA | PubmedQA | Average |
64
+ |-----------------------------|---------|--------|----------|---------|
65
+ | google/gemma-2-9b | 56.60 | 63.32 | 76.80 | 65.57 |
66
+ | gemMeditron-2-9b-4818 | 57.21 | 63.79 | 77.00 | 66.00 |
67
+ | Difference (gemMeditron vs.)| 0.61 | 0.47 | 0.20 | 0.43 |
68
+
69
 
70
  We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
71
  While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.