Update src/display/about.py
Browse files- src/display/about.py +13 -4
src/display/about.py
CHANGED
@@ -25,6 +25,8 @@ TITLE = """<h1 align="center" id="space-title">OpenLLM Turkish leaderboard v0.2<
|
|
25 |
# What does your leaderboard evaluate?
|
26 |
INTRODUCTION_TEXT = """
|
27 |
Welcome to the Turkish LLM Leaderboard, a pioneering platform dedicated to evaluating Turkish Large Language Models (LLMs). As multilingual LLMs advance, my mission is to specifically highlight models excelling in Turkish, providing benchmarks that drive progress in Turkish LLM and Generative AI for the Turkish language.
|
|
|
|
|
28 |
|
29 |
๐ Submit Your Model ๐
|
30 |
|
@@ -32,7 +34,6 @@ Got a Turkish LLM? Submit it for evaluation (Currently Manually, due to the lack
|
|
32 |
|
33 |
Join the forefront of Turkish language technology. Submit your model, and let's advance Turkish LLM's together!
|
34 |
|
35 |
-
|
36 |
"""
|
37 |
|
38 |
# Which evaluations are you running? how can people reproduce what you have?
|
@@ -48,14 +49,22 @@ I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adap
|
|
48 |
1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
|
49 |
2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
|
50 |
```python
|
51 |
-
lm_eval --model vllm --model_args pretrained=
|
52 |
```
|
53 |
-
3) Report Results:
|
54 |
|
55 |
## Notes:
|
56 |
|
57 |
- I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
|
58 |
-
- All the tests are using
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
"""
|
61 |
|
|
|
25 |
# What does your leaderboard evaluate?
|
26 |
INTRODUCTION_TEXT = """
|
27 |
Welcome to the Turkish LLM Leaderboard, a pioneering platform dedicated to evaluating Turkish Large Language Models (LLMs). As multilingual LLMs advance, my mission is to specifically highlight models excelling in Turkish, providing benchmarks that drive progress in Turkish LLM and Generative AI for the Turkish language.
|
28 |
+
The Leadboard uses [this](https://huggingface.co/collections/malhajar/openllmturkishleadboard-v02-datasets-662a8593043e73938e2f6b1e) currfelly curated benchmarks for evaluation.
|
29 |
+
The benchmarks are generated and checked using both GPT-4 and Human annotation rendering the leadboard the most valuable and accurate test in the LLM arena for Turkish evaluation.
|
30 |
|
31 |
๐ Submit Your Model ๐
|
32 |
|
|
|
34 |
|
35 |
Join the forefront of Turkish language technology. Submit your model, and let's advance Turkish LLM's together!
|
36 |
|
|
|
37 |
"""
|
38 |
|
39 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
49 |
1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
|
50 |
2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
|
51 |
```python
|
52 |
+
lm_eval --model vllm --model_args pretrained=Orbina/Orbita-v0.1 --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr_v0.2 --output /workspace/Orbina/Orbita-v0.1
|
53 |
```
|
54 |
+
3) Report Results: The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
|
55 |
|
56 |
## Notes:
|
57 |
|
58 |
- I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
|
59 |
+
- All the tests are using the same configuration used in the original OpenLLMLeadboard preciesly
|
60 |
+
|
61 |
+
The tasks and few shots parameters are:
|
62 |
+
- ARC: 25-shot, *arc-challenge* (`acc_norm`)
|
63 |
+
- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
|
64 |
+
- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
|
65 |
+
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
|
66 |
+
- Winogrande: 5-shot, *winogrande* (`acc`)
|
67 |
+
- GSM8k: 5-shot, *gsm8k* (`acc`)
|
68 |
|
69 |
"""
|
70 |
|