malhajar commited on
Commit
fc37e32
โ€ข
1 Parent(s): b5a38d3

Update src/display/about.py

Browse files
Files changed (1) hide show
  1. src/display/about.py +13 -4
src/display/about.py CHANGED
@@ -25,6 +25,8 @@ TITLE = """<h1 align="center" id="space-title">OpenLLM Turkish leaderboard v0.2<
25
  # What does your leaderboard evaluate?
26
  INTRODUCTION_TEXT = """
27
  Welcome to the Turkish LLM Leaderboard, a pioneering platform dedicated to evaluating Turkish Large Language Models (LLMs). As multilingual LLMs advance, my mission is to specifically highlight models excelling in Turkish, providing benchmarks that drive progress in Turkish LLM and Generative AI for the Turkish language.
 
 
28
 
29
  ๐Ÿš€ Submit Your Model ๐Ÿš€
30
 
@@ -32,7 +34,6 @@ Got a Turkish LLM? Submit it for evaluation (Currently Manually, due to the lack
32
 
33
  Join the forefront of Turkish language technology. Submit your model, and let's advance Turkish LLM's together!
34
 
35
-
36
  """
37
 
38
  # Which evaluations are you running? how can people reproduce what you have?
@@ -48,14 +49,22 @@ I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adap
48
  1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
49
  2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
50
  ```python
51
- lm_eval --model vllm --model_args pretrained=Trendyol/Trendyol-LLM-7b-chat-v1.0 --tasks truthfulqa_mc2_tr,truthfulqa_mc1_tr,mmlu_tr,winogrande_tr,gsm8k_tr,arc_challenge_tr,hellaswag_tr --output /workspace/Trendyol-LLM-7b-chat-v1.0
52
  ```
53
- 3) Report Results: I take the average of truthfulqa_mc1_tr and truthfulqa_mc2_tr scores and report it as truthfulqa. The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
54
 
55
  ## Notes:
56
 
57
  - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
58
- - All the tests are using "acc" as metric, with a plan to migrate to "acc_norm" for "ARC" and "Hellaswag" soon.
 
 
 
 
 
 
 
 
59
 
60
  """
61
 
 
25
  # What does your leaderboard evaluate?
26
  INTRODUCTION_TEXT = """
27
  Welcome to the Turkish LLM Leaderboard, a pioneering platform dedicated to evaluating Turkish Large Language Models (LLMs). As multilingual LLMs advance, my mission is to specifically highlight models excelling in Turkish, providing benchmarks that drive progress in Turkish LLM and Generative AI for the Turkish language.
28
+ The Leadboard uses [this](https://huggingface.co/collections/malhajar/openllmturkishleadboard-v02-datasets-662a8593043e73938e2f6b1e) currfelly curated benchmarks for evaluation.
29
+ The benchmarks are generated and checked using both GPT-4 and Human annotation rendering the leadboard the most valuable and accurate test in the LLM arena for Turkish evaluation.
30
 
31
  ๐Ÿš€ Submit Your Model ๐Ÿš€
32
 
 
34
 
35
  Join the forefront of Turkish language technology. Submit your model, and let's advance Turkish LLM's together!
36
 
 
37
  """
38
 
39
  # Which evaluations are you running? how can people reproduce what you have?
 
49
  1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
50
  2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
51
  ```python
52
+ lm_eval --model vllm --model_args pretrained=Orbina/Orbita-v0.1 --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr_v0.2 --output /workspace/Orbina/Orbita-v0.1
53
  ```
54
+ 3) Report Results: The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
55
 
56
  ## Notes:
57
 
58
  - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
59
+ - All the tests are using the same configuration used in the original OpenLLMLeadboard preciesly
60
+
61
+ The tasks and few shots parameters are:
62
+ - ARC: 25-shot, *arc-challenge* (`acc_norm`)
63
+ - HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
64
+ - TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
65
+ - MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
66
+ - Winogrande: 5-shot, *winogrande* (`acc`)
67
+ - GSM8k: 5-shot, *gsm8k* (`acc`)
68
 
69
  """
70