OpenLLMTurkishLeaderboard_v0.2

Running on CPU Upgrade

App Files Files Community

malhajar commited on Apr 28, 2024

Commit

fc37e32

verified ·

1 Parent(s): b5a38d3

Update src/display/about.py

Browse files

Files changed (1) hide show

src/display/about.py +13 -4

src/display/about.py CHANGED Viewed

@@ -25,6 +25,8 @@ TITLE = """<h1 align="center" id="space-title">OpenLLM Turkish leaderboard v0.2<
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 Welcome to the Turkish LLM Leaderboard, a pioneering platform dedicated to evaluating Turkish Large Language Models (LLMs). As multilingual LLMs advance, my mission is to specifically highlight models excelling in Turkish, providing benchmarks that drive progress in Turkish LLM and Generative AI for the Turkish language.
 🚀 Submit Your Model 🚀
@@ -32,7 +34,6 @@ Got a Turkish LLM? Submit it for evaluation (Currently Manually, due to the lack
 Join the forefront of Turkish language technology. Submit your model, and let's advance Turkish LLM's together!
 """
 # Which evaluations are you running? how can people reproduce what you have?
@@ -48,14 +49,22 @@ I use LM-Evaluation-Harness-Turkish, a version of the LM Evaluation Harness adap
 1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
 2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
 ```python
-lm_eval --model vllm --model_args pretrained=Trendyol/Trendyol-LLM-7b-chat-v1.0 --tasks truthfulqa_mc2_tr,truthfulqa_mc1_tr,mmlu_tr,winogrande_tr,gsm8k_tr,arc_challenge_tr,hellaswag_tr --output /workspace/Trendyol-LLM-7b-chat-v1.0
 ```
-3) Report Results: I take the average of truthfulqa_mc1_tr and truthfulqa_mc2_tr scores and report it as truthfulqa. The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
 ## Notes:
 - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
-- All the tests are using "acc" as metric, with a plan to migrate to "acc_norm" for "ARC" and "Hellaswag" soon.
 """

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 Welcome to the Turkish LLM Leaderboard, a pioneering platform dedicated to evaluating Turkish Large Language Models (LLMs). As multilingual LLMs advance, my mission is to specifically highlight models excelling in Turkish, providing benchmarks that drive progress in Turkish LLM and Generative AI for the Turkish language.
+The Leadboard uses [this](https://huggingface.co/collections/malhajar/openllmturkishleadboard-v02-datasets-662a8593043e73938e2f6b1e) currfelly curated benchmarks for evaluation.
+The benchmarks are generated and checked using both GPT-4 and Human annotation rendering the leadboard the most valuable and accurate test in the LLM arena for Turkish evaluation.
 🚀 Submit Your Model 🚀
 Join the forefront of Turkish language technology. Submit your model, and let's advance Turkish LLM's together!
 """
 # Which evaluations are you running? how can people reproduce what you have?
 1) Set Up the repo: Clone the "lm-evaluation-harness_turkish" from https://github.com/malhajar17/lm-evaluation-harness_turkish and follow the installation instructions.
 2) Run Evaluations: To get the results as on the leaderboard (Some tests might show small variations), use the following command, adjusting for your model. For example, with the Trendyol model:
 ```python
+lm_eval --model vllm --model_args pretrained=Orbina/Orbita-v0.1 --tasks mmlu_tr_v0.2,arc_tr-v0.2,gsm8k_tr-v0.2,hellaswag_tr-v0.2,truthfulqa_v0.2,winogrande_tr_v0.2  --output /workspace/Orbina/Orbita-v0.1
 ```
+3) Report Results: The results file generated is then uploaded to the OpenLLM Turkish Leaderboard.
 ## Notes:
 - I currently use "vllm" which might differ slightly as per the LM Evaluation Harness.
+- All the tests are using the same configuration used in the original OpenLLMLeadboard preciesly
+The tasks and few shots parameters are:
+- ARC: 25-shot, *arc-challenge* (`acc_norm`)
+- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
+- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
+- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
+- Winogrande: 5-shot, *winogrande* (`acc`)
+- GSM8k: 5-shot, *gsm8k* (`acc`)
 """