CultriX commited on
Commit
8560ee2
Β·
verified Β·
1 Parent(s): b005e3f

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +4 -24
app.py CHANGED
@@ -81,8 +81,6 @@ def get_model_info(df):
81
  return df
82
 
83
 
84
-
85
-
86
  def calculate_highest_combined_score(data, column):
87
  # Ensure the column exists and has numeric data
88
  if column not in data.columns or not pd.api.types.is_numeric_dtype(data[column]):
@@ -142,22 +140,10 @@ def main():
142
  st.title("πŸ† YALL - Yet Another LLM Leaderboard")
143
  st.markdown("Leaderboard made with 🧐 [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) using [Nous](https://huggingface.co/NousResearch) benchmark suite.")
144
 
145
- # Placeholder or logic to set 'content' with actual markdown or data
146
  content = create_yall()
147
-
148
- # Ensure 'content' has a value before proceeding
149
- if content:
150
- df = convert_markdown_table_to_dataframe(content)
151
- df = get_and_update_model_info(df)
152
- score_columns = ['Average', 'AGIEval', 'GPT4All', 'TruthfulQA', 'Bigbench']
153
- for col in score_columns:
154
- if col in df.columns:
155
- df[col] = pd.to_numeric(df[col], errors='coerce')
156
- display_highest_combined_scores(df, score_columns)
157
-
158
  tab1, tab2 = st.tabs(["πŸ† Leaderboard", "πŸ“ About"])
159
 
160
-
161
  # Leaderboard tab
162
  with tab1:
163
  if content:
@@ -257,32 +243,26 @@ def main():
257
  st.error(str(e))
258
  else:
259
  st.error("Failed to download the content from the URL provided.")
260
-
261
  # About tab
262
  with tab2:
263
  st.markdown('''
264
  ### Nous benchmark suite
265
-
266
  Popularized by [Teknium](https://huggingface.co/teknium) and [NousResearch](https://huggingface.co/NousResearch), this benchmark suite aggregates four benchmarks:
267
-
268
  * [**AGIEval**](https://arxiv.org/abs/2304.06364) (0-shot): `agieval_aqua_rat,agieval_logiqa_en,agieval_lsat_ar,agieval_lsat_lr,agieval_lsat_rc,agieval_sat_en,agieval_sat_en_without_passage,agieval_sat_math`
269
  * **GPT4ALL** (0-shot): `hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa`
270
  * [**TruthfulQA**](https://arxiv.org/abs/2109.07958) (0-shot): `truthfulqa_mc`
271
  * [**Bigbench**](https://arxiv.org/abs/2206.04615) (0-shot): `bigbench_causal_judgement,bigbench_date_understanding,bigbench_disambiguation_qa,bigbench_geometric_shapes,bigbench_logical_deduction_five_objects,bigbench_logical_deduction_seven_objects,bigbench_logical_deduction_three_objects,bigbench_movie_recommendation,bigbench_navigate,bigbench_reasoning_about_colored_objects,bigbench_ruin_names,bigbench_salient_translation_error_detection,bigbench_snarks,bigbench_sports_understanding,bigbench_temporal_sequences,bigbench_tracking_shuffled_objects_five_objects,bigbench_tracking_shuffled_objects_seven_objects,bigbench_tracking_shuffled_objects_three_objects`
272
-
273
  ### Reproducibility
274
-
275
  You can easily reproduce these results using 🧐 [LLM AutoEval](https://github.com/mlabonne/llm-autoeval/tree/master), a colab notebook that automates the evaluation process (benchmark: `nous`). This will upload the results to GitHub as gists. You can find the entire table with the links to the detailed results [here](https://gist.github.com/mlabonne/90294929a2dbcb8877f9696f28105fdf).
276
-
277
  ### Clone this space
278
-
279
  You can create your own leaderboard with your LLM AutoEval results on GitHub Gist. You just need to clone this space and specify two variables:
280
-
281
  * Change the `gist_id` in [yall.py](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard/blob/main/yall.py#L126).
282
  * Create "New Secret" in Settings > Variables and secrets (name: "github", value: [your GitHub token](https://github.com/settings/tokens))
283
-
284
  A special thanks to [gblazex](https://huggingface.co/gblazex) for providing many evaluations.
285
  ''')
 
 
 
286
 
287
  # Run the main function if this script is run directly
288
  if __name__ == "__main__":
 
81
  return df
82
 
83
 
 
 
84
  def calculate_highest_combined_score(data, column):
85
  # Ensure the column exists and has numeric data
86
  if column not in data.columns or not pd.api.types.is_numeric_dtype(data[column]):
 
140
  st.title("πŸ† YALL - Yet Another LLM Leaderboard")
141
  st.markdown("Leaderboard made with 🧐 [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) using [Nous](https://huggingface.co/NousResearch) benchmark suite.")
142
 
143
+ # Create tabs for leaderboard and about section
144
  content = create_yall()
 
 
 
 
 
 
 
 
 
 
 
145
  tab1, tab2 = st.tabs(["πŸ† Leaderboard", "πŸ“ About"])
146
 
 
147
  # Leaderboard tab
148
  with tab1:
149
  if content:
 
243
  st.error(str(e))
244
  else:
245
  st.error("Failed to download the content from the URL provided.")
 
246
  # About tab
247
  with tab2:
248
  st.markdown('''
249
  ### Nous benchmark suite
 
250
  Popularized by [Teknium](https://huggingface.co/teknium) and [NousResearch](https://huggingface.co/NousResearch), this benchmark suite aggregates four benchmarks:
 
251
  * [**AGIEval**](https://arxiv.org/abs/2304.06364) (0-shot): `agieval_aqua_rat,agieval_logiqa_en,agieval_lsat_ar,agieval_lsat_lr,agieval_lsat_rc,agieval_sat_en,agieval_sat_en_without_passage,agieval_sat_math`
252
  * **GPT4ALL** (0-shot): `hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa`
253
  * [**TruthfulQA**](https://arxiv.org/abs/2109.07958) (0-shot): `truthfulqa_mc`
254
  * [**Bigbench**](https://arxiv.org/abs/2206.04615) (0-shot): `bigbench_causal_judgement,bigbench_date_understanding,bigbench_disambiguation_qa,bigbench_geometric_shapes,bigbench_logical_deduction_five_objects,bigbench_logical_deduction_seven_objects,bigbench_logical_deduction_three_objects,bigbench_movie_recommendation,bigbench_navigate,bigbench_reasoning_about_colored_objects,bigbench_ruin_names,bigbench_salient_translation_error_detection,bigbench_snarks,bigbench_sports_understanding,bigbench_temporal_sequences,bigbench_tracking_shuffled_objects_five_objects,bigbench_tracking_shuffled_objects_seven_objects,bigbench_tracking_shuffled_objects_three_objects`
 
255
  ### Reproducibility
 
256
  You can easily reproduce these results using 🧐 [LLM AutoEval](https://github.com/mlabonne/llm-autoeval/tree/master), a colab notebook that automates the evaluation process (benchmark: `nous`). This will upload the results to GitHub as gists. You can find the entire table with the links to the detailed results [here](https://gist.github.com/mlabonne/90294929a2dbcb8877f9696f28105fdf).
 
257
  ### Clone this space
 
258
  You can create your own leaderboard with your LLM AutoEval results on GitHub Gist. You just need to clone this space and specify two variables:
 
259
  * Change the `gist_id` in [yall.py](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard/blob/main/yall.py#L126).
260
  * Create "New Secret" in Settings > Variables and secrets (name: "github", value: [your GitHub token](https://github.com/settings/tokens))
 
261
  A special thanks to [gblazex](https://huggingface.co/gblazex) for providing many evaluations.
262
  ''')
263
+
264
+
265
+
266
 
267
  # Run the main function if this script is run directly
268
  if __name__ == "__main__":