Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
@@ -81,8 +81,6 @@ def get_model_info(df):
|
|
81 |
return df
|
82 |
|
83 |
|
84 |
-
|
85 |
-
|
86 |
def calculate_highest_combined_score(data, column):
|
87 |
# Ensure the column exists and has numeric data
|
88 |
if column not in data.columns or not pd.api.types.is_numeric_dtype(data[column]):
|
@@ -142,22 +140,10 @@ def main():
|
|
142 |
st.title("π YALL - Yet Another LLM Leaderboard")
|
143 |
st.markdown("Leaderboard made with π§ [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) using [Nous](https://huggingface.co/NousResearch) benchmark suite.")
|
144 |
|
145 |
-
#
|
146 |
content = create_yall()
|
147 |
-
|
148 |
-
# Ensure 'content' has a value before proceeding
|
149 |
-
if content:
|
150 |
-
df = convert_markdown_table_to_dataframe(content)
|
151 |
-
df = get_and_update_model_info(df)
|
152 |
-
score_columns = ['Average', 'AGIEval', 'GPT4All', 'TruthfulQA', 'Bigbench']
|
153 |
-
for col in score_columns:
|
154 |
-
if col in df.columns:
|
155 |
-
df[col] = pd.to_numeric(df[col], errors='coerce')
|
156 |
-
display_highest_combined_scores(df, score_columns)
|
157 |
-
|
158 |
tab1, tab2 = st.tabs(["π Leaderboard", "π About"])
|
159 |
|
160 |
-
|
161 |
# Leaderboard tab
|
162 |
with tab1:
|
163 |
if content:
|
@@ -257,32 +243,26 @@ def main():
|
|
257 |
st.error(str(e))
|
258 |
else:
|
259 |
st.error("Failed to download the content from the URL provided.")
|
260 |
-
|
261 |
# About tab
|
262 |
with tab2:
|
263 |
st.markdown('''
|
264 |
### Nous benchmark suite
|
265 |
-
|
266 |
Popularized by [Teknium](https://huggingface.co/teknium) and [NousResearch](https://huggingface.co/NousResearch), this benchmark suite aggregates four benchmarks:
|
267 |
-
|
268 |
* [**AGIEval**](https://arxiv.org/abs/2304.06364) (0-shot): `agieval_aqua_rat,agieval_logiqa_en,agieval_lsat_ar,agieval_lsat_lr,agieval_lsat_rc,agieval_sat_en,agieval_sat_en_without_passage,agieval_sat_math`
|
269 |
* **GPT4ALL** (0-shot): `hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa`
|
270 |
* [**TruthfulQA**](https://arxiv.org/abs/2109.07958) (0-shot): `truthfulqa_mc`
|
271 |
* [**Bigbench**](https://arxiv.org/abs/2206.04615) (0-shot): `bigbench_causal_judgement,bigbench_date_understanding,bigbench_disambiguation_qa,bigbench_geometric_shapes,bigbench_logical_deduction_five_objects,bigbench_logical_deduction_seven_objects,bigbench_logical_deduction_three_objects,bigbench_movie_recommendation,bigbench_navigate,bigbench_reasoning_about_colored_objects,bigbench_ruin_names,bigbench_salient_translation_error_detection,bigbench_snarks,bigbench_sports_understanding,bigbench_temporal_sequences,bigbench_tracking_shuffled_objects_five_objects,bigbench_tracking_shuffled_objects_seven_objects,bigbench_tracking_shuffled_objects_three_objects`
|
272 |
-
|
273 |
### Reproducibility
|
274 |
-
|
275 |
You can easily reproduce these results using π§ [LLM AutoEval](https://github.com/mlabonne/llm-autoeval/tree/master), a colab notebook that automates the evaluation process (benchmark: `nous`). This will upload the results to GitHub as gists. You can find the entire table with the links to the detailed results [here](https://gist.github.com/mlabonne/90294929a2dbcb8877f9696f28105fdf).
|
276 |
-
|
277 |
### Clone this space
|
278 |
-
|
279 |
You can create your own leaderboard with your LLM AutoEval results on GitHub Gist. You just need to clone this space and specify two variables:
|
280 |
-
|
281 |
* Change the `gist_id` in [yall.py](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard/blob/main/yall.py#L126).
|
282 |
* Create "New Secret" in Settings > Variables and secrets (name: "github", value: [your GitHub token](https://github.com/settings/tokens))
|
283 |
-
|
284 |
A special thanks to [gblazex](https://huggingface.co/gblazex) for providing many evaluations.
|
285 |
''')
|
|
|
|
|
|
|
286 |
|
287 |
# Run the main function if this script is run directly
|
288 |
if __name__ == "__main__":
|
|
|
81 |
return df
|
82 |
|
83 |
|
|
|
|
|
84 |
def calculate_highest_combined_score(data, column):
|
85 |
# Ensure the column exists and has numeric data
|
86 |
if column not in data.columns or not pd.api.types.is_numeric_dtype(data[column]):
|
|
|
140 |
st.title("π YALL - Yet Another LLM Leaderboard")
|
141 |
st.markdown("Leaderboard made with π§ [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) using [Nous](https://huggingface.co/NousResearch) benchmark suite.")
|
142 |
|
143 |
+
# Create tabs for leaderboard and about section
|
144 |
content = create_yall()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
145 |
tab1, tab2 = st.tabs(["π Leaderboard", "π About"])
|
146 |
|
|
|
147 |
# Leaderboard tab
|
148 |
with tab1:
|
149 |
if content:
|
|
|
243 |
st.error(str(e))
|
244 |
else:
|
245 |
st.error("Failed to download the content from the URL provided.")
|
|
|
246 |
# About tab
|
247 |
with tab2:
|
248 |
st.markdown('''
|
249 |
### Nous benchmark suite
|
|
|
250 |
Popularized by [Teknium](https://huggingface.co/teknium) and [NousResearch](https://huggingface.co/NousResearch), this benchmark suite aggregates four benchmarks:
|
|
|
251 |
* [**AGIEval**](https://arxiv.org/abs/2304.06364) (0-shot): `agieval_aqua_rat,agieval_logiqa_en,agieval_lsat_ar,agieval_lsat_lr,agieval_lsat_rc,agieval_sat_en,agieval_sat_en_without_passage,agieval_sat_math`
|
252 |
* **GPT4ALL** (0-shot): `hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa`
|
253 |
* [**TruthfulQA**](https://arxiv.org/abs/2109.07958) (0-shot): `truthfulqa_mc`
|
254 |
* [**Bigbench**](https://arxiv.org/abs/2206.04615) (0-shot): `bigbench_causal_judgement,bigbench_date_understanding,bigbench_disambiguation_qa,bigbench_geometric_shapes,bigbench_logical_deduction_five_objects,bigbench_logical_deduction_seven_objects,bigbench_logical_deduction_three_objects,bigbench_movie_recommendation,bigbench_navigate,bigbench_reasoning_about_colored_objects,bigbench_ruin_names,bigbench_salient_translation_error_detection,bigbench_snarks,bigbench_sports_understanding,bigbench_temporal_sequences,bigbench_tracking_shuffled_objects_five_objects,bigbench_tracking_shuffled_objects_seven_objects,bigbench_tracking_shuffled_objects_three_objects`
|
|
|
255 |
### Reproducibility
|
|
|
256 |
You can easily reproduce these results using π§ [LLM AutoEval](https://github.com/mlabonne/llm-autoeval/tree/master), a colab notebook that automates the evaluation process (benchmark: `nous`). This will upload the results to GitHub as gists. You can find the entire table with the links to the detailed results [here](https://gist.github.com/mlabonne/90294929a2dbcb8877f9696f28105fdf).
|
|
|
257 |
### Clone this space
|
|
|
258 |
You can create your own leaderboard with your LLM AutoEval results on GitHub Gist. You just need to clone this space and specify two variables:
|
|
|
259 |
* Change the `gist_id` in [yall.py](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard/blob/main/yall.py#L126).
|
260 |
* Create "New Secret" in Settings > Variables and secrets (name: "github", value: [your GitHub token](https://github.com/settings/tokens))
|
|
|
261 |
A special thanks to [gblazex](https://huggingface.co/gblazex) for providing many evaluations.
|
262 |
''')
|
263 |
+
|
264 |
+
|
265 |
+
|
266 |
|
267 |
# Run the main function if this script is run directly
|
268 |
if __name__ == "__main__":
|