Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1012

parquet files loading error for details dataset

#782

by RaccoonOnion - opened Jun 6

Discussion

RaccoonOnion

Jun 6

Hi,

I found that some parquet files in details dataset are not able to be loaded. I tried to load it using both pandas and parquet view from VSCode but all failed with error message indicating file corruption.

The details dataset I tried are: open-llm-leaderboard/details_SF-Foundation__TextBase-7B-v0.1 and open-llm-leaderboard/details_BarraHome__Mistroll-7B-v2.2 According to commit history, they are all committed 12 days ago.

For toubleshooting, my code for loading parquet is:

import pandas as pd

# Load the parquet file
file_path = 'results_2024-05-26T03-37-47.370997.parquet'
df = pd.read_parquet(file_path)

# Convert to JSON
json_data = df.to_json(orient='records', lines=True)

# Save the JSON data to a file
output_path = 'output.json'
with open(output_path, 'w') as json_file:
    json_file.write(json_data)

print(f"JSON file saved to {output_path}")

The error message is:

clefourrier

Open LLM Leaderboard org Jun 7

Hi!
Thanks for the message, that's super interesting!
We know that users who had problems loading specific files succeeded by downloading them individually, then using load_datasets on them one by one locally instead of loading from the website (as we've had configuration mismatches in the past).
Could you try that and tell us what you get?

RaccoonOnion

Jun 11

•

edited Jun 11

Hi!
Thanks for the message, that's super interesting!
We know that users who had problems loading specific files succeeded by downloading them individually, then using load_datasets on them one by one locally instead of loading from the website (as we've had configuration mismatches in the past).
Could you try that and tell us what you get?

Hi Clémentine,

Thank you for your suggestion! With your information I am able to identify the problem: When I use git clone to download the entire repo, the parquet files in the repo are corrupted ("Parquet magic bytes not found in footer"). Thus, for those files, both pandas and datasets loading method can not process them correctly.

I then tried downloading them individually and then it appears that all loading methods work fine.

Thank you for your help and hope that the git clone pipeline could be fixed soon because it is a lot easier to use when we are doing things in batches.

RaccoonOnion changed discussion status to closed Jun 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment