parquet files loading error for details dataset

#782
by RaccoonOnion - opened

Hi,

I found that some parquet files in details dataset are not able to be loaded. I tried to load it using both pandas and parquet view from VSCode but all failed with error message indicating file corruption.

The details dataset I tried are: open-llm-leaderboard/details_SF-Foundation__TextBase-7B-v0.1 and open-llm-leaderboard/details_BarraHome__Mistroll-7B-v2.2 According to commit history, they are all committed 12 days ago.

For toubleshooting, my code for loading parquet is:

import pandas as pd

# Load the parquet file
file_path = 'results_2024-05-26T03-37-47.370997.parquet'
df = pd.read_parquet(file_path)

# Convert to JSON
json_data = df.to_json(orient='records', lines=True)

# Save the JSON data to a file
output_path = 'output.json'
with open(output_path, 'w') as json_file:
    json_file.write(json_data)

print(f"JSON file saved to {output_path}")

The error message is:

Screenshot 2024-06-06 at 4.30.32 PM.png

Open LLM Leaderboard org

Hi!
Thanks for the message, that's super interesting!
We know that users who had problems loading specific files succeeded by downloading them individually, then using load_datasets on them one by one locally instead of loading from the website (as we've had configuration mismatches in the past).
Could you try that and tell us what you get?

Hi!
Thanks for the message, that's super interesting!
We know that users who had problems loading specific files succeeded by downloading them individually, then using load_datasets on them one by one locally instead of loading from the website (as we've had configuration mismatches in the past).
Could you try that and tell us what you get?

Hi Clémentine,

Thank you for your suggestion! With your information I am able to identify the problem: When I use git clone to download the entire repo, the parquet files in the repo are corrupted ("Parquet magic bytes not found in footer"). Thus, for those files, both pandas and datasets loading method can not process them correctly.

I then tried downloading them individually and then it appears that all loading methods work fine.

Thank you for your help and hope that the git clone pipeline could be fixed soon because it is a lot easier to use when we are doing things in batches.

RaccoonOnion changed discussion status to closed

Sign up or log in to comment