A quick experiment I've carried out on around 600 datasets from the HF Hub, the results are stored in lbourdois/LLE, and the methodology is described in
https://huggingface.co/blog/lbourdois/lle
Join the community of Machine Learners and AI enthusiasts.
Sign UpI did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.
It's the exchanges I've had with you that have led me to question the quality of the data 🤗
On which desk in the Paris office should I leave a post-it note asking for the creation of the bot?
Pretty cool stuff! Maybe you should do a leaderboard of major datasets and their leakage score
For NER (Name Entity Recognition) you can consult https://huggingface.co/tasks/token-classification.
A leak is when data of the train split is found in the test split, biasing the results and benchmarks.