Data files Configuration
There are no constraints on how to structure dataset repositories.
However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.
Often it is as simple as naming your data files according to their split names, e.g. train.csv
and test.csv
.
What are splits and subsets?
Machine learning datasets typically have splits and may also have subsets. A dataset is generally made of splits (e.g. train
and test
) that are used during different stages of training and evaluating a model. A subset (also called configuration) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you’re interested in learning more about splits and subsets, check out the Splits and subsets guide!
Automatic splits detection
Splits are automatically detected based on file and directory names. For example, this is a dataset with train
, test
, and validation
splits:
my_dataset_repository/
├── README.md
├── train.csv
├── test.csv
└── validation.csv
To structure your dataset by naming your data files or directories according to their split names, see the File names and splits documentation and the companion collection of example datasets.
Manual splits and subsets configuration
You can choose the data files to show in the Dataset Viewer for your dataset using YAML. It is useful if you want to specify which file goes into which split manually.
You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).
Here is an example of a configuration defining a subset called “benchmark” with a test
split.
configs:
- config_name: benchmark
data_files:
- split: test
path: benchmark.csv
See the documentation on Manual configuration for more information. Look also to the example datasets.
Supported file formats
See the File formats doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the example datasets.
Image and Audio datasets
For image and audio classification datasets, you can also use directories to name the image and audio classes. And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them.
We provide two guides that you can check out:
< > Update on GitHub