--- title: README emoji: 📚 colorFrom: pink colorTo: gray sdk: static pinned: false --- BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models. BigLAM started as a [datasets hackathon](https://github.com/bigscience-workshop/lam) focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine-learning applications accessible via the Hugging Face Hub. We are continuing to work on making more datasets available via the Hugging Face hub to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine-learning datasets more closely reflect the richness of human culture. ## Dataset Overview An overview of datasets currently made available via BigLam organised by task type.
image-classification - [19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels](https://huggingface.co/datasets/biglam/illustrated_ads) - [Brill Iconclass AI Test Set ](https://huggingface.co/datasets/biglam/brill_iconclass) - [National Library of Scotland Chapbook Illustrations](https://huggingface.co/datasets/biglam/nls_chapbook_illustrations) - [Encyclopaedia Britannica Illustrated](https://huggingface.co/datasets/biglam/encyclopaedia_britannica_illustrated) - [V4Design Europeana style dataset](https://huggingface.co/datasets/biglam/v4design_europeana_style_dataset) - [Early Printed Books Font Detection Dataset](https://huggingface.co/datasets/biglam/early_printed_books_font_detection) - [DEArt: Dataset of European Art](https://huggingface.co/datasets/biglam/european_art)
text-classification - [Annotated dataset to assess the accuracy of the textual description of cultural heritage records](https://huggingface.co/datasets/biglam/biglam/cultural_heritage_metadata_accuracy) - [Atypical Animacy](https://huggingface.co/datasets/biglam/atypical_animacy) - [Old Bailey Proceedings](https://huggingface.co/datasets/biglam/old_bailey_proceedings) - [Lampeter Corpus](https://huggingface.co/datasets/biglam/lampeter_corpus) - [Hansard Speeches](https://huggingface.co/datasets/biglam/hansard_speech) - [Contentious Contexts Corpus](https://huggingface.co/datasets/biglam/contentious_contexts)
image-to-text - [Brill Iconclass AI Test Set ](https://huggingface.co/datasets/biglam/biglam/brill_iconclass)
text-generation - [Old Bailey Proceedings](https://huggingface.co/datasets/biglam/old_bailey_proceedings) - [Hansard Speeches](https://huggingface.co/datasets/biglam/hansard_speech) - [Berlin State Library OCR](https://huggingface.co/datasets/biglam/berlin_state_library_ocr) - [Literary fictions of Gallica](https://huggingface.co/datasets/biglam/gallica_literary_fictions) - [Europeana Newspapers ](https://huggingface.co/datasets/biglam/europeana_newspapers) - [Gutenberg Poetry Corpus](https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus) - [BnL Newspapers 1841-1879](https://huggingface.co/datasets/biglam/bnl_newspapers1841-1879)
object-detection - [National Library of Scotland Chapbook Illustrations](https://huggingface.co/datasets/biglam/nls_chapbook_illustrations) - [YALTAi Tabular Dataset](https://huggingface.co/datasets/biglam/yalta_ai_tabular_dataset) - [YALTAi Tabular Dataset](https://huggingface.co/datasets/biglam/yalta_ai_segmonto_manuscript_dataset) - [Beyond Words](https://huggingface.co/datasets/biglam/loc_beyond_words) - [DEArt: Dataset of European Art](https://huggingface.co/datasets/biglam/european_art)
fill-mask - [Berlin State Library OCR](https://huggingface.co/datasets/biglam/berlin_state_library_ocr) - [BnL Newspapers 1841-1879](https://huggingface.co/datasets/biglam/bnl_newspapers1841-1879)
token-classification - [Unsilencing Colonial Archives via Automated Entity Recognition](https://huggingface.co/datasets/biglam/unsilence_voc)