Searching the Hub Efficiently with Python
In this tutorial, we will explore how to interact and explore the Hugging Face Hub with the huggingface_hub
library to help find available models and datasets quickly.
The Basics
huggingface_hub
is a Python library that allows anyone to freely extract useful information from the Hub, as well as downloading and publishing models. You can install it with:
pip install huggingface_hub
It comes packaged with an interface that can interact with the Hub in the HfApi class:
>>> from huggingface_hub import HfApi
>>> api = HfApi()
This class lets you perform a variety of operations that interact with the raw Hub API. We’ll be focusing on two specific functions:
If you look at what can be passed into each function, you will find the parameter list looks something like:
filter
author
search
- …
Two of these parameters are intuitive (author
and search
), but what about that filter
? 🤔 Let’s dive into a few helpers quickly and revisit that question.
Search Parameters
The huggingface_hub
provides a user-friendly interface to know what exactly can be passed into this filter
parameter through the ModelSearchArguments and DatasetSearchArguments classes:
>>> from huggingface_hub import ModelSearchArguments, DatasetSearchArguments
>>> model_args = ModelSearchArguments()
>>> dataset_args = DatasetSearchArguments()
These are nested namespace objects that have every single option available on the Hub and that will return what should be passed to filter
. The best of all is: it has tab completion 🎊 .
Searching for a Model
Let’s pose a problem that would be complicated to solve without access to this information:
I want to search the Hub for all PyTorch models trained on the
glue
dataset that can do Text Classification.
If you check what is available in model_args
by checking it’s output, you will find:
>>> model_args
Available Attributes or Keys:
* author
* dataset
* language
* library
* license
* model_name
* pipeline_tag
It has a variety of attributes or keys available to you. This is because it is both an object and a dictionary, so you can either do model_args["author"]
or model_args.author
. For this tutorial, let’s follow the latter format.
The first criteria is getting all PyTorch models. This would be found under the library
attribute, so let’s see if it is there:
>>> model_args.library
Available Attributes or Keys:
* AdapterTransformers
* Asteroid
* ESPnet
* Fairseq
* Flair
* JAX
* Joblib
* Keras
* ONNX
* PyTorch
* Rust
* Scikit_learn
* SentenceTransformers
* Stable_Baselines3 (Key only)
* Stanza
* TFLite
* TensorBoard
* TensorFlow
* TensorFlowTTS
* Timm
* Transformers
* allenNLP
* fastText
* fastai
* pyannote_audio
* spaCy
* speechbrain
It is! The PyTorch
name is there, so you’ll need to use model_args.library.PyTorch
:
>>> model_args.library.PyTorch
'pytorch'
Below is an animation repeating the process for finding both the Text Classification
and glue
requirements:
Now that all the pieces are there, the last step is to combine them all for something the API can use through the ModelFilter and DatasetFilter classes. The classes transform the outputs of the previous step into something the API can use conveniently:
>>> from huggingface_hub import ModelFilter, DatasetFilter
>>> filt = ModelFilter(
... task=model_args.pipeline_tag.TextClassification,
... trained_dataset=dataset_args.dataset_name.glue,
... library=model_args.library.PyTorch
... )
>>> api.list_models(filter=filt)[0]
ModelInfo: {
modelId: Jiva/xlm-roberta-large-it-mnli
sha: c6e64469ec4aa17fedbd1b2522256f90a90b5b86
lastModified: 2021-12-10T14:56:38.000Z
tags: ['pytorch', 'xlm-roberta', 'text-classification', 'it', 'dataset:multi_nli', 'dataset:glue', 'arxiv:1911.02116', 'transformers', 'tensorflow', 'license:mit', 'zero-shot-classification']
pipeline_tag: zero-shot-classification
siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='sentencepiece.bpe.model'), ModelFile(rfilename='special_tokens_map.json'), ModelFile(rfilename='tokenizer.json'), ModelFile(rfilename='tokenizer_config.json')]
config: None
id: Jiva/xlm-roberta-large-it-mnli
private: False
downloads: 11061
library_name: transformers
likes: 1
}
As you can see, it found the models that fit all the criteria. You can even take it further by passing in an array for each of the parameters from before. For example, let’s take a look for the same configuration, but also include TensorFlow
in the filter:
>>> filt = ModelFilter(
... task=model_args.pipeline_tag.TextClassification,
... library=[model_args.library.PyTorch, model_args.library.TensorFlow]
>>> )
>>> api.list_models(filter=filt)[0]
ModelInfo: {
modelId: distilbert-base-uncased-finetuned-sst-2-english
sha: ada5cc01a40ea664f0a490d0b5f88c97ab460470
lastModified: 2022-03-22T19:47:08.000Z
tags: ['pytorch', 'tf', 'rust', 'distilbert', 'text-classification', 'en', 'dataset:sst-2', 'transformers', 'license:apache-2.0', 'infinity_compatible']
pipeline_tag: text-classification
siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='map.jpeg'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='rust_model.ot'), ModelFile(rfilename='tf_model.h5'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='vocab.txt')]
config: None
id: distilbert-base-uncased-finetuned-sst-2-english
private: False
downloads: 3917525
library_name: transformers
likes: 49
}
Searching for a Dataset
Similarly to finding a model, you can find a dataset easily by following the same steps.
The new scenario will be:
I want to search the Hub for all datasets that can be used for
text_classification
and are in English.
First, you should look at what is available in the DatasetSearchArguments, similar to the ModelSearchArguments:
>>> dataset_args = DatasetSearchArguments()
>>> dataset_args
Available Attributes or Keys:
* author
* benchmark
* dataset_name
* language_creators
* languages
* licenses
* multilinguality
* size_categories
* task_categories
* task_ids
text_classification
is a task, so first you should check task_categories
:
>>> dataset_args.task_categories
Available Attributes or Keys:
* CodeGeneration
* Evaluationoflanguagemodels
* InclusiveLanguage
* InformationRetrieval
* SemanticSearch
* Summarization
* Text2Textgeneration (Key only)
* TextNeutralization
* TokenClassification
* Translation
* audio_classification
* automatic_speech_recognition
* caption_retrieval
* code_generation
* computer_vision
* conditional_text_generation
* conversational
* cross_language_transcription
* crowdsourced
* dialogue_system
* entity_extraction
* feature_extraction
* fill_mask
* generative_modelling
* gpt_3 (Key only)
* grammaticalerrorcorrection
* image
* image_captioning
* image_classification
* image_retrieval
* image_segmentation
* image_to_text
* information_retrieval
* language_modeling
* machine_translation
* multiple_choice
* named_entity_disambiguation
* named_entity_recognition
* natural_language_inference
* news_classification
* object_detection
* other
* other_test
* other_text_search
* paraphrase
* paraphrasedetection
* query_paraphrasing
* question_answering
* question_generation
* question_pairing
* sentiment_analysis
* sequence2sequence (Key only)
* sequence_modeling
* speech_processing
* structure_prediction
* summarization
* table_to_text
* tabular_to_text
* text2text_generation (Key only)
* text_classification
* text_generation
* text_generation_other_code_modeling
* text_generation_other_common_sense_inference
* text_generation_other_discourse_analysis
* text_regression
* text_retrieval
* text_scoring
* text_to_structured
* text_to_tabular
* textual_entailment
* time_series_forecasting
* token_classification
* transkation
* translation
* tts
* unpaired_image_to_image_translation
* zero_shot_information_retrieval
* zero_shot_retrieval
There you will find text_classification
, so you should use dataset_args.task_categories.text_classification
.
Next we need to find the proper language. There is a languages
property we can check. These are two-letter language codes, so you should check if it has en
:
>>> "en" in dataset_args.languages
True
Now that the pieces are found, you can write a filter:
>>> filt = DatasetFilter(
... languages=dataset_args.languages.en,
... task_categories=dataset_args.task_categories.text_classification
... )
And search the API!
>>> api.list_datasets(filter=filt)[0]
DatasetInfo: {
id: Abirate/english_quotes
lastModified: None
tags: ['annotations_creators:expert-generated', 'language_creators:expert-generated', 'language_creators:crowdsourced', 'languages:en', 'multilinguality:monolingual', 'source_datasets:original', 'task_categories:text-classification', 'task_ids:multi-label-classification']
private: False
author: Abirate
description: None
citation: None
cardData: None
siblings: None
gated: False
}
With these two functionalities combined, you can search for all available parameters and tags within the Hub to search for with ease for both Datasets and Models!