Loading models

Transformers provides many pretrained models that are ready to use with a single line of code. It requires a model class and the from_pretrained() method.

Call from_pretrained() to download and load a models weights and configuration stored on the Hugging Face Hub.

The from_pretrained() method loads weights stored in the safetensors file format if they’re available. Traditionally, PyTorch model weights are serialized with the pickle utility which is known to be unsecure. Safetensor files are more secure and faster to load.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype="auto", device_map="auto")

This guide explains how models are loaded, the different ways you can load a model, how to overcome memory issues for really big models, and how to load custom models.

Models and configurations

All models have a configuration.py file with specific attributes like the number of hidden layers, vocabulary size, activation function, and more. You’ll also find a modeling.py file that defines the layers and mathematical operations taking place inside each layer. The modeling.py file takes the model attributes in configuration.py and builds the model accordingly. At this point, you have a model with random weights that needs to be trained to output meaningful results.

An architecture refers to the model’s skeleton and a checkpoint refers to the model’s weights for a given architecture. For example, BERT is an architecture while google-bert/bert-base-uncased is a checkpoint. You’ll see the term model used interchangeably with architecture and checkpoint.

There are two general types of models you can load:

A barebones model, like AutoModel or LlamaModel, that outputs hidden states.
A model with a specific head attached, like AutoModelForCausalLM or LlamaForCausalLM, for performing specific tasks.

For each model type, there is a separate class for each machine learning framework (PyTorch, TensorFlow, Flax). Pick the corresponding prefix for the framework you’re using.

PyTorch

TensorFlow

Flax

Model classes

To get a pretrained model, you need to load the weights into the model. This is done by calling from_pretrained() which accepts weights from the Hugging Face Hub or a local directory.

There are two model classes, the AutoModel class and a model-specific class.

AutoModel

model-specific class

Large models

Large pretrained models require a lot of memory to load. The loading process involves:

creating a model with random weights
loading the pretrained weights
placing the pretrained weights on the model

You need enough memory to hold two copies of the model weights (random and pretrained) which may not be possible depending on your hardware. In distributed training environments, this is even more challenging because each process loads a pretrained model.

Transformers reduces some of these memory-related challenges with fast initialization, sharded checkpoints, Accelerate’s Big Model Inference feature, and supporting lower bit data types.

Fast initialization

A PyTorch model is instantiated with random weights, or “empty” tensors, that take up space in memory without filling it.

Transformers boosts loading speed by skipping random weight initialization with the _fast_init parameter if the pretrained weights are correctly initialized. This parameter is set to True by default.

Sharded checkpoints

The save_pretrained() method automatically shards checkpoints larger than 10GB.

Each shard is loaded sequentially after the previous shard is loaded, limiting memory usage to only the model size and the largest shard size.

The max_shard_size parameter defaults to 5GB for each shard because it is easier to run on free-tier GPU instances without running out of memory.

For example, create some shards checkpoints for BioMistral/BioMistral-7B in save_pretrained().

from transformers import AutoModel
import tempfile
import os

model = AutoModel.from_pretrained("biomistral/biomistral-7b")
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
    print(sorted(os.listdir(tmp_dir)))

Reload the sharded checkpoint with from_pretrained().

with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir)
    new_model = AutoModel.from_pretrained(tmp_dir)

Sharded checkpoints can also be directly loaded with load_sharded_checkpoint().

from transformers.modeling_utils import load_sharded_checkpoint

with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
    load_sharded_checkpoint(model, tmp_dir)

The save_pretrained() method creates an index file that maps parameter names to the files they’re stored in. The index file has two keys, metadata and weight_map.

import json

with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="5GB")
    with open(os.path.join(tmp_dir, "model.safetensors.index.json"), "r") as f:
        index = json.load(f)

print(index.keys())

The metadata key provides the total model size.

index["metadata"]
{'total_size': 28966928384}

The weight_map key maps each parameter to the shard it’s stored in.

index["weight_map"]
{'lm_head.weight': 'model-00006-of-00006.safetensors',
 'model.embed_tokens.weight': 'model-00001-of-00006.safetensors',
 'model.layers.0.input_layernorm.weight': 'model-00001-of-00006.safetensors',
 'model.layers.0.mlp.down_proj.weight': 'model-00001-of-00006.safetensors',
 ...
}

Big Model Inference

Make sure you have Accelerate v0.9.0 and PyTorch v1.9.0 or later installed to use this feature!

from_pretrained() is supercharged with Accelerate’s Big Model Inference feature.

Big Model Inference creates a model skeleton on the PyTorch meta device. The meta device doesn’t store any real data, only the metadata.

Randomly initialized weights are only created when the pretrained weights are loaded to avoid maintaining two copies of the model in memory at the same time. The maximum memory usage is only the size of the model.

Learn more about device placement in Designing a device map.

Big Model Inference’s second feature relates to how weights are loaded and dispatched in the model skeleton. Model weights are dispatched across all available devices, starting with the fastest device (usually the GPU) and then offloading any remaining weights to slower devices (CPU and hard drive).

Both features combined reduces memory usage and loading times for big pretrained models.

Set device_map to "auto" to enable Big Model Inference. This also sets the low_cpu_mem_usage parameter to True, such that not more than 1x the model size is used in CPU memory.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto")

You can also manually assign layers to a device in device_map. It should map all model parameters to a device, but you don’t have to detail where all the submodules of a layer go if the entire layer is on the same device.

Access the hf_device_map attribute to see how a model is distributed across devices.

device_map = {"model.layers.1": 0, "model.layers.14": 1, "model.layers.31": "cpu", "lm_head": "disk"}
model.hf_device_map

Model data type

PyTorch model weights are initialized in torch.float32 by default. Loading a model in a different data type, like torch.float16, requires additional memory because the model is loaded again in the desired data type.

Explicitly set the torch_dtype parameter to directly initialize the model in the desired data type instead of loading the weights twice (torch.float32 then torch.float16). You could also set torch_dtype="auto" to automatically load the weights in the data type they are stored in.

specific dtype

auto dtype

The torch_dtype parameter can also be configured in AutoConfig for models instantiated from scratch.

import torch
from transformers import AutoConfig, AutoModel

my_config = AutoConfig.from_pretrained("google/gemma-2b", torch_dtype=torch.float16)
model = AutoModel.from_config(my_config)

Custom models

Custom models builds on Transformers’ configuration and modeling classes, supports the AutoClass API, and are loaded with from_pretrained(). The difference is that the modeling code is not from Transformers.

Take extra precaution when loading a custom model. While the Hub includes malware scanning for every repository, you should still be careful to avoid inadvertently executing malicious code.

Set trust_remote_code=True in from_pretrained() to load a custom model.

from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True)

As an extra layer of security, load a custom model from a specific revision to avoid loading model code that may have changed. The commit hash can be copied from the models commit history.

commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292"
model = AutoModelForImageClassification.from_pretrained(
    "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash
)

Refer to the Customize models guide for more information.

< > Update on GitHub