Optimum Inference with ONNX Runtime
Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the Hugging Face Hub and create pipelines to run accelerated inference without rewriting your APIs.
Loading
Transformers models
Once your model was exported to the ONNX format, you can load it by replacing AutoModelForXxx
with the corresponding ORTModelForXxx
class.
from transformers import AutoTokenizer, pipeline
- from transformers import AutoModelForCausalLM
+ from optimum.onnxruntime import ORTModelForCausalLM
- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B) # PyTorch checkpoint
+ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
More information for all the supported ORTModelForXxx
in our documentation
Diffusers models
Once your model was exported to the ONNX format, you can load it by replacing DiffusionPipeline
with the corresponding ORTDiffusionPipeline
class.
- from diffusers import DiffusionPipeline
+ from optimum.onnxruntime import ORTDiffusionPipeline
model_id = "runwayml/stable-diffusion-v1-5"
- pipeline = DiffusionPipeline.from_pretrained(model_id)
+ pipeline = ORTDiffusionPipeline.from_pretrained(model_id, revision="onnx")
prompt = "sailing ship in storm by Leonardo da Vinci"
image = pipeline(prompt).images[0]
Sentence Transformers models
Once your model was exported to the ONNX format, you can load it by replacing AutoModel
with the corresponding ORTModelForFeatureExtraction
class.
from transformers import AutoTokenizer
- from transformers import AutoModel
+ from optimum.onnxruntime import ORTModelForFeatureExtraction
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
- model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+ model = ORTModelForFeatureExtraction.from_pretrained("optimum/all-MiniLM-L6-v2")
inputs = tokenizer("This is an example sentence", return_tensors="pt")
outputs = model(**inputs)
You can also load your ONNX model directly using the sentence_transformers.SentenceTransformer
class, just make sure to have sentence-transformers>=3.2
installed. If the model wasn’t already converted to ONNX, it will be converted automatically on-the-fly.
from sentence_transformers import SentenceTransformer
model_id = "sentence-transformers/all-MiniLM-L6-v2"
- model = SentenceTransformer(model_id)
+ model = SentenceTransformer(model_id, backend="onnx")
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
Timm models
Once your model was exported to the ONNX format, you can load it by replacing the create_model
with the corresponding ORTModelForImageClassification
class.
import requests
from PIL import Image
- from timm import create_model
from timm.data import resolve_data_config, create_transform
+ from optimum.onnxruntime import ORTModelForImageClassification
- model = create_model("timm/mobilenetv3_large_100.ra_in1k", pretrained=True)
+ model = ORTModelForImageClassification.from_pretrained("optimum/mobilenetv3_large_100.ra_in1k")
transform = create_transform(**resolve_data_config(model.config.pretrained_cfg, model=model))
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png"
image = Image.open(requests.get(url, stream=True).raw)
inputs = transform(image).unsqueeze(0)
outputs = model(inputs)
Converting your model to ONNX on-the-fly
In case your model wasn’t already converted to ONNX, ORTModel includes a method to convert your model to ONNX on-the-fly.
Simply pass export=True
to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly:
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> # Load the model from the hub and export it to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
Pushing your model to the Hub
You can also call push_to_hub
directly on your model to upload it to the Hub.
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> # Load the model from the hub and export it to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
>>> # Save the converted model locally
>>> output_dir = "a_local_path_for_convert_onnx_model"
>>> model.save_pretrained(output_dir)
# Push the onnx model to HF Hub
>>> model.push_to_hub(output_dir, repository_id="my-onnx-repo")