How to host 11B on server with CPU only (No GPU)

#54
by AndrewChanU - opened

I tried serval method to host the llama 3.2 11B-Vision-Instruct model on CPU only server, but found errors on hosting.
Any advise for hosting the model?
Is 11B and 90B model limited that only works on server with GPU supported?
Below is the yml file used for hosting the 11B model.

services:
  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:2.3.1
    container_name: text-generation-inference
    shm_size: "10gb"
    ports:
      - "80:80"
    volumes:
       - ./text-generation-inference/data:/data
    environment:
      - MODEL_ID=meta-llama/Llama-3.2-11B-Vision-Instruct
      - HF_TOKEN=xxxxxxxxxxx
      - HF_HUB_ENABLE_HF_TRANSFER=0

Device="cpu"

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cpu",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

Thank you for the reply, and yes, the script could run the model.
Sorry that I forgot to mention that I'm trying to run it on Docker (with docker compose).
And still not able to use the model directly.

Sign up or log in to comment