Inference Speed
I run the demo case in the way same as the way in Colab. My GPU is one 40G A100 chip.And answering the case in Colab costs 10 minutes. Is that normal?
Hi @ThisIsSoMe could this be because you're downloading the model for the first time when running the script, causing it to take 10 minutes? How long do subsequent requests take to complete? If you could share some sample scripts that you ran with to reproduce your latency issue we can look into it further.
codes:
import torch
import sqlparse
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
res = torch.cuda.is_available()
print(res)
model_name = "/root/paddlejob/workspace/env_run/sqlcoder2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
# load_in_8bit=True,
# load_in_4bit=True,
device_map="auto",
use_cache=True
)
#model = model.to("cuda")
eos_token_id = tokenizer.eos_token_id
print("loaded model")
for i in tqdm(range(1)):
question = "What is our total revenue by product in the last week?"
prompt = """### Task
Generate a SQL query to answer the following question:
`{question}`
### Database Schema
This query will run on a database whose schema is represented in this string:
CREATE TABLE products (
product_id INTEGER PRIMARY KEY, -- Unique ID for each product
name VARCHAR(50), -- Name of the product
price DECIMAL(10,2), -- Price of each unit of the product
quantity INTEGER -- Current quantity in stock
);
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY, -- Unique ID for each customer
name VARCHAR(50), -- Name of the customer
address VARCHAR(100) -- Mailing address of the customer
);
CREATE TABLE salespeople (
salesperson_id INTEGER PRIMARY KEY, -- Unique ID for each salesperson
name VARCHAR(50), -- Name of the salesperson
region VARCHAR(50) -- Geographic sales region
);
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY, -- Unique ID for each sale
product_id INTEGER, -- ID of product sold
customer_id INTEGER, -- ID of customer who made purchase
salesperson_id INTEGER, -- ID of salesperson who made the sale
sale_date DATE, -- Date the sale occurred
quantity INTEGER -- Quantity of product sold
);
CREATE TABLE product_suppliers (
supplier_id INTEGER PRIMARY KEY, -- Unique ID for each supplier
product_id INTEGER, -- Product ID supplied
supply_price DECIMAL(10,2) -- Unit price charged by supplier
);
-- sales.product_id can be joined with products.product_id
-- sales.customer_id can be joined with customers.customer_id
-- sales.salesperson_id can be joined with salespeople.salesperson_id
-- product_suppliers.product_id can be joined with products.product_id
### SQL
Given the database schema, here is the SQL query that answers `{question}`:
""".format(question=question)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
#print(inputs)
generated_ids = model.generate(
**inputs,
num_return_sequences=1,
eos_token_id=eos_token_id,
pad_token_id=eos_token_id,
max_new_tokens=400,
do_sample=False,
num_beams=1,
)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
torch.cuda.empty_cache()
#torch.cuda.synchronize()
# empty cache so that you do generate more results w/o memory crashing
# particularly important on Colab β memory management is much more straightforward
# when running on an inference service
print(outputs[0])
print(sqlparse.format(outputs[0].split("```sql")[-1], reindent=True))
maybe it takes 10+ minutes to complete.
Hi
@ThisIsSoMe
, thanks for the code and screenshot. From the code and screenshot, it seems to me that the part that is taking very long is the inference code within tqdm. I'm guessing this might be due to your environment not using the GPU. I saw the nvidia-smi print with 0MB memory used and 0% utilization but wasn't sure if that was before or after the model was loaded. To rule that out, could you print out the model device after loading it via model.device
, and check nvidia-smi while running inference? 10 mins sound like the approximate time it would have taken on a cpu.