Edit model card

Llama 3 8B ChatQA - AWQ

Description

This repo contains AWQ model files for Nvidia's Llama 3 8B ChatQA.

About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is also now supported by continuous batching server vLLM, allowing the use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.

Model Details By Nvidia

Llama3-ChatQA-1.5, which excels at conversational question answering (QA) and retrieval-augmented generation (RAG). Llama3-ChatQA-1.5 is developed using an improved training recipe from ChatQA (1.0), and it is built on top of Llama-3 base model. Specifically, we incorporate more conversational QA data to enhance its tabular and arithmetic calculation capability. Llama3-ChatQA-1.5 has two variants: Llama3-ChatQA-1.5-8B and Llama3-ChatQA-1.5-70B. Both models were originally trained using Megatron-LM, we converted the checkpoints to Hugging Face format.

Other Resources

Llama3-ChatQA-1.5-70B   Evaluation Data   Training Data   Retriever

Benchmark Results

Results in ConvRAG Bench are as follows:

ChatQA-1.0-7B Command-R-Plus Llama-3-instruct-70b GPT-4-0613 ChatQA-1.0-70B ChatQA-1.5-8B ChatQA-1.5-70B
Doc2Dial 37.88 33.51 37.88 34.16 38.9 39.33 41.26
QuAC 29.69 34.16 36.96 40.29 41.82 39.73 38.82
QReCC 46.97 49.77 51.34 52.01 48.05 49.03 51.40
CoQA 76.61 69.71 76.98 77.42 78.57 76.46 78.44
DoQA 41.57 40.67 41.24 43.39 51.94 49.6 50.67
ConvFinQA 51.61 71.21 76.6 81.28 73.69 78.46 81.88
SQA 61.87 74.07 69.61 79.21 69.14 73.28 83.82
TopioCQA 45.45 53.77 49.72 45.09 50.98 49.96 55.63
HybriDial* 54.51 46.7 48.59 49.81 56.44 65.76 68.27
INSCIT 30.96 35.76 36.23 36.34 31.9 30.1 32.31
Average (all) 47.71 50.93 52.52 53.90 54.14 55.17 58.25
Average (exclude HybriDial) 46.96 51.40 52.95 54.35 53.89 53.99 57.14

Note that ChatQA-1.5 is built based on Llama-3 base model, and ChatQA-1.0 is built based on Llama-2 base model. ChatQA-1.5 used some samples from the HybriDial training dataset. To ensure fair comparison, we also compare average scores excluding HybriDial. The data and evaluation scripts for ConvRAG can be found here.

Prompt Format

System: {System}

{Context}

User: {Question}

Assistant: {Response}

User: {Question}

Assistant:

How to use

using vLLM

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, how are you?"
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=128)

# Create an LLM.
llm = LLM(model="Sreenington/Llama-3-8B-ChatQA-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

take the whole document as context

This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nvidia/Llama3-ChatQA-1.5-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

messages = [
    {"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
]

document = """NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"""

def get_formatted_input(messages, context):
    system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
    instruction = "Please give a full and complete answer for the question."

    for item in messages:
        if item['role'] == "user":
            ## only apply this instruction for the first user turn
            item['content'] = instruction + " " + item['content']
            break

    conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
    formatted_input = system + "\n\n" + context + "\n\n" + conversation
    
    return formatted_input

formatted_input = get_formatted_input(messages, document)
tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)

response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Quantized Model by

Sree Narayanan ([email protected]) https://github.com/eersnington https://www.linkedin.com/in/sreenington/

License

The use of this model is governed by the META LLAMA 3 COMMUNITY LICENSE AGREEMENT

Downloads last month
20
Safetensors
Model size
1.98B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Sreenington/Llama-3-8B-ChatQA-AWQ