Quantization made by Richard Erkhov.
Smaug-Llama-3-70B-Instruct-32K - GGUF
- Model creator: https://huggingface.co/abacusai/
- Original model: https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct-32K/
Original model description:
license: llama3 library_name: transformers datasets: - aqua_rat - microsoft/orca-math-word-problems-200k - m-a-p/CodeFeedback-Filtered-Instruction model-index: - name: Smaug-Llama-3-70B-Instruct-32K results: - task: type: text-generation name: Text Generation dataset: name: IFEval (0-Shot) type: HuggingFaceH4/ifeval args: num_few_shot: 0 metrics: - type: inst_level_strict_acc and prompt_level_strict_acc value: 77.61 name: strict accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: BBH (3-Shot) type: BBH args: num_few_shot: 3 metrics: - type: acc_norm value: 49.07 name: normalized accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MATH Lvl 5 (4-Shot) type: hendrycks/competition_math args: num_few_shot: 4 metrics: - type: exact_match value: 21.22 name: exact match source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GPQA (0-shot) type: Idavidrein/gpqa args: num_few_shot: 0 metrics: - type: acc_norm value: 6.15 name: acc_norm source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MuSR (0-shot) type: TAUR-Lab/MuSR args: num_few_shot: 0 metrics: - type: acc_norm value: 12.43 name: acc_norm source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU-PRO (5-shot) type: TIGER-Lab/MMLU-Pro config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 41.83 name: accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K name: Open LLM Leaderboard
Smaug-Llama-3-70B-Instruct-32K
Built with Meta Llama 3
This is a 32K version of Smaug-Llama-3-70B-Instruct. It uses PoSE (https://arxiv.org/abs/2309.10400) and LoRA (https://arxiv.org/abs/2106.09685) adapter transfer. More details are coming soon.
Needle-In-A-Haystack (https://github.com/jzhang38/EasyContext) heatmap:
Model Description
- Developed by: Abacus.AI
- License: https://llama.meta.com/llama3/license/
- Finetuned from model: meta-llama/Meta-Llama-3-70B-Instruct.
How to use
The prompt format is unchanged from Llama 3 70B Instruct.
Use with transformers
See the snippet below for usage with Transformers:
import transformers
import torch
model_id = "abacusai/Smaug-Llama-3-70B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
Evaluation
Arena-Hard
Arena-Hard
Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
Model | Score | 95% Confidence Interval | Average Tokens |
---|---|---|---|
GPT-4-Turbo-2024-04-09 | 82.6 | (-1.8, 1.6) | 662 |
GPT-4o | 78.3 | (-2.4, 2.1) | 685 |
Gemini-1.5-pro-latest | 72.1 | (-2.3, 2.2) | 630 |
Claude-3-Opus-20240229 | 60.4 | (-3.3, 2.4) | 541 |
Smaug-Llama-3-70B-Instruct-32K | 60.0 | (-2.6, 2.1) | 844 |
Smaug-Llama-3-70B-Instruct | 56.7 | (-2.2, 2.6) | 661 |
GPT-4-0314 | 50.0 | (-0.0, 0.0) | 423 |
Claude-3-Sonnet-20240229 | 46.8 | (-2.1, 2.2) | 552 |
Llama-3-70B-Instruct | 41.1 | (-2.5, 2.4) | 583 |
GPT-4-0613 | 37.9 | (-2.2, 2.0) | 354 |
Mistral-Large-2402 | 37.7 | (-1.9, 2.6) | 400 |
Mixtral-8x22B-Instruct-v0.1 | 36.4 | (-2.7, 2.9) | 430 |
Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2) | 474 |
Command-R-Plus | 33.1 | (-2.1, 2.2) | 541 |
Mistral-Medium | 31.9 | (-2.3, 2.4) | 485 |
GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0) | 401 |
Note that we believe the number of tokens/verbosity of the model strongly influences the GPT-4 judge in this case, and at least partially explains the improvement in Arena-Hard score for the 32K model.
OpenLLM Leaderboard Manual Evaluation
Model | ARC | Hellaswag | MMLU | TruthfulQA | Winogrande | GSM8K* | Average |
---|---|---|---|---|---|---|---|
Smaug-Llama-3-70B-Instruct-32K | 70.1 | TBA | TBA | 61.9 | 82.2 | TBA | TBA |
Llama-3-70B-Instruct | 71.4 | 85.7 | 80.0 | 61.8 | 82.9 | 91.1 | 78.8 |
GSM8K The GSM8K numbers quoted here are computed using a recent release
of the LM Evaluation Harness.
The commit used by the leaderboard has a significant issue that impacts models that
tend to use :
in their responses due to a bug in the stop word configuration for
GSM8K. The issue is covered in more detail in this
GSM8K evaluation discussion.
The score for both Llama-3 and this model are significantly different when evaluated
with the updated harness as the issue with stop words has been addressed.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 34.72 |
IFEval (0-Shot) | 77.61 |
BBH (3-Shot) | 49.07 |
MATH Lvl 5 (4-Shot) | 21.22 |
GPQA (0-shot) | 6.15 |
MuSR (0-shot) | 12.43 |
MMLU-PRO (5-shot) | 41.83 |
- Downloads last month
- 572