|
--- |
|
base_model: |
|
- Pinkstack/SuperThoughts-CoT-14B-16k-o1-QwQ |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- llama |
|
- gguf |
|
- code |
|
- phi3 |
|
- cot |
|
- o1 |
|
- reasoning |
|
- cot |
|
license: mit |
|
license_link: https://huggingface.co/microsoft/phi-4/resolve/main/LICENSE |
|
language: |
|
- en |
|
- multilingual |
|
pipeline_tag: text-generation |
|
inference: |
|
parameters: |
|
temperature: 0.3 |
|
widget: |
|
- messages: |
|
- role: user |
|
content: How many R's in strawberry? Think step by step. |
|
model-index: |
|
- name: SuperThoughts-CoT-14B-16k-o1-QwQ |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: IFEval (0-Shot) |
|
type: wis-k/instruction-following-eval |
|
split: train |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: inst_level_strict_acc and prompt_level_strict_acc |
|
value: 5.15 |
|
name: averaged accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: BBH (3-Shot) |
|
type: SaylorTwift/bbh |
|
split: test |
|
args: |
|
num_few_shot: 3 |
|
metrics: |
|
- type: acc_norm |
|
value: 52.85 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MATH Lvl 5 (4-Shot) |
|
type: lighteval/MATH-Hard |
|
split: test |
|
args: |
|
num_few_shot: 4 |
|
metrics: |
|
- type: exact_match |
|
value: 40.79 |
|
name: exact match |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GPQA (0-shot) |
|
type: Idavidrein/gpqa |
|
split: train |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 19.02 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MuSR (0-shot) |
|
type: TAUR-Lab/MuSR |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 21.79 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU-PRO (5-shot) |
|
type: TIGER-Lab/MMLU-Pro |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 47.43 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ |
|
name: Open LLM Leaderboard |
|
--- |
|
- safetensors version: Pinkstack/SuperThoughts-CoT-14B-16k-o1-QwQ |
|
[Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905) (SuperThoughts 14B is based on phi-4) |
|
|
|
You must use this prompt format: https://huggingface.co/Pinkstack/SuperThoughts-CoT-14B-16k-o1-QwQ-GGUF#format |
|
|
|
# We are very proud to announce, SuperThoughts, but you can just call it o1 mini 😉 |
|
A reasoning ai model based on Phi-4, which is better that QwQ at everything but Ifeval, but at a smaller size, really good at math and answers step by step in multiple languages with any prompt as reasoning is built into the prompt format. |
|
|
|
Please check the examples we provided: https://huggingface.co/Pinkstack/SuperThoughts-CoT-14B-16k-o1-QwQ-GGUF#%F0%9F%A7%80-examples |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/QDHJhI0EVT_L9AHY_g3Br.png) |
|
Beats qwen/qwq at MATH & MuSR & GPQA (MuSR being a reasoning benchmark) |
|
Evaluation: |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/csbdGKzGcDVMPRqMCoH8D.png) |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/HR9WtjBhE4h6wrq88FLAf.png) |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/GLt4ct4yAVMvYEpoYO5o6.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/CP9UF9kdBT_SW8Q79PSui.png) |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/doEIqDrM639hRPSg_J6AF.png) |
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/yl5Et2TkCoYuIrNpDhZu9.png) |
|
|
|
Unlike previous models we've uploaded, this one is the best one we've published! Answers in two steps: Reasoning -> Final answer like o1 mini and other similar reasoning ai models. |
|
# 🧀 Which quant is right for you? (all tested!) |
|
- ***Q3:*** This quant should be used on most high-end devices like rtx 2080TI's, Responses are very high quality, but its slightly slower than Q4. (Runs at ~1 tokens per second or less on a Samsung z fold 5 smartphone.) |
|
- ***Q4:*** This quant should be used on high-end modern devices like rtx 3080's or any GPU,TPU,CPU that is powerful enough and has at minimum 15gb of available memory, (On servers and high-end computers we personally use it.) reccomened. |
|
- ***Q8:*** This quant should be used on very high-end modern devices which can handle it's power, it is very powerful but q4 is more well rounded, not recommended. |
|
|
|
# [Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/Pinkstack__SuperThoughts-CoT-14B-16k-o1-QwQ-details)! |
|
Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=Pinkstack%2FSuperThoughts-CoT-14B-16k-o1-QwQ&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)! |
|
Please note, the low IFEVAL results is probably due to it always reasoning, it does have issues with instruction following. |
|
|
|
| Metric |Value (%)| |
|
|-------------------|--------:| |
|
|**Average** | 31.17| |
|
|IFEval (0-Shot) | 5.15| |
|
|BBH (3-Shot) | 52.85| |
|
|MATH Lvl 5 (4-Shot)| 40.79| |
|
|GPQA (0-shot) | 19.02| |
|
|MuSR (0-shot) | 21.79| |
|
|MMLU-PRO (5-shot) | 47.43| |
|
|
|
# Format |
|
the model uses this prompt format: (modified phi-4 prompt) |
|
``` |
|
{{ if .System }}<|system|> |
|
{{ .System }}<|im_end|> |
|
{{ end }}{{ if .Prompt }}<|user|> |
|
{{ .Prompt }}<|im_end|> |
|
{{ end }}<|assistant|>{{ .CoT }}<|CoT|> |
|
{{ .Response }}<|FinalAnswer|><|im_end|> |
|
``` |
|
It is recommended to use a system prompt like this one: |
|
``` |
|
You are a helpful ai assistant. Make sure to put your finalanswer at the end. |
|
``` |
|
|
|
# 🧀 Examples: |
|
(q4_k_m, 10GB rtx 3080, 64GB memory, running inside of MSTY, all use "You are a friendly ai assistant." as the System prompt.) |
|
**example 1:** |
|
![example1](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/NoLJREYFU8LdMwynyLLMG.png) |
|
**example 2:** |
|
![2](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/uboFipmS1ulfxeDgMBsBH.png) |
|
**example 3:** |
|
![example2](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/c4h-nw0DPTrQgX-_tvBoT.png) |
|
**example 4:** |
|
![example1part1.png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/Dcd6-wbpDQuXoulHaqATo.png) |
|
![example1part2.png](https://cdn-uploads.huggingface.co/production/uploads/6710ba6af1279fe0dfe33afe/CoBYmYiRt9Z4IDFoOwHxc.png) |
|
|
|
All generated locally and pretty quickly too! |
|
|
|
# 🧀 Information |
|
- ⚠️ A low temperature must be used to ensure it won't fail at reasoning. we use 0.3 - 0.8! |
|
- ⚠️ Due to the current prompt format, it may sometimes put <|FinalAnswer|> without providing a final answer at the end, you can ignore this or modify the prompt format. |
|
- this is out flagship model, with top-tier reasoning, rivaling gemini-flash-exp-2.0-thinking and o1 mini. results are overall similar to both of them, and it even beats QwQ at certain benchmarks. |
|
|
|
**Supported languages**: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian |
|
|
|
# 🧀 Uploaded model |
|
|
|
- **Developed by:** Pinkstack |
|
- **License:** MIT |
|
- **Finetuned from model :** Pinkstack/PARM-V1-phi-4-4k-CoT-pytorch |
|
|
|
This Phi-4 model was trained with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |