shimmyshimmer commited on
Commit
8b55e9c
·
verified ·
1 Parent(s): bfa32d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -44
README.md CHANGED
@@ -1,52 +1,53 @@
1
- <!-- markdownlint-disable first-line-h1 -->
2
- <!-- markdownlint-disable html -->
3
- <!-- markdownlint-disable no-duplicate-header -->
 
 
 
 
 
 
 
 
 
4
 
5
- <div align="center">
6
- <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek-V3" />
7
- </div>
8
- <hr>
9
- <div align="center" style="line-height: 1;">
10
- <a href="https://www.deepseek.com/" target="_blank" style="margin: 2px;">
11
- <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" style="display: inline-block; vertical-align: middle;"/>
12
- </a>
13
- <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;">
14
- <img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20V3-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
15
- </a>
16
- <a href="https://huggingface.co/deepseek-ai" target="_blank" style="margin: 2px;">
17
- <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
18
- </a>
19
- </div>
20
 
21
- <div align="center" style="line-height: 1;">
22
- <a href="https://discord.gg/Tc7c45Zzu5" target="_blank" style="margin: 2px;">
23
- <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
24
- </a>
25
- <a href="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/qr.jpeg?raw=true" target="_blank" style="margin: 2px;">
26
- <img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
27
- </a>
28
- <a href="https://twitter.com/deepseek_ai" target="_blank" style="margin: 2px;">
29
- <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
30
- </a>
31
- </div>
32
 
33
- <div align="center" style="line-height: 1;">
34
- <a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-CODE" style="margin: 2px;">
35
- <img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
36
- </a>
37
- <a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/LICENSE-MODEL" style="margin: 2px;">
38
- <img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
39
- </a>
40
- </div>
41
 
 
 
42
 
43
- <p align="center">
44
- <a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf"><b>Paper Link</b>👁️</a>
45
- </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
 
47
 
48
- ## 1. Introduction
 
 
49
 
 
 
 
 
50
  We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
51
  To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.
52
  Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.
@@ -55,9 +56,6 @@ Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source
55
  Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
56
  In addition, its training process is remarkably stable.
57
  Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
58
- <p align="center">
59
- <img width="80%" src="figures/benchmark.png">
60
- </p>
61
 
62
  ## 2. Model Summary
63
 
 
1
+ ---
2
+ base_model: deepseek-ai/DeepSeek-V3
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ license: mit
7
+ tags:
8
+ - deepseek_v3
9
+ - deepseek
10
+ - unsloth
11
+ - transformers
12
+ ---
13
 
14
+ ## ***See [our collection](https://huggingface.co/collections/unsloth/deepseek-v3-all-versions-677cf5cfd7df8b7815fc723c) for versions of Deepseek V3 including GGUF, bf16 and original formats.***
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ # Finetune Llama 3.3, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth!
18
+ We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
 
 
 
 
 
 
19
 
20
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/unsloth)
21
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
22
 
23
+ # unsloth/DeepSeek-V3-GGUF
24
+ For more details on the model, please go to Deepseek's original [model card](https://huggingface.co/deepseek-ai/DeepSeek-V3)
25
+
26
+ ## ✨ Finetune for Free
27
+
28
+ All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
29
+
30
+ | Unsloth supports | Free Notebooks | Performance | Memory use |
31
+ |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
32
+ | **Llama-3.2 (3B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2.4x faster | 58% less |
33
+ | **Llama-3.2 (11B vision)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 60% less |
34
+ | **Qwen2 VL (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) | 1.8x faster | 60% less |
35
+ | **Qwen2.5 (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 60% less |
36
+ | **Llama-3.1 (8B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2.4x faster | 58% less |
37
+ | **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb) | 2x faster | 50% less |
38
+ | **Gemma 2 (9B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2.4x faster | 58% less |
39
+ | **Mistral (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 62% less |
40
 
41
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
42
 
43
+ - This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
44
+ - This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
45
+ - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
46
 
47
+ ## Special Thanks
48
+ A huge thank you to the Deepseek team for creating and releasing these models.
49
+
50
+ ## Model Information
51
  We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
52
  To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.
53
  Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.
 
56
  Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
57
  In addition, its training process is remarkably stable.
58
  Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
 
 
 
59
 
60
  ## 2. Model Summary
61