qnguyen3 commited on
Commit
c5c5381
·
verified ·
1 Parent(s): 40787b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -56
README.md CHANGED
@@ -1,58 +1,52 @@
1
  ---
2
- license: other
3
- base_model: qnguyen3/Mixtral-4x400M
4
- tags:
5
- - llama-factory
6
- - generated_from_trainer
7
- model-index:
8
- - name: mixtral-4x-400M-pt
9
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
-
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
- # mixtral-4x-400M-pt
16
-
17
- This model is a fine-tuned version of [qnguyen3/Mixtral-4x400M](https://huggingface.co/qnguyen3/Mixtral-4x400M) on the thevault_function_xsmall, the redpajama_v2_small, the tiny_strange_textbooks, the tiny_textbooks, the code_textbook, the the_stack_smol_xl_cleaned, the refinedweb_1m_medium, the minipile, the goodwiki, the wikipedia_vi, the mathpile_arxiv_medium, the mathpile_stackexchange, the mathpile_proofpile, the mathpile_wikipedia, the thevault_class_xsmall, the tiny_stories_envi, the pretrain_instruct_1, the pretrain_instruct_2 and the pretrain_instruct_code datasets.
18
-
19
- ## Model description
20
-
21
- More information needed
22
-
23
- ## Intended uses & limitations
24
-
25
- More information needed
26
-
27
- ## Training and evaluation data
28
-
29
- More information needed
30
-
31
- ## Training procedure
32
-
33
- ### Training hyperparameters
34
-
35
- The following hyperparameters were used during training:
36
- - learning_rate: 0.0003
37
- - train_batch_size: 64
38
- - eval_batch_size: 8
39
- - seed: 42
40
- - distributed_type: multi-GPU
41
- - num_devices: 4
42
- - gradient_accumulation_steps: 4
43
- - total_train_batch_size: 1024
44
- - total_eval_batch_size: 32
45
- - optimizer: Adam with betas=(0.9,0.95) and epsilon=1e-05
46
- - lr_scheduler_type: cosine
47
- - num_epochs: 3.0
48
-
49
- ### Training results
50
-
51
-
52
-
53
- ### Framework versions
54
-
55
- - Transformers 4.37.1
56
- - Pytorch 2.1.2+cu121
57
- - Datasets 2.16.1
58
- - Tokenizers 0.15.1
 
1
  ---
2
+ license: apache-2.0
3
+ widget:
4
+ - text: My name is El Microondas the Wise, and
5
+ example_title: El Microondas
6
+ - text: Kennesaw State University is a public
7
+ example_title: Kennesaw State University
8
+ - text: Bungie Studios is an American video game developer. They are most famous for
9
+ developing the award winning Halo series of video games. They also made Destiny.
10
+ The studio was founded
11
+ example_title: Bungie
12
+ - text: The Mona Lisa is a world-renowned painting created by
13
+ example_title: Mona Lisa
14
+ - text: The Harry Potter series, written by J.K. Rowling, begins with the book titled
15
+ example_title: Harry Potter Series
16
+ - text: 'Question: I have cities, but no houses. I have mountains, but no trees. I
17
+ have water, but no fish. What am I?
18
+
19
+ Answer:'
20
+ example_title: Riddle
21
+ - text: The process of photosynthesis involves the conversion of
22
+ example_title: Photosynthesis
23
+ - text: Jane went to the store to buy some groceries. She picked up apples, oranges,
24
+ and a loaf of bread. When she got home, she realized she forgot
25
+ example_title: Story Continuation
26
+ - text: 'Problem 2: If a train leaves Station A at 9:00 AM and travels at 60 mph,
27
+ and another train leaves Station B at 10:00 AM and travels at 80 mph, when will
28
+ they meet if the distance between the stations is 300 miles?
29
+
30
+ To determine'
31
+ example_title: Math Problem
32
+ - text: In the context of computer programming, an algorithm is
33
+ example_title: Algorithm Definition
34
  ---
35
+ # Mixsmol-4x400M-v0.1 by Ontocord
36
+ This is the third checkpoint (Epoch 3) of Mixsmol-4x400M-v0.1
37
+ Note that this is an experimental in data mixing. Therefore, we only trained the model on 50B tokens (95% English and 5% Vietnamese) to test the following:
38
+ - Reasoining capabilities through high-quality synthetic textbooks data pretraining
39
+ - Crosslingual understanding through machine translation and multilingual + multiple tasks pretraining
40
+
41
+ After verifying our hypothesis with this run, we will schedule a second run on bigger data and compute for it to achieve its maximum capability.
42
+
43
+ ## Data
44
+ - Synthetic Textbooks: 8M samples
45
+ - RefinedWeb: 1M samples
46
+ - RedPajama-v2: 500K samples
47
+ - MathPile: Everything
48
+ - ThePile: MiniPile Subset
49
+ - GoodWiki
50
+ - The Stack Smol XL
51
+ - The Vault: train_small split
52
+ - Instruction Pretraining: 250k samples