|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- cerebras/SlimPajama-627B |
|
language: |
|
- en |
|
--- |
|
|
|
# Overview |
|
|
|
This is the repo for intermediate checkpoints for my upcoming **MicroLlama V2** model with 500 million parameters based on **Llama3.2**. |
|
They are completed pretrained from scratch using **SlmPajama-627B**. |
|
This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds. |
|
|
|
Some reasons for using these checkpoints: |
|
|
|
- You can use them starting point to train your own small language model. |
|
- More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human. |
|
|
|
# How to use these checkpoints |
|
|
|
These checkpoints are compatible with [litgpt](https://github.com/Lightning-AI/litgpt) with slight modifications (see section below). |
|
|
|
In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required): |
|
|
|
``` |
|
# Install litgpt |
|
pip install 'litgpt[all]' |
|
|
|
# litgpt pretrain checkpoint to inference checkpoint |
|
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \ |
|
--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> |
|
|
|
# litgpt inference checkpoint to HF checkpoints |
|
litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT> |
|
``` |
|
|
|
Reference: |
|
|
|
1. litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints |
|
2. litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md |
|
|
|
**Caveat**: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json |
|
to resolve any inference or evaluation error. |
|
|
|
# Advanced usage - pretraining with litgpt |
|
|
|
For folks who are familar with [litgpt](https://github.com/Lightning-AI/litgpt), you can add the following code to your config.py to use these checkpoints to continue to train the model. |
|
|
|
```python |
|
# based on Llama-3.2-1B |
|
dict( |
|
name="micro-llama-300M-v2", |
|
hf_config=dict(org="keeeeenw", name="MicroLlamaV2"), |
|
block_size=131072, # Stable choice for Llama model training |
|
# This contributes to 300M to 500M parameter increase |
|
# Note that we cannot change this number because the llama3 |
|
# tokenizer is hardcoded to support this vocab size. |
|
vocab_size=128000, |
|
padded_vocab_size=128256, |
|
n_layer=12, |
|
n_embd=1024, |
|
n_head=16, |
|
n_query_groups=4, |
|
rotary_percentage=1.0, |
|
parallel_residual=False, |
|
bias=False, |
|
norm_class_name="RMSNorm", |
|
mlp_class_name="LLaMAMLP", |
|
intermediate_size=5632, |
|
rope_base=500000, # Scaling for long sequence support |
|
# RoPE adjustments for block size of 131072 |
|
rope_adjustments=dict( |
|
factor=16.0, # Matches block_size=131072 |
|
low_freq_factor=1.0, |
|
high_freq_factor=4.0, |
|
original_max_seq_len=8192 # Max seq length for 128K token block |
|
) |
|
), |
|
``` |
|
|
|
You will need to preprocess your data using **meta-llama/Llama-3.2-1B** tokenizer similar to [prepare-the-tinyllama-1t-token-dataset](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#download-datasets) which uses the Llama2 tokenizer. |
|
|
|
Assuming you have litgpt installed already, |
|
``` |
|
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data |
|
|
|
litgpt download meta-llama/Llama-3.2-1B \ |
|
--access_token your_hf_token \ |
|
--tokenizer_only true |
|
|
|
python litgpt/data/prepare_slimpajama.py \ |
|
--input_dir data/slimpajama-raw/train \ |
|
--output_dir data/slimpajama/train \ |
|
--tokenizer_path checkpoints/meta-llama/Llama-3.2-1B |
|
|
|
python litgpt/data/prepare_slimpajama.py \ |
|
--input_dir data/slimpajama-raw/validation \ |
|
--output_dir data/slimpajama/val \ |
|
--tokenizer_path checkpoints/meta-llama/Llama-3.2-1B |
|
``` |
|
|
|
Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores. |
|
I have tried to shared the converted data as a HF dataset, |
|
but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later. |
|
|
|
Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml |
|
|
|
Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3: |
|
``` |
|
litgpt pretrain \ |
|
--config microllama_v2.yaml \ |
|
--resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> |
|
``` |
|
|
|
**IMPORTANT NOTE** |
|
I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from |
|
Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you |
|
store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to |
|
look for the same data under ```/cache/chunks/```. |
|
|
|
If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again. |
|
``` |
|
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \ |
|
--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> |
|
|
|
litgpt pretrain \ |
|
--config microllama_v2.yaml \ |
|
--initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> |
|
``` |
|
|
|
You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly. |
|
|
|
# Evaluation results |
|
|
|
**Note** this does not represent the final performance of the model and should only be served as a reference for my training progress. |
|
``` |
|
checkpoint: step-00088000 |
|
|
|
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
|
|-------------|------:|------|-----:|--------|-----:|---|-----:| |
|
|piqa | 1|none | 0|acc |0.6202|± |0.0113| |
|
| | |none | 0|acc_norm|0.6213|± |0.0113| |
|
|boolq | 2|none | 0|acc |0.5875|± |0.0086| |
|
|arc_challenge| 1|none | 0|acc |0.1980|± |0.0116| |
|
| | |none | 0|acc_norm|0.2201|± |0.0121| |
|
|arc_easy | 1|none | 0|acc |0.4373|± |0.0102| |
|
| | |none | 0|acc_norm|0.3935|± |0.0100| |
|
|winogrande | 1|none | 0|acc |0.5004|± |0.0141| |
|
|openbookqa | 1|none | 0|acc |0.1760|± |0.0170| |
|
| | |none | 0|acc_norm|0.2680|± |0.0198| |
|
|hellaswag | 1|none | 0|acc |0.2893|± |0.0045| |
|
| | |none | 0|acc_norm|0.3125|± |0.0046| |
|
``` |
|
|
|
You can use the following script to reproduce the results (assuming you have installed litgpt) |
|
``` |
|
MODEL_NAME="step-00088000" |
|
MODEL_OUTPUT_ROOT="MicroLlamaV2-VastAI-Checkpoints/out/pretrain/micro-llama-v2" |
|
MODEL_OUTPUT_REL="${MODEL_OUTPUT_ROOT}/${MODEL_NAME}" |
|
|
|
# HuggingFace |
|
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/lit_model.pth --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
|
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/generation_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
|
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/hyperparameters.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
|
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/model_config.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
|
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
|
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
|
|
|
# Copy config, see "caveat" below |
|
cp -r <local_path>/config.json checkpoints/${MODEL_OUTPUT_REL}/ |
|
|
|
# AWS |
|
# aws s3 cp s3://microllama-v2/checkpoints/out/pretrain/micro-llama-v2/${MODEL_NAME} checkpoints/${MODEL_OUTPUT_REL} --recursive |
|
|
|
litgpt evaluate \ |
|
${MODEL_OUTPUT_REL} \ |
|
--tasks "hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa" \ |
|
--device cuda:0 \ |
|
--batch_size 16 |
|
``` |
|
**Caveat**: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json |
|
to resolve the evaluation error. |
|
|
|
|