Update README.md

4dc4e8a verified about 1 month ago

8.79 kB

	---
	license: apache-2.0
	datasets:
	- cerebras/SlimPajama-627B
	language:
	- en
	---

	# Overview

	This is the repo for intermediate checkpoints for my upcoming MicroLlama V2 model with 500 million parameters based on Llama3.2.
	They are completed pretrained from scratch using SlmPajama-627B.
	This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds.

	Some reasons for using these checkpoints:

	- You can use them starting point to train your own small language model.
	- More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human.

	# How to use these checkpoints

	These checkpoints are compatible with [litgpt](https://github.com/Lightning-AI/litgpt) with slight modifications (see section below).

	In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required):

	```
	# Install litgpt
	pip install 'litgpt[all]'

	# litgpt pretrain checkpoint to inference checkpoint
	litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
	--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>

	# litgpt inference checkpoint to HF checkpoints
	litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT>
	```

	Reference:

	1. litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints
	2. litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md

	Caveat: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json
	to resolve any inference or evaluation error.

	# Advanced usage - pretraining with litgpt

	For folks who are familar with [litgpt](https://github.com/Lightning-AI/litgpt), you can add the following code to your config.py to use these checkpoints to continue to train the model.

	```python
	# based on Llama-3.2-1B
	dict(
	name="micro-llama-300M-v2",
	hf_config=dict(org="keeeeenw", name="MicroLlamaV2"),
	block_size=131072, # Stable choice for Llama model training
	# This contributes to 300M to 500M parameter increase
	# Note that we cannot change this number because the llama3
	# tokenizer is hardcoded to support this vocab size.
	vocab_size=128000,
	padded_vocab_size=128256,
	n_layer=12,
	n_embd=1024,
	n_head=16,
	n_query_groups=4,
	rotary_percentage=1.0,
	parallel_residual=False,
	bias=False,
	norm_class_name="RMSNorm",
	mlp_class_name="LLaMAMLP",
	intermediate_size=5632,
	rope_base=500000, # Scaling for long sequence support
	# RoPE adjustments for block size of 131072
	rope_adjustments=dict(
	factor=16.0, # Matches block_size=131072
	low_freq_factor=1.0,
	high_freq_factor=4.0,
	original_max_seq_len=8192 # Max seq length for 128K token block
	)
	),
	```

	You will need to preprocess your data using meta-llama/Llama-3.2-1B tokenizer similar to [prepare-the-tinyllama-1t-token-dataset](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#download-datasets) which uses the Llama2 tokenizer.

	Assuming you have litgpt installed already,
	```
	git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data

	litgpt download meta-llama/Llama-3.2-1B \
	--access_token your_hf_token \
	--tokenizer_only true

	python litgpt/data/prepare_slimpajama.py \
	--input_dir data/slimpajama-raw/train \
	--output_dir data/slimpajama/train \
	--tokenizer_path checkpoints/meta-llama/Llama-3.2-1B

	python litgpt/data/prepare_slimpajama.py \
	--input_dir data/slimpajama-raw/validation \
	--output_dir data/slimpajama/val \
	--tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
	```

	Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores.
	I have tried to shared the converted data as a HF dataset,
	but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later.

	Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml

	Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3:
	```
	litgpt pretrain \
	--config microllama_v2.yaml \
	--resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO>
	```

	IMPORTANT NOTE
	I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from
	Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you
	store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to
	look for the same data under ```/cache/chunks/```.

	If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again.
	```
	litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
	--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>

	litgpt pretrain \
	--config microllama_v2.yaml \
	--initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
	```

	You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly.

	# Evaluation results

	Note this does not represent the final performance of the model and should only be served as a reference for my training progress.
	```
	checkpoint: step-00088000

	\| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	\|-------------\|------:\|------\|-----:\|--------\|-----:\|---\|-----:\|
	\|piqa \| 1\|none \| 0\|acc \|0.6202\|± \|0.0113\|
	\| \| \|none \| 0\|acc_norm\|0.6213\|± \|0.0113\|
	\|boolq \| 2\|none \| 0\|acc \|0.5875\|± \|0.0086\|
	\|arc_challenge\| 1\|none \| 0\|acc \|0.1980\|± \|0.0116\|
	\| \| \|none \| 0\|acc_norm\|0.2201\|± \|0.0121\|
	\|arc_easy \| 1\|none \| 0\|acc \|0.4373\|± \|0.0102\|
	\| \| \|none \| 0\|acc_norm\|0.3935\|± \|0.0100\|
	\|winogrande \| 1\|none \| 0\|acc \|0.5004\|± \|0.0141\|
	\|openbookqa \| 1\|none \| 0\|acc \|0.1760\|± \|0.0170\|
	\| \| \|none \| 0\|acc_norm\|0.2680\|± \|0.0198\|
	\|hellaswag \| 1\|none \| 0\|acc \|0.2893\|± \|0.0045\|
	\| \| \|none \| 0\|acc_norm\|0.3125\|± \|0.0046\|
	```

	You can use the following script to reproduce the results (assuming you have installed litgpt)
	```
	MODEL_NAME="step-00088000"
	MODEL_OUTPUT_ROOT="MicroLlamaV2-VastAI-Checkpoints/out/pretrain/micro-llama-v2"
	MODEL_OUTPUT_REL="${MODEL_OUTPUT_ROOT}/${MODEL_NAME}"

	# HuggingFace
	huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/lit_model.pth --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
	huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/generation_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
	huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/hyperparameters.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
	huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/model_config.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
	huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
	huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/

	# Copy config, see "caveat" below
	cp -r <local_path>/config.json checkpoints/${MODEL_OUTPUT_REL}/

	# AWS
	# aws s3 cp s3://microllama-v2/checkpoints/out/pretrain/micro-llama-v2/${MODEL_NAME} checkpoints/${MODEL_OUTPUT_REL} --recursive

	litgpt evaluate \
	${MODEL_OUTPUT_REL} \
	--tasks "hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa" \
	--device cuda:0 \
	--batch_size 16
	```
	Caveat: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json
	to resolve the evaluation error.