airoboros-gpt-3.5-turbo-100k-7b / README.md

Update README.md

01653de almost 2 years ago

6.91 kB

	---
	license: other
	---

	## airoboros-gpt-3.5-turbo-100k-7b

	This is a 7b parameter, fine-tuned on 100k synthetic instruction/response pairs generated by gpt-3.5-turbo using my version of self-instruct [airoboros](https://github.com/jondurbin/airoboros)

	Links:

	* [airoboros](https://github.com/jondurbin/airoboros)
	* [instructions.jsonl](https://storage.googleapis.com/airoboros-dump/gpt-3.5-turbo-100k/instructions.jsonl)
	* [topics.txt](https://storage.googleapis.com/airoboros-dump/gpt-3.5-turbo-100k/topics-d732f92dd90a1a5337a4a02ddeaec72b.txt)

	### Prompt generation

	```
	airoboros generate-instructions --instruction-count 100000 --concurrency 100 --temperature 1.0
	```

	### Fine-tuning

	The instructions.jsonl file was converted to conversation style expected by the FastChat training scripts, and then trained with:
	```
	torchrun --nproc_per_node=8 --master_port=20001 train_mem.py \
	--model_name_or_path /workspace/llama-7b-hf \
	--data_path ./as_conversations.json \
	--bf16 True \
	--output_dir /workspace/airoboros-gpt-3.5-100k-7b \
	--num_train_epochs 3 \
	--per_device_train_batch_size 4 \
	--per_device_eval_batch_size 32 \
	--gradient_accumulation_steps 4 \
	--evaluation_strategy "steps" \
	--eval_steps 1500 \
	--save_strategy "steps" \
	--save_steps 1500 \
	--save_total_limit 8 \
	--learning_rate 2e-5 \
	--weight_decay 0. \
	--warmup_ratio 0.04 \
	--lr_scheduler_type "cosine" \
	--logging_steps 1 \
	--fsdp "full_shard auto_wrap offload" \
	--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
	--tf32 True \
	--model_max_length 2048 \
	--gradient_checkpointing True \
	--lazy_preprocess True
	```

	Training took roughly 22 hours on 8x nvidia A100 80GB.

	Conversion to conversation style:
	```
	import json
	import uuid
	inputs = [json.loads(line) for line in open("instructions.jsonl").readlines()]
	conversations = []
	for row in inputs:
	inputs = row['instruction']
	conversations.append({
	"id": str(uuid.uuid4()),
	"conversations": [
	{
	"from": "human",
	"value": inputs,
	},
	{
	"from": "gpt",
	"value": row['response']
	},
	],
	})
	with open("as_conversations.json", "w") as outfile:
	outfile.write(json.dumps(conversations, indent=2)
	```

	## Evaluation

	I used the same questions from (WizardVicunaLM)[]:

	\| instruction \| gpt3.5 \| wizard-vicuna-13b \| vicuna-13b \| wizard-7b \| airoboros-gpt-3.5-turbo-100k-7b \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| "Write a compelling product launch announcement email to inform our customers of our new software solution." \| 95 \| 92 \| 89 \| 90 \| 91 \|
	\| "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved." \| 94 \| 96 \| 90 \| 89 \| 91 \|
	\| "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?" \| 95 \| 90 \| 80 \| 70 \| 85 \|
	\| "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?" \| 95 \| 92 \| 90 \| 88 \| 85 \|
	\| "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet." \| 95 \| 90 \| 87 \| 85 \| 88 \|
	\| "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?" \| 93 \| 85 \| 89 \| 87 \| 90 \|
	\| "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?" \| 95 \| 90 \| 85 \| 92 \| 80 \|
	\| "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." \| 90 \| 70 \| 65 \| 80 \| 85 \|
	\| "What are the main differences between Python and JavaScript programming languages?"\| 90 \| 85 \| 80 \| 88 \| 82 \|
	\| "What are the differences between plant-based and animal-based protein sources?"\| 85 \| 92 \| 90 \| 80 \| 94 \|
	\| "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery." \| 95 \| 90 \| 92 \| 89 \| 91 \|
	\| "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?" \| 90 \| 85 \| 87 \| 83 \| 84 \|
	\| "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." \| 90 \| 70 \| 80 \| 75 \| 65 \|
	\| "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step." \| 90 \| 80 \| 60 \| 70 \| 85 \|

	If we use gpt-3.5 as the baseline (as wizardvicuna/vicuna did), we get the following scores:

	\| gpt3.5 \| wizard-vicuna-13b \| vicuna-13b \| wizard-7b \| airoboros-gpt-3.5-turbo-100k-7b \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| 1.0 \| __0.968421052631579__ \| 0.9368421052631579 \| 0.9473684210526315 \| 0.9578947368421052 \|
	\| 1.0 \| __1.0212765957446808__ \| 0.9574468085106383 \| 0.9468085106382979 \| 0.9680851063829787 \|
	\| 1.0 \| __0.9473684210526315__ \| 0.8421052631578947 \| 0.7368421052631579 \| 0.8947368421052632 \|
	\| 1.0 \| __0.968421052631579__ \| 0.9473684210526315 \| 0.9263157894736842 \| 0.8947368421052632 \|
	\| 1.0 \| __0.9473684210526315__ \| 0.9157894736842105 \| 0.8947368421052632 \| 0.9263157894736842 \|
	\| 1.0 \| 0.9139784946236559 \| 0.956989247311828 \| 0.9354838709677419 \| __0.967741935483871__ \|
	\| 1.0 \| 0.9473684210526315 \| 0.8947368421052632 \| __0.968421052631579__ \| 0.8421052631578947 \|
	\| 1.0 \| 0.7777777777777778 \| 0.7222222222222222 \| 0.8888888888888888 \| __0.9444444444444444__ \|
	\| 1.0 \| 0.9444444444444444 \| 0.8888888888888888 \| __0.9777777777777777__ \| 0.9111111111111111 \|
	\| 1.0 \| 1.0823529411764705 \| 1.0588235294117647 \| 0.9411764705882353 \| __1.1058823529411765__ \|
	\| 1.0 \| 0.9473684210526315 \| __0.968421052631579__ \| 0.9368421052631579 \| 0.9578947368421052 \|
	\| 1.0 \| 0.9444444444444444 \| __0.9666666666666667__ \| 0.9222222222222223 \| 0.9333333333333333 \|
	\| 1.0 \| 0.7777777777777778 \| __0.8888888888888888__ \| 0.8333333333333334 \| 0.7222222222222222 \|
	\| 1.0 \| 0.8888888888888888 \| 0.6666666666666666 \| 0.7777777777777778 \| __0.9444444444444444__ \|

	Average scores:

	```
	gpt3.5 1.000000
	wizard-vicuna-13b 0.934090
	vicuna-13b 0.900847
	wizard-7b 0.902428
	airoboros-gpt-3.5-turbo-100k-7b 0.926496
	```
	As you can see, the __7b__ airoboros model performs well, even compared to 13b models.

	## License
	The model is licensed under the LLaMA model, and the dataset is licensed under the terms of OpenAI because it uses ChatGPT. Everything else is free.