EntiGraph CPT Model (based on Llama 3 8B)
Model Description
The EntiGraph CPT model is a continuation of the Llama 3 8B base model, trained using the Synthetic Continued Pretraining by Yang et al. (2024) approach with the EntiGraph algorithm. This model has been trained on a synthetic corpus generated from the QuALITY dataset to acquire domain-specific knowledge efficiently. The code used to train the model is available at the Synthetic Continued Pretraining GitHub repo.
Model Details
- Developed by: Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto
- Model type: Causal Language Model
- Language(s): English
- License: Apache 2.0
- Finetuned from model: Llama 3 8B
Uses
Intended Use
This model is intended for research purposes and applications requiring domain-specific knowledge related to the QuALITY dataset. It can be used for tasks such as closed-book question answering, summarization, and other NLP tasks within the domain of the training data.
Out-of-Scope Use
This model should not be used for generating factual information outside the scope of its training data or for any malicious purposes.
Training Details
Training Data
The model was trained on a 455M token synthetic corpus generated by the EntiGraph algorithm from the QuALITY dataset.
Training Procedure
- Pretraining: Continued pretraining on the EntiGraph synthetic corpus
- Hyperparameters:
- Learning rate: 5e-06
- Batch size: 16
- Weight decay: 0.01
- Warmup: 0.05
- Epochs: 2
- RedPajama replay rate: 0.1
Evaluation
The model has been evaluated on the QuALITY question answering dataset, demonstrating improved performance in closed-book QA tasks compared to the base model.
Limitations and Biases
While the EntiGraph CPT model shows improved performance on domain-specific tasks, it may inherit biases present in the original Llama 3 8B model and the QuALITY dataset. Users should be aware of potential limitations in generating content outside its training domain.
Citation
If you use this model, please cite the original paper:
@misc{yang2024syntheticcontinuedpretraining,
title={Synthetic continued pretraining},
author={Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto},
year={2024},
eprint={2409.07431},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.07431},
}
Ethical Considerations
Users of this model should be aware of the ethical implications of using large language models and ensure responsible use in applications.
- Downloads last month
- 30