|
--- |
|
license: mit |
|
datasets: |
|
- tiiuae/falcon-refinedweb |
|
language: |
|
- en |
|
library_name: transformers |
|
--- |
|
|
|
# NeoBERT |
|
|
|
[](https://huggingface.co/chandar-lab/NeoBERT) |
|
|
|
NeoBERT is a **next-generation encoder** model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an **optimal depth-to-width ratio**, and leverages an extended context length of **4,096 tokens**. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves **state-of-the-art results** on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. |
|
|
|
- Paper: [paper](https://arxiv.org/abs/2502.19587) |
|
- Repository: [github](https://github.com/chandar-lab/NeoBERT). |
|
|
|
## Get started |
|
|
|
Ensure you have the following dependencies installed: |
|
|
|
```bash |
|
pip install transformers torch xformers==0.0.28.post3 |
|
``` |
|
|
|
If you would like to use sequence packing (un-padding), you will need to also install flash-attention: |
|
|
|
```bash |
|
pip install transformers torch xformers==0.0.28.post3 flash_attn |
|
``` |
|
|
|
## How to use |
|
|
|
Load the model using Hugging Face Transformers: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model_name = "chandar-lab/NeoBERT" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
# Tokenize input text |
|
text = "NeoBERT is the most efficient model of its kind!" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
# Generate embeddings |
|
outputs = model(**inputs) |
|
embedding = outputs.last_hidden_state[:, 0, :] |
|
print(embedding.shape) |
|
``` |
|
|
|
## Features |
|
| **Feature** | **NeoBERT** | |
|
|---------------------------|-----------------------------| |
|
| `Depth-to-width` | 28 × 768 | |
|
| `Parameter count` | 250M | |
|
| `Activation` | SwiGLU | |
|
| `Positional embeddings` | RoPE | |
|
| `Normalization` | Pre-RMSNorm | |
|
| `Data Source` | RefinedWeb | |
|
| `Data Size` | 2.8 TB | |
|
| `Tokenizer` | google/bert | |
|
| `Context length` | 4,096 | |
|
| `MLM Masking Rate` | 20% | |
|
| `Optimizer` | AdamW | |
|
| `Scheduler` | CosineDecay | |
|
| `Training Tokens` | 2.1 T | |
|
| `Efficiency` | FlashAttention | |
|
|
|
## License |
|
|
|
Model weights and code repository are licensed under the permissive MIT license. |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{breton2025neobertnextgenerationbert, |
|
title={NeoBERT: A Next-Generation BERT}, |
|
author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar}, |
|
year={2025}, |
|
eprint={2502.19587}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.19587}, |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
For questions, do not hesitate to reach out and open an issue on here or on our **[GitHub](https://github.com/chandar-lab/NeoBERT)**. |
|
|
|
--- |
|
|