--- license: mit datasets: - tiiuae/falcon-refinedweb language: - en library_name: transformers --- # NeoBERT [![Hugging Face Model Card](https://img.shields.io/badge/Hugging%20Face-Model%20Card-blue)](https://huggingface.co/chandar-lab/NeoBERT) NeoBERT is a **next-generation encoder** model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an **optimal depth-to-width ratio**, and leverages an extended context length of **4,096 tokens**. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves **state-of-the-art results** on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. - Paper: [paper](https://arxiv.org/abs/2502.19587) - Repository: [github](https://github.com/chandar-lab/NeoBERT). ## Get started Ensure you have the following dependencies installed: ```bash pip install transformers torch xformers==0.0.28.post3 ``` If you would like to use sequence packing (un-padding), you will need to also install flash-attention: ```bash pip install transformers torch xformers==0.0.28.post3 flash_attn ``` ## How to use Load the model using Hugging Face Transformers: ```python from transformers import AutoModel, AutoTokenizer model_name = "chandar-lab/NeoBERT" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Tokenize input text text = "NeoBERT is the most efficient model of its kind!" inputs = tokenizer(text, return_tensors="pt") # Generate embeddings outputs = model(**inputs) embedding = outputs.last_hidden_state[:, 0, :] print(embedding.shape) ``` ## Features | **Feature** | **NeoBERT** | |---------------------------|-----------------------------| | `Depth-to-width` | 28 × 768 | | `Parameter count` | 250M | | `Activation` | SwiGLU | | `Positional embeddings` | RoPE | | `Normalization` | Pre-RMSNorm | | `Data Source` | RefinedWeb | | `Data Size` | 2.8 TB | | `Tokenizer` | google/bert | | `Context length` | 4,096 | | `MLM Masking Rate` | 20% | | `Optimizer` | AdamW | | `Scheduler` | CosineDecay | | `Training Tokens` | 2.1 T | | `Efficiency` | FlashAttention | ## License Model weights and code repository are licensed under the permissive MIT license. ## Citation If you use this model in your research, please cite: ```bibtex @misc{breton2025neobertnextgenerationbert, title={NeoBERT: A Next-Generation BERT}, author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar}, year={2025}, eprint={2502.19587}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.19587}, } ``` ## Contact For questions, do not hesitate to reach out and open an issue on here or on our **[GitHub](https://github.com/chandar-lab/NeoBERT)**. ---