mosaicml
/

mosaic-bert-base-seqlen-512

@@ -33,42 +33,61 @@ The primary use case of these models is for research on efficient pretraining an
 April 2023
 ## Documentation
-* [Blog post](https://www.mosaicml.com/blog/mosaicbert)
-* [Github (mosaicml/examples/bert repo)](https://github.com/mosaicml/examples/tree/main/examples/bert)
 ## How to use
 ```python
-from transformers import AutoModelForMaskedLM
-mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-512', trust_remote_code=True)
-```
-The tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
-```python
-from transformers import BertTokenizer
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 ```
-To use this model directly for masked language modeling, use `pipeline`:
 ```python
-from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-512', trust_remote_code=True)
-classifier = pipeline('fill-mask', model=mlm, tokenizer=tokenizer)
-classifier("I [MASK] to the store yesterday.")
-```
 **To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training).
 **To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#single-task-fine-tuning).
 ### Remote Code
 This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:

 April 2023
+## Model Date
+April 2023
 ## Documentation
+* [Project Page (mosaicbert.github.io)](mosaicbert.github.io)
+* [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
+* [Paper (NeurIPS 2023)](https://openreview.net/forum?id=5zipcfLC2Z)
+* Colab Tutorials:
+  * [MosaicBERT Tutorial Part 1: Load Pretrained Weights and Experiment with Sequence Length Extrapolation Using ALiBi](https://colab.research.google.com/drive/1r0A3QEbu4Nzs2Jl6LaiNoW5EumIVqrGc?usp=sharing)
+* [Blog Post (March 2023)](https://www.mosaicml.com/blog/mosaicbert)
 ## How to use
 ```python
+import torch
+import transformers
+from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
+from transformers import BertTokenizer, BertConfig
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # MosaicBERT uses the standard BERT tokenizer
+config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base-seqlen-512') # the config needs to be passed in
+mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-512',config=config,trust_remote_code=True)
+# To use this model directly for masked language modeling
+mosaicbert_classifier = pipeline('fill-mask', model=mosaicbert, tokenizer=tokenizer,device="cpu")
+mosaicbert_classifier("I [MASK] to the store yesterday.")
 ```
+Note that the tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
+In order to take advantage of ALiBi by extrapolating to longer sequence lengths, simply change the `alibi_starting_size` flag in the
+config file and reload the model.
 ```python
+config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base-seqlen-512')
+config.alibi_starting_size = 1024 # maximum sequence length updated to 1024 from config default of 512
+mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-512',config=config,trust_remote_code=True)
+```
+This simply presets the non-learned linear bias matrix in every attention block to 1024 tokens (note that this particular model was trained with a sequence length of 512 tokens).
 **To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training).
 **To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#single-task-fine-tuning).
+### [Update 1/2/2024] Triton Flash Attention with ALiBi
+Note that by default, triton Flash Attention is **not** enabled or required. In order to enable our custom implementation of triton Flash Attention with ALiBi from March 2023,
+set `attention_probs_dropout_prob: 0.0`. We are currently working on supporting Flash Attention 2 (see [PR here](https://github.com/mosaicml/examples/pull/440)).
 ### Remote Code
 This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example: