|
--- |
|
library_name: transformers |
|
tags: |
|
- bert |
|
- cramming |
|
- NLU |
|
license: apache-2.0 |
|
datasets: |
|
- TucanoBR/GigaVerbo |
|
language: |
|
- pt |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# crammed BERT Portuguese |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day". |
|
|
|
To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and `import cramming` before using the 🤗 transformers `AutoModel` (see below). |
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
|
import cramming # needed to load crammed model |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese") |
|
model = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese") |
|
|
|
text = "Oi, eu sou um modelo <mask>." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data & Config |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
- 30M entries from `TucanoBR/GigaVerbo` |
|
- 107M sequences of length 128 |
|
- tokenizer: WordPiece |
|
- vocab_size: 32768 |
|
- seq_length: 128 |
|
- include_cls_token_in_corpus: false |
|
- include_sep_token_in_corpus: true |
|
|
|
### Training Procedure |
|
|
|
|
|
- **optim**: |
|
|
|
- type: AdamW |
|
- lr: 0.001 |
|
- betas: |
|
- 0.9 |
|
- 0.98 |
|
- eps: 1.0e-12 |
|
- weight_decay: 0.01 |
|
- amsgrad: false |
|
- fused: null |
|
- warmup_steps: 0 |
|
- cooldown_steps: 0 |
|
- steps: 900000 |
|
- batch_size: 8192 |
|
- gradient_clipping: 0.5 |
|
|
|
- **objective**: |
|
- name: masked-lm |
|
- mlm_probability: 0.25 |
|
- token_drop: 0.0 |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- num_transformer_layers: 16 |
|
- hidden_size: 768 |
|
- intermed_size: 3072 |
|
- hidden_dropout_prob: 0.1 |
|
- norm: LayerNorm |
|
- norm_eps: 1.0e-12 |
|
- norm_scheme: pre |
|
- nonlin: GELUglu |
|
- tie_weights: true |
|
- decoder_bias: false |
|
- sparse_prediction: 0.25 |
|
- loss: cross-entropy |
|
|
|
- **embedding**: |
|
- vocab_size: null |
|
- pos_embedding: scaled-sinusoidal |
|
- dropout_prob: 0.1 |
|
- pad_token_id: 0 |
|
- max_seq_length: 128 |
|
- embedding_dim: 768 |
|
- normalization: true |
|
- stable_low_precision: false |
|
|
|
- **attention**: |
|
- type: self-attention |
|
- causal_attention: false |
|
- num_attention_heads: 12 |
|
- dropout_prob: 0.1 |
|
- skip_output_projection: false |
|
- qkv_bias: false |
|
- rotary_embedding: false |
|
- seq_op_in_fp32: false |
|
- sequence_op: torch-softmax |
|
|
|
- **init**: |
|
|
|
- type: normal |
|
- std: 0.02 |
|
|
|
- ffn_layer_frequency: 1 |
|
- skip_head_transform: true |
|
- use_bias: false |
|
|
|
- **classification_head**: |
|
|
|
- pooler: avg |
|
- include_ff_layer: true |
|
- head_dim: 1024 |
|
- nonlin: Tanh |
|
- classifier_dropout: 0.1 |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
- ~ 0.1674s per step (97886t/s) |
|
|
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
TBD |