wilsonmarciliojr's picture
Update README.md
a11d380 verified
|
raw
history blame
3.12 kB
---
library_name: transformers
tags:
- bert
- cramming
- NLU
license: apache-2.0
datasets:
- TucanoBR/GigaVerbo
language:
- pt
pipeline_tag: fill-mask
---
# crammed BERT Portuguese
<!-- Provide a quick summary of what the model is/does. -->
This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".
To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and `import cramming` before using the 🤗 transformers `AutoModel` (see below).
## How to use
```python
import cramming # needed to load crammed model
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
model = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
text = "Oi, eu sou um modelo <mask>."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
## Training Details
### Training Data & Config
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- 30M entries from `TucanoBR/GigaVerbo`
- 107M sequences of length 128
- tokenizer: WordPiece
- vocab_size: 32768
- seq_length: 128
- include_cls_token_in_corpus: false
- include_sep_token_in_corpus: true
### Training Procedure
- **optim**:
- type: AdamW
- lr: 0.001
- betas:
- 0.9
- 0.98
- eps: 1.0e-12
- weight_decay: 0.01
- amsgrad: false
- fused: null
- warmup_steps: 0
- cooldown_steps: 0
- steps: 900000
- batch_size: 8192
- gradient_clipping: 0.5
- **objective**:
- name: masked-lm
- mlm_probability: 0.25
- token_drop: 0.0
#### Training Hyperparameters
- num_transformer_layers: 16
- hidden_size: 768
- intermed_size: 3072
- hidden_dropout_prob: 0.1
- norm: LayerNorm
- norm_eps: 1.0e-12
- norm_scheme: pre
- nonlin: GELUglu
- tie_weights: true
- decoder_bias: false
- sparse_prediction: 0.25
- loss: cross-entropy
- **embedding**:
- vocab_size: null
- pos_embedding: scaled-sinusoidal
- dropout_prob: 0.1
- pad_token_id: 0
- max_seq_length: 128
- embedding_dim: 768
- normalization: true
- stable_low_precision: false
- **attention**:
- type: self-attention
- causal_attention: false
- num_attention_heads: 12
- dropout_prob: 0.1
- skip_output_projection: false
- qkv_bias: false
- rotary_embedding: false
- seq_op_in_fp32: false
- sequence_op: torch-softmax
- **init**:
- type: normal
- std: 0.02
- ffn_layer_frequency: 1
- skip_head_transform: true
- use_bias: false
- **classification_head**:
- pooler: avg
- include_ff_layer: true
- head_dim: 1024
- nonlin: Tanh
- classifier_dropout: 0.1
#### Speeds, Sizes, Times
- ~ 0.1674s per step (97886t/s)
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
TBD