wilsonmarciliojr
/

crammed-bert-portuguese

Inference Endpoints

Model card Files Files and versions Community

crammed-bert-portuguese / README.md

wilsonmarciliojr's picture

wilsonmarciliojr

Update README.md

a11d380 verified 19 days ago

|

3.12 kB

	---
	library_name: transformers
	tags:
	- bert
	- cramming
	- NLU
	license: apache-2.0
	datasets:
	- TucanoBR/GigaVerbo
	language:
	- pt
	pipeline_tag: fill-mask
	---

	# crammed BERT Portuguese

	<!-- Provide a quick summary of what the model is/does. -->

	This is a BERT model trained for 24 hours on a single A6000 GPU. It follows the architecture described in "Cramming: Training a Language Model on a Single GPU in One Day".

	To use this model, clone the code from my fork https://github.com/wilsonjr/cramming and `import cramming` before using the 🤗 transformers `AutoModel` (see below).


	## How to use


	```python

	import cramming # needed to load crammed model
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")
	model = AutoModelForMaskedLM.from_pretrained("wilsonmarciliojr/crammed-bert-portuguese")

	text = "Oi, eu sou um modelo <mask>."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)

	```

	## Training Details

	### Training Data & Config

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	- 30M entries from `TucanoBR/GigaVerbo`
	- 107M sequences of length 128
	- tokenizer: WordPiece
	- vocab_size: 32768
	- seq_length: 128
	- include_cls_token_in_corpus: false
	- include_sep_token_in_corpus: true

	### Training Procedure


	- optim:

	- type: AdamW
	- lr: 0.001
	- betas:
	- 0.9
	- 0.98
	- eps: 1.0e-12
	- weight_decay: 0.01
	- amsgrad: false
	- fused: null
	- warmup_steps: 0
	- cooldown_steps: 0
	- steps: 900000
	- batch_size: 8192
	- gradient_clipping: 0.5

	- objective:
	- name: masked-lm
	- mlm_probability: 0.25
	- token_drop: 0.0


	#### Training Hyperparameters

	- num_transformer_layers: 16
	- hidden_size: 768
	- intermed_size: 3072
	- hidden_dropout_prob: 0.1
	- norm: LayerNorm
	- norm_eps: 1.0e-12
	- norm_scheme: pre
	- nonlin: GELUglu
	- tie_weights: true
	- decoder_bias: false
	- sparse_prediction: 0.25
	- loss: cross-entropy

	- embedding:
	- vocab_size: null
	- pos_embedding: scaled-sinusoidal
	- dropout_prob: 0.1
	- pad_token_id: 0
	- max_seq_length: 128
	- embedding_dim: 768
	- normalization: true
	- stable_low_precision: false

	- attention:
	- type: self-attention
	- causal_attention: false
	- num_attention_heads: 12
	- dropout_prob: 0.1
	- skip_output_projection: false
	- qkv_bias: false
	- rotary_embedding: false
	- seq_op_in_fp32: false
	- sequence_op: torch-softmax

	- init:

	- type: normal
	- std: 0.02

	- ffn_layer_frequency: 1
	- skip_head_transform: true
	- use_bias: false

	- classification_head:

	- pooler: avg
	- include_ff_layer: true
	- head_dim: 1024
	- nonlin: Tanh
	- classifier_dropout: 0.1

	#### Speeds, Sizes, Times

	- ~ 0.1674s per step (97886t/s)


	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	TBD