JackBAI/roberta-medium · Hugging Face

This is our reproduction using the official HuggingFace roberta architecture with a medium size. On the architecture side, RoBERTa is exactly the same as BERT except for its larger vocabulary size.

According to Google's BERT releases and BERT-Medium, a medium sized model should have a config of Layer=8, Hidden=512, #AttnHeads=8, and IntermediateSize=2048. We follow this config to pre-train a RoBERTa-base model for reproduction.

We use the same datasets as BERT (English Wikipedia and Book Corpus) to pre-train for 30k steps with a batch size of 8,192. I also released the reproduction of this dataset on HuggingFace.

We utilized DeepSpeed ZeRO-2 for performance optimization.

Other training configuration:

Parameter	Value
WARMUP_STEPS	1800
LR_DECAY	linear
ADAM_EPS	1e-6
ADAM_BETA1	0.9
ADAM_BETA2	0.98
ADAM_WEIGHT_DECAY	0.01
PEAK_LR	1e-3

We achieve very similar performance as the official BERT-Medium release on GLUE:

Model	MRPC-F1	STS-B-Pearson	SST-2-Acc	QQP-F1	MNLI-m	MNLI-mm	QNLI-Acc	WNLI-Acc	RTE-Acc
RoBERTa-medium (ours)	83.6	82.7	89.7	89.0	79.7	80.1	89.3	31.0	57.4
BERT-medium	86.3	87.7	88.9	89.4	80.6	81.0	89.2	29.6	63.9

Evaluation Scores Curve (AVG of scores) during pretraining:

For both stats above we don't report CoLA scores as it's pretty unstable. The raw CoLA scores are:

Step	1500	3000	6000	9000	13500	18000	24000	30000
CoLA	1.7	13.5	29.2	31.4	31.1	24.1	29.0	20.0

JackBAI
/

roberta-medium

Datasets used to train JackBAI/roberta-medium