File size: 2,784 Bytes
f38539a 19a2181 cdc28d2 f38539a 19a2181 f38539a cdc28d2 f38539a cdc28d2 f38539a 53e3a5b cdc28d2 f38539a cdc28d2 f38539a 53e3a5b cdc28d2 f38539a cdc28d2 f38539a cdc28d2 f38539a cdc28d2 f38539a 19a2181 f38539a cdc28d2 f38539a 53e3a5b 7c37abb f38539a 19a2181 f38539a cdc28d2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
tags:
- generated_from_keras_callback
model-index:
- name: distilBERT-Nepali
results:
- task:
type: Nepali-Language-Modelling
name: Masked Language Modelling
dataset:
type: raygx/Nepali-Extended-Text-Corpus
name: Nepali Language Corpus
metrics:
- type: PPL
value: 17.31
name: Perplexity
datasets:
- raygx/Nepali-Extended-Text-Corpus
- cc100
metrics:
- perplexity
language:
- ne
---
<!-- This model card has been generated automatically according to the information Keras had access to. You should
probably proofread and complete it, then remove this comment. -->
# distilBERT-Nepali
This model fine-tuned model of raygx/distilBERT-Nepali, revision no.: b35360e0cffb71ae18aaf4ea00ff8369964243a2
It achieves the following results on the evaluation set:
Perplexity:
> - lowest: 17.31
> - average: 19.12z
(This is because training is done in batches of data due to limited resources available)
Loss:
> - loss: 3.2503
> - val_loss: 3.0674
## Model description
This model is trained on [raygx/Nepali-Extended-Text-Corpus](https://huggingface.co/datasets/raygx/Nepali-Extended-Text-Corpus) dataset.
This dataset is a mixture of cc100 and [raygx/Nepali-Text-Corpus](https://huggingface.co/datasets/raygx/Nepali-Text-Corpus).
Thus this model is trained on 10 times more data than its previous self.
Another change is, the tokenizer is different. Hence, it is a totally different model.
## Training procedure
Training is done by running one epoch at once on a batch of data.
Thus, training is done for total 6 rounds.
So, there were total of 3 batches and 2 epochs.
### Training hyperparameters
The following hyperparameters were used during training:
- optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'class_name': 'WarmUp', 'config': {'initial_learning_rate': 5e-05, 'decay_schedule_fn': {'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 5e-05, 'decay_steps': 16760, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, '__passive_serialization__': True}, 'warmup_steps': 1000, 'power': 1.0, 'name': None}}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
- training_precision: mixed_float16
### Training results
Perplexity:
- lowest: 17.31
- average: 19.12
Loss:
- loss: 4.8605 - val_loss: 4.0510 - Perplexity: 56.96
- loss: 3.8504 - val_loss: 3.5142 - Perplexity: 33.65
- loss: 3.4918 - val_loss: 3.2408 - Perplexity: 25.64
- loss: 3.2503 - val_loss: 3.0674 - Perplexity: 21.56
- loss: 3.1324 - val_loss: 2.9243 - Perplexity: 18.49
- loss: 3.2503 - val_loss: 3.0674 - Perplexity: 17.31
### Framework versions
- Transformers 4.30.2
- TensorFlow 2.12.0
- Datasets 2.1.0
- Tokenizers 0.13.3 |