File size: 2,784 Bytes
f38539a
 
 
 
19a2181
cdc28d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f38539a
 
 
 
 
19a2181
f38539a
cdc28d2
f38539a
cdc28d2
f38539a
53e3a5b
cdc28d2
 
f38539a
cdc28d2
f38539a
53e3a5b
cdc28d2
 
f38539a
cdc28d2
f38539a
cdc28d2
 
 
 
f38539a
 
 
cdc28d2
 
 
 
f38539a
 
 
19a2181
 
f38539a
 
 
cdc28d2
 
 
f38539a
53e3a5b
7c37abb
 
 
 
 
 
f38539a
 
 
19a2181
 
f38539a
cdc28d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
tags:
- generated_from_keras_callback
model-index:
- name: distilBERT-Nepali
  results:
  - task:
      type: Nepali-Language-Modelling
      name: Masked Language Modelling
    dataset:
      type: raygx/Nepali-Extended-Text-Corpus
      name: Nepali Language Corpus
    metrics:
    - type: PPL
      value: 17.31
      name: Perplexity
datasets:
- raygx/Nepali-Extended-Text-Corpus
- cc100
metrics:
- perplexity
language:
- ne
---

<!-- This model card has been generated automatically according to the information Keras had access to. You should
probably proofread and complete it, then remove this comment. -->

# distilBERT-Nepali

This model fine-tuned model of raygx/distilBERT-Nepali, revision no.: b35360e0cffb71ae18aaf4ea00ff8369964243a2 

It achieves the following results on the evaluation set:

Perplexity:
>  - lowest: 17.31
>  - average: 19.12z

(This is because training is done in batches of data due to limited resources available)

Loss:
> - loss: 3.2503
> - val_loss: 3.0674

## Model description

This model is trained on [raygx/Nepali-Extended-Text-Corpus](https://huggingface.co/datasets/raygx/Nepali-Extended-Text-Corpus) dataset.
This dataset is a mixture of cc100 and [raygx/Nepali-Text-Corpus](https://huggingface.co/datasets/raygx/Nepali-Text-Corpus).
Thus this model is trained on 10 times more data than its previous self. 
Another change is, the tokenizer is different. Hence, it is a totally different model.

## Training procedure

Training is done by running one epoch at once on a batch of data. 
Thus, training is done for total 6 rounds.
So, there were total of 3 batches and 2 epochs.

### Training hyperparameters

The following hyperparameters were used during training:
- optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'class_name': 'WarmUp', 'config': {'initial_learning_rate': 5e-05, 'decay_schedule_fn': {'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 5e-05, 'decay_steps': 16760, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, '__passive_serialization__': True}, 'warmup_steps': 1000, 'power': 1.0, 'name': None}}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
- training_precision: mixed_float16

### Training results

Perplexity:
  - lowest: 17.31
  - average: 19.12

Loss:
  - loss: 4.8605 - val_loss: 4.0510 - Perplexity: 56.96
  - loss: 3.8504 - val_loss: 3.5142 - Perplexity: 33.65
  - loss: 3.4918 - val_loss: 3.2408 - Perplexity: 25.64
  - loss: 3.2503 - val_loss: 3.0674 - Perplexity: 21.56
  - loss: 3.1324 - val_loss: 2.9243 - Perplexity: 18.49
  - loss: 3.2503 - val_loss: 3.0674 - Perplexity: 17.31

### Framework versions

- Transformers 4.30.2
- TensorFlow 2.12.0
- Datasets 2.1.0
- Tokenizers 0.13.3