autoevaluator's picture
Add evaluation results on the indonli config and test_expert split of indonli
d9d9313
|
raw
history blame
12.7 kB
metadata
language: id
license: mit
datasets:
  - indonli
  - MoritzLaurer/multilingual-NLI-26lang-2mil7
pipeline_tag: zero-shot-classification
widget:
  - text: Saya suka makan kentang goreng.
    candidate_labels: positif, netral, negatif
    hypothesis_template: Kalimat ini mengandung tema {}.
    multi_class: false
    example_title: Sentiment
  - text: Apple umumkan harga iPhone 14.
    candidate_labels: teknologi, olahraga, kuliner, bisnis
    hypothesis_template: Kalimat ini mengandung tema {}.
    multi_class: true
    example_title: News
model-index:
  - name: ilos-vigil/bigbird-small-indonesian-nli
    results:
      - task:
          type: natural-language-inference
          name: Natural Language Inference
        dataset:
          name: indonli
          type: indonli
          config: indonli
          split: test_expert
        metrics:
          - type: accuracy
            value: 0.5385388739946381
            name: Accuracy
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNWRhZDkxNmI2NzE3MzRlYmNlMWFjZDVmNWUwYmMwN2IxYzNjMWE4YzY4NWI3NDZkYTMzY2NjN2MyZGQ5YzEwZSIsInZlcnNpb24iOjF9.AgizskHeXOzs0v93DNojNoqR_-1bQsYBokL8jcfelFm-zt-r5YXt89WXBDLLg4oKv-Roj8sLhUwe7ei0Mf1-Ag
          - type: f1
            value: 0.530444188199697
            name: F1 Macro
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMjk2YTFhY2E3NGIzNzgxY2M5YzUzNGUzYTAwOWZkNGU3Y2I5MDA1MTc0YzM4Yjg0MmIzY2Y5M2EzOGYxNjY4NiIsInZlcnNpb24iOjF9.YZ_fTuVftTCM6SFfkFCLPbJWYmYNMYL9PNHUwNFHQXZeknf6OCBgQtr1gF6VM9mX6WuU4OKEl12tsAytlkm7Ag
          - type: f1
            value: 0.5385388739946381
            name: F1 Micro
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiM2MxMGUyZmJhZTYzN2M4NDlkMTZmMzllOGVhMjRiODhkMGVkMGMxMjY2NDBkZWM3ZWY2ZjhmZTNmYWU5ZjEzMyIsInZlcnNpb24iOjF9.f0HQlPRx4VFnOOHsrvMKFni8g1B1OJfheOyADsf47GnrvCcW_dakDgBy5c_yy4TehQYRa6ToYGHnuQnemvhnBg
          - type: f1
            value: 0.5299257731385174
            name: F1 Weighted
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNTgzZjJkZWU0NDgyMGU5MDFmNzk2OWY1OWY4MzA2NTE3MDAxN2Y2MWExODJkYjdlN2I1YzgzYjljNjdkMTc1YiIsInZlcnNpb24iOjF9.lWB7MZlAiDjskKM-lx-XtLxTQYuWLz3QjyseDuZe_AxtyOKt2GZkP2NDOZxEWketHjRiTCQfBUvSfzFId-FCAg
          - type: precision
            value: 0.5592571894118881
            name: Precision Macro
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZDQxYTFlNTNjNDAwMWIxYmJlMzRkN2U5OWY1NWNjN2YyYTE2NzRjNjM3ZWNhMzM4NjFhYWM4MzJkYjY3MzU0YSIsInZlcnNpb24iOjF9.6OI4_M1wLX1Z1BztKUfZ-382F3coCeJjarsWc-J04TKpsFCddLjuF5ZDuBFmokpz4goRgx-FlH-5jCAsFkzkBg
          - type: precision
            value: 0.5385388739946381
            name: Precision Micro
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzRmY2I4YTAzMTRkMjFjNTE1NTEwZDlmZGQ4NDUyYTAxY2JhOTliMDRhNWY3OGY4OWRlNTlkNzcxODc0MDMwYyIsInZlcnNpb24iOjF9.X7ekS-JYOXH5eNmSfKQ_no1rNAbuQ3C0pNYvorPVfcna6RU8n6O6FNQor0AWvatAWdefJG6H3J7_GoC6M5zECw
          - type: precision
            value: 0.5586108016541553
            name: Precision Weighted
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMjUwNjMxYjEwMTEzNzAwNzQwZDQwMTRmZDM2ZDk0ZDc3YTUxOTQzNDE5ZWI2NWI4MmJmODAxYTlmN2E0Nzk2MCIsInZlcnNpb24iOjF9.nAO1wRFHMtm5kem9VhuuRg54fpvA2uzwEutjzsnZoyemUHbI2U_1TK_dDmR4bmpPjVnCZt5sF-jEq4oZIaIbDQ
          - type: recall
            value: 0.5385813032215204
            name: Recall Macro
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzVkNjliYTM0Njc3MTUzMDBmYTE5NDRkNzFjNzg2NzA0NzEyMTg4YTlkNGFlZWMxZWUwOGQzYzY1ZGU0ZmIwNyIsInZlcnNpb24iOjF9.cnEbDBJR8m3UqiuzCq_g4RUFLE8BVzXDebKguVrwPgY-Biu4sBFXVQvFyZScsLGEnaHYsE-R8ctTEGDdQONVBw
          - type: recall
            value: 0.5385388739946381
            name: Recall Micro
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODZkMmNjZWY4ZDYyYjU3NjQ2ZGNhZjkyNTQyOTg2ZjNmNDgwNDYxYmU2ZDA5M2EwOWRlMjMyYmI4MGU3MGMxNCIsInZlcnNpb24iOjF9.BfMB4_MZ-SYj1YbTES8pqgKNQkNnevSOjAwUqdoL6wsNpsKKWxPHmq0Kt9XufxHoQoyTkGvPfxh-0jEe3B1nBg
          - type: recall
            value: 0.5385388739946381
            name: Recall Weighted
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYmE3Yjg3OTVhMjdlMDk1YWFjMWIwNjMyZTA2Yzc3MjBlNjI1YWY5MzE0MjNkMDNiMmU5ZmIxYWExNmViYWE1NSIsInZlcnNpb24iOjF9.S9Bo-wq3wikFS-FqMQerxahu87PJyYx141G5PCWDtOs2wH1nf4texnJYWfHeVCJKZcKmS2RWn5XOjjJ9RoNJAA
          - type: loss
            value: 1.062397837638855
            name: loss
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOTFmNDI0ZmQ2YmNlZjJlZTdmZTYwOGVkMjdjMjJkMDIzNzhlOWFiNWQzNjFiMmU5NTdiM2Y1YjYxMjU4ZjQ2ZSIsInZlcnNpb24iOjF9.15RsFRkFpbarlU1L8UyV0o0_5WCveO_mT9CdO0UYwvQsOVjScheJ8fOqHBAC-C-CMTlfFNsmMhNrU_np8c_ZCQ

Indonesian small BigBird model NLI

Source Code

Source code to create this model and perform benchmark is available at https://github.com/ilos-vigil/bigbird-small-indonesian.

Model Description

This model is based on bigbird-small-indonesian and was finetuned on 2 datasets. It is intended to be used for zero-shot text classification.

How to use

Inference for ZSC (Zero Shot Classification) task

>>> pipe = pipeline(
...     task='zero-shot-classification',
...     model='./tmp/checkpoint-28832'
... )
>>> pipe(
...     sequences='Fakta nomor 7 akan membuat ada terkejut',
...     candidate_labels=['clickbait', 'bukan clickbait'],
...     hypothesis_template='Judul video ini {}.',
...     multi_label=False
... )
{
 'sequence': 'Fakta nomor 7 akan membuat ada terkejut',
 'labels': ['clickbait', 'bukan clickbait'],
 'scores': [0.6102734804153442, 0.38972654938697815]
}
>>> pipe(
...     sequences='Samsung tuntut balik Apple dengan alasan hak paten teknologi.',
...     candidate_labels=['teknologi', 'olahraga', 'bisnis', 'politik', 'kesehatan', 'kuliner'],
...     hypothesis_template='Kategori berita ini adalah {}.',
...     multi_label=True
... )
{
 'sequence': 'Samsung tuntut balik Apple dengan alasan hak paten teknologi.',
 'labels': ['politik', 'teknologi', 'kesehatan', 'bisnis', 'olahraga', 'kuliner'],
 'scores': [0.7390161752700806, 0.6657379269599915, 0.4459509551525116, 0.38407933712005615, 0.3679264783859253, 0.14181996881961823]
}

Inference for NLI (Natural Language Inference) task

>>> pipe = pipeline(
...     task='text-classification',
...     model='./tmp/checkpoint-28832',
...     return_all_scores=True
... )
>>> pipe({
...     'text': 'Nasi adalah makanan pokok.',  # Premise
...     'text_pair': 'Saya mau makan nasi goreng.'  # Hypothesis
... })
[
 {'label': 'entailment', 'score': 0.25495028495788574},
 {'label': 'neutral', 'score': 0.40920916199684143},
 {'label': 'contradiction', 'score': 0.33584052324295044}
]
>>> pipe({
...     'text': 'Python sering digunakan untuk web development dan AI research.',
...     'text_pair': 'AI research biasanya tidak menggunakan bahasa pemrograman Python.'
... })
[
 {'label': 'entailment', 'score': 0.12508109211921692},
 {'label': 'neutral', 'score': 0.22146646678447723},
 {'label': 'contradiction', 'score': 0.653452455997467}
]

Limitation and bias

This model inherit limitation/bias from it's parent model and 2 datasets used for fine-tuning. And just like most language model, this model is sensitive towards input change. Here's an example.

>>> from transformers import pipeline
>>> pipe = pipeline(
...     task='zero-shot-classification',
...     model='./tmp/checkpoint-28832'
... )
>>> text = 'Resep sate ayam enak dan mudah.'
>>> candidate_labels = ['kuliner', 'olahraga']
>>> pipe(
...     sequences=text,
...     candidate_labels=candidate_labels,
...     hypothesis_template='Kategori judul artikel ini adalah {}.',
...     multi_label=False
... )
{
 'sequence': 'Resep sate ayam enak dan mudah.',
 'labels': ['kuliner', 'olahraga'],
 'scores': [0.7711364030838013, 0.22886358201503754]
}
>>> pipe(
...     sequences=text,
...     candidate_labels=candidate_labels,
...     hypothesis_template='Kelas kalimat ini {}.',
...     multi_label=False
... )
{
 'sequence': 'Resep sate ayam enak dan mudah.',
 'labels': ['kuliner', 'olahraga'],
 'scores': [0.7043636441230774, 0.295636385679245]
}
>>> pipe(
...     sequences=text,
...     candidate_labels=candidate_labels,
...     hypothesis_template='{}.',
...     multi_label=False
... )
{
 'sequence': 'Resep sate ayam enak dan mudah.',
 'labels': ['kuliner', 'olahraga'],
 'scores': [0.5986711382865906, 0.4013288915157318]
}

Training, evaluation and testing data

This model was finetuned with IndoNLI and multilingual-NLI-26lang-2mil7. Although multilingual-NLI-26lang-2mil7 dataset is machine-translated, this dataset slightly improve result of NLI benchmark and extensively improve result of ZSC benchmark. Both evaluation and testing data is only based on IndoNLI dataset.

Training Procedure

The model was finetuned on single RTX 3060 with 16 epoch/28832 steps with accumulated batch size 64. AdamW optimizer is used with LR 1e-4, weight decay 0.05, learning rate warmup for first 6% steps (1730 steps) and linear decay of the learning rate afterwards. Take note while model weight on epoch 9 has lowest loss/highest accuracy, it has slightly lower performance on ZSC benchmark. Additional information can be seen on Tensorboard training logs.

Benchmark as NLI model

Both benchmark show result of 2 different model as additional comparison. Additional benchmark using IndoNLI dataset is available on it's paper IndoNLI: A Natural Language Inference Dataset for Indonesian.

Model bigbird-small-indonesian-nli xlm-roberta-large-xnli mDeBERTa-v3-base-xnli-multilingual-nli-2mil7
Parameter 30.6M 559.9M 278.8M
Multilingual V V
Finetuned on IndoNLI V V
Finetuned on multilingual-NLI-26lang-2mil7 V
Test (Lay) 0.6888 0.2226 0.8151
Test (Expert) 0.5734 0.3505 0.7775

Benchmark as ZSC model

Indonesian-Twitter-Emotion-Dataset is used to perform ZSC benchmark. This benchmark include 4 different parameter which affect performance of each model differently. Hypothesis template for this benchmark is Kalimat ini mengekspresikan perasaan {}. and {}.. Take note F1 score measurement only calculate label with highest probability.

Model Multi-label Use template F1 Score
bigbird-small-indonesian-nli V V 0.3574
V 0.3654
V 0.3985
0.4160
xlm-roberta-large-xnli V V 0.6292
V 0.5596
V 0.5737
0.5433
mDeBERTa-v3-base-xnli-multilingual-nli-2mil7 V V 0.5324
V 0.5499
V 0.5269
0.5228