language: fa
license: apache-2.0
This repository is created with the aim to provide better models for NLI in persian, with the transparent codes for training I hope you guys find it inspiring and build better model in the future. for more details about the task and methods used for training check the medium post and notebooks.
Dataset
The dataset used for training is Wiki D/Similar dataset (wiki-d-similar.zip), obtained from Sentence Transformers repository.
Model
The proposed model is published at HuggingFace Hub with the name of demoversion/bert-fa-base-uncased-haddad-wikinli
. You can download and use the model from HuggingFace Website or directly in transformers library like this:
from transformers import pipeline
model = pipeline("zero-shot-classification", model="demoversion/bert-fa-base-uncased-haddad-wikinli")
labels = ["ورزشی",
"سیاسی",
"علمی",
"فرهنگی"]
template_str = "این یک متن {} است."
str_sentence = "مرحله مقدماتی جام جهانی حاشیههای زیادی داشت."
model(str_sentence, labels, hypothesis_template=template_str)
The result of this code snippet is:
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
{'labels': ['فرهنگی', 'علمی', 'سیاسی', 'ورزشی'],
'scores': [0.25921085476875305,
0.25713297724723816,
0.24884170293807983,
0.23481446504592896],
'sequence': 'مرحله مقدماتی جام جهانی حاشیه\u200cهای زیادی داشت.'}
Yep, the right label (highest score) without training.
Results
The result comparing to the original model published for this dataset is available in the table bellow.
Model | dev_accuracy | dev_f1 | test_accuracy | test_f1 |
---|---|---|---|---|
m3hrdadfi/bert-fa-base-uncased-wikinli | 77.88 | 77.57 | 76.64 | 75.99 |
demoversion/bert-fa-base-uncased-haddad-wikinli | 78.62 | 79.74 | 77.04 | 78.56 |
Notebooks
Notebooks used for training and evaluation are available below.