SENTAGRAM Model

Model Summary

The SENTAGRAM model is a BERT-based model fine-tuned on a custom Turkish grammar dataset. It is designed to analyze and classify grammatical elements within Turkish sentences, such as subjects, predicates, objects, and adjuncts. The model is built on the BERTürk architecture, specifically adapted to understand and process the intricacies of Turkish grammar. For more information visit GitHub repository of project.

Model Description

  • Architecture: BERT (BERTürk)
  • Language: Turkish (tr)
  • Task: Token classification, focusing on part-of-speech tagging and grammatical role identification.
  • Training Dataset: The model was fine-tuned using the turkish-sentence-elements dataset, which contains annotated sentences from a variety of Turkish sources.

Intended Use

Applications

  • Educational Tools: Can be used to develop applications that help learners of Turkish understand and correct their grammar.
  • NLP Research: Useful for research in Turkish natural language processing, especially in areas related to syntax and grammar.
  • Grammatical Analysis: Can be integrated into text editors, language learning platforms, or automated proofreading tools to provide grammar suggestions and corrections.

Limitations

  • Complex Sentences: While the model performs well on standard sentences, its performance may degrade on more complex or ambiguous sentence structures.
  • Contextual Understanding: The model's ability to understand context is limited to the token classification task, and it might not perform as well in tasks requiring deep semantic understanding.

Performance

The model was evaluated on the SYNTÜRK SENTAGRAM dataset with the following results:

Precision Recall F1 Score Accuracy
0.911349 0.911826 0.911588 0.935395

These metrics demonstrate the model's effectiveness in correctly identifying and classifying grammatical elements in Turkish sentences.

How to Use

You can load and use the model with Hugging Face's transformers library:

from transformers import AutoTokenizer, AutoModelForTokenClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("synturk/sentagram-berturk")
model = AutoModelForTokenClassification.from_pretrained("synturk/sentagram-berturk")

# Example sentence
sentence = "SYNTÜRK yarışmayı kazandı."

# Tokenize and predict
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Decode the predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]

print(list(zip(tokens, predicted_labels)))

Training Details

  • Base Model: BERTürk
  • Fine-tuning Data: SYNTÜRK SENTAGRAM Dataset
  • Optimizer: Optuna
  • Hugging Face Trainer API: Used for training and evaluation.

Limitations and Future Work

Known Limitations

  • Out-of-Distribution Data: The model's performance may not be reliable on sentences or text types significantly different from the training data.
  • Ambiguity: The model might struggle with ambiguous grammatical structures, where multiple interpretations are possible.

Future Improvements

We plan to enhance the model by integrating additional grammatical features, such as semantic roles and more complex sentence structures. This will further improve its ability to process and understand the nuances of the Turkish language.

Ethical Considerations

  • Bias: The model was trained on a dataset that reflects specific sources and styles of Turkish. It may not generalize well to all varieties of the language.
  • Fairness: Care was taken to ensure that the dataset is balanced in terms of sentence structures and grammatical elements, but there may still be biases present.

License

This model is licensed under the Apache 2.0 License.

Citation

If you use this model in your research or applications, please cite it as follows:

@model{synturk-sentagram,
  author    = {SYNTÜRK Team},
  title     = {SENTAGRAM Model},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/synturk/sentagram},
}

Contact

For more information or questions, please contact the SYNTÜRK Team through our GitHub repository.

Follow SYNTÜRK Team on,

Downloads last month
11
Safetensors
Model size
184M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train synturk/sentagram