Multilingual Language Detection Model

Model Description

This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.

How to Use

You can use this model directly with a pipeline for text classification, or you can use it with the transformers library for more custom usage, as shown in the example below.

Quick Start

First, install the transformers library if you haven't already:

pip install transformers

from transformers import AutoModelForSequenceClassification, XLMRobertaTokenizer
import torch

# Load tokenizer and model
tokenizer = XLMRobertaTokenizer.from_pretrained("LocalDoc/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection")

# Prepare text
text = "Əlqasım oğulları vorzakondu"
encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

# Prediction
model.eval()
with torch.no_grad():
    outputs = model(**encoded_input)

# Process the outputs
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_class_index = probabilities.argmax().item()
labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
predicted_label = labels[predicted_class_index]
print(f"Predicted Language: {predicted_label}")

Language Label Information

The model outputs a label for each prediction, corresponding to one of the languages listed below. Each label is associated with a specific language code as detailed in the following table:

Label	Language Code	Language Name
LABEL_0	az	Azerbaijani
LABEL_1	ar	Arabic
LABEL_2	bg	Bulgarian
LABEL_3	de	German
LABEL_4	el	Greek
LABEL_5	en	English
LABEL_6	es	Spanish
LABEL_7	fr	French
LABEL_8	hi	Hindi
LABEL_9	it	Italian
LABEL_10	ja	Japanese
LABEL_11	nl	Dutch
LABEL_12	pl	Polish
LABEL_13	pt	Portuguese
LABEL_14	ru	Russian
LABEL_15	sw	Swahili
LABEL_16	th	Thai
LABEL_17	tr	Turkish
LABEL_18	ur	Urdu
LABEL_19	vi	Vietnamese
LABEL_20	zh	Chinese

This mapping is utilized to decode the model's predictions into understandable language names, facilitating the interpretation of results for further processing or analysis.

Training Performance

The model was trained over three epochs, showing consistent improvement in accuracy and loss:

Epoch 1: Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966
Epoch 2: Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973
Epoch 3: Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984

Test Results

The model achieved the following results on the test set:

Loss: 0.0133
Accuracy: 0.9975
F1 Score: 0.9975
Precision: 0.9975
Recall: 0.9975
Evaluation Time: 17.5 seconds
Samples per Second: 599.685
Steps per Second: 9.424

License

The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. This license allows you to freely share and redistribute the dataset with attribution to the source but prohibits commercial use and the creation of derivative works.

Contact information

If you have any questions or suggestions, please contact us at [[email protected]].