Model Card: NLLB-200 French-Wolof(🇫🇷↔️🇸🇳) Translation Model

Model Details

Model Description

A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages.

Developed by: Lahad
Model type: Sequence-to-Sequence Translation Model
Language(s): French (fr_Latn) ↔️ Wolof (wol_Latn)
License: CC-BY-NC-4.0
Finetuned from model: facebook/nllb-200-distilled-600M

Model Sources

Repository: Hugging Face - Lahad/nllb200-francais-wolof
GitHub: Fine-tuning NLLB-200 for French-Wolof

Uses

Direct Use

Text translation between French and Wolof
Content localization
Language learning assistance
Cross-cultural communication

Out-of-Scope Use

Commercial use without proper licensing
Translation of highly technical or specialized content
Legal or medical document translation where professional human translation is required
Real-time speech translation

Bias, Risks, and Limitations

Language Variety Limitations:
- Limited coverage of regional Wolof dialects
- May not handle cultural nuances effectively
Technical Limitations:
- Maximum context window of 128 tokens
- Reduced performance on technical/specialized content
- May struggle with informal language and slang
Potential Biases:
- Training data may reflect cultural biases
- May perform better on standard/formal language

Recommendations

Use for general communication and content translation
Verify translations for critical communications
Consider regional language variations
Implement human review for sensitive content
Test translations in intended context before deployment

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof")
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof")

# Translation function
def translate(text, max_length=128):
    inputs = tokenizer(
        text,
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"),
        max_length=max_length
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training Data

Dataset: galsenai/centralized_wolof_french_translation_data
Split: 80% training, 20% testing
Format: JSON pairs of French and Wolof translations

Training Procedure

Preprocessing

Dynamic tokenization with padding
Maximum sequence length: 128 tokens
Source/target language tags: fr_Latn/wol_Latn

Training Hyperparameters

Learning rate: 2e-5
Batch size: 8 per device
Training epochs: 3
FP16 training: Enabled
Evaluation strategy: Per epoch

Evaluation

Testing Data, Factors & Metrics

Testing Data: 20% of dataset
Metrics:
- Cloud Provider:
Evaluation Factors:
- Translation accuracy
- Semantic preservation
- Grammar correctness

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Hours used: 5
Cloud Provider: [Not Specified]
Compute Region: [Not Specified]
Carbon Emitted: [Not Calculated]

Technical Specifications

Model Architecture and Objective

Architecture: NLLB-200 (Distilled 600M version)
Objective: Neural Machine Translation
Parameters: 600M
Context Window: 128 tokens

Compute Infrastructure

Training Hardware: NVIDIA T4 GPU
Training Time: 5 hours
Software Framework: Hugging Face Transformers

Model Card Contact

For questions about this model, please create an issue on the model's Hugging Face repository.

Lahad
/

nllb200-francais-wolof