Model Card: NLLB-200 French-Wolof(πŸ‡«πŸ‡·β†”οΈπŸ‡ΈπŸ‡³) Translation Model

Model Details

Model Description

A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages.

  • Developed by: Lahad
  • Model type: Sequence-to-Sequence Translation Model
  • Language(s): French (fr_Latn) ↔️ Wolof (wol_Latn)
  • License: CC-BY-NC-4.0
  • Finetuned from model: facebook/nllb-200-distilled-600M

Model Sources

Uses

Direct Use

  • Text translation between French and Wolof
  • Content localization
  • Language learning assistance
  • Cross-cultural communication

Out-of-Scope Use

  • Commercial use without proper licensing
  • Translation of highly technical or specialized content
  • Legal or medical document translation where professional human translation is required
  • Real-time speech translation

Bias, Risks, and Limitations

  1. Language Variety Limitations:

    • Limited coverage of regional Wolof dialects
    • May not handle cultural nuances effectively
  2. Technical Limitations:

    • Maximum context window of 128 tokens
    • Reduced performance on technical/specialized content
    • May struggle with informal language and slang
  3. Potential Biases:

    • Training data may reflect cultural biases
    • May perform better on standard/formal language

Recommendations

  • Use for general communication and content translation
  • Verify translations for critical communications
  • Consider regional language variations
  • Implement human review for sensitive content
  • Test translations in intended context before deployment

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof")
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof")

# Translation function
def translate(text, max_length=128):
    inputs = tokenizer(
        text,
        max_length=max_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"),
        max_length=max_length
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Details

Training Data

  • Dataset: galsenai/centralized_wolof_french_translation_data
  • Split: 80% training, 20% testing
  • Format: JSON pairs of French and Wolof translations

Training Procedure

Preprocessing

  • Dynamic tokenization with padding
  • Maximum sequence length: 128 tokens
  • Source/target language tags: fr_Latn/wol_Latn

Training Hyperparameters

  • Learning rate: 2e-5
  • Batch size: 8 per device
  • Training epochs: 3
  • FP16 training: Enabled
  • Evaluation strategy: Per epoch

Evaluation

Testing Data, Factors & Metrics

  • Testing Data: 20% of dataset
  • Metrics:
    • Cloud Provider:
  • Evaluation Factors:
    • Translation accuracy
    • Semantic preservation
    • Grammar correctness

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Hours used: 5
  • Cloud Provider: [Not Specified]
  • Compute Region: [Not Specified]
  • Carbon Emitted: [Not Calculated]

Technical Specifications

Model Architecture and Objective

  • Architecture: NLLB-200 (Distilled 600M version)
  • Objective: Neural Machine Translation
  • Parameters: 600M
  • Context Window: 128 tokens

Compute Infrastructure

  • Training Hardware: NVIDIA T4 GPU
  • Training Time: 5 hours
  • Software Framework: Hugging Face Transformers

Model Card Contact

For questions about this model, please create an issue on the model's Hugging Face repository.

Downloads last month
29
Safetensors
Model size
615M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Lahad/nllb200-francais-wolof

Finetuned
(106)
this model

Dataset used to train Lahad/nllb200-francais-wolof