Model Card: NLLB-200 French-Wolof(π«π·βοΈπΈπ³) Translation Model
Model Details
Model Description
A fine-tuned version of Meta's NLLB-200 (600M distilled) model specialized for French to Wolof translation. This model was trained to improve accessibility of content between French and Wolof languages.
- Developed by: Lahad
- Model type: Sequence-to-Sequence Translation Model
- Language(s): French (fr_Latn) βοΈ Wolof (wol_Latn)
- License: CC-BY-NC-4.0
- Finetuned from model: facebook/nllb-200-distilled-600M
Model Sources
- Repository: Hugging Face - Lahad/nllb200-francais-wolof
- GitHub: Fine-tuning NLLB-200 for French-Wolof
Uses
Direct Use
- Text translation between French and Wolof
- Content localization
- Language learning assistance
- Cross-cultural communication
Out-of-Scope Use
- Commercial use without proper licensing
- Translation of highly technical or specialized content
- Legal or medical document translation where professional human translation is required
- Real-time speech translation
Bias, Risks, and Limitations
Language Variety Limitations:
- Limited coverage of regional Wolof dialects
- May not handle cultural nuances effectively
Technical Limitations:
- Maximum context window of 128 tokens
- Reduced performance on technical/specialized content
- May struggle with informal language and slang
Potential Biases:
- Training data may reflect cultural biases
- May perform better on standard/formal language
Recommendations
- Use for general communication and content translation
- Verify translations for critical communications
- Consider regional language variations
- Implement human review for sensitive content
- Test translations in intended context before deployment
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Lahad/nllb200-francais-wolof")
model = AutoModelForSeq2SeqLM.from_pretrained("Lahad/nllb200-francais-wolof")
# Translation function
def translate(text, max_length=128):
inputs = tokenizer(
text,
max_length=max_length,
padding="max_length",
truncation=True,
return_tensors="pt"
)
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
forced_bos_token_id=tokenizer.convert_tokens_to_ids("wol_Latn"),
max_length=max_length
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Training Details
Training Data
- Dataset: galsenai/centralized_wolof_french_translation_data
- Split: 80% training, 20% testing
- Format: JSON pairs of French and Wolof translations
Training Procedure
Preprocessing
- Dynamic tokenization with padding
- Maximum sequence length: 128 tokens
- Source/target language tags: fr_Latn/wol_Latn
Training Hyperparameters
- Learning rate: 2e-5
- Batch size: 8 per device
- Training epochs: 3
- FP16 training: Enabled
- Evaluation strategy: Per epoch
Evaluation
Testing Data, Factors & Metrics
- Testing Data: 20% of dataset
- Metrics:
- Cloud Provider:
- Evaluation Factors:
- Translation accuracy
- Semantic preservation
- Grammar correctness
Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Hours used: 5
- Cloud Provider: [Not Specified]
- Compute Region: [Not Specified]
- Carbon Emitted: [Not Calculated]
Technical Specifications
Model Architecture and Objective
- Architecture: NLLB-200 (Distilled 600M version)
- Objective: Neural Machine Translation
- Parameters: 600M
- Context Window: 128 tokens
Compute Infrastructure
- Training Hardware: NVIDIA T4 GPU
- Training Time: 5 hours
- Software Framework: Hugging Face Transformers
Model Card Contact
For questions about this model, please create an issue on the model's Hugging Face repository.
- Downloads last month
- 29
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for Lahad/nllb200-francais-wolof
Base model
facebook/nllb-200-distilled-600M