|
--- |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
license: llama3.1 |
|
language: |
|
- gl |
|
metrics: |
|
- bleu |
|
- rouge |
|
model-index: |
|
- name: Llama-3.1-8B-Instruct-Galician |
|
results: |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: alpaca_data_galician |
|
type: alpaca_data_galician |
|
metrics: |
|
- name: bleu |
|
type: bleu-4 |
|
value: 23.13 |
|
- name: rouge |
|
type: rouge-l |
|
value: 21.84 |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
widget: |
|
- text: "Onde está o concello de Frades?" |
|
output: |
|
text: Frades é un concello da provincia da Coruña, pertencente á comarca de Ordes. Está situado a 15 quilómetros de Santiago de Compostela. |
|
--- |
|
|
|
# Llama-3.1-8B-Instruct-Galician |
|
|
|
This model is a continued pretraining version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the [CorpusNós](https://zenodo.org/records/11655219) dataset. |
|
|
|
## Model Description |
|
|
|
- **Developed by:** [UDC Information Retrieval Lab (IRLab)](https://huggingface.co/irlab-udc) |
|
- **Language(s) (NLP):** Multilingual, adapted to Galician |
|
- **License:** llama3.1 |
|
- **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
|
- **Repository:** [Adapting Large Language Models for Underrepresented Languages](https://gitlab.irlab.org/eliseo.bao/xovetic-llms-underrepresented-languages) |
|
- **Paper:** _Coming soon_ |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
import transformers |
|
import torch |
|
|
|
model_id = "irlab-udc/Llama-3.1-8B-Instruct-Galician" |
|
|
|
pipeline = transformers.pipeline( |
|
"text-generation", |
|
model=model_id, |
|
model_kwargs={"torch_dtype": torch.bfloat16}, |
|
device_map="auto", |
|
) |
|
|
|
messages = [ |
|
{"role": "system", "content": "You are a conversational AI that always responds in Galician."}, |
|
{"role": "user", "content": "Cal é a principal vantaxe de usar Scrum?"}, |
|
] |
|
|
|
outputs = pipeline(messages, max_new_tokens=512) |
|
|
|
print(outputs[0]["generated_text"][-1]["content"]) |
|
``` |
|
|
|
#### Training Hyperparameters |
|
|
|
| Parameter | Value | |
|
|--------------------------------|--------------------------------------| |
|
| learning_rate | 0.0001 | |
|
| train_batch_size | 32 | |
|
| eval_batch_size | 1 | |
|
| seed | 42 | |
|
| distributed_type | multi-GPU | |
|
| num_devices | 4 | |
|
| gradient_accumulation_steps | 2 | |
|
| total_train_batch_size | 256 | |
|
| total_eval_batch_size | 4 | |
|
| optimizer | Adam with betas=(0.9, 0.999), epsilon=1e-08 | |
|
| lr_scheduler_type | cosine | |
|
| lr_scheduler_warmup_ratio | 0.1 | |
|
| num_epochs | 1.0 | |
|
|
|
|
|
#### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|:-------------:|:------:|:----:|:---------------:| |
|
| 2.0606 | 0.1682 | 900 | 2.0613 | |
|
| 1.9898 | 0.3363 | 1800 | 1.9929 | |
|
| 1.9847 | 0.5045 | 2700 | 1.9613 | |
|
| 1.9577 | 0.6726 | 3600 | 1.9445 | |
|
| 1.9287 | 0.8408 | 4500 | 1.9368 | |
|
|
|
## Environmental Impact |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** 4x NVIDIA A100 SXM4 80 GB (TDP of 400W) |
|
- **Hours used:** 60 |
|
- **Cloud Provider:** Private infrastructure |
|
- **Carbon Emitted:** 10.37 Kg. CO₂ eq. |
|
|
|
## Citation |
|
|
|
_Coming soon_ |