metadata

base_model:
  - meta-llama/Llama-3.1-8B-Instruct
license: llama3.1
language:
  - gl
metrics:
  - bleu
  - rouge
model-index:
  - name: Llama-3.1-8B-Instruct-Galician
    results:
      - task:
          type: text-generation
        dataset:
          name: alpaca_data_galician
          type: alpaca_data_galician
        metrics:
          - name: bleu
            type: bleu-4
            value: 23.13
          - name: rouge
            type: rouge-l
            value: 21.84
pipeline_tag: text-generation
library_name: transformers
widget:
  - text: Onde está o concello de Frades?
    output:
      text: >-
        Frades é un concello da provincia da Coruña, pertencente á comarca de
        Ordes. Está situado a 15 quilómetros de Santiago de Compostela.

Llama-3.1-8B-Instruct-Galician

This model is a continued pretraining version of meta-llama/Llama-3.1-8B-Instruct on the CorpusNós dataset.

Model Description

Developed by: UDC Information Retrieval Lab (IRLab)
Language(s) (NLP): Multilingual, adapted to Galician
License: llama3.1
Finetuned from model: meta-llama/Llama-3.1-8B-Instruct
Repository: Adapting Large Language Models for Underrepresented Languages
Paper: Coming soon

How to Get Started with the Model

import transformers
import torch

model_id = "irlab-udc/Llama-3.1-8B-Instruct-Galician"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a conversational AI that always responds in Galician."},
    {"role": "user", "content": "Cal é a principal vantaxe de usar Scrum?"},
]

outputs = pipeline(messages, max_new_tokens=512)

print(outputs[0]["generated_text"][-1]["content"])

Training Hyperparameters

Parameter	Value
learning_rate	0.0001
train_batch_size	32
eval_batch_size	1
seed	42
distributed_type	multi-GPU
num_devices	4
gradient_accumulation_steps	2
total_train_batch_size	256
total_eval_batch_size	4
optimizer	Adam with betas=(0.9, 0.999), epsilon=1e-08
lr_scheduler_type	cosine
lr_scheduler_warmup_ratio	0.1
num_epochs	1.0

Training results

Training Loss	Epoch	Step	Validation Loss
2.0606	0.1682	900	2.0613
1.9898	0.3363	1800	1.9929
1.9847	0.5045	2700	1.9613
1.9577	0.6726	3600	1.9445
1.9287	0.8408	4500	1.9368

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 4x NVIDIA A100 SXM4 80 GB (TDP of 400W)
Hours used: 60
Cloud Provider: Private infrastructure
Carbon Emitted: 10.37 Kg. CO₂ eq.

Citation

Coming soon