Model Card for GaMS-1B

We proudly present the family of GaMS (Generative Model for Slovene) models. The 1B version is based on Facebook's OPT model and is adapted for Slovene. GaMS-1B uses a BPE tokenizer with a vocabulary size of 80.000. The tokenizer was trained on Slovene, English, and Croatian data.

Acknowledgment

The model was developed within the PoVeJMo research program (Adaptive Natural Language Processing with Large Language Models), particularly within the research project titled SloLLaMai -- Open-access computationally efficient models for Slovenian. The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. The authors also acknowledge the financial support from the Slovenian Research and Innovation Agency (research core funding No. P6-0411 -- Language Resources and Technologies for Slovene).

We thank everyone who worked on data collection and preparation, enabling us to train our model. Special thanks go to Nikola Ljubešić, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec and Iztok Kosem.

Basic information

Developed by: team of researchers at the University of Ljubljana, Faculty for Computer and Information Science and XLAB.doo. Team members: Domen Vreš, Martin Božič, Aljaž Potočnik, Tomaž Martinčič, Iztok Lebar Bajec, Timotej Petrič and Marko Robnik-Šikonja.
Languages: Slovene (primary), English, Croatian, Bosnian and Serbian (secondary)
License: Apache 2.0
Repository: https://github.com/SloLama/NeMo
Paper: https://www.sdjt.si/wp/wp-content/uploads/2024/09/JT-DH-2024_Vres_Bozic_Potocnik_Martincic_Robnik.pdf

Intended usage

This version of the model is quite small and lacks instruction and safety tuning. Hence, using it as a general-purpose model is STRONGLY DISCOURAGED! The model might also contain certain biases. We do not recommend the usage of this model in any other language than Slovene.

The model can be efficiently tuned for specific use cases as suggested by promising results of fine-tuned models on SuperGLUE and SI-NLI benchmarks

How to get started with the model

The inference can be done using the following snippet of code:

from transformers import pipeline

model_id = ("cjvt/GaMS-1B")

pline = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto"
)

prompts = [
    "The examples of antonyms are:\nhigh => low\nwide => narrow\nbig =>",
    "Pristanek je bil prvi nadzorovani spust ameriškega vesoljskega plovila na površje Lune po Apollu 17 leta 1972, ko je na Luni pristala zadnja Nasina misija s posadko.\nDoslej so na Luni pristala vesoljska plovila le iz štirih drugih držav –",
    "U četvrtak je bila prva polufinalna večer Dore, a komentari na društvenim mrežama ne prestaju. U nedjeljno finale prošli su:"
]

sequences = pline(
    prompts,
    max_length=1000,
    do_sample=False,
    num_return_sequences=1
)

for seq in sequences:
    print("--------------------------")
    print(f"Result: {seq[0]['generated_text']}")
    print("--------------------------\n")

Training details

Training data

The model was additionally pretrained on the following Slovene, English, and Croatian-Bosnian-Serbian (CBS) corpora:

Corpus	Language	# Tokens	Percentage
MetaFida	Slovene	3.35 B	11.9 %
KAS	Slovene	1.66 B	5.89 %
Trendi	Slovene	0.68 B	2.4 %
mC4	Slovene	2.88 B	10.25 %
MaCoCu	Slovene	2.34 B	8.3 %
CC100	Slovene	0.29 B	1.02 %
Riznica	Croatian	0.11 B	0.39 %
Hr News	Croatian	2.14 B	7.59 %
MaCoCu HBS	CBS	8.63 B	30.69 %
Wikipedia	English	5.61 B	19.93 %
CC-News	English	0.46 B	1.64 %

The total size of additional training data is 28.13 B tokens.

Training Procedure

The model was trained using the NeMo framework on Slovene HPC Vega, utilizing 64 A100 GPUs simultaneously. The model was trained on 4 epochs. WECHSEL initialization method was used to initialize the embedding matrix of the new vocabulary. All layers apart from the embedding and the output layer were frozen during the first epoch to avoid forgetting. Training took approximately 60 hours. The model was trained with batch size 1024 (2 million tokens) using Adam optimizer and cosine learning rate scheduler with 10.000 warmup and 5.000 constant steps.

Evaluation

The models were evaluated using Slovene SuperGLUE and SI-NLI tasks on SloBench. Additionally, the models were evaluated on an improved version of the Slovenian-LLM-eval introduced by Aleksa Gordić. All decoder-type models were evaluated using few-shot prompts and were not finetuned on the benchmark (except for the versions with finetuned in the name).

SuperGLUE results

Model	SuperGLUE Average	BoolQ Accuracy	CB Accuracy	CB F1 Score	CB Average	COPA Accuracy	MultiRC EM	MultiRC F1a Score	MultiRC Average	RTE Accuracy	WSC Accuracy
OPT_GaMS-1B	0.4408	0.5667	0.5040	0.3885	0.4463	0.5020	0.0961	0.2543	0.1752	0.4138	0.5411
GaMS-1B	0.4604	0.5000	0.6200	0.4565	0.5382	0.4920	0.1351	0.2675	0.2013	0.4828	0.5479
OPT_GaMS-1B-Chat	0.4165	0.7000	0.3720	0.2961	0.3341	0.4600	0.1111	0.3448	0.2280	0.4138	0.3630
GaMS-1B-Chat	0.4570	0.8000	0.4880	0.3023	0.3951	0.4840	0.1081	0.2428	0.1755	0.5172	0.3699
OPT_GaMS-1B-Chat finetuned	0.5645	0.7000	0.8040	0.5884	0.6962	0.5860	0.1021	0.4808	0.2914	0.5862	0.5274
GaMS-1B-Chat finetuned	0.5806	0.7333	0.8120	0.5592	0.6856	0.5080	0.1381	0.4882	0.3132	0.5862	0.6575
SlovenianGPT-Chat*	0.5078	0.7333	0.3920	0.3829	0.3874	0.6840	0.2432	0.4944	0.3688	0.5172	0.3562
CroSloEngual BERT	0.6078	0.7333	0.7920	0.7437	0.7679	0.5720	0.0931	0.5241	0.3086	0.6552	0.6096

*SlovenianGPT-Chat was obtained by instruction-tuning Aleksa Gordić's SlovenianGPT on our instruction dataset.

SI-NLI results

Model	Accuracy	P(entailment)	R(entailment)	F1(entailment)	P(neutral)	R(neutral)	F1(neutral)	P(contradiction)	R(contradiction)	F1(contradiction)
OPT_GaMS-1B	0.3277	0.3407	0.6754	0.4529	0.3538	0.1402	0.2009	0.2632	0.1524	0.1931
GaMS-1B	0.3317	0.3418	0.4327	0.3819	0.3353	0.5122	0.4053	0.2344	0.0457	0.0765
OPT_GaMS-1B-Chat	0.3447	0.3515	0.6784	0.4631	0.3386	0.3293	0.3338	0.2105	0.0122	0.0231
GaMS-1B-Chat	0.3417	0.3405	0.9737	0.5045	0.2857	0.0061	0.0119	0.4615	0.0183	0.0352
OPT_GaMS-1B-Chat finetuned	0.7244	0.7065	0.8304	0.7634	0.7269	0.6006	0.6578	0.7446	0.7378	0.7412
GaMS-1B-Chat finetuned	0.7144	0.8037	0.6345	0.7092	0.7247	0.6341	0.6764	0.6531	0.8780	0.7490
SlovenianGPT-Chat*	0.4729	0.4399	0.7281	0.5485	0.3719	0.1372	0.2004	0.5723	0.5427	0.5571
GPT-3.5-Turbo finetuned	0.8567	0.8464	0.8538	0.8501	0.8041	0.8384	0.8209	0.9260	0.8780	0.9014
SloBERTa	0.7375	0.8127	0.7105	0.7582	0.6844	0.7470	0.7143	0.7273	0.7561	0.7414
CroSloEngual BERT	0.6623	0.7147	0.6667	0.6899	0.6072	0.6646	0.6346	0.6719	0.6555	0.6636

*SlovenianGPT-Chat was obtained by instruction-tuning Aleksa Gordić's SlovenianGPT on our instruction dataset.

Slovenian-LLM-eval results

Model	ARC-Challenge Accuracy	ARC-Easy Accuracy	BoolQ Accuracy	HellaSwag Accuracy	NQ-Open EM	OpenBookQA Accuracy	PIQA Accuracy	WinoGrande Accuracy
OPT_GaMS-1B	0.2227 ± 0.0122	0.436 ± 0.0102	0.378 ± 0.0085	0.3394 ± 0.0047	0.0003 ± 0.0003	0.214 ± 0.0184	0.6083 ± 0.0114	0.5533 ± 0.014
GaMS-1B	0.2329 ± 0.0124	0.4743 ± 0.0102	0.3813 ± 0.0085	0.3555 ± 0.0048	0.0036 ± 0.001	0.22 ± 0.0185	0.624 ± 0.0113	0.532 ± 0.014
OPT_GaMS-1B-Chat	0.2355 ± 0.0124	0.3960 ± 0.0100	0.4398 ± 0.0087	0.3459 ± 0.0047	0.0011 ± 0.0006	0.20 ± 0.0179	0.5778 ± 0.0115	0.5359 ± 0.014
GaMS-1B-Chat	0.2517 ± 0.0127	0.4394 ± 0.0102	0.4502 ± 0.0087	0.3634 ± 0.0048	0 ± 0	0.196 ± 0.0178	0.6115 ± 0.0114	0.5572 ± 0.014
YugoGPT	0.2961 ± 0.0133	0.4781 ± 0.0102	0.3783 ± 0.0085	0.3890 ± 0.0047	0.0385 ± 0.0032	0.226 ± 0.0187	0.5816 ± 0.0115	0.5588 ± 0.014
SlovenianGPT	0.3805 ± 0.0142	0.6498 ± 0.0098	0.4523 ± 0.0087	0.4935 ± 0.0050	0.0432 ± 0.0034	0.27 ± 0.0199	0.6937 ± 0.0108	0.644 ± 0.0135
SlovenianGPT-Chat*	0.3567 ± 0.014	0.5901 ± 0.0101	0.4706 ± 0.0087	0.4719 ± 0.0050	0.0003 ± 0.0003	0.27 ± 0.0199	0.6861 ± 0.0108	0.6425 ± 0.0135

*SlovenianGPT-Chat was obtained by instruction-tuning Aleksa Gordić's SlovenianGPT on our instruction dataset.

@inproceedings{GaMS,
 author = {Vre{\v s}, Domen and Bo{\v z}i{\v c}, Martin and Poto{\v c}nik, Alja{\v z} and Martin{\v c}i{\v c}, Toma{\v z} and Robnik-{\v S}ikonja, Marko},
 booktitle = {Language Technologies and Digital Humanities Conference},
 title = {{Generative Model for Less-Resourced Language with 1 billion parameters}},
 url = {https://www.sdjt.si/wp/wp-content/uploads/2024/09/JT-DH-2024_Vres_Bozic_Potocnik_Martincic_Robnik.pdf},
 year = {2024}
}