DHIVEHI NOUGAT BASE (IMAGE-TO-TEXT)

This model is a fine-tuned version of facebook/nougat-base on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.0142

Model description

Finetuned dhivehi on text-image dataset, config all

Usage

from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel
from pathlib import Path

# Load the model and processor
processor = NougatProcessor.from_pretrained("alakxender/dhivehi-nougat-base")
model = VisionEncoderDecoderModel.from_pretrained(
    "alakxender/dhivehi-nougat-base",  
    torch_dtype=torch.bfloat16,                 # Optional: Load the model with BF16 data type for faster inference and lower memory usage
    attn_implementation={                       # Optional: Specify the attention kernel implementations for different parts of the model
        "decoder": "flash_attention_2",         # Use FlashAttention-2 for the decoder for improved performance
        "encoder": "eager"                      # Use the default ("eager") attention implementation for the encoder
    }
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

context_length = 128

def predict(img_path):
    # Ensure image is in RGB format
    image = Image.open(img_path).convert("RGB")  
    pixel_values = processor(image, return_tensors="pt").pixel_values.to(torch.bfloat16)

    # generate prediction
    outputs = model.generate(
        pixel_values.to(device),
        min_length=1,
        max_new_tokens=context_length,
        repetition_penalty=1.5,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        eos_token_id=processor.tokenizer.eos_token_id,
    )

    page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return page_sequence

print(predict("DV01-04_31.jpg"))

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 3
eval_batch_size: 3
seed: 42
gradient_accumulation_steps: 6
total_train_batch_size: 18
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 100

Training results

Training Loss	Epoch	Step	Validation Loss
6.4404	0.0057	100	1.0417
5.7761	0.0114	200	0.9055
5.1723	0.0171	300	0.8193
4.8315	0.0228	400	0.7661
4.4217	0.0285	500	0.7232
3.9861	0.0342	600	0.6724
3.7268	0.0400	700	0.5966
3.5393	0.0457	800	0.5337
2.8666	0.0514	900	0.4108
2.0269	0.0571	1000	0.2803
1.4121	0.0628	1100	0.1904
1.0161	0.0685	1200	0.1351
0.867	0.0742	1300	0.1130
0.7506	0.0799	1400	0.0950
0.5764	0.0856	1500	0.0801
0.5123	0.0913	1600	0.0716
0.558	0.0970	1700	0.0650
0.5242	0.1027	1800	0.0616
0.4229	0.1084	1900	0.0556
0.3721	0.1142	2000	0.0545
0.3388	0.1199	2100	0.0519
0.4042	0.1256	2200	0.0499
0.3593	0.1313	2300	0.0449
0.3837	0.1370	2400	0.0421
0.3291	0.1427	2500	0.0407
0.3092	0.1484	2600	0.0388
0.2762	0.1541	2700	0.0380
0.3073	0.1598	2800	0.0422
0.2577	0.1655	2900	0.0340
0.2596	0.1712	3000	0.0331
0.3397	0.1769	3100	0.0328
0.3019	0.1826	3200	0.0307
0.2522	0.1884	3300	0.0314
0.2546	0.1941	3400	0.0289
0.1972	0.1998	3500	0.0282
0.2231	0.2055	3600	0.0300
0.2342	0.2112	3700	0.0278
0.2152	0.2169	3800	0.0276
0.2059	0.2226	3900	0.0260
0.2165	0.2283	4000	0.0257
0.1919	0.2340	4100	0.0253
0.1608	0.2397	4200	0.0244
0.1673	0.2454	4300	0.0242
0.2004	0.2511	4400	0.0248
0.2277	0.2568	4500	0.0230
0.1831	0.2625	4600	0.0228
0.1905	0.2683	4700	0.0221
0.0996	0.2740	4800	0.0215
0.1596	0.2797	4900	0.0213
0.168	0.2854	5000	0.0208
0.2119	0.2911	5100	0.0215
0.1436	0.2968	5200	0.0202
0.1656	0.3025	5300	0.0202
0.1183	0.3082	5400	0.0194
0.1397	0.3139	5500	0.0202
0.1248	0.3196	5600	0.0191
0.1202	0.3253	5700	0.0191
0.1175	0.3310	5800	0.0207
0.1427	0.3367	5900	0.0183
0.1487	0.3425	6000	0.0178
0.1597	0.3482	6100	0.0174
0.1363	0.3539	6200	0.0172
0.1266	0.3596	6300	0.0171
0.1288	0.3653	6400	0.0170
0.1202	0.3710	6500	0.0170
0.1174	0.3767	6600	0.0164
0.1334	0.3824	6700	0.0168
0.1627	0.3881	6800	0.0164
0.0982	0.3938	6900	0.0161
0.1038	0.3995	7000	0.0160
0.1523	0.4052	7100	0.0160
0.1337	0.4109	7200	0.0157
0.2063	0.4167	7300	0.0153
0.1476	0.4224	7400	0.0156
0.0838	0.4281	7500	0.0150
0.082	0.4338	7600	0.0158
0.1269	0.4395	7700	0.0159
0.1168	0.4452	7800	0.0147
0.1024	0.4509	7900	0.0147
0.1138	0.4566	8000	0.0145
0.1188	0.4623	8100	0.0146
0.0881	0.4680	8200	0.0142
0.0752	0.4737	8300	0.0138
0.1165	0.4794	8400	0.0141
0.1017	0.4851	8500	0.0137
0.0971	0.4909	8600	0.0135
0.135	0.4966	8700	0.0136
0.0732	0.5023	8800	0.0137
0.1217	0.5080	8900	0.0142

Framework versions

Transformers 4.47.0
Pytorch 2.6.0+cu124
Datasets 3.2.0
Tokenizers 0.21.0

alakxender
/

dhivehi-nougat-base