---
language:
- en
metrics:
- accuracy
library_name: transformers
tags:
- donut
- kyc
---

# Model description

Donut is an end-to-end (i.e., self-contained) VDU model for the general understanding of document images. The architecture of Donut is quite simple, which consists of a Transformer based visual encoder and textual decoder modules.
Donut does not rely on any modules related to OCR functionality but uses a visual encoder for extracting features from a given document image.
The following textual decoder maps the derived features into a sequence of subword tokens to construct a desired structured format (e.g., JSON). Each model component is Transformer-based, and thus the model is trained easily in an end-to-end manner.


![image.png](https://cdn-uploads.huggingface.co/production/uploads/637eccd46df7e8f7df76a3ae/OSQp25332524epV2PimZb.png)


# Intended uses and limitations

This model is trained to be used for reading the contents of Indian KYC documents. It can classify and read the contents of Aadhar, PAN and Voter. It also can detect the orientation and whether the document is coloured or Black and White. The document for input can be oriented in any direction.
The model should be provided with a fair-quality image (so that the contents are readable).
It has been trained on limited data so the performance might not be very good. In future versions, the number of images will be more and more types of KYC documents can be added to this.

# Training data

For v1, a custom dataset has been used for the training purpose where around 283 images were used, out of which 199 were for training, 42 were for validation and 42 were for testing.
Out of 199 images, 57 Aadhar samples, 57 PAN samples and 85 Voter samples were used.

# Performance

The current performance is as follows
Overall accuracy = 74 %

Aadhar = 49 % (need to check out, the reason behind the less accuracy)
PAN = 94 %
Voter = 76 %

# Inference

``` python
from transformers import DonutProcessor, VisionEncoderDecoderModel

import re
import cv2
import json
import torch
from tqdm.auto import tqdm
import numpy as np

from donut import JSONParseEvaluator

processor = DonutProcessor.from_pretrained("sourinkarmakar/kyc_v1-donut-demo")
model = VisionEncoderDecoderModel.from_pretrained("sourinkarmakar/kyc_v1-donut-demo")

# Need to install python-donut
# !pip install -q donut-python

# Images stored inside a folder 'unseen_samples'
dataset = glob.glob(os.path.join(basepath, "unseen_samples/*"))

output_list = []

for idx, sample in tqdm(enumerate(dataset), total=len(dataset)):
# prepare encoder inputs
img = cv2.imread(sample)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
pixel_values = processor(img, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

# prepare decoder inputs
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids = decoder_input_ids.to(device)

# autoregressively generate sequence
outputs = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
)

# turn into JSON
seq = processor.batch_decode(outputs.sequences)[0]
seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
seq = re.sub(r"<.*?>", "", seq, count=1).strip() # remove first task start token
seq = processor.token2json(seq)

output_list.append(seq)

print(output_list)
```