--- language: - en metrics: - accuracy library_name: transformers tags: - donut - kyc --- # Model description Donut is an end-to-end (i.e., self-contained) VDU model for the general understanding of document images. The architecture of Donut is quite simple, which consists of a Transformer based visual encoder and textual decoder modules. Donut does not rely on any modules related to OCR functionality but uses a visual encoder for extracting features from a given document image. The following textual decoder maps the derived features into a sequence of subword tokens to construct a desired structured format (e.g., JSON). Each model component is Transformer-based, and thus the model is trained easily in an end-to-end manner. ![image.png](https://cdn-uploads.huggingface.co/production/uploads/637eccd46df7e8f7df76a3ae/OSQp25332524epV2PimZb.png) # Intended uses and limitations This model is trained to be used for reading the contents of Indian KYC documents. It can classify and read the contents of Aadhar, PAN and Voter. It also can detect the orientation and whether the document is coloured or Black and White. The document for input can be oriented in any direction. The model should be provided with a fair-quality image (so that the contents are readable). It has been trained on limited data so the performance might not be very good. In future versions, the number of images will be more and more types of KYC documents can be added to this. # Training data For v1, a custom dataset has been used for the training purpose where around 283 images were used, out of which 199 were for training, 42 were for validation and 42 were for testing. Out of 199 images, 57 Aadhar samples, 57 PAN samples and 85 Voter samples were used. # Performance The current performance is as follows Overall accuracy = 74 % Aadhar = 49 % (need to check out, the reason behind the less accuracy) PAN = 94 % Voter = 76 % # Inference ``` python from transformers import DonutProcessor, VisionEncoderDecoderModel import re import cv2 import json import torch from tqdm.auto import tqdm import numpy as np from donut import JSONParseEvaluator processor = DonutProcessor.from_pretrained("sourinkarmakar/kyc_v1-donut-demo") model = VisionEncoderDecoderModel.from_pretrained("sourinkarmakar/kyc_v1-donut-demo") # Need to install python-donut # !pip install -q donut-python # Images stored inside a folder 'unseen_samples' dataset = glob.glob(os.path.join(basepath, "unseen_samples/*")) output_list = [] for idx, sample in tqdm(enumerate(dataset), total=len(dataset)): # prepare encoder inputs img = cv2.imread(sample) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) pixel_values = processor(img, return_tensors="pt").pixel_values pixel_values = pixel_values.to(device) # prepare decoder inputs task_prompt = "" decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids decoder_input_ids = decoder_input_ids.to(device) # autoregressively generate sequence outputs = model.generate( pixel_values, decoder_input_ids=decoder_input_ids, max_length=model.decoder.config.max_position_embeddings, early_stopping=True, pad_token_id=processor.tokenizer.pad_token_id, eos_token_id=processor.tokenizer.eos_token_id, use_cache=True, num_beams=1, bad_words_ids=[[processor.tokenizer.unk_token_id]], return_dict_in_generate=True, ) # turn into JSON seq = processor.batch_decode(outputs.sequences)[0] seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") seq = re.sub(r"<.*?>", "", seq, count=1).strip() # remove first task start token seq = processor.token2json(seq) output_list.append(seq) print(output_list) ```