|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: image-to-text |
|
--- |
|
# Model Card for DoNUT Model |
|
|
|
<!-- Provide a quick summary of what the model does. --> |
|
|
|
This model card provides details about the DoNUT model fine-tuned for document question answering (docQA) on a synthetically generated dataset. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
The DoNUT model is a document question answering model that has been fine-tuned for answering questions related to tax forms, specifically 1099-div, 1099-int, w2, and w3 forms. It has been trained on a synthetically generated dataset to achieve high accuracy and performance in identifying and extracting information from these forms. |
|
|
|
Developed by: [CALM.ai] |
|
Model type: Question Answering (QA) |
|
Language(s) (NLP): English |
|
License: Apache-2.0 |
|
Finetuned from model : DoNUT Model |
|
|
|
![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [naver-clova-ix/donut-base](https://huggingface.co/naver-clova-ix/donut-base) |
|
- **Paper [optional]:** [More Information Needed] |
|
- **Demo [optional]:** [More Information Needed] |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
The model can be directly used for querying tax forms and extracting information from them. Users can interact with the extracted information using the llama-3 LLM, which provides a better understanding of the forms and allows for simple mathematical operations on some fields. |
|
|
|
### General Purpose Use |
|
|
|
The model can also be used as a general-purpose document question answering system. It can parse various types of documents, such as textbooks, magazines, articles, and technical papers, providing users with relevant information and insights. |
|
|
|
### Downstream Use |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
The model can be further fine-tuned for specific use cases or integrated into larger document processing systems. It can also be used for classifying uploaded documents into form documents (1099-DIV, 1099-INT, W2, W3) and non-form documents (non-form). This allows for general-purpose use, such as parsing textbooks, magazines, articles, technical papers, etc. |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
The model is not suitable for non-tax-related documents and may not perform well on handwritten or poorly scanned forms. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
The model may exhibit biases based on the synthetic nature of the dataset and may not generalize well to real-world scenarios. It may also struggle with handwritten or poorly scanned forms. |
|
|
|
## How to Get Started with the Model |
|
|
|
To get started with the model, you can use the following code: |
|
|
|
### Installing reqired libraries |
|
|
|
```bash |
|
!pip install -q transformers\ |
|
datasets |
|
``` |
|
|
|
### Loading the Dataset |
|
```py |
|
from datasets import load_dataset |
|
|
|
|
|
dataset = load_dataset("calm-ai/Multiple_financial_forms", split="test", use_auth_token=True) |
|
|
|
``` |
|
|
|
### Loading the Model |
|
```python |
|
|
|
from transformers import DonutProcessor, VisionEncoderDecoderModel |
|
|
|
processor = DonutProcessor.from_pretrained("calm-ai/donut-base-finetuned-forms-v1") |
|
model = VisionEncoderDecoderModel.from_pretrained("calm-ai/donut-base-finetuned-forms-v1") |
|
``` |
|
# Use the model for inference |
|
```python |
|
import re |
|
import json |
|
import torch |
|
from tqdm.auto import tqdm |
|
import numpy as np |
|
|
|
def process_document(image): |
|
# prepare encoder inputs |
|
pixel_values = processor(image, return_tensors="pt").pixel_values |
|
|
|
print(type(pixel_values),pixel_values.shape) |
|
|
|
# prepare decoder inputs |
|
task_prompt = "<s>" |
|
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids |
|
|
|
# generate answer |
|
outputs = model.generate( |
|
pixel_values.to(torch.device(1)), |
|
decoder_input_ids=decoder_input_ids.to(device), |
|
max_length=model.decoder.config.max_position_embeddings, |
|
early_stopping=True, |
|
pad_token_id=processor.tokenizer.pad_token_id, |
|
eos_token_id=processor.tokenizer.eos_token_id, |
|
use_cache=True, |
|
num_beams=1, |
|
bad_words_ids=[[processor.tokenizer.unk_token_id]], |
|
return_dict_in_generate=True, |
|
) |
|
|
|
# postprocess |
|
sequence = processor.batch_decode(outputs.sequences)[0] |
|
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") |
|
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token |
|
|
|
return processor.token2json(sequence) |
|
|
|
#youcan change the index number between 0-99 and check the parsed information |
|
image = dataset[20]['image'] |
|
|
|
image |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The model was trained on a synthetically generated dataset consisting of 4000 tax forms (1099-div, 1099-int, w2, w3) with complete data imputed using the Faker library. |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
#### Preprocessing. |
|
|
|
The forms were preprocessed to extract text and annotating information for training. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** Fine-tuning on the DoNUT model |
|
Optimizer: Adam |
|
Learning rate: 5e-5 |
|
Batch size: 8 |
|
#### Speeds, Sizes, Time |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
Training time: 3 epochs |
|
Speed: 6s |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
The model was evaluated on a separate set of tax forms not seen during training. |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
The evaluation was disaggregated by form type (1099-div, 1099-int, w2, w3). |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
Val_edit_distance: 0.0434 |
|
|
|
Val Edit distance is a measure of similarity between two strings, calculated as the minimum number of operations required to transform one string into the other. In the context of document parsing and generation, edit distance can be used to measure the accuracy of the generated output compared to the ground truth. |
|
|
|
Here's why val-edit-distance may be a suitable metric for this purpose: |
|
|
|
Quantifies Accuracy: Edit distance provides a quantitative measure of how similar the generated JSON output is to the ground truth. A lower edit distance indicates a higher degree of accuracy in the generated output. |
|
|
|
Handles Variability: Edit distance is robust to variations in the generated output that may still be considered correct. For example, minor differences in formatting or word choice may result in a small edit distance but still be acceptable. |
|
|
|
Easy Interpretation: The edit distance value is easy to interpret, with smaller values indicating higher similarity between the generated and ground truth outputs. |
|
|
|
### Results |
|
|
|
Accuracy: 97% |
|
|
|
#### Summary |
|
Our DoNUT finetunde model is the only open-source model capable of extracting information from tax forms such as 1099-div, 1099-int, w2, and w3, achieving an accuracy of 97%. |
|
|
|
|
|
## Technical Specifications. |
|
|
|
### Compute Infrastructure |
|
|
|
GPU requirements : (min) 4gb |
|
System Ram : (min) 8gb |
|
|
|
|
|
## Model Card Authors |
|
|
|
Abhishek A |
|
Chandan V K |
|
Likhith V |
|
Monish M |
|
|
|
## Model Card Contact |