---
datasets:
- pierreguillou/DocLayNet-base
metrics:
- accuracy
base_model:
- google/vit-base-patch16-224-in21k
library_name: transformers
tags:
- vision
- document-layout-analysis
- document-classification
- vit
- doclaynet
---
# ViT Model for Document Layout Classification

This model is a fine-tuned Vision Transformer (ViT) for document layout classification based on the DocLayNet dataset.

## Model description

This model is built upon the `google/vit-base-patch16-224-in21k` Vision Transformer architecture and fine-tuned specifically for document layout classification. The base ViT model uses a patch size of 16x16 pixels and was pre-trained on ImageNet-21k. The model has been optimized to recognize and classify different types of document layouts from the DocLayNet dataset.

## Training data

The model was trained on DocLayNet-base dataset, which is available on the Hugging Face Hub: [pierreguillou/DocLayNet-base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base)

DocLayNet is a comprehensive dataset for document layout analysis, containing various document types and their corresponding layout annotations.

## Training procedure

The training was made with following hyperparameters:

```python
{
    'batch_size': 64,
    'num_epochs': 20,
    'learning_rate': 1e-4,
    'weight_decay': 0.05,
    'warmup_ratio': 0.2,
    'gradient_clip': 0.1,
    'dropout_rate': 0.1,
    'label_smoothing': 0.1,
    'optimizer': 'AdamW'
}


## Evaluation results
The model achieved the following performance metrics on the test set:

Test Loss: 0.8622
Test Accuracy: 81.36%