--- license: mit --- # [Unifying Vision, Text, and Layout for Universal Document Processing (CVPR 2023 Highlight)](https://arxiv.org/pdf/2212.02623) [Zineng Tang](https://zinengtang.github.io/), [Ziyi Yang](https://ziyi-yang.github.io/), [Guoxin Wang](https://www.guoxwang.com/), [Yuwei Fang](https://www.microsoft.com/en-us/research/people/yuwfan/), [Yang Liu](https://nlp-yang.github.io/), [Chenguang Zhu](https://cs.stanford.edu/people/cgzhu/), [Michael Zeng](https://www.microsoft.com/en-us/research/people/nzeng/), [Cha Zhang](https://www.microsoft.com/en-us/research/people/chazhang/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/) Open Source Checklist: - [x] Release Model (Encoder + Text decoder) - [x] Release Most Scripts - [ ] Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API) - [x] Demo ## Introduction UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives (joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives (question answering and layout analysis). ## Install ### Setup `python` environment ``` conda create -n UDOP python=3.8 # You can also use other environment. ``` ### Install other dependencies ``` pip install -r requirements.txt ``` ## Run Scripts Switch model type by: --model_type "UdopDual" --model_type "UdopUnimodel" ### Finetuninng on RVLCDIP Download RVLCDIP first and change the path For OCR, you might need to customize your code ``` bash scripts/finetune_rvlcdip.sh # Finetuning on RVLCDIP ``` ### Finetuninng on DUE Benchmark Download [Duebenchmark](https://github.com/due-benchmark/baselines) and follow its procedure to preprocess the data. The training code adapted to our framework is hosted at benchmarker by running: ``` bash scripts/finetune_duebenchmark.sh # Finetuning on DUE Benchmark, Switch tasks by changing path to the dataset ``` Evaluation of the output generation can be evaluated by [Duebenchmark due_evaluator](https://github.com/due-benchmark/evaluator) ### Model Checkpoints The model checkpoints are hosted here [Huggingface Hub](https://huggingface.co/ZinengTang/Udop) ## Citation ``` @article{tang2022unifying, title={Unifying Vision, Text, and Layout for Universal Document Processing}, author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit}, journal={arXiv preprint arXiv:2212.02623}, year={2022} } ``` ## Contact Zineng Tang (zn.tang.terran@gmail.com)