arxiv:2502.16781

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

Published on Feb 24

Authors:

Abstract

Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet <PRE_TAG>OCR errors</POST_TAG> -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual <PRE_TAG>QA dataset</POST_TAG> Multi<PRE_TAG>OCR-QA</POST_TAG>, designed to analyze the effects of <PRE_TAG>OCR noise</POST_TAG> on QA systems' performance. The Multi<PRE_TAG>OCR-QA</POST_TAG> dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate Multi<PRE_TAG>OCR-QA</POST_TAG> on various levels and types of <PRE_TAG>OCR errors</POST_TAG> to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16781 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.16781 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16781 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.