MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts
Abstract
Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet <PRE_TAG>OCR errors</POST_TAG> -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual <PRE_TAG>QA dataset</POST_TAG> Multi<PRE_TAG>OCR-QA</POST_TAG>, designed to analyze the effects of <PRE_TAG>OCR noise</POST_TAG> on QA systems' performance. The Multi<PRE_TAG>OCR-QA</POST_TAG> dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate Multi<PRE_TAG>OCR-QA</POST_TAG> on various levels and types of <PRE_TAG>OCR errors</POST_TAG> to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper