Bounding boxes in the pre-training data and pre-training tasks
I would like to ask two questions.
In the technical report, you wrote that
In order to obtain strong OCR and document understanding abilities, we train Idefics2 on different sources of PDF documents: 19 million industry documents from OCR-IDL (Biten et al., 2022) and 18 million pages from PDFA.
As far as I know, these two datasets contain not only the OCR-ed/extracted text, but the corresponding bounding boxes as well.
My first question is: How did you utilize the bounding boxes during pre-training and fine-tuning?
My second question is: What was the objective during pre-training?
Good question since it's true that our model does not support bounding boxes, and the OCR datasets we used, like PDFA https://huggingface.co/datasets/pixparse/pdfa-eng-wds, contain parts of the text with the associated bounding boxes.
We simply linearized everything: usually, the order of the texts in PDFA is also generally the order of the text when you read the document. @Molbap can better develop on this.
Then, the objective we used was simply the next token prediction, trying to predict the text in the image.
It's not always in the right order, but for the pretraining, it still works well.