392000 short texts (around 500 tokens each) generated from a language model
OCR-related training (error correction, training data generation, etc.)
-