Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

jordyvl commited on Apr 9, 2024

Commit

e0a78f5

1 Parent(s): c2e5c18

First commit

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +291 -0
OCR_directory.sh +17 -0
app.py +114 -67
assets/txts/pg_0002.txt +1 -0
assets/txts/pg_0003.txt +25 -0
assets/txts/pg_0004.txt +10 -0
assets/txts/pg_0005.txt +32 -0
assets/txts/pg_0006.txt +45 -0
assets/txts/pg_0007.txt +33 -0
assets/txts/pg_0008.txt +35 -0
assets/txts/pg_0009.txt +34 -0
assets/txts/pg_0010.txt +44 -0
assets/txts/pg_0013.txt +22 -0
assets/txts/pg_0014.txt +30 -0
assets/txts/pg_0015.txt +30 -0
assets/txts/pg_0016.txt +12 -0
assets/txts/pg_0017.txt +200 -0
assets/txts/pg_0018.txt +204 -0
assets/txts/pg_0019.txt +394 -0
assets/txts/pg_0020.txt +320 -0
assets/txts/pg_0021.txt +421 -0
assets/txts/pg_0033.txt +30 -0
assets/txts/pg_0034.txt +44 -0
assets/txts/pg_0035.txt +46 -0
assets/txts/pg_0036.txt +80 -0
assets/txts/pg_0037.txt +42 -0
assets/txts/pg_0038.txt +45 -0
assets/txts/pg_0039.txt +41 -0
assets/txts/pg_0040.txt +35 -0
assets/txts/pg_0041.txt +27 -0
assets/txts/pg_0042.txt +31 -0
assets/txts/pg_0043.txt +32 -0
assets/txts/pg_0044.txt +163 -0
assets/txts/pg_0045.txt +53 -0
assets/txts/pg_0046.txt +47 -0
assets/txts/pg_0047.txt +53 -0
assets/txts/pg_0048.txt +45 -0
assets/txts/pg_0049.txt +41 -0
assets/txts/pg_0050.txt +46 -0
assets/txts/pg_0051.txt +58 -0
assets/txts/pg_0052.txt +38 -0
assets/txts/pg_0053.txt +46 -0
assets/txts/pg_0054.txt +45 -0
assets/txts/pg_0055.txt +45 -0
assets/txts/pg_0056.txt +39 -0
assets/txts/pg_0057.txt +98 -0
assets/txts/pg_0058.txt +62 -0
assets/txts/pg_0059.txt +64 -0
assets/txts/pg_0060.txt +53 -0
assets/txts/pg_0061.txt +42 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,294 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0031.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0081.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0123.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0155.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0216.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0277.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0015.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0047.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0051.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0054.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0088.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0250.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0009.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0089.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0117.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0241.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0101.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0110.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0208.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0226.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0284.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0060.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0252.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0058.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0099.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0195.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0057.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0105.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0125.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0169.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0184.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0196.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0075.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0236.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0276.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0006.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0156.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0082.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0106.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0157.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0188.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0201.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0225.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0248.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0023.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0116.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0119.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0254.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0278.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0045.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0093.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0182.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0064.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0094.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0104.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0113.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0150.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0189.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0220.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0261.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0011.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0048.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0288.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0034.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0108.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0214.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0287.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0100.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0198.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0227.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0244.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0245.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0270.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0039.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0055.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0086.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0174.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0181.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0266.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0283.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0073.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0080.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0274.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0279.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0036.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0050.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0069.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0053.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0056.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0145.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0027.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0067.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0079.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0013.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0072.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0191.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0263.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0268.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0041.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0136.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0170.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0180.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0200.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0217.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0280.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0016.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0018.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0062.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0122.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0147.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0265.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0215.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0133.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0165.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0166.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0222.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0078.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0171.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0219.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0028.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0107.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0144.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0178.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0190.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0043.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0010.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0021.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0160.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0247.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0063.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0090.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0137.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0159.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0269.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0014.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0026.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0033.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0035.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0046.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0186.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0237.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0179.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0193.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0232.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0109.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0134.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0286.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0003.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0004.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0206.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0251.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0040.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0083.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0230.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0272.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0275.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0096.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0115.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0260.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0271.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0012.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0022.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0176.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0218.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0273.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0065.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0132.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0187.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0267.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0044.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0029.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0084.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0087.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0238.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0253.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0257.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0102.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0103.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0148.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0242.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0258.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0005.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0008.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0032.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0037.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0070.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0207.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0235.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0061.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0068.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0077.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0204.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0239.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0255.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0289.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0025.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0052.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0066.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0131.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0163.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0259.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0224.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0249.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0121.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0140.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0143.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0151.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0095.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0111.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0139.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0211.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0019.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0076.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0152.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0212.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0223.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0017.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0142.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0158.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0233.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0256.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0262.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0282.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0020.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0024.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0199.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0264.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0002.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0092.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0120.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0071.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0074.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0203.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0285.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0085.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0127.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0185.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0281.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0098.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0112.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0141.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0146.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0164.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0240.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0246.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0097.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0149.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0162.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0030.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0049.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0177.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0209.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0213.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0059.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0091.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0129.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0172.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0175.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0183.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0194.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0231.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0001.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0130.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0168.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0202.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0210.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0234.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0038.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0042.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0114.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0124.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0138.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0153.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0154.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0161.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0173.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0221.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0229.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0118.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0126.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0135.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0167.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0192.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0290.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0007.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0128.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0197.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0243.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0205.pdf filter=lfs diff=lfs merge=lfs -text
+assets/pdfs/pg_0228.pdf filter=lfs diff=lfs merge=lfs -text

OCR_directory.sh ADDED Viewed

	@@ -0,0 +1,17 @@

+# pdftk thesis.pdf burst
+#using pdf2text, extract text for each page in assets/pdfs and store in asssets/txts with similar basename
+for pdf in assets/pdfs/*.pdf
+do
+    echo
+    #pdftotext $pdf assets/txts/$(basename $pdf .pdf).txt
+    #pdf2txt.py -o assets/txts/$(basename $pdf .pdf).txt $pdf
+done
+for pdf in assets/pdfs/*.pdf
+do
+    convert -density 100 -quality 100 -colorspace RGB -alpha remove -alpha off $pdf assets/pngs/$(basename $pdf .pdf).png
+done

app.py CHANGED Viewed

@@ -1,67 +1,114 @@
-import streamlit as st
-from llama_index import VectorStoreIndex
-from llama_index import ServiceContext
-from llama_index.embeddings import HuggingFaceEmbedding
-from llama_index.llms import HuggingFaceInferenceAPI
-from llama_index.schema import Document
-from PyPDF2 import PdfReader
-# Streamlit title and description
-st.title("PDF querying using Llama-Index by Rahul Bhoyar")
-st.write("Base Model: **HuggingFaceH4/zephyr-7b-alpha (open-source from HuggingFace)**")
-st.write("Embedding Model: **WhereIsAI/UAE-Large-V1 (open-source from HuggingFace)**")
-st.write("This app allows you to upload your own PDF and query your document.")
-hf_token = st.text_input("Enter your Hugging Face token:")
-def read_pdf(uploaded_file):
-    pdf_reader = PdfReader(uploaded_file)
-    text = ""
-    for page_num in range(len(pdf_reader.pages)):
-        text += pdf_reader.pages[page_num].extract_text()
-    return text
-# Streamlit input for user file upload
-success = False
-query_engine_creation = False
-uploaded_pdf = st.file_uploader("Upload your PDF", type=['pdf'])
-# Load data and configure the index
-if uploaded_pdf is not None:
-    file_contents = read_pdf(uploaded_pdf)
-    documents = Document(text=file_contents)
-    documents = [documents]
-    st.success("Documents loaded successfully!")
-    model = st.selectbox('Select the model', ('google/flan-t5-xxl','HuggingFaceH4/zephyr-7b-alpha'), index=0)
-    llm = HuggingFaceInferenceAPI(model_name=model, token=hf_token)
-    with st.spinner('Creating Vector Embeddings...'):
-        embed_model_uae = HuggingFaceEmbedding(model_name="WhereIsAI/UAE-Large-V1")
-        service_context = ServiceContext.from_defaults(
-            llm=llm, chunk_size=800, chunk_overlap=20, embed_model=embed_model_uae
-        )
-        index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)
-        index.storage_context.persist()
-        query_engine = index.as_query_engine()
-        query_engine_creation = True
-        # Display the result of the task
-    st.success("Vector embeddings created.")
-    success = True
-else:
-    st.write("Please upload a file first.")
-if query_engine_creation:
-    # Streamlit input for user query
-    if success:
-        user_query = st.text_input("Enter your query:")
-        # Query engine with user input
-        if user_query:
-            with st.spinner('Fetching the response...'):
-                response = query_engine.query(user_query)
-            st.markdown(f"**Response:** {response}")

+import torch
+from transformers import BitsAndBytesConfig
+from llama_index.llms.huggingface import HuggingFaceLLM
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.core import SimpleDirectoryReader
+from llama_index.core import VectorStoreIndex, SummaryIndex
+from llama_index.core.prompts import PromptTemplate
+from llama_index.core import Settings
+import gradio as gr
+def messages_to_prompt(messages):
+    prompt = ""
+    for message in messages:
+        if message.role == "system":
+            m = "You are an expert in the research field of document understanding, bayesian deep learning and neural networks."
+            prompt += f"<|system|>\n{m}</s>\n"
+        elif message.role == "user":
+            prompt += f"<|user|>\n{message.content}</s>\n"
+        elif message.role == "assistant":
+            prompt += f"<|assistant|>\n{message.content}</s>\n"
+    # ensure we start with a system prompt, insert blank if needed
+    if not prompt.startswith("<|system|>\n"):
+        prompt = "<|system|>\n</s>\n" + prompt
+    # add final assistant prompt
+    prompt = prompt + "<|assistant|>\n"
+    return prompt
+def load_RAG_pipeline():
+    # LLM
+    quantization_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_compute_dtype=torch.float16,
+        bnb_4bit_quant_type="nf4",
+        bnb_4bit_use_double_quant=True,
+    )
+    llm = HuggingFaceLLM(
+        model_name="HuggingFaceH4/zephyr-7b-alpha",
+        tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
+        query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
+        context_window=3900,
+        max_new_tokens=256,
+        model_kwargs={"quantization_config": quantization_config},
+        # tokenizer_kwargs={},
+        generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
+        messages_to_prompt=messages_to_prompt,
+        device_map="auto",
+    )
+    # Llama-index
+    Settings.llm = llm
+    Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
+    # Settings.chunk_size = 512
+    # Settings.chunk_overlap = 50
+    # raw data
+    documents = SimpleDirectoryReader("assets/txts").load_data()
+    vector_index = VectorStoreIndex.from_documents(documents)
+    # summary_index = SummaryIndex.from_documents(documents)
+    query_engine = vector_index.as_query_engine(response_mode="compact", similarity_top_k=3)
+    return query_engine
+query_engine = load_RAG_pipeline()
+# These are placeholder functions to simulate the behavior of the RAG setup.
+# You would need to implement these with the actual logic to retrieve and generate answers based on the document.
+def get_answer(question, temperature, nucleus_sampling, max_tokens):
+    # Here you should implement the logic to generate an answer based on the question and the document.
+    # For example, you could use a machine learning model for RAG.
+    # answer = "This is a placeholder answer."
+    # https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/#setting-local-configurations
+    return query_engine.query(question)
+def get_answer_page(question):
+    # Implement logic to retrieve the page number or an image of the page with the answer.
+    answer_page = "Page X - placeholder image."
+    return answer_page
+# Create the gr.Interface function
+def ask_my_thesis(question, temperature, nucleus_sampling, max_tokens):
+    answer = get_answer(question, temperature, nucleus_sampling, max_tokens)
+    answer_page = get_answer_page(question)
+    return answer, answer_page
+# Set up the interface options based on the design in the image.
+iface = gr.Interface(
+    fn=ask_my_thesis,
+    inputs=[
+        gr.Textbox(label="Question", placeholder="Type your question here..."),
+        gr.Slider(0, 1, value=0.7, label="Temperature"),
+        gr.Slider(0, 1, value=0.9, label="Nucleus Sampling"),
+        gr.Slider(1, 500, value=100, label="Max Generated Number of Tokens"),
+    ],
+    outputs=[gr.Textbox(label="Answer"), gr.Image(label="Answer Page")],
+    title="Ask my thesis",
+    description="Chat with the manuscript: ask questions and receive answers with references.",
+    allow_flagging="never",
+)
+# Start the application.
+if __name__ == "__main__":
+    iface.launch()

assets/txts/pg_0002.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

assets/txts/pg_0003.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+Intelligent Automation for AI-Driven Document
+Understanding
+Jordy VAN LANDEGHEM
+Examination committee:
+em. Prof. Dr. ir. Jean-Pierre Celis, chair
+Prof. Dr. Marie-Francine Moens, supervisor
+Prof. Dr. Matthew B. Blaschko, supervisor
+Prof. Dr. ir. Johan Suykens
+Prof. Dr. ir. Tinne Tuytelaars
+Prof. Dr. Marcus Rohrbach
+(TU Darmstadt)
+Prof. Dr. Wenpeng Yin
+(Penn State University)
+Dr. Bertrand Anckaert
+(Contract.fit)
+March 2024
+Dissertation presented in partial
+fulfillment of the requirements for
+the degree of Doctor of Engineering
+Science (PhD): Computer Science

assets/txts/pg_0004.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+© 2024 KU Leuven – Faculty of Engineering Science
+Uitgegeven in eigen beheer, Jordy Van Landeghem, Celestijnenlaan 200A box 2402, B-3001 Leuven (Belgium)
+Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
+door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
+schriftelijke toestemming van de uitgever.
+All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
+electronic or any other means without written permission from the publisher.

assets/txts/pg_0005.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+Preface
+This journey has been long and arduous, but I have finally reached an end. At
+this end, I have a thesis that I am proud of, and I have learned a lot. As I look
+back, I have been very fortunate to have had the support of many people, and I
+would like to take this opportunity to thank them.
+First and foremost, I would like to thank my supervisors, Sien and Matthew,
+for their guidance and support throughout this journey. Sien has taught me
+the importance of being thorough and meticulous, striving for diligence and
+perfection from the get-go. I still remember how patiently she helped me with
+my first paper, holding a Sunday afternoon call from her attic/home-office,
+helping me hone the presentation and writing. Involving Matthew as the cosupervisor has been the best decision for my personal development, as he offered
+a different perspective on my work, always challenging me to look at problems
+from the lens of statistical theory and machine learning fundamentals. My
+knee-jerk reaction to start implementing things as soon as possible was often
+met with a “slow down, think about it first” from Matthew, which has been
+invaluable in my development as a researcher. I am grateful to both of them
+for their patience and understanding, and for giving me the freedom to explore
+my own ideas and interests.
+Next, a sincere thanks to my jury members, for taking the time to read my
+thesis and for their valuable feedback. Furthermore, I would like to thank
+het Vlaams Agentschap Innoveren & Ondernemen (VLAIO) for awarding the
+Baekeland grant without which this PhD would not have been possible.
+Pol & Bertrand, thanks for having me contribute to your dream to rid the
+world of boring administrative processes and paperwork. Technically my bosses,
+but in reality you are the embodiment of leadership by example, and I am
+grateful for the many lessons I have learned from you. I am grateful for the
+many opportunities you have given me to grow as a researcher and as a person.
+Many thanks to my past and present colleagues at Contract.fit, for always
+i

assets/txts/pg_0006.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+ii
+PREFACE
+preaching automation, inspiring me, and for having fun along the way. I am
+grateful to my LIIR colleagues at KU Leuven, particularly the folks from office
+4.34 for the many interesting discussions and whiteboard sessions, whenever I
+occasionally popped into the office.
+I was fortunate to travel to many places during my PhD (Lausanne, Lisbon,
+Barcelona, San Jose, Paris, Waikoloa), and I have met many people along the
+way. My DUDEs, you have been the trigger to complete my PhD, reinvigorating
+my passion for research and inspiring me for my future career. How crazy is it
+that we conceived the seeds of the DUDE
+project in a pirates bar, on a
+hotel rooftop, and from a hospital bed after my back surgery?
+Finally, I would like to thank my family and friends for their support and
+encouragement throughout this journey. My parents, Peter en Nadine, you
+have showed me that hard work pays off, and merci for the many sacrifices you
+have made to give me the best possible education and life. Marijke, you are
+the love of my life, and although I am not religious, you are my goddess, de
+mammiej. Feliz, when you came into our lives, you added an extra dimension.
+I used to see in 2D, now I see in 3D. Forever your father, your pappiej. Wes en
+Jen, thanks for showing me to never give up, keep on pushing, even when you
+are at your lowest, there is a way out, and only hard work will get you there.
+Cornbois -Bryan, Emile, (even) Jan, for our friendship, I fail to make an
+exhaustive definition. I wish for many more years of friendship from my likeminded brothers. John, Teunen, Wannes, if there is ever a zombie apocalypse, I
+know that I can count on you to have my window. Kessel-city - Poohke, Vinny,
+Kweinch etc., thanks for keeping on pushing the bar higher, and inspiring me
+with your ambition and drive. Gustaf, thanks for the many laughs (#velleke)
+and the much-needed distraction. Elstipoes, you are my oldest friend, and I am
+grateful for the many years of friendship. Woutje, thanks for your contagious
+optimism and the mancave during university. Leuvenbende, you were the
+ones that made university fun and enjoyable. Individually and together you are
+beautiful people, and I cherish our yearly reunions. Lauren en Yannick, thanks
+for letting me win at Mario Kart. I might be forgetting some people, but I
+would like to thank all my friends for bringing joy, for keeping me grounded,
+and for reminding me that there is more to life than work.
+Having studied literature in my Bachelor’s, it feels appropriate to finish with a
+quote wrongly attributed to Ernest Hemingway: “Write drunk; edit sober.”
+Jordy Van Landeghem
+Gurdo, Pogomeister, Jorre, De Van Laaandeghem
+February, 2024
+Kessel, Belgium

assets/txts/pg_0007.txt ADDED Viewed

	@@ -0,0 +1,33 @@

+Abstract
+Human communication is increasingly document-based, requiring machines
+to understand a wide variety of visually-rich documents to assist humans in
+their daily lives. Amid the digital evolution, documents continue to facilitate
+crucial human and organizational interactions but are tethered to manual
+processing, causing inefficiency. We examine why organizations lag in adopting
+automated document processing solutions and outline two primary challenges:
+the complexity of processing long, multimodal documents algorithmically and
+the necessity for reliability and control over associated risks. Automated decisionmaking is key to improving the efficiency of document processing, but the current
+state-of-the-art technology is not yet reliable and robust enough to be deployed
+in autonomous systems.
+The practical objective set is to develop Intelligent Automation () systems
+capable of estimating confidence in their actions, thereby increasing throughput
+without accruing additional costs due to errors. We analyze the key challenges
+and propose solutions to bridge the gap between research and practical
+applications, with a focus on realistic datasets and experimental methodologies.
+Building upon foundations of Document Understanding (), this dissertation
+introduces advanced methodologies combining Machine Learning, Natural
+Language Processing, and Computer Vision.
+Addressing the evident gaps in research, this work presents novel methods
+for predictive uncertainty quantification () alongside practical frameworks for
+evaluating the robustness and reliability of DU technologies. The contribution
+culminates in the introduction of two novel multipage document classification
+datasets and a multifaceted benchmark, DUDE
+, designed to rigorously
+challenge and assess the state-of-the-art in DU. Extensive experiments across
+these datasets reveal that while advancements have been made, significant
+room for improvement remains, particularly in long-context modeling for
+multipage document processing and calibrated, selective document visual
+question answering. Efficient DU is also explored, revealing the effectiveness of
+iii

assets/txts/pg_0008.txt ADDED Viewed

	@@ -0,0 +1,35 @@

+iv
+ABSTRACT
+knowledge distillation () model compression in visually-rich document layout
+analysis () and classification.
+Through empirical studies and methodological contributions, this dissertation
+has the following contributions and findings:
+First, in a benchmarking study of established methods on real-world text
+classification, we find that our novel hybrid method ‘Concrete Dropout
+Ensemble’ performs best, enhancing in-domain calibration and novel class
+detection, even at a smaller ensemble size. Detailed ablation experiments
+reveal the impact of prior, neural architecture, and hyperparameter choices on
+estimation quality.
+Second, on a prototypical DU task, we identify challenges in DU progress
+and propose a formalization of multipage document classification scenarios,
+constructed novel datasets, and conducted an experimental analysis showing
+the promise of multipage representation learning and inference.
+Third, we introduce DUDE, incorporating multifaceted challenges and principles
+for a comprehensive evaluation of generic DU. Next to our own benchmarking,
+we organize a competition, revealing that while newer document foundation
+models show promise, they struggle with questions involving visual evidence or
+complex reasoning. Moreover, we find severe problems in the ability of Large
+Language Models (s) to reason about documents in their entirety, highlighting
+issues with hallucination, long-context reasoning and control.
+Fourth, we propose the first methodology for enriching documents with semantic
+layout structure using distilled DLA models. We apply KD to visual document
+tasks, unraveling the influence of various task and architecture components.
+Finally, the dissertation concludes with a discussion of the findings and
+implications for future research, emphasizing the need for advancements in
+multipage document representation learning and the importance of realistic
+datasets and experimental methodologies to measurably move forward to reliable
+and robust IA-DU technology.

assets/txts/pg_0009.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+Beknopte samenvatting
+Menselijke communicatie is in toenemende mate documentgebaseerd, waarbij
+machines een breed aanbod aan visueel-rijke documenten moeten begrijpen
+om mensen in hun dagelijks leven te assisteren. Te midden van de digitale
+evolutie blijven documenten cruciale menselijke en organisatorische interacties
+faciliteren, maar zijn ze gebonden aan handmatige verwerking, wat inefficiëntie
+veroorzaakt. We onderzoeken waarom organisaties achterblijven bij het
+adopteren van geautomatiseerde documentverwerkingsoplossingen en schetsen
+twee primaire uitdagingen: de complexiteit van het algoritmisch verwerken van
+lange, multimodale documenten en de noodzaak van betrouwbaarheid en controle
+over daarmee samenhangende risico’s. Geautomatiseerde besluitvorming is
+essentieel voor het verbeteren van de efficiëntie van documentverwerking, maar
+de huidige stand van de technologie is nog niet betrouwbaar en robuust genoeg
+om ingezet te worden in autonome toepassingen.
+Het praktische doel dat gesteld wordt, is het ontwikkelen van systemen voor
+Intelligente Automatisering (IA) die in staat zijn om vertrouwen in hun acties te
+schatten, daarmee de doorvoer verhogend zonder extra kosten vanwege fouten.
+We analyseren de belangrijkste uitdagingen en stellen oplossingen voor om de
+kloof tussen onderzoek en praktische toepassingen te overbruggen, met een focus
+op realistische datasets en experimentele methodologieën. Voortbouwend op
+de fundamenten van Documentinterpretatie (DI), introduceert dit proefschrift
+geavanceerde methodologieën die Machinaal Leren, Natuurlijke Taalverwerking
+en Computer Visie combineren.
+Door de duidelijke hiaten in onderzoek aan te pakken, presenteert dit werk
+nieuwe methoden voor predictieve onzekerheidskwantificering (POK) naast
+praktische kaders voor het evalueren van de robuustheid en betrouwbaarheid
+van DI-technologieën. De bijdrage culmineert in de introductie van twee
+nieuwe datasets voor classificatie van multipagina documenten en een veelzijdige
+benchmark, DUDE
+, ontworpen om de state-of-the-art in DI rigoureus
+uit te dagen en te beoordelen. Uitgebreide experimenten met deze datasets
+v

assets/txts/pg_0010.txt ADDED Viewed

	@@ -0,0 +1,44 @@

+vi
+BEKNOPTE SAMENVATTING
+onthullen dat er weliswaar vooruitgang is geboekt, maar dat er nog significant
+veel ruimte is voor verbetering, met name in de lange-contextmodellering voor
+de verwerking van multipagina documenten en gekalibreerd, selectief visueel
+vraagbeantwoording van documenten. Meer schaalbaar DI wordt ook verkend,
+waarbij de effectiviteit van kennisdistillatie (KD) voor modelcompressie in
+visueel-rijke layoutanalyse (DLA) en classificatie van documenten aan het licht
+komt.
+Door middel van empirische studies en methodologische bijdragen, heeft dit
+proefschrift de volgende bijdragen en bevindingen:
+Ten eerste vinden we in een benchmarkstudie van gevestigde POK-methoden
+op tekstclassificatie in de echte wereld dat onze nieuwe hybride POK-methode
+’Concrete Dropout Ensemble’ het beste presteert, de kalibratie binnenshuis
+verbeterend en detectie van nieuwe klassen, zelfs met een kleiner ensemble.
+Gedetailleerde ablatie-experimenten onthullen de impact van voorafgaande
+kennis, neurale architectuur en keuzes van hyperparameters op de kwaliteit van
+POK-schatting.
+Ten tweede identificeren we uitdagingen in de vooruitgang van DI en stellen een
+formalisatie voor van multipagina documentclassificatiescenario’s, bouwen novel
+datasets, en voeren een experimentele analyse uit die de belofte van multipagina
+representatie-leren en inferentie toont.
+Ten derde introduceren we DUDE, waarin veelzijdige uitdagingen en principes
+worden voorgesteld voor een uitgebreide evaluatie.
+Naast onze eigen
+benchmarking organiseren we een competitie, waaruit blijkt dat hoewel nieuwere
+modellen veelbelovend zijn, ze het moeilijk hebben met vragen die visueel bewijs
+of complex redeneren vereisen. Bovendien vinden we ernstige problemen in het
+vermogen van Grote Taalmodellen (LLMs) om over documenten in hun geheel
+te redeneren, wat problemen benadrukt met hallucinatie, redeneren met lange
+context en controle.
+Ten vierde stellen we de eerste experimentele methodologie voor om documenten
+te verrijken met semantische layoutstructuur met behulp van gedestilleerde
+DLA-modellen. We passen KD toe op visuele documenttaken, waarbij we de
+invloed van verschillende architectuurcomponenten van taken ontrafelen.
+Ten slotte sluit het proefschrift af met een bespreking van de bevindingen en
+implicaties voor toekomstig onderzoek, waarbij de noodzaak wordt benadrukt
+voor vooruitgang in multipagina documentrepresentatie-leren en het belang van
+realistische datasets en experimentele methodologieën om meetbaar vooruitgang
+te boeken naar betrouwbare en robuuste IA-DI technologie.

assets/txts/pg_0013.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+List of Abbreviations
+AAPD Arxiv Academic Paper Dataset
+Acc_ID Accuracy in-domain
+Acc_OOD Accuracy out of domain
+AI Artificial Intelligence
+ANLS Average Normalized Levenshtein Similarity
+AUPR Area Under the Precision-Recall Curve
+AURC Area-Under-Risk-Coverage-Curve
+AUROC Area Under the Receiver Operating Characteristic curve
+BDL Bayesian Deep Learning
+BNN Bayesian Neural Network
+BPM Business Process Management
+CE Cross-Entropy
+CER Character Error Rate
+COCO Common Objects in Context
+CSF Confidence Scoring Function
+CV Computer Vision
+DC Document Classification
+DG Document Generation
+ix

assets/txts/pg_0014.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+x
+List of Abbreviations
+DL Deep Learning
+DLA Document Layout Analysis
+DNN Deep Neural Network
+DocAI Document AI
+DocVQA Document Visual Question Answering
+DOD Document Object Detection
+DU Document Understanding
+DUDE Document UnderstanDing of Everything
+ECE Expected Calibration Error
+ELBO Evidence Lower Bound
+ERM Empirical Risk Minimization
+FasterRCNN Faster Region-based Convolutional Neural Network
+FP False Positives
+IA Intelligent Automation
+ICDAR International Conference on Document Analysis and Recognition
+IDP Intelligent Document Processing
+i.i.d. Independent and Identically Distributed
+IOB/IOBES Inside, Outside, Beginning / End, Single
+KD Knowledge Distillation
+KIE Key Information Extraction
+LLM Large Language Model
+MAP Maximum-a-Posteriori
+mAP Mean Average Precision
+MCD Monte Carlo Dropout

assets/txts/pg_0015.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+List of Abbreviations
+MCMC Markov Chain Monte-Carlo
+MDLT Multi-Domain Long-Tailed Recognition
+MECE Mutually Exclusive and Collectively Exhaustive
+MI Mutual Information
+ML Machine Learning
+MSE Mean Squared Error
+MSP Maximum Softmax Probability
+MU Model Uncertainty
+NLG Natural Language Generation
+NLL Negative Log Likelihood
+NLP Natural Language Processing
+NN Neural Network
+OCR Optical Character Recognition
+OOD Out-of-Distribution
+PCC Pearson Correlation Coefficient
+PUQ Predictive Uncertainty Quantification
+RERM Regularized Empirical Risk Minimization
+ResNet Residual Network
+RPA Robotic Process Automation
+SaaS Software-as-a-service
+SNGP Spectral-normalized Neural Gaussian Process
+SOTA State-of-the-art
+STP Straight-Through-Processing
+TSR Table Structure Recognition
+xi

assets/txts/pg_0016.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+xii
+VDU Visual Document Understanding
+VI Variational Inference
+VLM Vision Language Model
+VQA Visual Question Answering
+VRD Visually-Rich Document
+WER Word Error Rate
+LIST OF ABBREVIATIONS

assets/txts/pg_0017.txt ADDED Viewed

	@@ -0,0 +1,200 @@

+Contents
+Abstract
+iii
+Beknopte samenvatting
+v
+List of Abbreviations
+xii
+Contents
+xiii
+List of Figures
+xix
+List of Tables
+xxv
+1 Introduction
+1.1 Research Context . . . . . . . . . . . . . . . . . . . . . .
+1.2 Problem Statement and Questions . . . . . . . . . . . .
+1.2.1 Reliable and Robust Deep Learning . . . . . . .
+1.2.2 Realistic and Efficient Document Understanding
+1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+1
+4
+6
+6
+7
+9
+2 Fundamentals
+2.1 Statistical Learning . . . . . . . . . . . . . . . .
+2.1.1 Neural Networks . . . . . . . . . . . . .
+2.1.2 Probabilistic Evaluation . . . . . . . . .
+2.1.3 Architectures . . . . . . . . . . . . . . .
+2.1.3.1 Convolutional Neural Networks
+2.1.3.2 Language Neural Networks . .
+2.1.3.3 Transformer Network . . . . .
+2.2 Reliability and Robustness . . . . . . . . . . . .
+2.2.1 Generalization and Adaptation . . . . .
+2.2.2 Confidence Estimation . . . . . . . . . .
+2.2.3 Evaluation Metrics . . . . . . . . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+11
+12
+14
+15
+16
+17
+18
+19
+21
+22
+23
+24
+xiii
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.

assets/txts/pg_0018.txt ADDED Viewed

	@@ -0,0 +1,204 @@

+xiv
+CONTENTS
+2.3
+2.4
+I
+2.2.4 Calibration . . . . . . . . . . . . . . . .
+2.2.5 Predictive Uncertainty Quantification .
+2.2.6 Failure Prediction . . . . . . . . . . . .
+Document Understanding . . . . . . . . . . . .
+2.3.1 Task Definitions . . . . . . . . . . . . .
+2.3.2 Datasets . . . . . . . . . . . . . . . . . .
+2.3.3 Models . . . . . . . . . . . . . . . . . .
+2.3.4 Challenges in Document Understanding
+2.3.4.1 Long-Context Modeling . . . .
+2.3.4.2 Document Structure Modeling
+Intelligent Automation . . . . . . . . . . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+Reliable and Robust Deep Learning
+3 Benchmarking Scalable Predictive Uncertainty in Text Classification
+3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+3.3 Uncertainty Methods . . . . . . . . . . . . . . . . . . . . . . . .
+3.3.1 Quantifying Uncertainty in Deep Learning . . . . . . . .
+3.3.2 Predictive Uncertainty Methods . . . . . . . . . . . . .
+3.3.2.1 Monte Carlo Dropout . . . . . . . . . . . . . .
+3.3.2.2 Deep Ensemble . . . . . . . . . . . . . . . . . .
+3.3.2.3 Concrete Dropout . . . . . . . . . . . . . . . .
+3.3.2.4 Heteroscedastic Extensions . . . . . . . . . . .
+3.3.3 Uncertainty Estimation . . . . . . . . . . . . . . . . . .
+3.3.4 Motivating Hybrid Approaches . . . . . . . . . . . . . .
+3.3.5 Uncertainty Calibration under Distribution Shift . . . .
+3.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . .
+3.4.1 Proposed Hybrid Approaches . . . . . . . . . . . . . . .
+3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
+3.4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . .
+3.4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . .
+3.4.5 Experimental design . . . . . . . . . . . . . . . . . . . .
+3.4.5.1 In-domain Setting . . . . . . . . . . . . . . . .
+3.4.5.2 Cross-domain Setting . . . . . . . . . . . . . .
+3.4.5.3 Novelty Detection Setting . . . . . . . . . . . .
+3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+3.5.1 Experiment: In-domain . . . . . . . . . . . . . . . . . .
+3.5.2 Experiment: Cross-domain . . . . . . . . . . . . . . . .
+3.5.3 Experiment: Novelty Detection . . . . . . . . . . . . . .
+3.5.4 Experiment: Ablations . . . . . . . . . . . . . . . . . . .
+3.5.4.1 Diversity . . . . . . . . . . . . . . . . . . . . .
+28
+30
+32
+33
+35
+36
+37
+38
+39
+40
+41
+43
+44
+46
+48
+51
+51
+52
+53
+53
+54
+54
+55
+58
+59
+61
+61
+63
+64
+66
+66
+67
+67
+68
+69
+70
+71
+73
+75
+76

assets/txts/pg_0019.txt ADDED Viewed

	@@ -0,0 +1,394 @@

+CONTENTS
+3.6
+3.7
+3.8
+3.9
+II
+xv
+3.5.4.2 NLP Architecture . . . . . . . . . .
+3.5.4.3 Ensemble size M . . . . . . . . . . .
+3.5.4.4 Concrete Dropout p . . . . . . . . .
+Discussion . . . . . . . . . . . . . . . . . . . . . . . .
+Additional Uncertainty Approaches . . . . . . . . . .
+3.7.1 Stochastic Gradient MCMC Methods . . . .
+3.7.2 Spectral-normalized Neural Gaussian Process
+3.7.2.1 SNGP Results . . . . . . . . . . . .
+3.7.2.2 SNGP Discussion . . . . . . . . . .
+Limitations . . . . . . . . . . . . . . . . . . . . . . .
+Chapter Conclusion . . . . . . . . . . . . . . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+Realistic and Efficient Document Understanding
+4 Beyond Document Page Classification: Design,
+Challenges
+4.1 Introduction . . . . . . . . . . . . . . . . . . . .
+4.2 Problem Formulation . . . . . . . . . . . . . . .
+4.3 Balancing Research & Applications . . . . . . .
+4.4 Experimental Study . . . . . . . . . . . . . . .
+4.5 Challenges and Guidelines . . . . . . . . . . . .
+4.5.1 Divergence of Tasks: f . . . . . . . . . .
+4.5.2 Divergence of Label Space: Y . . . . . .
+4.5.3 Divergence of Input Data: X . . . . . .
+4.5.4 Maturity of Evaluation Methodology . .
+4.6 Chapter Conclusion . . . . . . . . . . . . . . .
+5 Document UnderstanDing of Everything (DUDE
+5.1 Introduction . . . . . . . . . . . . . . . . . . .
+5.2 Related Work . . . . . . . . . . . . . . . . . .
+5.3 DUDE Dataset . . . . . . . . . . . . . . . .
+5.3.1 Gathering Documents . . . . . . . . .
+5.3.2 Annotation Process . . . . . . . . . .
+5.3.3 Dataset Statistics . . . . . . . . . . . .
+5.3.4 Diagnostic Subsets . . . . . . . . . . .
+5.3.5 Evaluation . . . . . . . . . . . . . . .
+5.4 DUDE Competition . . . . . . . . . . . . . .
+5.4.1 Challenge Objectives . . . . . . . . . .
+5.4.2 Challenge Contributions . . . . . . . .
+5.4.3 Motivation and Scope . . . . . . . . .
+5.4.3.1 Desired Generalization. . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+77
+79
+80
+81
+85
+86
+87
+88
+90
+90
+91
+94
+Datasets, and
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+95
+97
+98
+101
+104
+107
+107
+108
+109
+111
+111
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+)
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+. .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+113
+116
+117
+118
+121
+121
+123
+125
+126
+128
+128
+129
+129
+130

assets/txts/pg_0020.txt ADDED Viewed

	@@ -0,0 +1,320 @@

+xvi
+CONTENTS
+5.4.4
+5.5
+5.6
+5.7
+5.8
+DUDE Competition Protocol . . . . . . . .
+5.4.4.1 Task Formulation . . . . . . . . . .
+5.4.4.2 Evaluation Protocol . . . . . . . . .
+DUDE Benchmark . . . . . . . . . . . . . . . . . .
+5.5.1 Baselines . . . . . . . . . . . . . . . . . . . .
+5.5.2 Analysis & Discussion . . . . . . . . . . . . .
+Detailed Results Analysis . . . . . . . . . . . . . . .
+5.6.1 Within Model Class Analysis . . . . . . . . .
+5.6.1.1 Encoder vs. Decoder . . . . . . . .
+5.6.1.2 Incorporating Layout & Vision . . .
+5.6.1.3 Toward Long Document Processing
+5.6.1.4 Diagnosis of LLM Results . . . . . .
+5.6.2 Assessing Confidence . . . . . . . . . . . . . .
+DUDE Competition Results . . . . . . . . . . . . .
+5.7.1 Submitted Methods . . . . . . . . . . . . . .
+5.7.2 Performance Analysis . . . . . . . . . . . . .
+Chapter Conclusion . . . . . . . . . . . . . . . . . .
+6 DistilDoc: Knowledge Distillation for Visually-Rich
+Applications
+6.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
+6.2 Related Work . . . . . . . . . . . . . . . . . . . . .
+6.3 Experimental Setup . . . . . . . . . . . . . . . . .
+6.3.1 Datasets . . . . . . . . . . . . . . . . . . . .
+6.3.2 Architectures and Backbones . . . . . . . .
+6.3.3 KD Methods . . . . . . . . . . . . . . . . .
+6.3.4 Evaluation . . . . . . . . . . . . . . . . . .
+6.3.5 DLA-enriched LLM prompting . . . . . . .
+6.4 Results & Discussion . . . . . . . . . . . . . . . . .
+6.5 Chapter Conclusion . . . . . . . . . . . . . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+131
+132
+132
+133
+133
+134
+136
+136
+136
+136
+136
+137
+138
+138
+138
+139
+144
+Document
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+145
+147
+149
+151
+152
+153
+155
+157
+158
+158
+163
+7 Conclusion
+7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
+7.2 Perspectives For Future Research . . . . . . . . . . . .
+7.2.1 Open Problems In Reliability & Robustness . .
+7.2.2 A Future-Proof Design Of IA-DU . . . . . . . .
+7.2.2.1 The ‘Ultimate’ DU Dataset? . . . . .
+7.2.2.2 A Feature-complete IA-DU Solution?
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+165
+165
+171
+172
+173
+173
+178
+Bibliography
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+181
+A Appendix - PUQ
+223
+A
+Implementation Details . . . . . . . . . . . . . . . . . . . . . . 223

assets/txts/pg_0021.txt ADDED Viewed

	@@ -0,0 +1,421 @@

+CONTENTS
+B
+C
+xvii
+A.1
+Software and Data . . . . . . . . . .
+A.2
+Hyperparameter Defaults . . . . . .
+Practical Considerations . . . . . . . . . . .
+B.1
+Take-home Summary . . . . . . . . .
+B.2
+Compute vs. Performance Trade-off
+Detailed Experiment Results . . . . . . . .
+C.1
+Zoom-in Benchmark Evidence . . . .
+C.2
+Absolute Benchmark Results . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+223
+223
+224
+224
+225
+226
+226
+226
+B Appendix - BDPC
+230
+A
+Existing DC Datasets . . . . . . . . . . . . . . . . . . . . . . . . 230
+B
+Visualization of Proposed DC Datasets . . . . . . . . . . . . . . 231
+C Appendix - DUDE
+A
+Baseline Experiments Setup . . . . . . . . . .
+A.1
+Hyperparameter Defaults . . . . . . .
+A.2
+Generative LLM Prompt Fine-tuning
+A.3
+Confidence Estimation . . . . . . . . .
+A.4
+Evaluation . . . . . . . . . . . . . . .
+B
+Qualitative Examples . . . . . . . . . . . . .
+B.1
+Qualitative Examples - Competition .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+232
+232
+232
+232
+233
+235
+235
+241
+D Appendix - KDD
+A
+Code and Datasets . . . . . . . . . . .
+B
+Implementation Details . . . . . . . .
+C
+Task Definitions . . . . . . . . . . . .
+D
+Additional Experiment Results . . . .
+D.1
+Tobacco-3482 Results . . . . .
+D.2
+PRImA Results . . . . . . . . .
+D.3
+RVL-CDIP-N Results . . . . .
+D.4
+Downstream DocVQA Results
+D.5
+Ablation Experiments . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+244
+244
+244
+246
+247
+249
+249
+249
+249
+249
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+Curriculum
+253
+Publications
+255

assets/txts/pg_0033.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+Chapter 1
+Introduction
+“yourAmid
+significant life events—like buying a house or expecting
+firstborn child—lies a less cheerful reality that I experienced
+firsthand: the hassle of dealing with manual paperwork.
+For the former case, this required a lot of back-and-forth with
+the bank, the notary, and the real estate agent, with each of
+them requiring a different set of documents (e.g., monthly pay
+stubs, bank statements, copies of national registry, etc.) to be
+filled in, signed, and sent back for processing.
+On the side of the document processors, each document needed
+to be classified, key information extracted, and the information
+validated against other documents to be able to prove my
+solvency in making an offer, applying for a loan, or being drafted
+as the future house owner. In between all parties and external
+organizations, even more documents were either created, adapted,
+or passed along such as the offer, the loan agreement, the deed
+of sale, a soil certificate, etc.
+This juxtaposition of valuable moments in life with cumbersome
+administrative procedures involving manual document
+processing forms the backdrop against which I aim to explore
+and propose potential solutions in this thesis.
+”
+1

assets/txts/pg_0034.txt ADDED Viewed

	@@ -0,0 +1,44 @@

+2
+INTRODUCTION
+Documents are containers of information that are easily shareable. The concept
+of a document dates back to when humans started writing and has been a
+cornerstone of human communication ever since. In the age of digital technology,
+documents are still the primary means of communication between humans and
+organizations and form the backbone of many business processes. Human
+communication is increasingly happening through digital channels, and the
+COVID-19 pandemic has only accelerated this trend. We are increasingly living
+in a “document society” [53], dependent on documents in our daily lives or for
+recording second-hand knowledge. With instant gratification as the norm in
+the digital age, people expect similar seamless interactions with businesses and
+governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with
+organizations drowning under the sheer volume of documents they receive.
+So why have organizations not switched en masse to
+automated document processing?
+The answer lies for some part in (I) the complexity of the task, and for the
+other part in (II) the need for reliability and risk control.
+(I) While it might be straightforward for a human (white-collar) worker to read
+a long, structured document, understand its contents, categorize it, and extract
+crucial information accordingly, this is not so easy for a machine. This could be
+perceived as an instance of Moravec’s paradox [319], which states that tasks
+that are easy for humans are hard for machines, and vice versa. However, in
+recent times, significant strides forward have been made thanks to technological
+advances combining Natural Language Processing (NLP), Computer Vision
+(CV) and Machine Learning (ML). Document Understanding (DU) is
+the umbrella term for both the end-to-end solution and the research field
+studying to make machines interpret and understand documents (elaborated
+on in Section 2.3). It has seen a surge in interest in the past few years, with
+the rise of large-scale pretrained Language and Vision models (LLM, VLM)
+[52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs.
+What makes DU challenging is that it encompasses multiple subtasks, each of
+which is a research field in its own right, such as Optical Character Recognition
+(OCR), Document Layout Analysis (DLA), Document Classification (DC), Key
+Information Extraction (KIE), Visual Question Answering (VQA), etc. The
+complexity of the task is further increased by the fact that documents are
+multimodal, containing both text and images and that they are compositional,
+i.e., the meaning of the document is not just the sum of its parts. Information
+can appear in a wide range of forms including text, images, tables or graphs,
+and be spread across multiple pages. Moreover, the meaning of a document

assets/txts/pg_0035.txt ADDED Viewed

	@@ -0,0 +1,46 @@

+INTRODUCTION
+3
+can change depending on the context in which it is used. As an artifact of the
+communication channel, not all documents are born digitally, and the quality
+of the document can vary greatly, with some documents being handwritten,
+scanned with low resolution, or even a picture of a document. Furthermore,
+documents are often not standardized templates and can be highly variable in
+terms of layout, structure, and content. Finally, the longer the document, the
+more computationally demanding it becomes to process, and the more likely it
+is to induce errors, which can be harder to detect.
+Addressing the inherent challenges of document processing, and achieving high
+levels of accuracy, processing speed, reliability, robustness, and scalability in
+DU forms the applied scope of this thesis.
+(II) Consider the example given of the birth certificate. While I might not
+appreciate as much the manual handling of this document, if they had registered
+my baby girl’s name (Feliz, Spanish writing without an accent on the ‘e’)
+incorrectly, I would be pretty upset as this could have further repercussions.
+Whereas this error might be easily rectified, it is not so easy to do so in the
+case of a mortgage application, where the wrong information could lead to a
+rejection of the application, or even worse, a loan agreement with the wrong
+terms and conditions. This demonstrates that, even when full automation of
+document processing is in high demand, it is not always desirable if the risk of
+failure might be too large.
+Nevertheless, a lot of the potential for automation remains untapped, and
+organizations are increasingly looking for solutions to fully automate their
+document processing workflows. However, full automation, implying perfect
+recognition of document categories and impeccable information extraction is an
+unattainable goal with the current state of technology [79].
+The more realistic objective set is Intelligent Automation (IA) (elaborated
+on in Section 2.4), where the goal is to have the machine estimate confidence
+in its predictions, deriving business value with as high as possible volumes of
+perfect predictions (Straight-Through-Processing, STP) without incurring extra
+costs (False Positives, FP).
+The leitmotif of this thesis will be the fundamental enablers of IA: confidence
+estimation and failure prediction.
+Calibrated uncertainty estimation with efficient and effective DU technology
+will allow organizations to confidently automate their document processing
+workflow, while keeping a human in the loop only for predictions with a higher
+likelihood of being wrong. To date, however, little research has addressed the
+question of how to make DU technology more reliable, as is illustrated in a toy
+analysis (Table 1.1) reporting the absence of many IA-related keywords in the
+Proceedings of the 2021 International Conference on Document Analysis and

assets/txts/pg_0036.txt ADDED Viewed

	@@ -0,0 +1,80 @@

+4
+INTRODUCTION
+Recognition (ICDAR) [289].
+The thesis aims to fill this gap by proposing novel methods for uncertainty
+estimation and failure prediction (Part I), and by providing a framework for
+benchmarking and evaluating the reliability and robustness of DU technology,
+as close as possible to real-world requirements (Part II).
+Table 1.1. Comparative analysis of keywords in the ICDAR 2021 proceedings. While
+many DU subtasks are represented, there is a lack of keywords related to IA. Do note
+that calibration is used in the context of camera calibration, and not in the context of
+confidence estimation.
+keyword
+freq
+keyword
+freq
+document
+classification
+3388
+242
+33
+0
+key information
+56
+question answering
+106
+layout analysis
+223
+calibration/calibrate
+temperature scaling
+failure prediction
+misclassification detection
+out-of-distribution
+OOD
+predictive uncertainty
+0
+25
+0
+In the remainder of the Introduction, I will sketch the surrounding research
+context, followed by the problem statement and research questions, and finally
+the outline of the thesis manuscript.
+1.1
+Research Context
+All chapters of this dissertation have been executed as part of the Baekeland
+PhD mandate (HBC.2019.2604) with financial support of VLAIO (Flemish
+Innovation & Entrepreneurship) and Contract.fit. The latter is a Belgian-based
+software-as-a-service (SaaS) provider of Intelligent Document Processing (IDP)
+drawing on innovations in DU to power their product suite (email-routing,
+Parble), and my generous employer since 2017.
+Some of the joint work (Chapter 5) has been partially funded by a PhD
+Scholarship from AGAUR (2023 FI-3-00223), and the Smart Growth Operational
+Programme under projects no. POIR.01.01.01-00-1624/20 (Hiper-OCR - an
+innovative solution for information extraction from scanned documents) and
+POIR.01.01.01-00-0605/19 (Disruptive adoption of Neural Language Modelling
+for automation of text-intensive work).
+Moreover, given that the dissertation work has been performed over a large
+span of time, it warrants putting it in the larger context and dynamics of AI
+innovations, the state of DU as a field, how notions of ’reliability’ have evolved
+over time, and finally the business context.

assets/txts/pg_0037.txt ADDED Viewed

	@@ -0,0 +1,42 @@

+RESEARCH CONTEXT
+5
+This thesis started almost concurrently with the rise of the global COVID19 pandemic, making it hard to foster collaborations in the early stages. At
+the start of the PhD, DU methodology was fairly established, with OCR and
+Transformer-based pipelines such as BERT [94] and LayoutLM [502], which
+is why we first prioritized the more fundamental challenge of decision-making
+under uncertainty (Part I); which was followed by a step back, closer to applied
+DU research (Part II).
+The research community’s understanding of ‘reliability’ has also evolved over
+time. When starting the work of Chapter 3, the notion of reliability was mostly
+associated with uncertainty quantification and calibration. However, calibration
+is not a panacea, and only fairly recently, Jaeger et al. [193] proposed a more
+general framework encapsulating reliability and robustness. They promote the
+more concrete and useful notion of failure prediction, which still involves
+confidence/uncertainty estimation yet with an explicit definition of the failure
+source which one wants to detect or guard against, e.g., in-domain test errors,
+changing input feature distributions, novel class shifts, etc. Since I share a
+similar view of the problem, I have focused following works on the more general
+notion of failure prediction, which is also more in line with the business context
+of IA.
+Whereas we originally intended to work on multi-task learning of DU subtasks,
+the rise of general-purpose LLMs offering a natural language interface to
+documents rather than discriminative modeling (e.g., ChatGPT [52, 344]),
+prompted us toward evaluating this promising technology in the context of
+DU. More importantly, we observed the lack of sufficiently complex datasets
+and benchmarks in DU that would allow us to tackle larger, more fundamental
+questions such as ’Do text-only LLMs suffice for most low-level DU subtasks?’
+(subsequently tackled in Chapter 5), which is why we shifted our focus to the
+more applied research questions of benchmarking and evaluation (Part II).
+Finally, the business context has also evolved over time. Originally, IDP was
+practiced by legacy OCR companies; specialized vendors, offering a range of
+solutions for specific document types (e.g., invoices, contracts, tax forms, etc.);
+or cloud service providers, offering IDP as part of a larger suite of services
+(e.g., AWS Textract, Azure Form Recognizer, etc.). However, the rise of both
+open-source LLM development and powerful, though closed-source models has
+lowered the barrier to entry for any new entrants or incumbents. This has led
+to a commoditization of IDP, with the quality of the LLMs and the ease of
+integration with existing business processes becoming key differentiators.

assets/txts/pg_0038.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+6
+1.2
+INTRODUCTION
+Problem Statement and Questions
+The general introduction sketches the context of the research, and motivates
+the research questions. In this Section, I will formulate the problem statement
+and research questions more formally and how they relate to the manuscript’s
+contents.
+1.2.1
+Reliable and Robust Deep Learning
+The dissertation opens with the more fundamental challenge of targeting
+reliability and robustness in Deep Learning, which covers fairly abstract concepts
+that have been used interchangeably and inconsistently in the literature. They
+will be defined more extensively in Section 2.2, but for now, consider reliability
+as the ability to avoid failure, robustness as the ability to resist failure, and
+resilience as the ability to recover from failure [373, 438, 455]. In Chapter 3, we
+focus on the more concrete objective of predictive uncertainty quantification
+(PUQ), which shows promise for improving reliability and robustness in Deep
+Learning (DL) [123, 140, 173, 455]. Concretely, PUQ methods are expected to
+elucidate sources of uncertainty such as a model’s lack of in-domain knowledge
+due to either training data scarcity or model misspecification, or its ability to
+flag potentially noisy, shifted or unknown input data [136].
+We observed that the majority of prior PUQ research focused on regression and
+CV tasks, while the applicability of PUQ methods had not been thoroughly
+explored in the context of NLP. As mentioned earlier, most DU pipelines (in
+2020) were text-centric with a high dependency on the quality of OCR. Since
+OCR is often considered a solved problem [262], we hypothesized that the main
+source of error and uncertainty in DU would reside in the text representations
+learned by deep neural networks (DNN)s. This is why we focused on the
+more fundamental question of how well do PUQ methods scale in NLP? More
+specifically, we restricted the scope to the prototypical, well-studied task of
+text classification, for which we could leverage existing multi-domain datasets
+varying in complexity, size and label space (multi-class vs. multi-label).
+This leads to the following research questions:
+RQ 1. When tested in realistic language data distributions on various text
+classification tasks, how well do PUQ methods fare in NLP?

assets/txts/pg_0039.txt ADDED Viewed

	@@ -0,0 +1,41 @@

+PROBLEM STATEMENT AND QUESTIONS
+7
+RQ 2. In which settings are PUQ methods most useful, i.e., which failure sources
+/ distribution shifts are they most sensitive to?
+RQ 3. How can we obtain better PUQ estimates without overrelying on
+computationally prohibitive methods, e.g., Deep Ensemble [238]?
+RQ 4. How important are certain prior, neural architecture or hyperparameter
+influences on the quality of PUQ estimation?
+In a later chapter (Chapter 5), we introduce a complex benchmark for generic
+DU that additionally tests for robustness to domain, visual and layout shifts,
+and explores the novel problem of hallucination and control in natural language
+generation (NLG) with LLMs from the perspective of calibrated and selective
+DocVQA. The general task formulation involves a natural language question (on
+content, aspect, form, visual/layout), an input document, and a set of reference
+answers. The model is expected to provide a natural language answer, an answer
+confidence and a (binary) abstention decision. Evaluation is done in terms of
+answer correctness, calibration and selective prediction. On the one hand, one
+expects a model to lower confidence when unsure about the correctness of a
+predicted answer. On the other hand, one expects a model to abstain from
+answering and refrain from hallucinations on unanswerable questions (which
+had been explicitly added in the dataset).
+RQ 5. How severe is the problem of hallucination and control in LLMs when
+evaluated in a selective, free-form DocVQA task setting?
+1.2.2
+Realistic and Efficient Document Understanding
+The second part of the dissertation focuses on the more applied research questions
+of realistic and efficient DU. The overall objective is to make DU technology
+more generically applicable (Chapter 5), evaluation more in sync with real-world
+requirements (Chapters 4 and 5), and more efficient at modeling the multimodal
+and compositional nature of documents (Chapters 5 and 6).
+Due to the proximity to business applications and the risks of leaking personal
+information, DU research benchmarks have diverged substantially from the
+real-world distributions of document data. For instance, DU datasets are often
+limited to single-page document images, are from outdated sources (e.g., IIT-

assets/txts/pg_0040.txt ADDED Viewed

	@@ -0,0 +1,35 @@

+8
+INTRODUCTION
+CDIP [252]), or are restricted to a single domain or a small set of document
+types.
+We posit that larger, fundamental questions in DU remain unanswered due to a
+lack of sufficiently complex datasets and benchmarks with a rich methodology
+covering evaluation beyond the independent and identically distributed (i.i.d.)
+test set setting. While there exist performant models for DU subtasks such
+as OCR, DC, KIE, etc., it is unclear how to move from these specific analysis
+and recognition tasks to models that can reason and understand documents. A
+truly end-to-end DU solution must handle the complexity and variety of realworld documents and subtasks, which could be expressed as natural language
+questions. Moreover, it should be able to generalize to any question on any
+document and reason over multiple pages and modalities.
+The following research questions are addressed in Chapters 4 and 5:
+RQ 6. How can we iteratively close the gap between research and practice in DU?
+RQ 7. How can we design a resource that comprehensively challenges the state-ofthe-art?
+RQ 8. Which DU aspects are most challenging for current state-of-the-art LLMs?
+How can these be incorporated in a benchmark to allow proper measurements
+of future improvements?
+However, moving the goalpost beyond a single-page context inevitably requires
+us to reconsider the research challenge of efficiency in DU. The rise of LLMs
+has enabled a new generation of DU pipelines, which are more flexible and
+easier to maintain than separate and specialized subtask modules, but also
+more computationally demanding. Importantly, most LLMs are not designed
+to handle the multimodality and long context windows of multipage documents,
+and are often unaware of the visual and layout semantics of documents.
+The research questions for Chapter 6 address the efficiency challenge in DU:
+RQ 9. How can we efficiently infuse LLMs with semantic layout awareness for
+more focused information extraction?
+RQ 10. To what degree can model compression resolve the problem of efficiency
+in processing documents?

assets/txts/pg_0041.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+OUTLINE
+1.3
+9
+Outline
+Figure 1.1. Overview of publications and how they relate to the chapters.
+Figure 1.2. Visual Overview of the research questions and how they relate to the
+chapters.
+After the introductory Chapters 1 and 2, we continue with the publication-based
+chapters that form the core of the thesis, which are structured in two parts.
+Part I consists of a single chapter, Chapter 3, which presents a benchmarking
+study of PUQ methods applied on real-world text classification datasets with
+1-D convolutional neural networks and pretrained transformers. It motivates
+a novel PUQ method, Deep Ensemble with Concrete Dropout, combining the
+benefits of both methods, and showing promise for improving reliability and
+robustness in NLP at a lower computational cost. The chapter concludes with
+a discussion of the results, including targeted ablation studies, and provides
+recommendations for future research.
+Part II consists of three chapters, Chapters 4 to 6, which all focus on the more
+applied research questions of realistic and efficient DU.

assets/txts/pg_0042.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+10
+INTRODUCTION
+Chapter 4 reflects on the current state of DU research, and proposes guidelines to
+foster document dataset construction efforts. It introduces two novel document
+classification datasets, RVL-CDIP_MP and RVL-CDIP-N_MP, as extensions
+of the RVL-CDIP dataset [165] with multipage documents. The datasets are
+accompanied by a comprehensive experimental analysis, which shows promise
+from advancing multipage document representations and inference.
+Chapter 5 introduces the multi-faceted DUDE
+benchmark for assessing
+generic DU, that was also hosted as a competition to challenge the DU
+community. It describes the complete methodology and design of the dataset,
+targeting model innovations that can handle the complexity and variety of
+real-world documents and subtasks, and generalize to any documents and any
+questions. Next to a discussion of the competition results, it also presents
+our own comprehensive benchmarking study of SOTA LLMs with varying the
+context length and what modalities are represented.
+Chapter 6 investigates how to efficiently obtain more semantic document layout
+awareness. We explore what affects the teacher-student knowledge gap in
+KD-based model compression methods, and design a downstream task setup
+to evaluate the robustness of distilled DLA models on zero-shot layout-aware
+DocVQA.
+Finally, Chapter 7 concludes the thesis with a summary of the main contributions
+(Section 7.1), and a discussion of future research directions. As a logical followup to Chapter 5, we propose in Section 7.2.2.1 how the DUDE dataset could
+be extended to become the ‘ultimate’ DU benchmark. The thesis ends with a
+hypothetical, informed design of how the research presented would form part of
+an end-to-end, fully-fledged IA-DU solution (Section 7.2.2.2).

assets/txts/pg_0043.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+Chapter 2
+Fundamentals
+This chapter provides all the necessary background knowledge necessary to
+understand the contributions of this thesis.
+The key questions covered here are:
+i.
+ii.
+iii.
+iv.
+v.
+vi.
+How to feed a document to an algorithm to perform arbitrary tasks on it?
+How to model language, vision, layout or structure?
+How does it learn and then operate at inference time?
+How does it estimate prediction uncertainty?
+How to evaluate its performance?
+How to integrate it as a useful, end-to-end system in a document workflow?
+Section 2.1 explains the basic setting from the perspective of statistical learning
+theory [472], which is a mathematical framework for analyzing how algorithms
+learn from data with minimal error. Section 2.2 gives a primer on reliability and
+robustness, particularly calibration, failure detection and relevant evaluation
+metrics. Section 2.3 surveys the DU field, and discusses the state of the art in
+DU technology. Finally, Section 2.4 covers Intelligent Automation to illustrate
+how solving the challenges posed in this thesis will enable to augment human
+intelligence, creativity and productivity in straight-through business processes.
+11

assets/txts/pg_0044.txt ADDED Viewed

	@@ -0,0 +1,163 @@

+12
+FUNDAMENTALS
+Contents
+2.1
+2.2
+2.3
+2.4
+2.1
+Statistical Learning - basics . . . . . . . . . . . .
+2.1.1 Neural Networks . . . . . . . . . . . . .
+2.1.2 Probabilistic Evaluation . . . . . . . . .
+2.1.3 Architectures . . . . . . . . . . . . . . .
+Reliability and Robustness . . . . . . . . . . . .
+2.2.1 Generalization and Adaptation . . . . .
+2.2.2 Confidence Estimation . . . . . . . . . .
+2.2.3 Evaluation Metrics . . . . . . . . . . . .
+2.2.4 Calibration . . . . . . . . . . . . . . . .
+2.2.5 Predictive Uncertainty Quantification . .
+2.2.6 Failure Prediction . . . . . . . . . . . . .
+Document Understanding . . . . . . . . . . . . .
+2.3.1 Task Definitions . . . . . . . . . . . . . .
+2.3.2 Datasets . . . . . . . . . . . . . . . . . .
+2.3.3 Models . . . . . . . . . . . . . . . . . . .
+2.3.4 Challenges in Document Understanding
+Intelligent Automation . . . . . . . . . . . . . .
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+.
+12
+14
+15
+17
+18
+19
+20
+21
+25
+27
+29
+30
+31
+33
+34
+35
+38
+Statistical Learning
+Two popular definitions of Machine Learning (ML) are given below.
+Machine Learning is the field of study that gives computers the ability
+to learn without being explicitly programmed. [406]
+A computer program is said to learn from experience E with respect to
+some class of tasks T, and performance measure P, if its performance
+at tasks in T, as measured by P, improves with experience E. [317]
+Following these, different types of learning problems [472] can be discerned, of
+which the most common (and the one used throughout our works) is supervised
+learning. It defines experience E as a set of input-output pairs for which the
+task T is to learn a mapping f from inputs X ∈ X to outputs Y ∈ Y, and the
+performance measure P is the risk or expected loss (Equation (2.1)), given a
+(0-1) loss function ` : Y × Y → R+ .
+R(f ) = E(X,Y )∼P [`(Y, f (X))]
+(2.1)
+The mapping f (·; θ) : X → Y is typically parameterized by a set of parameters
+θ (omitted whenever it is fixed) and a hypothesis class F, which is a set of

assets/txts/pg_0045.txt ADDED Viewed

	@@ -0,0 +1,53 @@

+STATISTICAL LEARNING
+13
+possible functions. The objective is to find a function f ∈ F that minimizes the
+risk, or even better, the Bayes risk
+f ∗ = inf R(f ),
+f ∈F
+(2.2)
+which is the minimum achievable risk over all functions in F. The latter is only
+realizable with infinite data or having access to the data-generating distribution
+P(X , Y). In practice, Equation (2.2) is unknown, and the goal is to find a
+function fˆ that minimizes the empirical risk
+N
+1 X
+`(yi , f (xi )),
+fˆ =
+N i=1
+(2.3)
+where (xi , yi ) are N independently and identically distributed (i.i.d.) samples
+drawn from an unknown distribution P on X × Y. This is known as empirical
+risk minimization (ERM), which is a popular approach to supervised learning,
+under which three important processes are defined.
+Training or model fitting is the process of estimating the parameters θ of a
+model, which is done by minimizing a suitable loss function ` over a training
+set D = {(xi , yi )}N
+i=1 of N i.i.d. samples.
+Inference or prediction is the process of estimating the output of a model for
+a given input, which is typically done by computing the posterior probability
+P (y|x) over the output space Y. Classification output is a discrete label, while
+regression output is a continuous value.
+Evaluation involves measuring the quality of a model’s predictions, which is
+typically done by computing a suitable evaluation metric over a test set Dtest
+of i.i.d. samples, which were not used for training.
+However, ERM has its caveats concerning generalization to unseen data,
+requiring either additional assumptions on the hypothesis class F, which
+are known as inductive biases, and/or regularization to penalize the
+complexity of the function class F [445]. In neural networks (discussed in
+detail Section 2.1.1), the former is controlled by the architecture of the network,
+while the latter involves specifying constraints to parameters or adding a
+regularization term to the loss function.
+fˆ = arg min R̂(f ) + λΨ(θ)
+f ∈F
+(2.4)

assets/txts/pg_0046.txt ADDED Viewed

	@@ -0,0 +1,47 @@

+14
+FUNDAMENTALS
+Equation (2.4) defines regularized empirical risk minimization (RERM),
+where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the
+trade-off between the empirical risk (denoted with R̂) and the regularization
+term.
+All these concepts will be revisited in the context of neural networks in
+Section 2.1.1, where we will also discuss the optimization process of the model
+parameters θ, how inference differs in the case of probabilistic models to estimate
+uncertainty (Section 2.2.5), and how regularization affects confidence estimation
+and calibration (Section 2.2.4).
+2.1.1
+Neural Networks
+An artificial neural network (NN) is a mathematical approximation inspired
+by data processing in the human brain [396]. It can be represented by a
+network topology of interconnected neurons that are organized in layers that
+successively refine intermediately learned feature representations of the input
+[448] that are useful for the task at hand, e.g., classifying an animal by means
+of its size, shape and fur, or detecting the sentiment of a review by focusing on
+adjectives.
+A basic NN building block is a linear layer, which is a linear function of the
+input parameters: f (x) = W x + b, where the bias term b is a constant vector
+shifting the decision boundary away from the origin and the weight matrix
+W holds most parameters that rotate the decision boundary in input space.
+Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to
+introduce non-linearity in the model, which is required for learning complex
+functions.
+The first deep learning (DL) network (stacking multiple linear layers) dates
+back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398].
+The first successful DL application was a demonstration of digit recognition
+in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent
+success of DL is attributed to the availability of large datasets, the increase in
+computational power, the development of new algorithms and architectures,
+and the commercial interest of large companies.
+Consider a conventional DL architecture as a composition of parameterized
+functions. Each consists of a configuration of layers (e.g., convolution, pooling,
+activation function, normalization, embeddings) determining the type of input
+transformation (e.g., convolutional, recurrent, attention) with (trainable)
+parameters linear/non-linear w.r.t. the input x. Given the type of input,
+e.g., language which is naturally discrete-sequential, or vision which presents a

assets/txts/pg_0047.txt ADDED Viewed

	@@ -0,0 +1,53 @@

+STATISTICAL LEARNING
+15
+Sigmoid Function
+1
+σ(z) =
+1 + exp−z
+Softmax Function
+exp(z)
+softmax(z) = PK
+k=1 exp(zk )
+Table 2.1. Sigmoid and softmax activation functions for binary and multi-class
+classification, respectively.
+ready continuous-spatial signal, different DL architectures have been established,
+which will be discussed in Section 2.1.3.
+A K-class classification function with an l-layer NN with d dimensional input x ∈
+Rd is shorthand fθ : Rd → RK , with θ = {θj }lj=1 assumed to be optimized, either
+partially or fully, using backpropagation and a loss function. More specifically,
+it presents a non-convex optimization problem, concerning multiple feasible
+regions with multiple locally optimal points within each. With maximumlikelihood estimation estimation, the goal is to find the optimal parameters
+or weights that minimize the loss function, effectively interpolating the training
+data. This process involves traversing the high-dimensional loss landscape.
+Upon convergence of model training, the optimized parameters form a solution
+in the weight-space, representing a unique mode (specific function fθ̂ ). However,
+when regularization techniques such as weight decay, dropout, or early stopping
+are applied, the objective shifts towards maximum-a-posteriori (MAP), to
+take into account the prior probability of the parameters. The difference in
+parameter estimation forms the basis for several uncertainty estimation methods,
+covered in Section 2.2.5.
+A prediction is a translation of a model’s output to which a standard decision
+rule is applied, e.g., to obtain the top-1/k prediction (Equation (2.5)), or decode
+structured output according to a function maximizing total likelihood with
+optionally additional diversity criteria.
+ŷ = argmax fθ̂ (x)
+(2.5)
+Considering standard NNs, the last layer outputs a vector of real-valued logits
+z ∈ RK , which in turn are normalized to a probability distribution over K
+classes using a sigmoid or softmax function (Table 2.1).
+2.1.2
+Probabilistic Evaluation
+The majority of our works involves supervised learning with NNs, formulated
+generically as a probabilistic predictor in Definition 1.

assets/txts/pg_0048.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+16
+FUNDAMENTALS
+Definition 1. Probabilistic predictor f : X → ∆Y that outputs a conditional
+probability distribution P (y 0 |x) over outputs y 0 ∈ Y for an i.i.d. drawn sample
+(x,y).
+|Y|
+Definition 2 (Probability Simplex). Let ∆Y := {v ∈ R≥0 : kvk1 = 1} be a
+probability simplex of size |Y| − 1 as a geometric representation of a probability
+space, where each vertex represents a mutually exclusive label and each point
+has an associated probability vector v [368].
+Figure 2.1 illustrates a multi-class classifier, where Y = [K] for K=3 classes.
+photos.google.com
+Google Photos
+Home for all your photos and videos,
+automatically organized and easy to
+share.
+https://photos.google.com/search/fox
+Figure 2.1. Scatter plot of a ternary problem (K = 3, N = 100) in the probability
+simplex space. Example of overconfident misprediction (above is a Shiba Inu dog) and
+correct sharp prediction (clear image of Beagle).
+In practice, loss functions are proper scoring rules [330], S : ∆Y × Y → R, that
+measure the quality of a probabilistic prediction P (ŷ|x) given the true label y.
+The cross-entropy (CE) loss is a popular loss function for classification, while
+the mean-squared error (MSE) loss is used for regression. In Section 2.2, we
+will discuss the evaluation of probabilistic predictors in more detail, including
+the calibration of confidence estimates and the detection of out-of-distribution
+samples.
+2.1.3
+Architectures
+Throughout the chapters of the thesis, we have primarily used the following
+NN architectures: Convolutional Neural Networks (CNNs), Transformer
+Networks . We will briefly introduce the building blocks of these architectures,
+with a focus on how they are used in the context of document understanding.

assets/txts/pg_0049.txt ADDED Viewed

	@@ -0,0 +1,41 @@

+STATISTICAL LEARNING
+2.1.3.1
+17
+Convolutional Neural Networks
+Convolutional Neural Networks (CNNs) [244] are a class of DNNs designed
+primarily for visual and grid-spatial data such as images. They are inspired by
+the visual cortex of animals, which contains neurons that are sensitive to small
+subregions of the visual field, called a receptive field. The receptive fields of
+different neurons partially overlap such that they cover the entire visual field,
+growing larger in deeper layers of the visual cortex.
+Figure 2.2. Sketch of a CNN architecture. The input is a 2D image, which is iteratively
+convolved with a set of learned filters detecting specific input features, e.g., edges,
+corners, blobs, to produce feature maps. Feature maps are then downsampled using
+a pooling operation.
+As illustrated in Figure 2.2, CNNs are composed of multiple convolutional layers,
+which hierarchically extract features from the input, followed by pooling and
+fully-connected layers to classify the input based on the downsampled features.
+A filter K ∈ Rd×d is a rectangular matrix of trainable weights with width and
+height d typically smaller than the input x. A convolutional layer applies filters
+sliding over the input, with each filter producing a feature map:
+F = K ∗ x,
+(2.6)
+where the convolution operation ∗ computes a dot product between filter entries
+and the covered portions of the input.
+Thanks to the weight sharing property of the convolution operation, CNNs are
+able to learn translation invariance, i.e., the ability to recognize an object
+regardless of its position in the image. This is particularly useful for object
+detection, where the position of the object in the image is unknown.
+This architecture was used for document image classification and document
+layout analysis (Section 6.3.2). A special version is 1-D CNNs, which we applied
+to one-hot encoded text data in text classification benchmarking (Section 3.4.3).

assets/txts/pg_0050.txt ADDED Viewed

	@@ -0,0 +1,46 @@

+18
+2.1.3.2
+FUNDAMENTALS
+Language Neural Networks
+The first step to represent language input into a format compatible with NNs is
+to convert units of language, words or characters or “tokens” as depending on
+a tokenizer, into numerical vectors. This is done by means of embeddings,
+which are typically learned as part of the training process, and are used to
+represent the meaning of words in a continuous vector space. There have been
+multiple generations of word embeddings, starting with one-hot vectors that
+represent each word by a vector of zeros with a single one at its vocabulary index,
+which depends highly on the tokenizer used and does not capture semantic
+relationships between words. Alternatives are frequency-based embeddings,
+such as TF-IDF vectors, which represent each word by its frequency in the
+corpus, weighted by its inverse frequency in the corpus, capturing some lexical
+semantics, but not the context in which the word appears. The next generation
+are Word2Vec embeddings that are trained to predict the context of a word, i.e.,
+the words that appear before and after it in a sentence. FastText embeddings
+improve this by considering a character n-gram context, i.e., a sequence of n
+characters. The current generation are contextual word embeddings that
+are trained to predict the context of a word, taking into account the surrounding
+context and learning the sense of a word based on its context, e.g., ‘bank’ as
+a river bank vs. a financial institution in ‘Feliz sits at the bank of the river
+Nete’. Another important innovation is subword tokenization to deal with
+the out-of-vocabulary (OOV) problem, which is particularly relevant for
+morphologically rich languages, such as Dutch, where word meaning can be
+inferred from its subwords. A clever extension is byte pair encoding (BPE)
+[412], which is a data compression algorithm that iteratively replaces the most
+frequent pair of bytes in a sequence with a single, unused byte, until a predefined
+vocabulary size is reached. This is particularly useful for multilingual models,
+where the vocabulary size would otherwise be too large to fit in memory.
+The first embedding layer is typically a lookup table, which maps each word
+to a unique index in a vocabulary, and each index to a vector of real numbers.
+The embedding layer is typically followed by a recurrent, convolutional or
+attention layer, which is used to capture the sequential nature of language.
+Recurrent Neural Networks (RNNs) and recurrent architectures extended
+to model long-range dependencies such as Long Short-Term Memory (LSTM)
+and Gated Recurrent Unit (GRU) networks were the dominant architectures
+for sequence modeling in NLP, yet they have been superseded by Transformers
+in recent years.

assets/txts/pg_0051.txt ADDED Viewed

	@@ -0,0 +1,58 @@

+STATISTICAL LEARNING
+2.1.3.3
+19
+Transformer Network
+A Transformer [473] is a sequence-to-sequence model that uses an attention
+mechanism to capture long-range dependencies in the input sequence, benefiting
+from increased parallelization. Traditionally, it consists of an encoder and a
+decoder, each composed of multiple layers of self-attention and feed-forward
+layers.
+Attention is a mechanism that allows for soft selection of relevant information
+from a set of candidates, e.g., tokens in a document, based on a query, e.g.,
+a token in the document. The scaled dot-product P
+attention is defined
+n
+for a sequence of length n as follows: Att(Q, K, V ) = i=1 αi Vi . It utilizes
+three learnable weight matrices, each multiplied with all token embeddings in a
+sequence to build queries Q ∈ Rn×dq , keys K ∈ Rn×dq , and values V ∈ Rn×dv .
+The output of the attention mechanism is a weighted sum of the unnormalized
+values, where each attention weight of the i-th key is computed by normalizing
+exp(QT
+i Ki )
+the dot product between the query and key vectors αi = Pn exp(Q
+T K ) . For
+j=1
+J
+j
+training stability, the dot product is typically scaled by the square root of the
+dimensionality of the query and key vectors. This is followed by a feed-forward
+layer to capture non-linear relationships between the tokens in the sequence.
+There exist different forms of attention, depending on the type of relationship
+that is captured. Self-attention computes the attention of each token w.r.t.
+all other tokens in the sequence, which changes the representation of each token
+based on the other tokens in the sequence. Multi-head attention is a set
+of h attention layers, which every Transformer uses to concurrently capture
+different types of relationships, concatenated together after the parallelized
+processing. Cross-attention computes the attention of each token in one
+sequence w.r.t. all tokens in another sequence, which is used in encoder-decoder
+Transformer architectures for e.g., summarization and machine translation.
+Specific to decoder layers, masked attention is used to prevent the decoder
+from attending to future tokens in the sequence by masking the upper triangle
+of the attention matrix calculation.
+A major downside to Transformers is the quadratic complexity of the attention
+mechanism (Figure 2.3), which makes them computationally inefficient for long
+sequences. This has been addressed by a wealth of techniques [120], such as
+sparsifing attention, targeting recurrence, downsampling, random or low-rank
+approximations.
+Position Embeddings are indispensable for Transformers to be able to process
+sequences, as they do not have any notion of order or position of tokens in
+a sequence. The most common type of position embedding is a sinusoidal

assets/txts/pg_0052.txt ADDED Viewed

	@@ -0,0 +1,38 @@

+20
+FUNDAMENTALS
+Quadratic complexity
+Figure 2.3. Illustration of the main attention mechanisms in a Transformer.
+embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the
+frequency and φ is the phase which are learned as part of the training process,
+and they are typically shared across all tokens in the sequence. Integrating
+position information into Transformers can be achieved in different ways, which
+[105, Table 1] gives an overview for.
+Transformers have gradually taken over as an end-to-end architecture for both
+NLP and CV tasks, albeit adoption in CV has been slower, due to the lack
+of spatial invariance in the original Transformer architecture. This has been
+addressed by recent works, such as Vision Transformer (ViT) [101], which uses
+a patch-based input representation with position embeddings.
+A large language model (LLM) consists of a stack of Transformers that is
+pretrained on a large corpus of text, typically using a self-supervised learning
+objective, such as predicting the next token in a sequence. The goal of LLMs
+is to learn a general-purpose language representation that can be fine-tuned
+to perform well on a wide range of downstream tasks. LLMs have disrupted
+NLP in recent years, as they have achieved SOTA performance on a wide
+range of tasks thanks to pretraining on large amounts of data. The most
+popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383],
+GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to
+modeling document inputs, explained in Section 2.3.4, open challenges for
+LLMs include: (i) structured output generation, (ii) domain-specific knowledge
+injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]),
+(iii) multimodality.
+Vision-language models (VLM) are a recent development in multimodal
+learning, which combine the power of LLMs with vision encoders to perform
+tasks that require understanding both visual and textual information. The most
+popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344].
+In every chapter of this dissertation we have used Transformers, either as part

assets/txts/pg_0053.txt ADDED Viewed

	@@ -0,0 +1,46 @@

+RELIABILITY AND ROBUSTNESS
+21
+of a foundation model for DU tasks (Chapters 4 to 6) or to contrast with 1-D
+CNNs in text classification (Chapter 3). Note that [265] share our concerns that
+NLP needs a new ‘playground’ with more realistic tasks and benchmarks, which
+extend beyond sentence-level contexts to more complex document-level tasks.
+Alternative sub-quadratic architectures have started addressing Transformer’s
+computational inefficiency on long sequences, e.g., Mamba [152] and Longnet
+[99]. Time will tell if these will be able to compete with the Transformer’s
+dominance in foundation models.
+2.2
+Reliability and Robustness
+Chapter 3 contains a lot of relevant content on the basic relation between
+uncertainty quantification, calibration, and distributional generalization or
+detection tasks. Here, we will focus on the more general concepts of reliability
+and robustness, and how they relate to concepts used throughout the rest of
+the thesis. Next, we discuss the need for confidence estimation and appropriate
+evaluation metrics, followed by short summaries of the main research trends in
+calibration and uncertainty quantification.
+Emerging guidance and regulations [2, 3, 475] place increasing importance on
+the reliability and robustness of ML systems, particularly once they are used
+in the public sphere or in safety-critical applications. In ML, reliability and
+robustness are often used interchangeably [78, 420, 455], yet they are distinct
+concepts, and it is important to understand the difference between them. This
+thesis uses the following definitions of reliability and robustness, adapted from
+systems engineering literature [395]:
+Definition 3 [Reliability]. Reliability is the ability of a system to consistently
+perform its intended function in a specific, known environment for a specific
+period of time, with a specific level of expected accuracy [395]. Closer to the ML
+context, this entails all evaluation under the i.i.d. assumption, allowing for some
+benign shifts of the distribution, including predictive performance evaluation
+with task-dependent metrics (accuracy, F1, perplexity, etc.), calibration, selective
+prediction, uncertainty estimation, etc.
+Reliability requires to clearly specify the role an ML component plays in a
+larger system, and to define the expected behavior of the system as a function
+of alignment with the training data distribution. This is particularly important
+in the context of black-box models, where the inner workings of the model are
+not transparent to the user. In this case, the user needs to be aware of the
+model’s limitations, e.g., model misspecification, lack of training data, and the

assets/txts/pg_0054.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+22
+FUNDAMENTALS
+model needs to be able to communicate its own uncertainty to the user. This is
+the focus of Chapter 3.
+Definition 4 [Robustness]. Robustness is the ability of a system to maintain
+its intended function despite a wide range of disturbances, with a minimal
+degradation of performance [395]. Such disturbances can take the form of
+adversarial attacks, distributional shifts, or other types of noise. In the ML
+context, this entails all evaluation violating the i.i.d. assumption, including
+adversarial and label noise robustness, out-of-distribution detection, domain
+generalization, extrapolation, etc.
+Robustness is more involved with the application scope in which a model can
+perform well, assuming that the model can maintain some degree of its prediction
+capacity on non-i.i.d. data which might be unknown at training time. Detecting
+when the model is operating outside of its intended scope is an important part
+of robustness to prevent failure propagation to downstream systems.
+Resilience is another component of the R3 : reliability, robustness, resilience
+concept in systems engineering, yet it is not a focus of this thesis, nor is it
+a relevant qualifier of the ML model in isolation, as it is more related to the
+system as a whole. Resilient systems are able to recover from disturbances, even
+those caused by model misspecification, e.g., by adapting to new environments
+and unexpected inputs from unknown distributions or by self-healing.
+2.2.1
+Generalization and Adaptation
+To complete the R3 picture, we cannot overlook the generalizationadaptation spectrum, which has been less explored in our works, yet it is an
+important part of current practices in ML.
+Definition 5 [Generalization-adaptation]. Generalization is the ability of
+a system to perform its intended function in a wide range of environments,
+including those not known at design time [395]. Each environment is defined by
+a data distribution over a domain and a task, and generalization is the ability
+of a model to perform well on new data drawn from the same distribution.
+Adaptation is the ability of a system to perform its intended function in a specific,
+known environment, despite changes in the system itself or its environment
+[395]. This entails the ability of a model to perform well on new data drawn
+from a different distribution, which is known at design time.
+Different settings of generalization-adaptation are: in-distribution (same
+domain and task), domain generalization (same task, different domain), task
+generalization (same domain, different task), out-of-distribution (different

assets/txts/pg_0055.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+RELIABILITY AND ROBUSTNESS
+23
+domain or task). If the model has access to limited samples for training
+on the new distribution, it is referred to as few-shot learning or no samples at
+all, zero-shot learning; if it is able to adapt to new distributions over time, or
+accumulate knowledge over different tasks without retraining from scratch [87],
+it is referred to as continual learning or incremental learning.
+Many of these settings are referred to in business as out-of-the-box, self-learning,
+yet without any formal definitions given. Domain and task generalization are
+major selling points of pretrained LLMs, which are able to perform well on a
+wide range of tasks and domains. In the case of very different distributions, e.g.,
+a different task/expected output or an additional domain/input modality, it is
+often necessary to fine-tune the model on a small amount of data from the new
+distribution, which is known as transfer learning. Specific to LLMs, instruction
+tuning is a form of transfer learning, where samples from a new distribution are
+appended with natural language instructions [69, 532]. This approach has been
+used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an
+effort to reduce the amount of annotated data required to generalize to unseen
+domains and questions.
+2.2.2
+Confidence Estimation
+A quintessential component of reliability and robustness requires a model to
+estimate its own uncertainty, or inversely to translate model outputs into
+probabilities or ‘confidence’ (Definition 6).
+Definition 6 [Confidence Scoring Function]. Any function g : X → R
+whose continuous output aims to separate a model’s failures from correct
+predictions can be interpreted as a confidence scoring function (CSF) [193].
+Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier
+thresholding, this is not a strict requirement.
+Circling back on the question of why one needs a CSF, there are multiple reasons:
+i) ML models are continually improving, yet 0 test error is an illusion, even a
+toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed,
+performance deterioration is expected due to i.i.d. assumptions breaking; iii)
+generative models are prone to hallucinations [198], requiring some control
+mechanisms and guardrails to guide them.
+Below, we present some common CSFs used in practice [114, 172, 194, 539],
+where for convenience the subscript is reused to denote the k-th element of the
+output vector g(x) = gk (x).

assets/txts/pg_0056.txt ADDED Viewed

	@@ -0,0 +1,39 @@

+24
+FUNDAMENTALS
+I. Maximum softmax probability (MSP): g(x) = maxy0 ∈Y fy0 (x)
+II. Maximum logit: g(x) = maxy0 ∈Y zy0 (x), with logits z ∈ RK
+P
+III. Negative entropy: g(x) = − y0 ∈Y fy0 (x) log fy0 (x)
+IV. Margin: g(x) = maxy0 ∈Y fy0 (x) − maxy00 ∈Y\y0 fy00 (x)
+V. Distance-based measures
+• kNN distance: A 1D outlier score derived from the average distance
+of the feature representation of x to its k nearest neighbors in the
+training distribution
+• Mahalanobis distance [390]: The minimum distance of the feature
+map (e.g., penultimate layer activations) of a test input to classconditional Gaussian distributions of the training data.
+VI. Bayesian uncertainty estimation
+Chapter 3 used MSP and negative entropy as CSFs, next to various PUQ
+methods for Bayesian uncertainty estimation. Other chapters used MSP as it
+is the most common CSF in practice, requiring only logits as input. From the
+use of CSFs also follows the need to evaluate their statistical quality next to
+task-specific predictive performance metrics, which is discussed next.
+2.2.3
+Evaluation Metrics
+In an ideal world, the evaluation metric of interest would be the same as the loss
+function used for training, yet this is rarely the case in practice, as the gradientbased optimization process requires a continuously differentiable function, while
+the metric of interest is often non-differentiable, e.g., accuracy vs. cross-entropy
+in classification.
+Throughout our works, we have used (or extended) multiple predictive
+performance, calibration, and robustness metrics, of which the most interesting
+are respectively outlined.
+Average Normalized Levenshtein Similarity (ANLS) is a metric introduced in [39] for the evaluation of VQA, which was then extended [449] to
+support lists and be invariant to the order of provided answers. We adapted the
+underlying Levenshtein Distance (LD) metric [251] to support not-answerable
+questions, NA(G) = I[type(G) = not-answerable ] (see Equation (2.7)).

assets/txts/pg_0057.txt ADDED Viewed

	@@ -0,0 +1,98 @@

+RELIABILITY AND ROBUSTNESS
+25
+Consider for simplicity, the evaluation of a single non-list ground truth answer
+G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively.
+
+1 if NA(G) ∧ |P̂ | > 0,
+
+
+
+
+
+0 if NA(G) ∧ |P̂ | = 0,
+
+
+
+
+ |G| if |P̂ | = 0,
+LD(G, P̂ ) =
+LD(tail(G), tail(P̂ )) if G[0] = P̂ [0],
+
+
+
+
+if G[0] 6= P̂ [0] (deletion),
+ LD(tail(G), P̂ )
+
+
+
+
+1 + min
+LD(G, tail(P̂ ))
+if G[0] 6= P̂ [0] (insertion),
+
+
+
+
+LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution)
+(2.7)
+Each of the conditions is tested in turn, and the first one that is true is executed.
+The normalized similarity metric is then defined as
+NLS(G, P̂ ) =
+1 − LD(G, P̂ )
+max(1, |G|, |P̂ |)
+.
+Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted
+answer for P̂Qi for each question Q in the test set of size N , we define the
+complete metric as follows:
+N
+1 X
+ANLS =
+max s a, P̂Qi
+N i=1 a∈Gi
+s a, P̂Qi =
+
+ NLS a, P̂Q
+i
+ 0
+if NLS a, P̂Qi > τ
+,
+if NLS a, P̂Qi < τ
+(2.8)
+(2.9)
+where we follow prior literature [39, 449] in setting the threshold τ = 0.5.
+In the case of a list-type question, Hungarian matching is performed following
+[449] according to NLS between each ground truth answer part and each
+prediction answer part.
+Proper scoring rules [330] are used for generic evaluation of predictive
+performance, which calculate scoring at the instance-level while measuring both
+the quality of the predictive function and predicted probability distribution (as
+they are not compatible with an arbitrary CSF):
+• Negative Log Likelihood (NLL) [378] is both a popular loss function
+(cross-entropy) and scoring rule which only penalizes (wrong) log
+probabilities qi given to the true class, with I an indicator function defining

assets/txts/pg_0058.txt ADDED Viewed

	@@ -0,0 +1,62 @@

+26
+FUNDAMENTALS
+the true class. This measure more heavily penalizes sharp probabilities,
+which are close to the wrong edge or class by over/under-confidence.
+`NLL (f ) = −
+N K
+1 XX
+I [yi = k] · log (fk (xi ))
+N i=1
+(2.10)
+k=1
+• Brier Score [50] is a scoring rule that measures the accuracy of a
+probabilistic classifier and is related to the mean-squared error (MSE) loss
+function. Brier score is more commonly used in industrial practice since it
+is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities
+less severely than NLL.
+`BS (f ) =
+N K
+1 XX
+2
+(I (yi = k) − fk (xi ))
+N i=1
+(2.11)
+k=1
+All metrics following require a CSF g(x) to be defined, and can pertain to
+specific evaluation settings [389] tested in Section 3.4.5.
+Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate
+top-1 prediction miscalibration. A calibration estimator (Definition 7) measures
+the Lp norm difference between a model’s posterior and the true likelihood of
+being correct.
+Definition 7 (Lp Calibration Error). [231, 463]
+The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y )
+with the Lp norm p ∈ [1, ∞) is given by:
+CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp
+(2.12)
+The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the
+above with p = 1, where the expectation is approximated using a histogram.
+MaxCE defines the worst-case risk version with p = ∞, effectively reporting on
+the bin with the highest error. As part of Chapter 5, we contributed a novel
+empirical estimator of top-1 calibration for the task of VQA, where the exact
+accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior
+work [329] used a similar strategy of thresholding continuous quality scores to
+be able to estimate ECE.
+In practice, ECE is implemented as a histogram binning estimator that
+discretizes predicted probabilities into ranges of possible values for which
+conditional expectation can be estimated. Concretely, the probability space
+is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap
+between observed accuracy and bin confidence P¯b is measured, with a final

assets/txts/pg_0059.txt ADDED Viewed

	@@ -0,0 +1,64 @@

+RELIABILITY AND ROBUSTNESS
+27
+average weighted by the number of samples per bin |bi |.
+ECE =
+B
+X
+|bi |
+i=1
+N
+acc(bi ) − P¯b (bi )
+(2.13)
+To minimize the drawbacks inherited from histogram binning, as suggested
+by the literature [231, 342, 393, √
+463], we have applied an equal-mass binning
+scheme with 100 bins (close to N ). While plenty of histogram-based ECE
+estimator implementations exist, many design hyperparameters are not reported
+or exposed:
+I.
+II.
+III.
+IV.
+V.
+`p norm
+The number of bins (beyond the unfounded default of |B| = 15)
+Different binning schemes (equal-range, equal-mass)
+Binning range to define the operating zone
+Proxy used as bin accuracy (lower-e.g., center, upper-edge)
+We upstreamed 1 a generic implementation of binning-based ECE as part of
+the ICDAR 2023 DUDE competition (Chapter 5).
+Alternative formulations have been developed for multi-class [342, 370, 492]
+and multi-label calibration [493, 520]. Measurements of “strong” calibration,
+over the full predicted vector instead of the winning class, are reported less in
+practice. Possible reasons are that they render class-wise scorings, either based
+on adaptive thresholds or require estimation of kernel-based calibration error
+to derive hypothesis tests. While we are mindful of alternatives (revisited in
+Section 2.2.4), we have found that the simpler “weak” calibration measured by
+ECE meets the practical requirements for most of our benchmarking.
+Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error %
+under given coverage). The metric explicitly assesses i.i.d. failure detection
+performance as desired for safe deployment. It has advantages as a primary
+evaluation metric given that it is effective both when underlying prediction
+models are the same or different (as opposed to AUROC or AUPR). Its most
+general form (without any curve approximation), with a task-specific evaluation
+metric ` and CSF g, is defined as:
+E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]]
+AURC(f, g) = Ex∼P(X)
+(2.14)
+Ex̃∼PX [I[g(x̃) > g(x)]]
+This captures the intuition that the CSF g should be able to rank instances by
+their risk, and that the risk should be low for instances with high confidence.
+1 https://huggingface.co/spaces/jordyvl/ece

assets/txts/pg_0060.txt ADDED Viewed

	@@ -0,0 +1,53 @@

+28
+FUNDAMENTALS
+The standard curve metric can be obtained by sorting all CSF estimates and
+P
+T P +F P
+evaluating risk ( T PF+F
+P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P
+if above threshold) from high to low, together with their respective correctness (T
+if correct). This is normally based on exact match, yet for generative evaluation
+in Section 5.3.5, we have applied ANLS thresholding instead. Formulated
+this way, the best possible AURC is constrained by the model’s test error
+(1-ANLS) and the number of test instances. AURC might be more sensible for
+evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be
+better controlled and error tolerance is an apriori system-level decision [115].
+This metric was used in every chapter of Part II.
+For the evaluation under distribution shift in Chapter 3, we have used binary
+classification metrics following [172], Area Under the Receiver Operating
+Characteristic Curve (AUROC) and Area Under the Precision-Recall
+Curve (AUPR), which are threshold-independent measures that summarize
+detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability
+that a randomly chosen out-of-distribution sample is assigned a higher confidence
+score than a randomly chosen in-distribution sample. AUPR is more informative
+under class imbalance.
+2.2.4
+Calibration
+The study of calibration originated in the meteorology and statistics literature,
+primarily in the context of proper loss functions [330] for evaluating
+probabilistic forecasts. Calibration promises i) interpretability, ii) system
+integration, iii) active learning, and iv) improved accuracy. A calibrated model,
+as defined in Definition 8, can be interpreted as a probabilistic model, which can
+be integrated into a larger system, and can guide active learning with potentially
+fewer samples. Research into calibration regained popularity after repeated
+empirical observations of overconfidence in DNNs [156, 339].
+Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of
+an empirical predictor f , which states that on finite-sample data it converges
+to a solution where the confidence scoring function reflects the probability ρ of
+being correct. Perfect calibration, CE(f ) = 0, is satisfied iff:
+P(Y = Ŷ | f (X) = ρ) = ρ,
+∀ρ ∈ [0, 1]
+(2.15)
+Below, we characterize calibration research in two directions: (A) CSF evaluation
+with both theoretical guarantees and practical estimation methodologies
+• Estimators for calibration notions beyond top-1 [229, 231, 342, 463]

assets/txts/pg_0061.txt ADDED Viewed

	@@ -0,0 +1,42 @@

+RELIABILITY AND ROBUSTNESS
+29
+• Theoretical frameworks to generalize over existing metrics and design
+novel metrics [43, 231, 492, 493]
+• Specialize towards a task such as multi-class classification [463], regression
+[228, 428], or structured prediction [227]
+• Alternative error estimation procedures, based on histogram regression
+[156, 331, 332, 340, 343], kernels [230, 370, 492, 493] or splines [159]
+(B) Calibration methods for improving the reliability of a model by adapting
+the CSF or inducing calibration during training of f :
+• Learn a post-hoc forecaster F : f (X) → [0, 1] on top of f (overview: [298])
+• Modify the training procedure with regularization (overview: [277, 370])
+Due to its importance in practice, we will provide more detail on train-time
+calibration methods. It has been shown for a broad class of loss functions
+that risk minimization leads to Fisher consistent, Bayes optimal classifiers in
+the asymptotic limit [25, 495]. These can be shown to decompose into a sum
+of multiple metrics including both accuracy and calibration error [144, 177].
+However, there is no –finite data, nor asymptotic– guarantee that classifiers
+trained with proper loss functions containing an explicit calibration term
+will eventually be well-calibrated. In practice, being entangled with other
+optimization terms often leads to sub-optimal calibration. For this reason,
+recent studies [12, 230, 492] have derived trainable estimators of calibration
+to have a better handle (γ > 0) on penalizing miscalibration, i.e., by jointly
+optimizing risk (R(f ) = EX,Y [` (Y, f (X))]) and parameterized calibration error
+(CE) as in Equation (2.16).
+fˆ = arg min (R(f ) + γ CE(f ))
+f ∈F
+(2.16)
+Many of these methods are implicitly or explicitly maximizing entropy of
+predictions or entropy relative to another probability distribution, e.g., Entropy
+Regularization [361], Label Smoothing (LS) [327], Focal Loss [324], Marginbased LS [277], next to more direct (differentiable), kernel-based calibration
+error estimation [211, 230, 370, 492, 493, 526]. We had expected community
+contribution on the DUDE competition (Chapter 5) to take advantage of this
+wealth of calibration methods, yet the majority of submissions used uncalibrated
+models with MSP, requiring more education on the importance of calibration
+in practice.