Spaces:
Paused
Paused
First commit
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +291 -0
- OCR_directory.sh +17 -0
- app.py +114 -67
- assets/txts/pg_0002.txt +1 -0
- assets/txts/pg_0003.txt +25 -0
- assets/txts/pg_0004.txt +10 -0
- assets/txts/pg_0005.txt +32 -0
- assets/txts/pg_0006.txt +45 -0
- assets/txts/pg_0007.txt +33 -0
- assets/txts/pg_0008.txt +35 -0
- assets/txts/pg_0009.txt +34 -0
- assets/txts/pg_0010.txt +44 -0
- assets/txts/pg_0013.txt +22 -0
- assets/txts/pg_0014.txt +30 -0
- assets/txts/pg_0015.txt +30 -0
- assets/txts/pg_0016.txt +12 -0
- assets/txts/pg_0017.txt +200 -0
- assets/txts/pg_0018.txt +204 -0
- assets/txts/pg_0019.txt +394 -0
- assets/txts/pg_0020.txt +320 -0
- assets/txts/pg_0021.txt +421 -0
- assets/txts/pg_0033.txt +30 -0
- assets/txts/pg_0034.txt +44 -0
- assets/txts/pg_0035.txt +46 -0
- assets/txts/pg_0036.txt +80 -0
- assets/txts/pg_0037.txt +42 -0
- assets/txts/pg_0038.txt +45 -0
- assets/txts/pg_0039.txt +41 -0
- assets/txts/pg_0040.txt +35 -0
- assets/txts/pg_0041.txt +27 -0
- assets/txts/pg_0042.txt +31 -0
- assets/txts/pg_0043.txt +32 -0
- assets/txts/pg_0044.txt +163 -0
- assets/txts/pg_0045.txt +53 -0
- assets/txts/pg_0046.txt +47 -0
- assets/txts/pg_0047.txt +53 -0
- assets/txts/pg_0048.txt +45 -0
- assets/txts/pg_0049.txt +41 -0
- assets/txts/pg_0050.txt +46 -0
- assets/txts/pg_0051.txt +58 -0
- assets/txts/pg_0052.txt +38 -0
- assets/txts/pg_0053.txt +46 -0
- assets/txts/pg_0054.txt +45 -0
- assets/txts/pg_0055.txt +45 -0
- assets/txts/pg_0056.txt +39 -0
- assets/txts/pg_0057.txt +98 -0
- assets/txts/pg_0058.txt +62 -0
- assets/txts/pg_0059.txt +64 -0
- assets/txts/pg_0060.txt +53 -0
- assets/txts/pg_0061.txt +42 -0
.gitattributes
CHANGED
@@ -33,3 +33,294 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.pdf filter=lfs diff=lfs merge=lfs -text
|
37 |
+
assets/pdfs/pg_0031.pdf filter=lfs diff=lfs merge=lfs -text
|
38 |
+
assets/pdfs/pg_0081.pdf filter=lfs diff=lfs merge=lfs -text
|
39 |
+
assets/pdfs/pg_0123.pdf filter=lfs diff=lfs merge=lfs -text
|
40 |
+
assets/pdfs/pg_0155.pdf filter=lfs diff=lfs merge=lfs -text
|
41 |
+
assets/pdfs/pg_0216.pdf filter=lfs diff=lfs merge=lfs -text
|
42 |
+
assets/pdfs/pg_0277.pdf filter=lfs diff=lfs merge=lfs -text
|
43 |
+
assets/pdfs/pg_0015.pdf filter=lfs diff=lfs merge=lfs -text
|
44 |
+
assets/pdfs/pg_0047.pdf filter=lfs diff=lfs merge=lfs -text
|
45 |
+
assets/pdfs/pg_0051.pdf filter=lfs diff=lfs merge=lfs -text
|
46 |
+
assets/pdfs/pg_0054.pdf filter=lfs diff=lfs merge=lfs -text
|
47 |
+
assets/pdfs/pg_0088.pdf filter=lfs diff=lfs merge=lfs -text
|
48 |
+
assets/pdfs/pg_0250.pdf filter=lfs diff=lfs merge=lfs -text
|
49 |
+
assets/pdfs/pg_0009.pdf filter=lfs diff=lfs merge=lfs -text
|
50 |
+
assets/pdfs/pg_0089.pdf filter=lfs diff=lfs merge=lfs -text
|
51 |
+
assets/pdfs/pg_0117.pdf filter=lfs diff=lfs merge=lfs -text
|
52 |
+
assets/pdfs/pg_0241.pdf filter=lfs diff=lfs merge=lfs -text
|
53 |
+
assets/pdfs/pg_0101.pdf filter=lfs diff=lfs merge=lfs -text
|
54 |
+
assets/pdfs/pg_0110.pdf filter=lfs diff=lfs merge=lfs -text
|
55 |
+
assets/pdfs/pg_0208.pdf filter=lfs diff=lfs merge=lfs -text
|
56 |
+
assets/pdfs/pg_0226.pdf filter=lfs diff=lfs merge=lfs -text
|
57 |
+
assets/pdfs/pg_0284.pdf filter=lfs diff=lfs merge=lfs -text
|
58 |
+
assets/pdfs/pg_0060.pdf filter=lfs diff=lfs merge=lfs -text
|
59 |
+
assets/pdfs/pg_0252.pdf filter=lfs diff=lfs merge=lfs -text
|
60 |
+
assets/pdfs/pg_0058.pdf filter=lfs diff=lfs merge=lfs -text
|
61 |
+
assets/pdfs/pg_0099.pdf filter=lfs diff=lfs merge=lfs -text
|
62 |
+
assets/pdfs/pg_0195.pdf filter=lfs diff=lfs merge=lfs -text
|
63 |
+
assets/pdfs/pg_0057.pdf filter=lfs diff=lfs merge=lfs -text
|
64 |
+
assets/pdfs/pg_0105.pdf filter=lfs diff=lfs merge=lfs -text
|
65 |
+
assets/pdfs/pg_0125.pdf filter=lfs diff=lfs merge=lfs -text
|
66 |
+
assets/pdfs/pg_0169.pdf filter=lfs diff=lfs merge=lfs -text
|
67 |
+
assets/pdfs/pg_0184.pdf filter=lfs diff=lfs merge=lfs -text
|
68 |
+
assets/pdfs/pg_0196.pdf filter=lfs diff=lfs merge=lfs -text
|
69 |
+
assets/pdfs/pg_0075.pdf filter=lfs diff=lfs merge=lfs -text
|
70 |
+
assets/pdfs/pg_0236.pdf filter=lfs diff=lfs merge=lfs -text
|
71 |
+
assets/pdfs/pg_0276.pdf filter=lfs diff=lfs merge=lfs -text
|
72 |
+
assets/pdfs/pg_0006.pdf filter=lfs diff=lfs merge=lfs -text
|
73 |
+
assets/pdfs/pg_0156.pdf filter=lfs diff=lfs merge=lfs -text
|
74 |
+
assets/pdfs/pg_0082.pdf filter=lfs diff=lfs merge=lfs -text
|
75 |
+
assets/pdfs/pg_0106.pdf filter=lfs diff=lfs merge=lfs -text
|
76 |
+
assets/pdfs/pg_0157.pdf filter=lfs diff=lfs merge=lfs -text
|
77 |
+
assets/pdfs/pg_0188.pdf filter=lfs diff=lfs merge=lfs -text
|
78 |
+
assets/pdfs/pg_0201.pdf filter=lfs diff=lfs merge=lfs -text
|
79 |
+
assets/pdfs/pg_0225.pdf filter=lfs diff=lfs merge=lfs -text
|
80 |
+
assets/pdfs/pg_0248.pdf filter=lfs diff=lfs merge=lfs -text
|
81 |
+
assets/pdfs/pg_0023.pdf filter=lfs diff=lfs merge=lfs -text
|
82 |
+
assets/pdfs/pg_0116.pdf filter=lfs diff=lfs merge=lfs -text
|
83 |
+
assets/pdfs/pg_0119.pdf filter=lfs diff=lfs merge=lfs -text
|
84 |
+
assets/pdfs/pg_0254.pdf filter=lfs diff=lfs merge=lfs -text
|
85 |
+
assets/pdfs/pg_0278.pdf filter=lfs diff=lfs merge=lfs -text
|
86 |
+
assets/pdfs/pg_0045.pdf filter=lfs diff=lfs merge=lfs -text
|
87 |
+
assets/pdfs/pg_0093.pdf filter=lfs diff=lfs merge=lfs -text
|
88 |
+
assets/pdfs/pg_0182.pdf filter=lfs diff=lfs merge=lfs -text
|
89 |
+
assets/pdfs/pg_0064.pdf filter=lfs diff=lfs merge=lfs -text
|
90 |
+
assets/pdfs/pg_0094.pdf filter=lfs diff=lfs merge=lfs -text
|
91 |
+
assets/pdfs/pg_0104.pdf filter=lfs diff=lfs merge=lfs -text
|
92 |
+
assets/pdfs/pg_0113.pdf filter=lfs diff=lfs merge=lfs -text
|
93 |
+
assets/pdfs/pg_0150.pdf filter=lfs diff=lfs merge=lfs -text
|
94 |
+
assets/pdfs/pg_0189.pdf filter=lfs diff=lfs merge=lfs -text
|
95 |
+
assets/pdfs/pg_0220.pdf filter=lfs diff=lfs merge=lfs -text
|
96 |
+
assets/pdfs/pg_0261.pdf filter=lfs diff=lfs merge=lfs -text
|
97 |
+
assets/pdfs/pg_0011.pdf filter=lfs diff=lfs merge=lfs -text
|
98 |
+
assets/pdfs/pg_0048.pdf filter=lfs diff=lfs merge=lfs -text
|
99 |
+
assets/pdfs/pg_0288.pdf filter=lfs diff=lfs merge=lfs -text
|
100 |
+
assets/pdfs/pg_0034.pdf filter=lfs diff=lfs merge=lfs -text
|
101 |
+
assets/pdfs/pg_0108.pdf filter=lfs diff=lfs merge=lfs -text
|
102 |
+
assets/pdfs/pg_0214.pdf filter=lfs diff=lfs merge=lfs -text
|
103 |
+
assets/pdfs/pg_0287.pdf filter=lfs diff=lfs merge=lfs -text
|
104 |
+
assets/pdfs/pg_0100.pdf filter=lfs diff=lfs merge=lfs -text
|
105 |
+
assets/pdfs/pg_0198.pdf filter=lfs diff=lfs merge=lfs -text
|
106 |
+
assets/pdfs/pg_0227.pdf filter=lfs diff=lfs merge=lfs -text
|
107 |
+
assets/pdfs/pg_0244.pdf filter=lfs diff=lfs merge=lfs -text
|
108 |
+
assets/pdfs/pg_0245.pdf filter=lfs diff=lfs merge=lfs -text
|
109 |
+
assets/pdfs/pg_0270.pdf filter=lfs diff=lfs merge=lfs -text
|
110 |
+
assets/pdfs/pg_0039.pdf filter=lfs diff=lfs merge=lfs -text
|
111 |
+
assets/pdfs/pg_0055.pdf filter=lfs diff=lfs merge=lfs -text
|
112 |
+
assets/pdfs/pg_0086.pdf filter=lfs diff=lfs merge=lfs -text
|
113 |
+
assets/pdfs/pg_0174.pdf filter=lfs diff=lfs merge=lfs -text
|
114 |
+
assets/pdfs/pg_0181.pdf filter=lfs diff=lfs merge=lfs -text
|
115 |
+
assets/pdfs/pg_0266.pdf filter=lfs diff=lfs merge=lfs -text
|
116 |
+
assets/pdfs/pg_0283.pdf filter=lfs diff=lfs merge=lfs -text
|
117 |
+
assets/pdfs/pg_0073.pdf filter=lfs diff=lfs merge=lfs -text
|
118 |
+
assets/pdfs/pg_0080.pdf filter=lfs diff=lfs merge=lfs -text
|
119 |
+
assets/pdfs/pg_0274.pdf filter=lfs diff=lfs merge=lfs -text
|
120 |
+
assets/pdfs/pg_0279.pdf filter=lfs diff=lfs merge=lfs -text
|
121 |
+
assets/pdfs/pg_0036.pdf filter=lfs diff=lfs merge=lfs -text
|
122 |
+
assets/pdfs/pg_0050.pdf filter=lfs diff=lfs merge=lfs -text
|
123 |
+
assets/pdfs/pg_0069.pdf filter=lfs diff=lfs merge=lfs -text
|
124 |
+
assets/pdfs/pg_0053.pdf filter=lfs diff=lfs merge=lfs -text
|
125 |
+
assets/pdfs/pg_0056.pdf filter=lfs diff=lfs merge=lfs -text
|
126 |
+
assets/pdfs/pg_0145.pdf filter=lfs diff=lfs merge=lfs -text
|
127 |
+
assets/pdfs/pg_0027.pdf filter=lfs diff=lfs merge=lfs -text
|
128 |
+
assets/pdfs/pg_0067.pdf filter=lfs diff=lfs merge=lfs -text
|
129 |
+
assets/pdfs/pg_0079.pdf filter=lfs diff=lfs merge=lfs -text
|
130 |
+
assets/pdfs/pg_0013.pdf filter=lfs diff=lfs merge=lfs -text
|
131 |
+
assets/pdfs/pg_0072.pdf filter=lfs diff=lfs merge=lfs -text
|
132 |
+
assets/pdfs/pg_0191.pdf filter=lfs diff=lfs merge=lfs -text
|
133 |
+
assets/pdfs/pg_0263.pdf filter=lfs diff=lfs merge=lfs -text
|
134 |
+
assets/pdfs/pg_0268.pdf filter=lfs diff=lfs merge=lfs -text
|
135 |
+
assets/pdfs/pg_0041.pdf filter=lfs diff=lfs merge=lfs -text
|
136 |
+
assets/pdfs/pg_0136.pdf filter=lfs diff=lfs merge=lfs -text
|
137 |
+
assets/pdfs/pg_0170.pdf filter=lfs diff=lfs merge=lfs -text
|
138 |
+
assets/pdfs/pg_0180.pdf filter=lfs diff=lfs merge=lfs -text
|
139 |
+
assets/pdfs/pg_0200.pdf filter=lfs diff=lfs merge=lfs -text
|
140 |
+
assets/pdfs/pg_0217.pdf filter=lfs diff=lfs merge=lfs -text
|
141 |
+
assets/pdfs/pg_0280.pdf filter=lfs diff=lfs merge=lfs -text
|
142 |
+
assets/pdfs/pg_0016.pdf filter=lfs diff=lfs merge=lfs -text
|
143 |
+
assets/pdfs/pg_0018.pdf filter=lfs diff=lfs merge=lfs -text
|
144 |
+
assets/pdfs/pg_0062.pdf filter=lfs diff=lfs merge=lfs -text
|
145 |
+
assets/pdfs/pg_0122.pdf filter=lfs diff=lfs merge=lfs -text
|
146 |
+
assets/pdfs/pg_0147.pdf filter=lfs diff=lfs merge=lfs -text
|
147 |
+
assets/pdfs/pg_0265.pdf filter=lfs diff=lfs merge=lfs -text
|
148 |
+
assets/pdfs/pg_0215.pdf filter=lfs diff=lfs merge=lfs -text
|
149 |
+
assets/pdfs/pg_0133.pdf filter=lfs diff=lfs merge=lfs -text
|
150 |
+
assets/pdfs/pg_0165.pdf filter=lfs diff=lfs merge=lfs -text
|
151 |
+
assets/pdfs/pg_0166.pdf filter=lfs diff=lfs merge=lfs -text
|
152 |
+
assets/pdfs/pg_0222.pdf filter=lfs diff=lfs merge=lfs -text
|
153 |
+
assets/pdfs/pg_0078.pdf filter=lfs diff=lfs merge=lfs -text
|
154 |
+
assets/pdfs/pg_0171.pdf filter=lfs diff=lfs merge=lfs -text
|
155 |
+
assets/pdfs/pg_0219.pdf filter=lfs diff=lfs merge=lfs -text
|
156 |
+
assets/pdfs/pg_0028.pdf filter=lfs diff=lfs merge=lfs -text
|
157 |
+
assets/pdfs/pg_0107.pdf filter=lfs diff=lfs merge=lfs -text
|
158 |
+
assets/pdfs/pg_0144.pdf filter=lfs diff=lfs merge=lfs -text
|
159 |
+
assets/pdfs/pg_0178.pdf filter=lfs diff=lfs merge=lfs -text
|
160 |
+
assets/pdfs/pg_0190.pdf filter=lfs diff=lfs merge=lfs -text
|
161 |
+
assets/pdfs/pg_0043.pdf filter=lfs diff=lfs merge=lfs -text
|
162 |
+
assets/pdfs/pg_0010.pdf filter=lfs diff=lfs merge=lfs -text
|
163 |
+
assets/pdfs/pg_0021.pdf filter=lfs diff=lfs merge=lfs -text
|
164 |
+
assets/pdfs/pg_0160.pdf filter=lfs diff=lfs merge=lfs -text
|
165 |
+
assets/pdfs/pg_0247.pdf filter=lfs diff=lfs merge=lfs -text
|
166 |
+
assets/pdfs/pg_0063.pdf filter=lfs diff=lfs merge=lfs -text
|
167 |
+
assets/pdfs/pg_0090.pdf filter=lfs diff=lfs merge=lfs -text
|
168 |
+
assets/pdfs/pg_0137.pdf filter=lfs diff=lfs merge=lfs -text
|
169 |
+
assets/pdfs/pg_0159.pdf filter=lfs diff=lfs merge=lfs -text
|
170 |
+
assets/pdfs/pg_0269.pdf filter=lfs diff=lfs merge=lfs -text
|
171 |
+
assets/pdfs/pg_0014.pdf filter=lfs diff=lfs merge=lfs -text
|
172 |
+
assets/pdfs/pg_0026.pdf filter=lfs diff=lfs merge=lfs -text
|
173 |
+
assets/pdfs/pg_0033.pdf filter=lfs diff=lfs merge=lfs -text
|
174 |
+
assets/pdfs/pg_0035.pdf filter=lfs diff=lfs merge=lfs -text
|
175 |
+
assets/pdfs/pg_0046.pdf filter=lfs diff=lfs merge=lfs -text
|
176 |
+
assets/pdfs/pg_0186.pdf filter=lfs diff=lfs merge=lfs -text
|
177 |
+
assets/pdfs/pg_0237.pdf filter=lfs diff=lfs merge=lfs -text
|
178 |
+
assets/pdfs/pg_0179.pdf filter=lfs diff=lfs merge=lfs -text
|
179 |
+
assets/pdfs/pg_0193.pdf filter=lfs diff=lfs merge=lfs -text
|
180 |
+
assets/pdfs/pg_0232.pdf filter=lfs diff=lfs merge=lfs -text
|
181 |
+
assets/pdfs/pg_0109.pdf filter=lfs diff=lfs merge=lfs -text
|
182 |
+
assets/pdfs/pg_0134.pdf filter=lfs diff=lfs merge=lfs -text
|
183 |
+
assets/pdfs/pg_0286.pdf filter=lfs diff=lfs merge=lfs -text
|
184 |
+
assets/pdfs/pg_0003.pdf filter=lfs diff=lfs merge=lfs -text
|
185 |
+
assets/pdfs/pg_0004.pdf filter=lfs diff=lfs merge=lfs -text
|
186 |
+
assets/pdfs/pg_0206.pdf filter=lfs diff=lfs merge=lfs -text
|
187 |
+
assets/pdfs/pg_0251.pdf filter=lfs diff=lfs merge=lfs -text
|
188 |
+
assets/pdfs/pg_0040.pdf filter=lfs diff=lfs merge=lfs -text
|
189 |
+
assets/pdfs/pg_0083.pdf filter=lfs diff=lfs merge=lfs -text
|
190 |
+
assets/pdfs/pg_0230.pdf filter=lfs diff=lfs merge=lfs -text
|
191 |
+
assets/pdfs/pg_0272.pdf filter=lfs diff=lfs merge=lfs -text
|
192 |
+
assets/pdfs/pg_0275.pdf filter=lfs diff=lfs merge=lfs -text
|
193 |
+
assets/pdfs/pg_0096.pdf filter=lfs diff=lfs merge=lfs -text
|
194 |
+
assets/pdfs/pg_0115.pdf filter=lfs diff=lfs merge=lfs -text
|
195 |
+
assets/pdfs/pg_0260.pdf filter=lfs diff=lfs merge=lfs -text
|
196 |
+
assets/pdfs/pg_0271.pdf filter=lfs diff=lfs merge=lfs -text
|
197 |
+
assets/pdfs/pg_0012.pdf filter=lfs diff=lfs merge=lfs -text
|
198 |
+
assets/pdfs/pg_0022.pdf filter=lfs diff=lfs merge=lfs -text
|
199 |
+
assets/pdfs/pg_0176.pdf filter=lfs diff=lfs merge=lfs -text
|
200 |
+
assets/pdfs/pg_0218.pdf filter=lfs diff=lfs merge=lfs -text
|
201 |
+
assets/pdfs/pg_0273.pdf filter=lfs diff=lfs merge=lfs -text
|
202 |
+
assets/pdfs/pg_0065.pdf filter=lfs diff=lfs merge=lfs -text
|
203 |
+
assets/pdfs/pg_0132.pdf filter=lfs diff=lfs merge=lfs -text
|
204 |
+
assets/pdfs/pg_0187.pdf filter=lfs diff=lfs merge=lfs -text
|
205 |
+
assets/pdfs/pg_0267.pdf filter=lfs diff=lfs merge=lfs -text
|
206 |
+
assets/pdfs/pg_0044.pdf filter=lfs diff=lfs merge=lfs -text
|
207 |
+
assets/pdfs/pg_0029.pdf filter=lfs diff=lfs merge=lfs -text
|
208 |
+
assets/pdfs/pg_0084.pdf filter=lfs diff=lfs merge=lfs -text
|
209 |
+
assets/pdfs/pg_0087.pdf filter=lfs diff=lfs merge=lfs -text
|
210 |
+
assets/pdfs/pg_0238.pdf filter=lfs diff=lfs merge=lfs -text
|
211 |
+
assets/pdfs/pg_0253.pdf filter=lfs diff=lfs merge=lfs -text
|
212 |
+
assets/pdfs/pg_0257.pdf filter=lfs diff=lfs merge=lfs -text
|
213 |
+
assets/pdfs/pg_0102.pdf filter=lfs diff=lfs merge=lfs -text
|
214 |
+
assets/pdfs/pg_0103.pdf filter=lfs diff=lfs merge=lfs -text
|
215 |
+
assets/pdfs/pg_0148.pdf filter=lfs diff=lfs merge=lfs -text
|
216 |
+
assets/pdfs/pg_0242.pdf filter=lfs diff=lfs merge=lfs -text
|
217 |
+
assets/pdfs/pg_0258.pdf filter=lfs diff=lfs merge=lfs -text
|
218 |
+
assets/pdfs/pg_0005.pdf filter=lfs diff=lfs merge=lfs -text
|
219 |
+
assets/pdfs/pg_0008.pdf filter=lfs diff=lfs merge=lfs -text
|
220 |
+
assets/pdfs/pg_0032.pdf filter=lfs diff=lfs merge=lfs -text
|
221 |
+
assets/pdfs/pg_0037.pdf filter=lfs diff=lfs merge=lfs -text
|
222 |
+
assets/pdfs/pg_0070.pdf filter=lfs diff=lfs merge=lfs -text
|
223 |
+
assets/pdfs/pg_0207.pdf filter=lfs diff=lfs merge=lfs -text
|
224 |
+
assets/pdfs/pg_0235.pdf filter=lfs diff=lfs merge=lfs -text
|
225 |
+
assets/pdfs/pg_0061.pdf filter=lfs diff=lfs merge=lfs -text
|
226 |
+
assets/pdfs/pg_0068.pdf filter=lfs diff=lfs merge=lfs -text
|
227 |
+
assets/pdfs/pg_0077.pdf filter=lfs diff=lfs merge=lfs -text
|
228 |
+
assets/pdfs/pg_0204.pdf filter=lfs diff=lfs merge=lfs -text
|
229 |
+
assets/pdfs/pg_0239.pdf filter=lfs diff=lfs merge=lfs -text
|
230 |
+
assets/pdfs/pg_0255.pdf filter=lfs diff=lfs merge=lfs -text
|
231 |
+
assets/pdfs/pg_0289.pdf filter=lfs diff=lfs merge=lfs -text
|
232 |
+
assets/pdfs/pg_0025.pdf filter=lfs diff=lfs merge=lfs -text
|
233 |
+
assets/pdfs/pg_0052.pdf filter=lfs diff=lfs merge=lfs -text
|
234 |
+
assets/pdfs/pg_0066.pdf filter=lfs diff=lfs merge=lfs -text
|
235 |
+
assets/pdfs/pg_0131.pdf filter=lfs diff=lfs merge=lfs -text
|
236 |
+
assets/pdfs/pg_0163.pdf filter=lfs diff=lfs merge=lfs -text
|
237 |
+
assets/pdfs/pg_0259.pdf filter=lfs diff=lfs merge=lfs -text
|
238 |
+
assets/pdfs/pg_0224.pdf filter=lfs diff=lfs merge=lfs -text
|
239 |
+
assets/pdfs/pg_0249.pdf filter=lfs diff=lfs merge=lfs -text
|
240 |
+
assets/pdfs/pg_0121.pdf filter=lfs diff=lfs merge=lfs -text
|
241 |
+
assets/pdfs/pg_0140.pdf filter=lfs diff=lfs merge=lfs -text
|
242 |
+
assets/pdfs/pg_0143.pdf filter=lfs diff=lfs merge=lfs -text
|
243 |
+
assets/pdfs/pg_0151.pdf filter=lfs diff=lfs merge=lfs -text
|
244 |
+
assets/pdfs/pg_0095.pdf filter=lfs diff=lfs merge=lfs -text
|
245 |
+
assets/pdfs/pg_0111.pdf filter=lfs diff=lfs merge=lfs -text
|
246 |
+
assets/pdfs/pg_0139.pdf filter=lfs diff=lfs merge=lfs -text
|
247 |
+
assets/pdfs/pg_0211.pdf filter=lfs diff=lfs merge=lfs -text
|
248 |
+
assets/pdfs/pg_0019.pdf filter=lfs diff=lfs merge=lfs -text
|
249 |
+
assets/pdfs/pg_0076.pdf filter=lfs diff=lfs merge=lfs -text
|
250 |
+
assets/pdfs/pg_0152.pdf filter=lfs diff=lfs merge=lfs -text
|
251 |
+
assets/pdfs/pg_0212.pdf filter=lfs diff=lfs merge=lfs -text
|
252 |
+
assets/pdfs/pg_0223.pdf filter=lfs diff=lfs merge=lfs -text
|
253 |
+
assets/pdfs/pg_0017.pdf filter=lfs diff=lfs merge=lfs -text
|
254 |
+
assets/pdfs/pg_0142.pdf filter=lfs diff=lfs merge=lfs -text
|
255 |
+
assets/pdfs/pg_0158.pdf filter=lfs diff=lfs merge=lfs -text
|
256 |
+
assets/pdfs/pg_0233.pdf filter=lfs diff=lfs merge=lfs -text
|
257 |
+
assets/pdfs/pg_0256.pdf filter=lfs diff=lfs merge=lfs -text
|
258 |
+
assets/pdfs/pg_0262.pdf filter=lfs diff=lfs merge=lfs -text
|
259 |
+
assets/pdfs/pg_0282.pdf filter=lfs diff=lfs merge=lfs -text
|
260 |
+
assets/pdfs/pg_0020.pdf filter=lfs diff=lfs merge=lfs -text
|
261 |
+
assets/pdfs/pg_0024.pdf filter=lfs diff=lfs merge=lfs -text
|
262 |
+
assets/pdfs/pg_0199.pdf filter=lfs diff=lfs merge=lfs -text
|
263 |
+
assets/pdfs/pg_0264.pdf filter=lfs diff=lfs merge=lfs -text
|
264 |
+
assets/pdfs/pg_0002.pdf filter=lfs diff=lfs merge=lfs -text
|
265 |
+
assets/pdfs/pg_0092.pdf filter=lfs diff=lfs merge=lfs -text
|
266 |
+
assets/pdfs/pg_0120.pdf filter=lfs diff=lfs merge=lfs -text
|
267 |
+
assets/pdfs/pg_0071.pdf filter=lfs diff=lfs merge=lfs -text
|
268 |
+
assets/pdfs/pg_0074.pdf filter=lfs diff=lfs merge=lfs -text
|
269 |
+
assets/pdfs/pg_0203.pdf filter=lfs diff=lfs merge=lfs -text
|
270 |
+
assets/pdfs/pg_0285.pdf filter=lfs diff=lfs merge=lfs -text
|
271 |
+
assets/pdfs/pg_0085.pdf filter=lfs diff=lfs merge=lfs -text
|
272 |
+
assets/pdfs/pg_0127.pdf filter=lfs diff=lfs merge=lfs -text
|
273 |
+
assets/pdfs/pg_0185.pdf filter=lfs diff=lfs merge=lfs -text
|
274 |
+
assets/pdfs/pg_0281.pdf filter=lfs diff=lfs merge=lfs -text
|
275 |
+
assets/pdfs/pg_0098.pdf filter=lfs diff=lfs merge=lfs -text
|
276 |
+
assets/pdfs/pg_0112.pdf filter=lfs diff=lfs merge=lfs -text
|
277 |
+
assets/pdfs/pg_0141.pdf filter=lfs diff=lfs merge=lfs -text
|
278 |
+
assets/pdfs/pg_0146.pdf filter=lfs diff=lfs merge=lfs -text
|
279 |
+
assets/pdfs/pg_0164.pdf filter=lfs diff=lfs merge=lfs -text
|
280 |
+
assets/pdfs/pg_0240.pdf filter=lfs diff=lfs merge=lfs -text
|
281 |
+
assets/pdfs/pg_0246.pdf filter=lfs diff=lfs merge=lfs -text
|
282 |
+
assets/pdfs/pg_0097.pdf filter=lfs diff=lfs merge=lfs -text
|
283 |
+
assets/pdfs/pg_0149.pdf filter=lfs diff=lfs merge=lfs -text
|
284 |
+
assets/pdfs/pg_0162.pdf filter=lfs diff=lfs merge=lfs -text
|
285 |
+
assets/pdfs/pg_0030.pdf filter=lfs diff=lfs merge=lfs -text
|
286 |
+
assets/pdfs/pg_0049.pdf filter=lfs diff=lfs merge=lfs -text
|
287 |
+
assets/pdfs/pg_0177.pdf filter=lfs diff=lfs merge=lfs -text
|
288 |
+
assets/pdfs/pg_0209.pdf filter=lfs diff=lfs merge=lfs -text
|
289 |
+
assets/pdfs/pg_0213.pdf filter=lfs diff=lfs merge=lfs -text
|
290 |
+
assets/pdfs/pg_0059.pdf filter=lfs diff=lfs merge=lfs -text
|
291 |
+
assets/pdfs/pg_0091.pdf filter=lfs diff=lfs merge=lfs -text
|
292 |
+
assets/pdfs/pg_0129.pdf filter=lfs diff=lfs merge=lfs -text
|
293 |
+
assets/pdfs/pg_0172.pdf filter=lfs diff=lfs merge=lfs -text
|
294 |
+
assets/pdfs/pg_0175.pdf filter=lfs diff=lfs merge=lfs -text
|
295 |
+
assets/pdfs/pg_0183.pdf filter=lfs diff=lfs merge=lfs -text
|
296 |
+
assets/pdfs/pg_0194.pdf filter=lfs diff=lfs merge=lfs -text
|
297 |
+
assets/pdfs/pg_0231.pdf filter=lfs diff=lfs merge=lfs -text
|
298 |
+
assets/pdfs/pg_0001.pdf filter=lfs diff=lfs merge=lfs -text
|
299 |
+
assets/pdfs/pg_0130.pdf filter=lfs diff=lfs merge=lfs -text
|
300 |
+
assets/pdfs/pg_0168.pdf filter=lfs diff=lfs merge=lfs -text
|
301 |
+
assets/pdfs/pg_0202.pdf filter=lfs diff=lfs merge=lfs -text
|
302 |
+
assets/pdfs/pg_0210.pdf filter=lfs diff=lfs merge=lfs -text
|
303 |
+
assets/pdfs/pg_0234.pdf filter=lfs diff=lfs merge=lfs -text
|
304 |
+
assets/pdfs/pg_0038.pdf filter=lfs diff=lfs merge=lfs -text
|
305 |
+
assets/pdfs/pg_0042.pdf filter=lfs diff=lfs merge=lfs -text
|
306 |
+
assets/pdfs/pg_0114.pdf filter=lfs diff=lfs merge=lfs -text
|
307 |
+
assets/pdfs/pg_0124.pdf filter=lfs diff=lfs merge=lfs -text
|
308 |
+
assets/pdfs/pg_0138.pdf filter=lfs diff=lfs merge=lfs -text
|
309 |
+
assets/pdfs/pg_0153.pdf filter=lfs diff=lfs merge=lfs -text
|
310 |
+
assets/pdfs/pg_0154.pdf filter=lfs diff=lfs merge=lfs -text
|
311 |
+
assets/pdfs/pg_0161.pdf filter=lfs diff=lfs merge=lfs -text
|
312 |
+
assets/pdfs/pg_0173.pdf filter=lfs diff=lfs merge=lfs -text
|
313 |
+
assets/pdfs/pg_0221.pdf filter=lfs diff=lfs merge=lfs -text
|
314 |
+
assets/pdfs/pg_0229.pdf filter=lfs diff=lfs merge=lfs -text
|
315 |
+
assets/pdfs/pg_0118.pdf filter=lfs diff=lfs merge=lfs -text
|
316 |
+
assets/pdfs/pg_0126.pdf filter=lfs diff=lfs merge=lfs -text
|
317 |
+
assets/pdfs/pg_0135.pdf filter=lfs diff=lfs merge=lfs -text
|
318 |
+
assets/pdfs/pg_0167.pdf filter=lfs diff=lfs merge=lfs -text
|
319 |
+
assets/pdfs/pg_0192.pdf filter=lfs diff=lfs merge=lfs -text
|
320 |
+
assets/pdfs/pg_0290.pdf filter=lfs diff=lfs merge=lfs -text
|
321 |
+
assets/pdfs/pg_0007.pdf filter=lfs diff=lfs merge=lfs -text
|
322 |
+
assets/pdfs/pg_0128.pdf filter=lfs diff=lfs merge=lfs -text
|
323 |
+
assets/pdfs/pg_0197.pdf filter=lfs diff=lfs merge=lfs -text
|
324 |
+
assets/pdfs/pg_0243.pdf filter=lfs diff=lfs merge=lfs -text
|
325 |
+
assets/pdfs/pg_0205.pdf filter=lfs diff=lfs merge=lfs -text
|
326 |
+
assets/pdfs/pg_0228.pdf filter=lfs diff=lfs merge=lfs -text
|
OCR_directory.sh
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# pdftk thesis.pdf burst
|
2 |
+
|
3 |
+
#using pdf2text, extract text for each page in assets/pdfs and store in asssets/txts with similar basename
|
4 |
+
|
5 |
+
for pdf in assets/pdfs/*.pdf
|
6 |
+
do
|
7 |
+
echo
|
8 |
+
#pdftotext $pdf assets/txts/$(basename $pdf .pdf).txt
|
9 |
+
#pdf2txt.py -o assets/txts/$(basename $pdf .pdf).txt $pdf
|
10 |
+
done
|
11 |
+
|
12 |
+
|
13 |
+
for pdf in assets/pdfs/*.pdf
|
14 |
+
do
|
15 |
+
convert -density 100 -quality 100 -colorspace RGB -alpha remove -alpha off $pdf assets/pngs/$(basename $pdf .pdf).png
|
16 |
+
done
|
17 |
+
|
app.py
CHANGED
@@ -1,67 +1,114 @@
|
|
1 |
-
import
|
2 |
-
from
|
3 |
-
from llama_index import
|
4 |
-
from llama_index.embeddings import HuggingFaceEmbedding
|
5 |
-
from llama_index.
|
6 |
-
from llama_index.
|
7 |
-
from
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
from transformers import BitsAndBytesConfig
|
3 |
+
from llama_index.llms.huggingface import HuggingFaceLLM
|
4 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
5 |
+
from llama_index.core import SimpleDirectoryReader
|
6 |
+
from llama_index.core import VectorStoreIndex, SummaryIndex
|
7 |
+
from llama_index.core.prompts import PromptTemplate
|
8 |
+
from llama_index.core import Settings
|
9 |
+
|
10 |
+
|
11 |
+
import gradio as gr
|
12 |
+
|
13 |
+
|
14 |
+
def messages_to_prompt(messages):
|
15 |
+
prompt = ""
|
16 |
+
for message in messages:
|
17 |
+
if message.role == "system":
|
18 |
+
m = "You are an expert in the research field of document understanding, bayesian deep learning and neural networks."
|
19 |
+
prompt += f"<|system|>\n{m}</s>\n"
|
20 |
+
elif message.role == "user":
|
21 |
+
prompt += f"<|user|>\n{message.content}</s>\n"
|
22 |
+
elif message.role == "assistant":
|
23 |
+
prompt += f"<|assistant|>\n{message.content}</s>\n"
|
24 |
+
|
25 |
+
# ensure we start with a system prompt, insert blank if needed
|
26 |
+
if not prompt.startswith("<|system|>\n"):
|
27 |
+
prompt = "<|system|>\n</s>\n" + prompt
|
28 |
+
|
29 |
+
# add final assistant prompt
|
30 |
+
prompt = prompt + "<|assistant|>\n"
|
31 |
+
|
32 |
+
return prompt
|
33 |
+
|
34 |
+
|
35 |
+
def load_RAG_pipeline():
|
36 |
+
# LLM
|
37 |
+
quantization_config = BitsAndBytesConfig(
|
38 |
+
load_in_4bit=True,
|
39 |
+
bnb_4bit_compute_dtype=torch.float16,
|
40 |
+
bnb_4bit_quant_type="nf4",
|
41 |
+
bnb_4bit_use_double_quant=True,
|
42 |
+
)
|
43 |
+
|
44 |
+
llm = HuggingFaceLLM(
|
45 |
+
model_name="HuggingFaceH4/zephyr-7b-alpha",
|
46 |
+
tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
|
47 |
+
query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
|
48 |
+
context_window=3900,
|
49 |
+
max_new_tokens=256,
|
50 |
+
model_kwargs={"quantization_config": quantization_config},
|
51 |
+
# tokenizer_kwargs={},
|
52 |
+
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
|
53 |
+
messages_to_prompt=messages_to_prompt,
|
54 |
+
device_map="auto",
|
55 |
+
)
|
56 |
+
|
57 |
+
# Llama-index
|
58 |
+
Settings.llm = llm
|
59 |
+
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
|
60 |
+
# Settings.chunk_size = 512
|
61 |
+
# Settings.chunk_overlap = 50
|
62 |
+
|
63 |
+
# raw data
|
64 |
+
documents = SimpleDirectoryReader("assets/txts").load_data()
|
65 |
+
vector_index = VectorStoreIndex.from_documents(documents)
|
66 |
+
# summary_index = SummaryIndex.from_documents(documents)
|
67 |
+
query_engine = vector_index.as_query_engine(response_mode="compact", similarity_top_k=3)
|
68 |
+
return query_engine
|
69 |
+
|
70 |
+
|
71 |
+
query_engine = load_RAG_pipeline()
|
72 |
+
|
73 |
+
|
74 |
+
# These are placeholder functions to simulate the behavior of the RAG setup.
|
75 |
+
# You would need to implement these with the actual logic to retrieve and generate answers based on the document.
|
76 |
+
def get_answer(question, temperature, nucleus_sampling, max_tokens):
|
77 |
+
# Here you should implement the logic to generate an answer based on the question and the document.
|
78 |
+
# For example, you could use a machine learning model for RAG.
|
79 |
+
# answer = "This is a placeholder answer."
|
80 |
+
# https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/#setting-local-configurations
|
81 |
+
return query_engine.query(question)
|
82 |
+
|
83 |
+
|
84 |
+
def get_answer_page(question):
|
85 |
+
# Implement logic to retrieve the page number or an image of the page with the answer.
|
86 |
+
answer_page = "Page X - placeholder image."
|
87 |
+
return answer_page
|
88 |
+
|
89 |
+
|
90 |
+
# Create the gr.Interface function
|
91 |
+
def ask_my_thesis(question, temperature, nucleus_sampling, max_tokens):
|
92 |
+
answer = get_answer(question, temperature, nucleus_sampling, max_tokens)
|
93 |
+
answer_page = get_answer_page(question)
|
94 |
+
return answer, answer_page
|
95 |
+
|
96 |
+
|
97 |
+
# Set up the interface options based on the design in the image.
|
98 |
+
iface = gr.Interface(
|
99 |
+
fn=ask_my_thesis,
|
100 |
+
inputs=[
|
101 |
+
gr.Textbox(label="Question", placeholder="Type your question here..."),
|
102 |
+
gr.Slider(0, 1, value=0.7, label="Temperature"),
|
103 |
+
gr.Slider(0, 1, value=0.9, label="Nucleus Sampling"),
|
104 |
+
gr.Slider(1, 500, value=100, label="Max Generated Number of Tokens"),
|
105 |
+
],
|
106 |
+
outputs=[gr.Textbox(label="Answer"), gr.Image(label="Answer Page")],
|
107 |
+
title="Ask my thesis",
|
108 |
+
description="Chat with the manuscript: ask questions and receive answers with references.",
|
109 |
+
allow_flagging="never",
|
110 |
+
)
|
111 |
+
|
112 |
+
# Start the application.
|
113 |
+
if __name__ == "__main__":
|
114 |
+
iface.launch()
|
assets/txts/pg_0002.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
|
assets/txts/pg_0003.txt
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Intelligent Automation for AI-Driven Document
|
2 |
+
Understanding
|
3 |
+
|
4 |
+
Jordy VAN LANDEGHEM
|
5 |
+
|
6 |
+
Examination committee:
|
7 |
+
em. Prof. Dr. ir. Jean-Pierre Celis, chair
|
8 |
+
Prof. Dr. Marie-Francine Moens, supervisor
|
9 |
+
Prof. Dr. Matthew B. Blaschko, supervisor
|
10 |
+
Prof. Dr. ir. Johan Suykens
|
11 |
+
Prof. Dr. ir. Tinne Tuytelaars
|
12 |
+
Prof. Dr. Marcus Rohrbach
|
13 |
+
(TU Darmstadt)
|
14 |
+
Prof. Dr. Wenpeng Yin
|
15 |
+
(Penn State University)
|
16 |
+
Dr. Bertrand Anckaert
|
17 |
+
(Contract.fit)
|
18 |
+
March 2024
|
19 |
+
|
20 |
+
Dissertation presented in partial
|
21 |
+
fulfillment of the requirements for
|
22 |
+
the degree of Doctor of Engineering
|
23 |
+
Science (PhD): Computer Science
|
24 |
+
|
25 |
+
|
assets/txts/pg_0004.txt
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
© 2024 KU Leuven – Faculty of Engineering Science
|
2 |
+
Uitgegeven in eigen beheer, Jordy Van Landeghem, Celestijnenlaan 200A box 2402, B-3001 Leuven (Belgium)
|
3 |
+
|
4 |
+
Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
|
5 |
+
door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
|
6 |
+
schriftelijke toestemming van de uitgever.
|
7 |
+
All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
|
8 |
+
electronic or any other means without written permission from the publisher.
|
9 |
+
|
10 |
+
|
assets/txts/pg_0005.txt
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Preface
|
2 |
+
This journey has been long and arduous, but I have finally reached an end. At
|
3 |
+
this end, I have a thesis that I am proud of, and I have learned a lot. As I look
|
4 |
+
back, I have been very fortunate to have had the support of many people, and I
|
5 |
+
would like to take this opportunity to thank them.
|
6 |
+
First and foremost, I would like to thank my supervisors, Sien and Matthew,
|
7 |
+
for their guidance and support throughout this journey. Sien has taught me
|
8 |
+
the importance of being thorough and meticulous, striving for diligence and
|
9 |
+
perfection from the get-go. I still remember how patiently she helped me with
|
10 |
+
my first paper, holding a Sunday afternoon call from her attic/home-office,
|
11 |
+
helping me hone the presentation and writing. Involving Matthew as the cosupervisor has been the best decision for my personal development, as he offered
|
12 |
+
a different perspective on my work, always challenging me to look at problems
|
13 |
+
from the lens of statistical theory and machine learning fundamentals. My
|
14 |
+
knee-jerk reaction to start implementing things as soon as possible was often
|
15 |
+
met with a “slow down, think about it first” from Matthew, which has been
|
16 |
+
invaluable in my development as a researcher. I am grateful to both of them
|
17 |
+
for their patience and understanding, and for giving me the freedom to explore
|
18 |
+
my own ideas and interests.
|
19 |
+
Next, a sincere thanks to my jury members, for taking the time to read my
|
20 |
+
thesis and for their valuable feedback. Furthermore, I would like to thank
|
21 |
+
het Vlaams Agentschap Innoveren & Ondernemen (VLAIO) for awarding the
|
22 |
+
Baekeland grant without which this PhD would not have been possible.
|
23 |
+
Pol & Bertrand, thanks for having me contribute to your dream to rid the
|
24 |
+
world of boring administrative processes and paperwork. Technically my bosses,
|
25 |
+
but in reality you are the embodiment of leadership by example, and I am
|
26 |
+
grateful for the many lessons I have learned from you. I am grateful for the
|
27 |
+
many opportunities you have given me to grow as a researcher and as a person.
|
28 |
+
Many thanks to my past and present colleagues at Contract.fit, for always
|
29 |
+
|
30 |
+
i
|
31 |
+
|
32 |
+
|
assets/txts/pg_0006.txt
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
ii
|
2 |
+
|
3 |
+
PREFACE
|
4 |
+
|
5 |
+
preaching automation, inspiring me, and for having fun along the way. I am
|
6 |
+
grateful to my LIIR colleagues at KU Leuven, particularly the folks from office
|
7 |
+
4.34 for the many interesting discussions and whiteboard sessions, whenever I
|
8 |
+
occasionally popped into the office.
|
9 |
+
I was fortunate to travel to many places during my PhD (Lausanne, Lisbon,
|
10 |
+
Barcelona, San Jose, Paris, Waikoloa), and I have met many people along the
|
11 |
+
way. My DUDEs, you have been the trigger to complete my PhD, reinvigorating
|
12 |
+
my passion for research and inspiring me for my future career. How crazy is it
|
13 |
+
that we conceived the seeds of the DUDE
|
14 |
+
project in a pirates bar, on a
|
15 |
+
hotel rooftop, and from a hospital bed after my back surgery?
|
16 |
+
Finally, I would like to thank my family and friends for their support and
|
17 |
+
encouragement throughout this journey. My parents, Peter en Nadine, you
|
18 |
+
have showed me that hard work pays off, and merci for the many sacrifices you
|
19 |
+
have made to give me the best possible education and life. Marijke, you are
|
20 |
+
the love of my life, and although I am not religious, you are my goddess, de
|
21 |
+
mammiej. Feliz, when you came into our lives, you added an extra dimension.
|
22 |
+
I used to see in 2D, now I see in 3D. Forever your father, your pappiej. Wes en
|
23 |
+
Jen, thanks for showing me to never give up, keep on pushing, even when you
|
24 |
+
are at your lowest, there is a way out, and only hard work will get you there.
|
25 |
+
Cornbois -Bryan, Emile, (even) Jan, for our friendship, I fail to make an
|
26 |
+
exhaustive definition. I wish for many more years of friendship from my likeminded brothers. John, Teunen, Wannes, if there is ever a zombie apocalypse, I
|
27 |
+
know that I can count on you to have my window. Kessel-city - Poohke, Vinny,
|
28 |
+
Kweinch etc., thanks for keeping on pushing the bar higher, and inspiring me
|
29 |
+
with your ambition and drive. Gustaf, thanks for the many laughs (#velleke)
|
30 |
+
and the much-needed distraction. Elstipoes, you are my oldest friend, and I am
|
31 |
+
grateful for the many years of friendship. Woutje, thanks for your contagious
|
32 |
+
optimism and the mancave during university. Leuvenbende, you were the
|
33 |
+
ones that made university fun and enjoyable. Individually and together you are
|
34 |
+
beautiful people, and I cherish our yearly reunions. Lauren en Yannick, thanks
|
35 |
+
for letting me win at Mario Kart. I might be forgetting some people, but I
|
36 |
+
would like to thank all my friends for bringing joy, for keeping me grounded,
|
37 |
+
and for reminding me that there is more to life than work.
|
38 |
+
Having studied literature in my Bachelor’s, it feels appropriate to finish with a
|
39 |
+
quote wrongly attributed to Ernest Hemingway: “Write drunk; edit sober.”
|
40 |
+
Jordy Van Landeghem
|
41 |
+
Gurdo, Pogomeister, Jorre, De Van Laaandeghem
|
42 |
+
February, 2024
|
43 |
+
Kessel, Belgium
|
44 |
+
|
45 |
+
|
assets/txts/pg_0007.txt
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Abstract
|
2 |
+
Human communication is increasingly document-based, requiring machines
|
3 |
+
to understand a wide variety of visually-rich documents to assist humans in
|
4 |
+
their daily lives. Amid the digital evolution, documents continue to facilitate
|
5 |
+
crucial human and organizational interactions but are tethered to manual
|
6 |
+
processing, causing inefficiency. We examine why organizations lag in adopting
|
7 |
+
automated document processing solutions and outline two primary challenges:
|
8 |
+
the complexity of processing long, multimodal documents algorithmically and
|
9 |
+
the necessity for reliability and control over associated risks. Automated decisionmaking is key to improving the efficiency of document processing, but the current
|
10 |
+
state-of-the-art technology is not yet reliable and robust enough to be deployed
|
11 |
+
in autonomous systems.
|
12 |
+
The practical objective set is to develop Intelligent Automation () systems
|
13 |
+
capable of estimating confidence in their actions, thereby increasing throughput
|
14 |
+
without accruing additional costs due to errors. We analyze the key challenges
|
15 |
+
and propose solutions to bridge the gap between research and practical
|
16 |
+
applications, with a focus on realistic datasets and experimental methodologies.
|
17 |
+
Building upon foundations of Document Understanding (), this dissertation
|
18 |
+
introduces advanced methodologies combining Machine Learning, Natural
|
19 |
+
Language Processing, and Computer Vision.
|
20 |
+
Addressing the evident gaps in research, this work presents novel methods
|
21 |
+
for predictive uncertainty quantification () alongside practical frameworks for
|
22 |
+
evaluating the robustness and reliability of DU technologies. The contribution
|
23 |
+
culminates in the introduction of two novel multipage document classification
|
24 |
+
datasets and a multifaceted benchmark, DUDE
|
25 |
+
, designed to rigorously
|
26 |
+
challenge and assess the state-of-the-art in DU. Extensive experiments across
|
27 |
+
these datasets reveal that while advancements have been made, significant
|
28 |
+
room for improvement remains, particularly in long-context modeling for
|
29 |
+
multipage document processing and calibrated, selective document visual
|
30 |
+
question answering. Efficient DU is also explored, revealing the effectiveness of
|
31 |
+
iii
|
32 |
+
|
33 |
+
|
assets/txts/pg_0008.txt
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
iv
|
2 |
+
|
3 |
+
ABSTRACT
|
4 |
+
|
5 |
+
knowledge distillation () model compression in visually-rich document layout
|
6 |
+
analysis () and classification.
|
7 |
+
Through empirical studies and methodological contributions, this dissertation
|
8 |
+
has the following contributions and findings:
|
9 |
+
First, in a benchmarking study of established methods on real-world text
|
10 |
+
classification, we find that our novel hybrid method ‘Concrete Dropout
|
11 |
+
Ensemble’ performs best, enhancing in-domain calibration and novel class
|
12 |
+
detection, even at a smaller ensemble size. Detailed ablation experiments
|
13 |
+
reveal the impact of prior, neural architecture, and hyperparameter choices on
|
14 |
+
estimation quality.
|
15 |
+
Second, on a prototypical DU task, we identify challenges in DU progress
|
16 |
+
and propose a formalization of multipage document classification scenarios,
|
17 |
+
constructed novel datasets, and conducted an experimental analysis showing
|
18 |
+
the promise of multipage representation learning and inference.
|
19 |
+
Third, we introduce DUDE, incorporating multifaceted challenges and principles
|
20 |
+
for a comprehensive evaluation of generic DU. Next to our own benchmarking,
|
21 |
+
we organize a competition, revealing that while newer document foundation
|
22 |
+
models show promise, they struggle with questions involving visual evidence or
|
23 |
+
complex reasoning. Moreover, we find severe problems in the ability of Large
|
24 |
+
Language Models (s) to reason about documents in their entirety, highlighting
|
25 |
+
issues with hallucination, long-context reasoning and control.
|
26 |
+
Fourth, we propose the first methodology for enriching documents with semantic
|
27 |
+
layout structure using distilled DLA models. We apply KD to visual document
|
28 |
+
tasks, unraveling the influence of various task and architecture components.
|
29 |
+
Finally, the dissertation concludes with a discussion of the findings and
|
30 |
+
implications for future research, emphasizing the need for advancements in
|
31 |
+
multipage document representation learning and the importance of realistic
|
32 |
+
datasets and experimental methodologies to measurably move forward to reliable
|
33 |
+
and robust IA-DU technology.
|
34 |
+
|
35 |
+
|
assets/txts/pg_0009.txt
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Beknopte samenvatting
|
2 |
+
Menselijke communicatie is in toenemende mate documentgebaseerd, waarbij
|
3 |
+
machines een breed aanbod aan visueel-rijke documenten moeten begrijpen
|
4 |
+
om mensen in hun dagelijks leven te assisteren. Te midden van de digitale
|
5 |
+
evolutie blijven documenten cruciale menselijke en organisatorische interacties
|
6 |
+
faciliteren, maar zijn ze gebonden aan handmatige verwerking, wat inefficiëntie
|
7 |
+
veroorzaakt. We onderzoeken waarom organisaties achterblijven bij het
|
8 |
+
adopteren van geautomatiseerde documentverwerkingsoplossingen en schetsen
|
9 |
+
twee primaire uitdagingen: de complexiteit van het algoritmisch verwerken van
|
10 |
+
lange, multimodale documenten en de noodzaak van betrouwbaarheid en controle
|
11 |
+
over daarmee samenhangende risico’s. Geautomatiseerde besluitvorming is
|
12 |
+
essentieel voor het verbeteren van de efficiëntie van documentverwerking, maar
|
13 |
+
de huidige stand van de technologie is nog niet betrouwbaar en robuust genoeg
|
14 |
+
om ingezet te worden in autonome toepassingen.
|
15 |
+
Het praktische doel dat gesteld wordt, is het ontwikkelen van systemen voor
|
16 |
+
Intelligente Automatisering (IA) die in staat zijn om vertrouwen in hun acties te
|
17 |
+
schatten, daarmee de doorvoer verhogend zonder extra kosten vanwege fouten.
|
18 |
+
We analyseren de belangrijkste uitdagingen en stellen oplossingen voor om de
|
19 |
+
kloof tussen onderzoek en praktische toepassingen te overbruggen, met een focus
|
20 |
+
op realistische datasets en experimentele methodologieën. Voortbouwend op
|
21 |
+
de fundamenten van Documentinterpretatie (DI), introduceert dit proefschrift
|
22 |
+
geavanceerde methodologieën die Machinaal Leren, Natuurlijke Taalverwerking
|
23 |
+
en Computer Visie combineren.
|
24 |
+
Door de duidelijke hiaten in onderzoek aan te pakken, presenteert dit werk
|
25 |
+
nieuwe methoden voor predictieve onzekerheidskwantificering (POK) naast
|
26 |
+
praktische kaders voor het evalueren van de robuustheid en betrouwbaarheid
|
27 |
+
van DI-technologieën. De bijdrage culmineert in de introductie van twee
|
28 |
+
nieuwe datasets voor classificatie van multipagina documenten en een veelzijdige
|
29 |
+
benchmark, DUDE
|
30 |
+
, ontworpen om de state-of-the-art in DI rigoureus
|
31 |
+
uit te dagen en te beoordelen. Uitgebreide experimenten met deze datasets
|
32 |
+
v
|
33 |
+
|
34 |
+
|
assets/txts/pg_0010.txt
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
vi
|
2 |
+
|
3 |
+
BEKNOPTE SAMENVATTING
|
4 |
+
|
5 |
+
onthullen dat er weliswaar vooruitgang is geboekt, maar dat er nog significant
|
6 |
+
veel ruimte is voor verbetering, met name in de lange-contextmodellering voor
|
7 |
+
de verwerking van multipagina documenten en gekalibreerd, selectief visueel
|
8 |
+
vraagbeantwoording van documenten. Meer schaalbaar DI wordt ook verkend,
|
9 |
+
waarbij de effectiviteit van kennisdistillatie (KD) voor modelcompressie in
|
10 |
+
visueel-rijke layoutanalyse (DLA) en classificatie van documenten aan het licht
|
11 |
+
komt.
|
12 |
+
Door middel van empirische studies en methodologische bijdragen, heeft dit
|
13 |
+
proefschrift de volgende bijdragen en bevindingen:
|
14 |
+
Ten eerste vinden we in een benchmarkstudie van gevestigde POK-methoden
|
15 |
+
op tekstclassificatie in de echte wereld dat onze nieuwe hybride POK-methode
|
16 |
+
’Concrete Dropout Ensemble’ het beste presteert, de kalibratie binnenshuis
|
17 |
+
verbeterend en detectie van nieuwe klassen, zelfs met een kleiner ensemble.
|
18 |
+
Gedetailleerde ablatie-experimenten onthullen de impact van voorafgaande
|
19 |
+
kennis, neurale architectuur en keuzes van hyperparameters op de kwaliteit van
|
20 |
+
POK-schatting.
|
21 |
+
Ten tweede identificeren we uitdagingen in de vooruitgang van DI en stellen een
|
22 |
+
formalisatie voor van multipagina documentclassificatiescenario’s, bouwen novel
|
23 |
+
datasets, en voeren een experimentele analyse uit die de belofte van multipagina
|
24 |
+
representatie-leren en inferentie toont.
|
25 |
+
Ten derde introduceren we DUDE, waarin veelzijdige uitdagingen en principes
|
26 |
+
worden voorgesteld voor een uitgebreide evaluatie.
|
27 |
+
Naast onze eigen
|
28 |
+
benchmarking organiseren we een competitie, waaruit blijkt dat hoewel nieuwere
|
29 |
+
modellen veelbelovend zijn, ze het moeilijk hebben met vragen die visueel bewijs
|
30 |
+
of complex redeneren vereisen. Bovendien vinden we ernstige problemen in het
|
31 |
+
vermogen van Grote Taalmodellen (LLMs) om over documenten in hun geheel
|
32 |
+
te redeneren, wat problemen benadrukt met hallucinatie, redeneren met lange
|
33 |
+
context en controle.
|
34 |
+
Ten vierde stellen we de eerste experimentele methodologie voor om documenten
|
35 |
+
te verrijken met semantische layoutstructuur met behulp van gedestilleerde
|
36 |
+
DLA-modellen. We passen KD toe op visuele documenttaken, waarbij we de
|
37 |
+
invloed van verschillende architectuurcomponenten van taken ontrafelen.
|
38 |
+
Ten slotte sluit het proefschrift af met een bespreking van de bevindingen en
|
39 |
+
implicaties voor toekomstig onderzoek, waarbij de noodzaak wordt benadrukt
|
40 |
+
voor vooruitgang in multipagina documentrepresentatie-leren en het belang van
|
41 |
+
realistische datasets en experimentele methodologieën om meetbaar vooruitgang
|
42 |
+
te boeken naar betrouwbare en robuuste IA-DI technologie.
|
43 |
+
|
44 |
+
|
assets/txts/pg_0013.txt
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
List of Abbreviations
|
2 |
+
AAPD Arxiv Academic Paper Dataset
|
3 |
+
Acc_ID Accuracy in-domain
|
4 |
+
Acc_OOD Accuracy out of domain
|
5 |
+
AI Artificial Intelligence
|
6 |
+
ANLS Average Normalized Levenshtein Similarity
|
7 |
+
AUPR Area Under the Precision-Recall Curve
|
8 |
+
AURC Area-Under-Risk-Coverage-Curve
|
9 |
+
AUROC Area Under the Receiver Operating Characteristic curve
|
10 |
+
BDL Bayesian Deep Learning
|
11 |
+
BNN Bayesian Neural Network
|
12 |
+
BPM Business Process Management
|
13 |
+
CE Cross-Entropy
|
14 |
+
CER Character Error Rate
|
15 |
+
COCO Common Objects in Context
|
16 |
+
CSF Confidence Scoring Function
|
17 |
+
CV Computer Vision
|
18 |
+
DC Document Classification
|
19 |
+
DG Document Generation
|
20 |
+
ix
|
21 |
+
|
22 |
+
|
assets/txts/pg_0014.txt
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
x
|
2 |
+
|
3 |
+
List of Abbreviations
|
4 |
+
|
5 |
+
DL Deep Learning
|
6 |
+
DLA Document Layout Analysis
|
7 |
+
DNN Deep Neural Network
|
8 |
+
DocAI Document AI
|
9 |
+
DocVQA Document Visual Question Answering
|
10 |
+
DOD Document Object Detection
|
11 |
+
DU Document Understanding
|
12 |
+
DUDE Document UnderstanDing of Everything
|
13 |
+
ECE Expected Calibration Error
|
14 |
+
ELBO Evidence Lower Bound
|
15 |
+
ERM Empirical Risk Minimization
|
16 |
+
FasterRCNN Faster Region-based Convolutional Neural Network
|
17 |
+
FP False Positives
|
18 |
+
IA Intelligent Automation
|
19 |
+
ICDAR International Conference on Document Analysis and Recognition
|
20 |
+
IDP Intelligent Document Processing
|
21 |
+
i.i.d. Independent and Identically Distributed
|
22 |
+
IOB/IOBES Inside, Outside, Beginning / End, Single
|
23 |
+
KD Knowledge Distillation
|
24 |
+
KIE Key Information Extraction
|
25 |
+
LLM Large Language Model
|
26 |
+
MAP Maximum-a-Posteriori
|
27 |
+
mAP Mean Average Precision
|
28 |
+
MCD Monte Carlo Dropout
|
29 |
+
|
30 |
+
|
assets/txts/pg_0015.txt
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
List of Abbreviations
|
2 |
+
|
3 |
+
MCMC Markov Chain Monte-Carlo
|
4 |
+
MDLT Multi-Domain Long-Tailed Recognition
|
5 |
+
MECE Mutually Exclusive and Collectively Exhaustive
|
6 |
+
MI Mutual Information
|
7 |
+
ML Machine Learning
|
8 |
+
MSE Mean Squared Error
|
9 |
+
MSP Maximum Softmax Probability
|
10 |
+
MU Model Uncertainty
|
11 |
+
NLG Natural Language Generation
|
12 |
+
NLL Negative Log Likelihood
|
13 |
+
NLP Natural Language Processing
|
14 |
+
NN Neural Network
|
15 |
+
OCR Optical Character Recognition
|
16 |
+
OOD Out-of-Distribution
|
17 |
+
PCC Pearson Correlation Coefficient
|
18 |
+
PUQ Predictive Uncertainty Quantification
|
19 |
+
RERM Regularized Empirical Risk Minimization
|
20 |
+
ResNet Residual Network
|
21 |
+
RPA Robotic Process Automation
|
22 |
+
SaaS Software-as-a-service
|
23 |
+
SNGP Spectral-normalized Neural Gaussian Process
|
24 |
+
SOTA State-of-the-art
|
25 |
+
STP Straight-Through-Processing
|
26 |
+
TSR Table Structure Recognition
|
27 |
+
|
28 |
+
xi
|
29 |
+
|
30 |
+
|
assets/txts/pg_0016.txt
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
xii
|
2 |
+
|
3 |
+
VDU Visual Document Understanding
|
4 |
+
VI Variational Inference
|
5 |
+
VLM Vision Language Model
|
6 |
+
VQA Visual Question Answering
|
7 |
+
VRD Visually-Rich Document
|
8 |
+
WER Word Error Rate
|
9 |
+
|
10 |
+
LIST OF ABBREVIATIONS
|
11 |
+
|
12 |
+
|
assets/txts/pg_0017.txt
ADDED
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Contents
|
2 |
+
Abstract
|
3 |
+
|
4 |
+
iii
|
5 |
+
|
6 |
+
Beknopte samenvatting
|
7 |
+
|
8 |
+
v
|
9 |
+
|
10 |
+
List of Abbreviations
|
11 |
+
|
12 |
+
xii
|
13 |
+
|
14 |
+
Contents
|
15 |
+
|
16 |
+
xiii
|
17 |
+
|
18 |
+
List of Figures
|
19 |
+
|
20 |
+
xix
|
21 |
+
|
22 |
+
List of Tables
|
23 |
+
|
24 |
+
xxv
|
25 |
+
|
26 |
+
1 Introduction
|
27 |
+
1.1 Research Context . . . . . . . . . . . . . . . . . . . . . .
|
28 |
+
1.2 Problem Statement and Questions . . . . . . . . . . . .
|
29 |
+
1.2.1 Reliable and Robust Deep Learning . . . . . . .
|
30 |
+
1.2.2 Realistic and Efficient Document Understanding
|
31 |
+
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
32 |
+
|
33 |
+
.
|
34 |
+
.
|
35 |
+
.
|
36 |
+
.
|
37 |
+
.
|
38 |
+
|
39 |
+
.
|
40 |
+
.
|
41 |
+
.
|
42 |
+
.
|
43 |
+
.
|
44 |
+
|
45 |
+
.
|
46 |
+
.
|
47 |
+
.
|
48 |
+
.
|
49 |
+
.
|
50 |
+
|
51 |
+
.
|
52 |
+
.
|
53 |
+
.
|
54 |
+
.
|
55 |
+
.
|
56 |
+
|
57 |
+
1
|
58 |
+
4
|
59 |
+
6
|
60 |
+
6
|
61 |
+
7
|
62 |
+
9
|
63 |
+
|
64 |
+
2 Fundamentals
|
65 |
+
2.1 Statistical Learning . . . . . . . . . . . . . . . .
|
66 |
+
2.1.1 Neural Networks . . . . . . . . . . . . .
|
67 |
+
2.1.2 Probabilistic Evaluation . . . . . . . . .
|
68 |
+
2.1.3 Architectures . . . . . . . . . . . . . . .
|
69 |
+
2.1.3.1 Convolutional Neural Networks
|
70 |
+
2.1.3.2 Language Neural Networks . .
|
71 |
+
2.1.3.3 Transformer Network . . . . .
|
72 |
+
2.2 Reliability and Robustness . . . . . . . . . . . .
|
73 |
+
2.2.1 Generalization and Adaptation . . . . .
|
74 |
+
2.2.2 Confidence Estimation . . . . . . . . . .
|
75 |
+
2.2.3 Evaluation Metrics . . . . . . . . . . . .
|
76 |
+
|
77 |
+
.
|
78 |
+
.
|
79 |
+
.
|
80 |
+
.
|
81 |
+
.
|
82 |
+
.
|
83 |
+
.
|
84 |
+
.
|
85 |
+
.
|
86 |
+
.
|
87 |
+
.
|
88 |
+
|
89 |
+
.
|
90 |
+
.
|
91 |
+
.
|
92 |
+
.
|
93 |
+
.
|
94 |
+
.
|
95 |
+
.
|
96 |
+
.
|
97 |
+
.
|
98 |
+
.
|
99 |
+
.
|
100 |
+
|
101 |
+
.
|
102 |
+
.
|
103 |
+
.
|
104 |
+
.
|
105 |
+
.
|
106 |
+
.
|
107 |
+
.
|
108 |
+
.
|
109 |
+
.
|
110 |
+
.
|
111 |
+
.
|
112 |
+
|
113 |
+
.
|
114 |
+
.
|
115 |
+
.
|
116 |
+
.
|
117 |
+
.
|
118 |
+
.
|
119 |
+
.
|
120 |
+
.
|
121 |
+
.
|
122 |
+
.
|
123 |
+
.
|
124 |
+
|
125 |
+
11
|
126 |
+
12
|
127 |
+
14
|
128 |
+
15
|
129 |
+
16
|
130 |
+
17
|
131 |
+
18
|
132 |
+
19
|
133 |
+
21
|
134 |
+
22
|
135 |
+
23
|
136 |
+
24
|
137 |
+
|
138 |
+
xiii
|
139 |
+
|
140 |
+
.
|
141 |
+
.
|
142 |
+
.
|
143 |
+
.
|
144 |
+
.
|
145 |
+
.
|
146 |
+
.
|
147 |
+
.
|
148 |
+
.
|
149 |
+
.
|
150 |
+
.
|
151 |
+
|
152 |
+
.
|
153 |
+
.
|
154 |
+
.
|
155 |
+
.
|
156 |
+
.
|
157 |
+
.
|
158 |
+
.
|
159 |
+
.
|
160 |
+
.
|
161 |
+
.
|
162 |
+
.
|
163 |
+
|
164 |
+
.
|
165 |
+
.
|
166 |
+
.
|
167 |
+
.
|
168 |
+
.
|
169 |
+
.
|
170 |
+
.
|
171 |
+
.
|
172 |
+
.
|
173 |
+
.
|
174 |
+
.
|
175 |
+
|
176 |
+
.
|
177 |
+
.
|
178 |
+
.
|
179 |
+
.
|
180 |
+
.
|
181 |
+
.
|
182 |
+
.
|
183 |
+
.
|
184 |
+
.
|
185 |
+
.
|
186 |
+
.
|
187 |
+
|
188 |
+
.
|
189 |
+
.
|
190 |
+
.
|
191 |
+
.
|
192 |
+
.
|
193 |
+
.
|
194 |
+
.
|
195 |
+
.
|
196 |
+
.
|
197 |
+
.
|
198 |
+
.
|
199 |
+
|
200 |
+
|
assets/txts/pg_0018.txt
ADDED
@@ -0,0 +1,204 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
xiv
|
2 |
+
|
3 |
+
CONTENTS
|
4 |
+
|
5 |
+
2.3
|
6 |
+
|
7 |
+
2.4
|
8 |
+
|
9 |
+
I
|
10 |
+
|
11 |
+
2.2.4 Calibration . . . . . . . . . . . . . . . .
|
12 |
+
2.2.5 Predictive Uncertainty Quantification .
|
13 |
+
2.2.6 Failure Prediction . . . . . . . . . . . .
|
14 |
+
Document Understanding . . . . . . . . . . . .
|
15 |
+
2.3.1 Task Definitions . . . . . . . . . . . . .
|
16 |
+
2.3.2 Datasets . . . . . . . . . . . . . . . . . .
|
17 |
+
2.3.3 Models . . . . . . . . . . . . . . . . . .
|
18 |
+
2.3.4 Challenges in Document Understanding
|
19 |
+
2.3.4.1 Long-Context Modeling . . . .
|
20 |
+
2.3.4.2 Document Structure Modeling
|
21 |
+
Intelligent Automation . . . . . . . . . . . . . .
|
22 |
+
|
23 |
+
.
|
24 |
+
.
|
25 |
+
.
|
26 |
+
.
|
27 |
+
.
|
28 |
+
.
|
29 |
+
.
|
30 |
+
.
|
31 |
+
.
|
32 |
+
.
|
33 |
+
.
|
34 |
+
|
35 |
+
.
|
36 |
+
.
|
37 |
+
.
|
38 |
+
.
|
39 |
+
.
|
40 |
+
.
|
41 |
+
.
|
42 |
+
.
|
43 |
+
.
|
44 |
+
.
|
45 |
+
.
|
46 |
+
|
47 |
+
.
|
48 |
+
.
|
49 |
+
.
|
50 |
+
.
|
51 |
+
.
|
52 |
+
.
|
53 |
+
.
|
54 |
+
.
|
55 |
+
.
|
56 |
+
.
|
57 |
+
.
|
58 |
+
|
59 |
+
.
|
60 |
+
.
|
61 |
+
.
|
62 |
+
.
|
63 |
+
.
|
64 |
+
.
|
65 |
+
.
|
66 |
+
.
|
67 |
+
.
|
68 |
+
.
|
69 |
+
.
|
70 |
+
|
71 |
+
.
|
72 |
+
.
|
73 |
+
.
|
74 |
+
.
|
75 |
+
.
|
76 |
+
.
|
77 |
+
.
|
78 |
+
.
|
79 |
+
.
|
80 |
+
.
|
81 |
+
.
|
82 |
+
|
83 |
+
.
|
84 |
+
.
|
85 |
+
.
|
86 |
+
.
|
87 |
+
.
|
88 |
+
.
|
89 |
+
.
|
90 |
+
.
|
91 |
+
.
|
92 |
+
.
|
93 |
+
.
|
94 |
+
|
95 |
+
.
|
96 |
+
.
|
97 |
+
.
|
98 |
+
.
|
99 |
+
.
|
100 |
+
.
|
101 |
+
.
|
102 |
+
.
|
103 |
+
.
|
104 |
+
.
|
105 |
+
.
|
106 |
+
|
107 |
+
.
|
108 |
+
.
|
109 |
+
.
|
110 |
+
.
|
111 |
+
.
|
112 |
+
.
|
113 |
+
.
|
114 |
+
.
|
115 |
+
.
|
116 |
+
.
|
117 |
+
.
|
118 |
+
|
119 |
+
.
|
120 |
+
.
|
121 |
+
.
|
122 |
+
.
|
123 |
+
.
|
124 |
+
.
|
125 |
+
.
|
126 |
+
.
|
127 |
+
.
|
128 |
+
.
|
129 |
+
.
|
130 |
+
|
131 |
+
Reliable and Robust Deep Learning
|
132 |
+
|
133 |
+
3 Benchmarking Scalable Predictive Uncertainty in Text Classification
|
134 |
+
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
135 |
+
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
136 |
+
3.3 Uncertainty Methods . . . . . . . . . . . . . . . . . . . . . . . .
|
137 |
+
3.3.1 Quantifying Uncertainty in Deep Learning . . . . . . . .
|
138 |
+
3.3.2 Predictive Uncertainty Methods . . . . . . . . . . . . .
|
139 |
+
3.3.2.1 Monte Carlo Dropout . . . . . . . . . . . . . .
|
140 |
+
3.3.2.2 Deep Ensemble . . . . . . . . . . . . . . . . . .
|
141 |
+
3.3.2.3 Concrete Dropout . . . . . . . . . . . . . . . .
|
142 |
+
3.3.2.4 Heteroscedastic Extensions . . . . . . . . . . .
|
143 |
+
3.3.3 Uncertainty Estimation . . . . . . . . . . . . . . . . . .
|
144 |
+
3.3.4 Motivating Hybrid Approaches . . . . . . . . . . . . . .
|
145 |
+
3.3.5 Uncertainty Calibration under Distribution Shift . . . .
|
146 |
+
3.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . .
|
147 |
+
3.4.1 Proposed Hybrid Approaches . . . . . . . . . . . . . . .
|
148 |
+
3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
149 |
+
3.4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . .
|
150 |
+
3.4.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . .
|
151 |
+
3.4.5 Experimental design . . . . . . . . . . . . . . . . . . . .
|
152 |
+
3.4.5.1 In-domain Setting . . . . . . . . . . . . . . . .
|
153 |
+
3.4.5.2 Cross-domain Setting . . . . . . . . . . . . . .
|
154 |
+
3.4.5.3 Novelty Detection Setting . . . . . . . . . . . .
|
155 |
+
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
156 |
+
3.5.1 Experiment: In-domain . . . . . . . . . . . . . . . . . .
|
157 |
+
3.5.2 Experiment: Cross-domain . . . . . . . . . . . . . . . .
|
158 |
+
3.5.3 Experiment: Novelty Detection . . . . . . . . . . . . . .
|
159 |
+
3.5.4 Experiment: Ablations . . . . . . . . . . . . . . . . . . .
|
160 |
+
3.5.4.1 Diversity . . . . . . . . . . . . . . . . . . . . .
|
161 |
+
|
162 |
+
28
|
163 |
+
30
|
164 |
+
32
|
165 |
+
33
|
166 |
+
35
|
167 |
+
36
|
168 |
+
37
|
169 |
+
38
|
170 |
+
39
|
171 |
+
40
|
172 |
+
41
|
173 |
+
|
174 |
+
43
|
175 |
+
44
|
176 |
+
46
|
177 |
+
48
|
178 |
+
51
|
179 |
+
51
|
180 |
+
52
|
181 |
+
53
|
182 |
+
53
|
183 |
+
54
|
184 |
+
54
|
185 |
+
55
|
186 |
+
58
|
187 |
+
59
|
188 |
+
61
|
189 |
+
61
|
190 |
+
63
|
191 |
+
64
|
192 |
+
66
|
193 |
+
66
|
194 |
+
67
|
195 |
+
67
|
196 |
+
68
|
197 |
+
69
|
198 |
+
70
|
199 |
+
71
|
200 |
+
73
|
201 |
+
75
|
202 |
+
76
|
203 |
+
|
204 |
+
|
assets/txts/pg_0019.txt
ADDED
@@ -0,0 +1,394 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
CONTENTS
|
2 |
+
|
3 |
+
3.6
|
4 |
+
3.7
|
5 |
+
|
6 |
+
3.8
|
7 |
+
3.9
|
8 |
+
|
9 |
+
II
|
10 |
+
|
11 |
+
xv
|
12 |
+
|
13 |
+
3.5.4.2 NLP Architecture . . . . . . . . . .
|
14 |
+
3.5.4.3 Ensemble size M . . . . . . . . . . .
|
15 |
+
3.5.4.4 Concrete Dropout p . . . . . . . . .
|
16 |
+
Discussion . . . . . . . . . . . . . . . . . . . . . . . .
|
17 |
+
Additional Uncertainty Approaches . . . . . . . . . .
|
18 |
+
3.7.1 Stochastic Gradient MCMC Methods . . . .
|
19 |
+
3.7.2 Spectral-normalized Neural Gaussian Process
|
20 |
+
3.7.2.1 SNGP Results . . . . . . . . . . . .
|
21 |
+
3.7.2.2 SNGP Discussion . . . . . . . . . .
|
22 |
+
Limitations . . . . . . . . . . . . . . . . . . . . . . .
|
23 |
+
Chapter Conclusion . . . . . . . . . . . . . . . . . .
|
24 |
+
|
25 |
+
.
|
26 |
+
.
|
27 |
+
.
|
28 |
+
.
|
29 |
+
.
|
30 |
+
.
|
31 |
+
.
|
32 |
+
.
|
33 |
+
.
|
34 |
+
.
|
35 |
+
.
|
36 |
+
|
37 |
+
.
|
38 |
+
.
|
39 |
+
.
|
40 |
+
.
|
41 |
+
.
|
42 |
+
.
|
43 |
+
.
|
44 |
+
.
|
45 |
+
.
|
46 |
+
.
|
47 |
+
.
|
48 |
+
|
49 |
+
.
|
50 |
+
.
|
51 |
+
.
|
52 |
+
.
|
53 |
+
.
|
54 |
+
.
|
55 |
+
.
|
56 |
+
.
|
57 |
+
.
|
58 |
+
.
|
59 |
+
.
|
60 |
+
|
61 |
+
.
|
62 |
+
.
|
63 |
+
.
|
64 |
+
.
|
65 |
+
.
|
66 |
+
.
|
67 |
+
.
|
68 |
+
.
|
69 |
+
.
|
70 |
+
.
|
71 |
+
.
|
72 |
+
|
73 |
+
.
|
74 |
+
.
|
75 |
+
.
|
76 |
+
.
|
77 |
+
.
|
78 |
+
.
|
79 |
+
.
|
80 |
+
.
|
81 |
+
.
|
82 |
+
.
|
83 |
+
.
|
84 |
+
|
85 |
+
.
|
86 |
+
.
|
87 |
+
.
|
88 |
+
.
|
89 |
+
.
|
90 |
+
.
|
91 |
+
.
|
92 |
+
.
|
93 |
+
.
|
94 |
+
.
|
95 |
+
.
|
96 |
+
|
97 |
+
Realistic and Efficient Document Understanding
|
98 |
+
|
99 |
+
4 Beyond Document Page Classification: Design,
|
100 |
+
Challenges
|
101 |
+
4.1 Introduction . . . . . . . . . . . . . . . . . . . .
|
102 |
+
4.2 Problem Formulation . . . . . . . . . . . . . . .
|
103 |
+
4.3 Balancing Research & Applications . . . . . . .
|
104 |
+
4.4 Experimental Study . . . . . . . . . . . . . . .
|
105 |
+
4.5 Challenges and Guidelines . . . . . . . . . . . .
|
106 |
+
4.5.1 Divergence of Tasks: f . . . . . . . . . .
|
107 |
+
4.5.2 Divergence of Label Space: Y . . . . . .
|
108 |
+
4.5.3 Divergence of Input Data: X . . . . . .
|
109 |
+
4.5.4 Maturity of Evaluation Methodology . .
|
110 |
+
4.6 Chapter Conclusion . . . . . . . . . . . . . . .
|
111 |
+
5 Document UnderstanDing of Everything (DUDE
|
112 |
+
5.1 Introduction . . . . . . . . . . . . . . . . . . .
|
113 |
+
5.2 Related Work . . . . . . . . . . . . . . . . . .
|
114 |
+
5.3 DUDE Dataset . . . . . . . . . . . . . . . .
|
115 |
+
5.3.1 Gathering Documents . . . . . . . . .
|
116 |
+
5.3.2 Annotation Process . . . . . . . . . .
|
117 |
+
5.3.3 Dataset Statistics . . . . . . . . . . . .
|
118 |
+
5.3.4 Diagnostic Subsets . . . . . . . . . . .
|
119 |
+
5.3.5 Evaluation . . . . . . . . . . . . . . .
|
120 |
+
5.4 DUDE Competition . . . . . . . . . . . . . .
|
121 |
+
5.4.1 Challenge Objectives . . . . . . . . . .
|
122 |
+
5.4.2 Challenge Contributions . . . . . . . .
|
123 |
+
5.4.3 Motivation and Scope . . . . . . . . .
|
124 |
+
5.4.3.1 Desired Generalization. . . .
|
125 |
+
|
126 |
+
.
|
127 |
+
.
|
128 |
+
.
|
129 |
+
.
|
130 |
+
.
|
131 |
+
.
|
132 |
+
.
|
133 |
+
.
|
134 |
+
.
|
135 |
+
.
|
136 |
+
.
|
137 |
+
.
|
138 |
+
.
|
139 |
+
|
140 |
+
77
|
141 |
+
79
|
142 |
+
80
|
143 |
+
81
|
144 |
+
85
|
145 |
+
86
|
146 |
+
87
|
147 |
+
88
|
148 |
+
90
|
149 |
+
90
|
150 |
+
91
|
151 |
+
|
152 |
+
94
|
153 |
+
|
154 |
+
Datasets, and
|
155 |
+
.
|
156 |
+
.
|
157 |
+
.
|
158 |
+
.
|
159 |
+
.
|
160 |
+
.
|
161 |
+
.
|
162 |
+
.
|
163 |
+
.
|
164 |
+
.
|
165 |
+
|
166 |
+
.
|
167 |
+
.
|
168 |
+
.
|
169 |
+
.
|
170 |
+
.
|
171 |
+
.
|
172 |
+
.
|
173 |
+
.
|
174 |
+
.
|
175 |
+
.
|
176 |
+
|
177 |
+
.
|
178 |
+
.
|
179 |
+
.
|
180 |
+
.
|
181 |
+
.
|
182 |
+
.
|
183 |
+
.
|
184 |
+
.
|
185 |
+
.
|
186 |
+
.
|
187 |
+
|
188 |
+
.
|
189 |
+
.
|
190 |
+
.
|
191 |
+
.
|
192 |
+
.
|
193 |
+
.
|
194 |
+
.
|
195 |
+
.
|
196 |
+
.
|
197 |
+
.
|
198 |
+
|
199 |
+
.
|
200 |
+
.
|
201 |
+
.
|
202 |
+
.
|
203 |
+
.
|
204 |
+
.
|
205 |
+
.
|
206 |
+
.
|
207 |
+
.
|
208 |
+
.
|
209 |
+
|
210 |
+
.
|
211 |
+
.
|
212 |
+
.
|
213 |
+
.
|
214 |
+
.
|
215 |
+
.
|
216 |
+
.
|
217 |
+
.
|
218 |
+
.
|
219 |
+
.
|
220 |
+
|
221 |
+
.
|
222 |
+
.
|
223 |
+
.
|
224 |
+
.
|
225 |
+
.
|
226 |
+
.
|
227 |
+
.
|
228 |
+
.
|
229 |
+
.
|
230 |
+
.
|
231 |
+
|
232 |
+
.
|
233 |
+
.
|
234 |
+
.
|
235 |
+
.
|
236 |
+
.
|
237 |
+
.
|
238 |
+
.
|
239 |
+
.
|
240 |
+
.
|
241 |
+
.
|
242 |
+
|
243 |
+
.
|
244 |
+
.
|
245 |
+
.
|
246 |
+
.
|
247 |
+
.
|
248 |
+
.
|
249 |
+
.
|
250 |
+
.
|
251 |
+
.
|
252 |
+
.
|
253 |
+
|
254 |
+
95
|
255 |
+
97
|
256 |
+
98
|
257 |
+
101
|
258 |
+
104
|
259 |
+
107
|
260 |
+
107
|
261 |
+
108
|
262 |
+
109
|
263 |
+
111
|
264 |
+
111
|
265 |
+
|
266 |
+
.
|
267 |
+
.
|
268 |
+
.
|
269 |
+
.
|
270 |
+
.
|
271 |
+
.
|
272 |
+
.
|
273 |
+
.
|
274 |
+
.
|
275 |
+
.
|
276 |
+
.
|
277 |
+
.
|
278 |
+
.
|
279 |
+
|
280 |
+
)
|
281 |
+
. .
|
282 |
+
. .
|
283 |
+
. .
|
284 |
+
. .
|
285 |
+
. .
|
286 |
+
. .
|
287 |
+
. .
|
288 |
+
. .
|
289 |
+
. .
|
290 |
+
. .
|
291 |
+
. .
|
292 |
+
. .
|
293 |
+
. .
|
294 |
+
|
295 |
+
.
|
296 |
+
.
|
297 |
+
.
|
298 |
+
.
|
299 |
+
.
|
300 |
+
.
|
301 |
+
.
|
302 |
+
.
|
303 |
+
.
|
304 |
+
.
|
305 |
+
.
|
306 |
+
.
|
307 |
+
.
|
308 |
+
|
309 |
+
.
|
310 |
+
.
|
311 |
+
.
|
312 |
+
.
|
313 |
+
.
|
314 |
+
.
|
315 |
+
.
|
316 |
+
.
|
317 |
+
.
|
318 |
+
.
|
319 |
+
.
|
320 |
+
.
|
321 |
+
.
|
322 |
+
|
323 |
+
.
|
324 |
+
.
|
325 |
+
.
|
326 |
+
.
|
327 |
+
.
|
328 |
+
.
|
329 |
+
.
|
330 |
+
.
|
331 |
+
.
|
332 |
+
.
|
333 |
+
.
|
334 |
+
.
|
335 |
+
.
|
336 |
+
|
337 |
+
.
|
338 |
+
.
|
339 |
+
.
|
340 |
+
.
|
341 |
+
.
|
342 |
+
.
|
343 |
+
.
|
344 |
+
.
|
345 |
+
.
|
346 |
+
.
|
347 |
+
.
|
348 |
+
.
|
349 |
+
.
|
350 |
+
|
351 |
+
.
|
352 |
+
.
|
353 |
+
.
|
354 |
+
.
|
355 |
+
.
|
356 |
+
.
|
357 |
+
.
|
358 |
+
.
|
359 |
+
.
|
360 |
+
.
|
361 |
+
.
|
362 |
+
.
|
363 |
+
.
|
364 |
+
|
365 |
+
.
|
366 |
+
.
|
367 |
+
.
|
368 |
+
.
|
369 |
+
.
|
370 |
+
.
|
371 |
+
.
|
372 |
+
.
|
373 |
+
.
|
374 |
+
.
|
375 |
+
.
|
376 |
+
.
|
377 |
+
.
|
378 |
+
|
379 |
+
113
|
380 |
+
116
|
381 |
+
117
|
382 |
+
118
|
383 |
+
121
|
384 |
+
121
|
385 |
+
123
|
386 |
+
125
|
387 |
+
126
|
388 |
+
128
|
389 |
+
128
|
390 |
+
129
|
391 |
+
129
|
392 |
+
130
|
393 |
+
|
394 |
+
|
assets/txts/pg_0020.txt
ADDED
@@ -0,0 +1,320 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
xvi
|
2 |
+
|
3 |
+
CONTENTS
|
4 |
+
|
5 |
+
5.4.4
|
6 |
+
|
7 |
+
5.5
|
8 |
+
5.6
|
9 |
+
|
10 |
+
5.7
|
11 |
+
5.8
|
12 |
+
|
13 |
+
DUDE Competition Protocol . . . . . . . .
|
14 |
+
5.4.4.1 Task Formulation . . . . . . . . . .
|
15 |
+
5.4.4.2 Evaluation Protocol . . . . . . . . .
|
16 |
+
DUDE Benchmark . . . . . . . . . . . . . . . . . .
|
17 |
+
5.5.1 Baselines . . . . . . . . . . . . . . . . . . . .
|
18 |
+
5.5.2 Analysis & Discussion . . . . . . . . . . . . .
|
19 |
+
Detailed Results Analysis . . . . . . . . . . . . . . .
|
20 |
+
5.6.1 Within Model Class Analysis . . . . . . . . .
|
21 |
+
5.6.1.1 Encoder vs. Decoder . . . . . . . .
|
22 |
+
5.6.1.2 Incorporating Layout & Vision . . .
|
23 |
+
5.6.1.3 Toward Long Document Processing
|
24 |
+
5.6.1.4 Diagnosis of LLM Results . . . . . .
|
25 |
+
5.6.2 Assessing Confidence . . . . . . . . . . . . . .
|
26 |
+
DUDE Competition Results . . . . . . . . . . . . .
|
27 |
+
5.7.1 Submitted Methods . . . . . . . . . . . . . .
|
28 |
+
5.7.2 Performance Analysis . . . . . . . . . . . . .
|
29 |
+
Chapter Conclusion . . . . . . . . . . . . . . . . . .
|
30 |
+
|
31 |
+
6 DistilDoc: Knowledge Distillation for Visually-Rich
|
32 |
+
Applications
|
33 |
+
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
|
34 |
+
6.2 Related Work . . . . . . . . . . . . . . . . . . . . .
|
35 |
+
6.3 Experimental Setup . . . . . . . . . . . . . . . . .
|
36 |
+
6.3.1 Datasets . . . . . . . . . . . . . . . . . . . .
|
37 |
+
6.3.2 Architectures and Backbones . . . . . . . .
|
38 |
+
6.3.3 KD Methods . . . . . . . . . . . . . . . . .
|
39 |
+
6.3.4 Evaluation . . . . . . . . . . . . . . . . . .
|
40 |
+
6.3.5 DLA-enriched LLM prompting . . . . . . .
|
41 |
+
6.4 Results & Discussion . . . . . . . . . . . . . . . . .
|
42 |
+
6.5 Chapter Conclusion . . . . . . . . . . . . . . . . .
|
43 |
+
|
44 |
+
.
|
45 |
+
.
|
46 |
+
.
|
47 |
+
.
|
48 |
+
.
|
49 |
+
.
|
50 |
+
.
|
51 |
+
.
|
52 |
+
.
|
53 |
+
.
|
54 |
+
.
|
55 |
+
.
|
56 |
+
.
|
57 |
+
.
|
58 |
+
.
|
59 |
+
.
|
60 |
+
.
|
61 |
+
|
62 |
+
.
|
63 |
+
.
|
64 |
+
.
|
65 |
+
.
|
66 |
+
.
|
67 |
+
.
|
68 |
+
.
|
69 |
+
.
|
70 |
+
.
|
71 |
+
.
|
72 |
+
.
|
73 |
+
.
|
74 |
+
.
|
75 |
+
.
|
76 |
+
.
|
77 |
+
.
|
78 |
+
.
|
79 |
+
|
80 |
+
.
|
81 |
+
.
|
82 |
+
.
|
83 |
+
.
|
84 |
+
.
|
85 |
+
.
|
86 |
+
.
|
87 |
+
.
|
88 |
+
.
|
89 |
+
.
|
90 |
+
.
|
91 |
+
.
|
92 |
+
.
|
93 |
+
.
|
94 |
+
.
|
95 |
+
.
|
96 |
+
.
|
97 |
+
|
98 |
+
.
|
99 |
+
.
|
100 |
+
.
|
101 |
+
.
|
102 |
+
.
|
103 |
+
.
|
104 |
+
.
|
105 |
+
.
|
106 |
+
.
|
107 |
+
.
|
108 |
+
.
|
109 |
+
.
|
110 |
+
.
|
111 |
+
.
|
112 |
+
.
|
113 |
+
.
|
114 |
+
.
|
115 |
+
|
116 |
+
.
|
117 |
+
.
|
118 |
+
.
|
119 |
+
.
|
120 |
+
.
|
121 |
+
.
|
122 |
+
.
|
123 |
+
.
|
124 |
+
.
|
125 |
+
.
|
126 |
+
.
|
127 |
+
.
|
128 |
+
.
|
129 |
+
.
|
130 |
+
.
|
131 |
+
.
|
132 |
+
.
|
133 |
+
|
134 |
+
.
|
135 |
+
.
|
136 |
+
.
|
137 |
+
.
|
138 |
+
.
|
139 |
+
.
|
140 |
+
.
|
141 |
+
.
|
142 |
+
.
|
143 |
+
.
|
144 |
+
.
|
145 |
+
.
|
146 |
+
.
|
147 |
+
.
|
148 |
+
.
|
149 |
+
.
|
150 |
+
.
|
151 |
+
|
152 |
+
131
|
153 |
+
132
|
154 |
+
132
|
155 |
+
133
|
156 |
+
133
|
157 |
+
134
|
158 |
+
136
|
159 |
+
136
|
160 |
+
136
|
161 |
+
136
|
162 |
+
136
|
163 |
+
137
|
164 |
+
138
|
165 |
+
138
|
166 |
+
138
|
167 |
+
139
|
168 |
+
144
|
169 |
+
|
170 |
+
Document
|
171 |
+
.
|
172 |
+
.
|
173 |
+
.
|
174 |
+
.
|
175 |
+
.
|
176 |
+
.
|
177 |
+
.
|
178 |
+
.
|
179 |
+
.
|
180 |
+
.
|
181 |
+
|
182 |
+
.
|
183 |
+
.
|
184 |
+
.
|
185 |
+
.
|
186 |
+
.
|
187 |
+
.
|
188 |
+
.
|
189 |
+
.
|
190 |
+
.
|
191 |
+
.
|
192 |
+
|
193 |
+
.
|
194 |
+
.
|
195 |
+
.
|
196 |
+
.
|
197 |
+
.
|
198 |
+
.
|
199 |
+
.
|
200 |
+
.
|
201 |
+
.
|
202 |
+
.
|
203 |
+
|
204 |
+
.
|
205 |
+
.
|
206 |
+
.
|
207 |
+
.
|
208 |
+
.
|
209 |
+
.
|
210 |
+
.
|
211 |
+
.
|
212 |
+
.
|
213 |
+
.
|
214 |
+
|
215 |
+
.
|
216 |
+
.
|
217 |
+
.
|
218 |
+
.
|
219 |
+
.
|
220 |
+
.
|
221 |
+
.
|
222 |
+
.
|
223 |
+
.
|
224 |
+
.
|
225 |
+
|
226 |
+
.
|
227 |
+
.
|
228 |
+
.
|
229 |
+
.
|
230 |
+
.
|
231 |
+
.
|
232 |
+
.
|
233 |
+
.
|
234 |
+
.
|
235 |
+
.
|
236 |
+
|
237 |
+
145
|
238 |
+
147
|
239 |
+
149
|
240 |
+
151
|
241 |
+
152
|
242 |
+
153
|
243 |
+
155
|
244 |
+
157
|
245 |
+
158
|
246 |
+
158
|
247 |
+
163
|
248 |
+
|
249 |
+
7 Conclusion
|
250 |
+
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
|
251 |
+
7.2 Perspectives For Future Research . . . . . . . . . . . .
|
252 |
+
7.2.1 Open Problems In Reliability & Robustness . .
|
253 |
+
7.2.2 A Future-Proof Design Of IA-DU . . . . . . . .
|
254 |
+
7.2.2.1 The ‘Ultimate’ DU Dataset? . . . . .
|
255 |
+
7.2.2.2 A Feature-complete IA-DU Solution?
|
256 |
+
|
257 |
+
.
|
258 |
+
.
|
259 |
+
.
|
260 |
+
.
|
261 |
+
.
|
262 |
+
.
|
263 |
+
|
264 |
+
.
|
265 |
+
.
|
266 |
+
.
|
267 |
+
.
|
268 |
+
.
|
269 |
+
.
|
270 |
+
|
271 |
+
.
|
272 |
+
.
|
273 |
+
.
|
274 |
+
.
|
275 |
+
.
|
276 |
+
.
|
277 |
+
|
278 |
+
.
|
279 |
+
.
|
280 |
+
.
|
281 |
+
.
|
282 |
+
.
|
283 |
+
.
|
284 |
+
|
285 |
+
.
|
286 |
+
.
|
287 |
+
.
|
288 |
+
.
|
289 |
+
.
|
290 |
+
.
|
291 |
+
|
292 |
+
165
|
293 |
+
165
|
294 |
+
171
|
295 |
+
172
|
296 |
+
173
|
297 |
+
173
|
298 |
+
178
|
299 |
+
|
300 |
+
Bibliography
|
301 |
+
|
302 |
+
.
|
303 |
+
.
|
304 |
+
.
|
305 |
+
.
|
306 |
+
.
|
307 |
+
.
|
308 |
+
.
|
309 |
+
.
|
310 |
+
.
|
311 |
+
.
|
312 |
+
|
313 |
+
181
|
314 |
+
|
315 |
+
A Appendix - PUQ
|
316 |
+
223
|
317 |
+
A
|
318 |
+
Implementation Details . . . . . . . . . . . . . . . . . . . . . . 223
|
319 |
+
|
320 |
+
|
assets/txts/pg_0021.txt
ADDED
@@ -0,0 +1,421 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
CONTENTS
|
2 |
+
|
3 |
+
B
|
4 |
+
C
|
5 |
+
|
6 |
+
xvii
|
7 |
+
|
8 |
+
A.1
|
9 |
+
Software and Data . . . . . . . . . .
|
10 |
+
A.2
|
11 |
+
Hyperparameter Defaults . . . . . .
|
12 |
+
Practical Considerations . . . . . . . . . . .
|
13 |
+
B.1
|
14 |
+
Take-home Summary . . . . . . . . .
|
15 |
+
B.2
|
16 |
+
Compute vs. Performance Trade-off
|
17 |
+
Detailed Experiment Results . . . . . . . .
|
18 |
+
C.1
|
19 |
+
Zoom-in Benchmark Evidence . . . .
|
20 |
+
C.2
|
21 |
+
Absolute Benchmark Results . . . .
|
22 |
+
|
23 |
+
.
|
24 |
+
.
|
25 |
+
.
|
26 |
+
.
|
27 |
+
.
|
28 |
+
.
|
29 |
+
.
|
30 |
+
.
|
31 |
+
|
32 |
+
.
|
33 |
+
.
|
34 |
+
.
|
35 |
+
.
|
36 |
+
.
|
37 |
+
.
|
38 |
+
.
|
39 |
+
.
|
40 |
+
|
41 |
+
.
|
42 |
+
.
|
43 |
+
.
|
44 |
+
.
|
45 |
+
.
|
46 |
+
.
|
47 |
+
.
|
48 |
+
.
|
49 |
+
|
50 |
+
.
|
51 |
+
.
|
52 |
+
.
|
53 |
+
.
|
54 |
+
.
|
55 |
+
.
|
56 |
+
.
|
57 |
+
.
|
58 |
+
|
59 |
+
.
|
60 |
+
.
|
61 |
+
.
|
62 |
+
.
|
63 |
+
.
|
64 |
+
.
|
65 |
+
.
|
66 |
+
.
|
67 |
+
|
68 |
+
.
|
69 |
+
.
|
70 |
+
.
|
71 |
+
.
|
72 |
+
.
|
73 |
+
.
|
74 |
+
.
|
75 |
+
.
|
76 |
+
|
77 |
+
.
|
78 |
+
.
|
79 |
+
.
|
80 |
+
.
|
81 |
+
.
|
82 |
+
.
|
83 |
+
.
|
84 |
+
.
|
85 |
+
|
86 |
+
.
|
87 |
+
.
|
88 |
+
.
|
89 |
+
.
|
90 |
+
.
|
91 |
+
.
|
92 |
+
.
|
93 |
+
.
|
94 |
+
|
95 |
+
.
|
96 |
+
.
|
97 |
+
.
|
98 |
+
.
|
99 |
+
.
|
100 |
+
.
|
101 |
+
.
|
102 |
+
.
|
103 |
+
|
104 |
+
.
|
105 |
+
.
|
106 |
+
.
|
107 |
+
.
|
108 |
+
.
|
109 |
+
.
|
110 |
+
.
|
111 |
+
.
|
112 |
+
|
113 |
+
.
|
114 |
+
.
|
115 |
+
.
|
116 |
+
.
|
117 |
+
.
|
118 |
+
.
|
119 |
+
.
|
120 |
+
.
|
121 |
+
|
122 |
+
223
|
123 |
+
223
|
124 |
+
224
|
125 |
+
224
|
126 |
+
225
|
127 |
+
226
|
128 |
+
226
|
129 |
+
226
|
130 |
+
|
131 |
+
B Appendix - BDPC
|
132 |
+
230
|
133 |
+
A
|
134 |
+
Existing DC Datasets . . . . . . . . . . . . . . . . . . . . . . . . 230
|
135 |
+
B
|
136 |
+
Visualization of Proposed DC Datasets . . . . . . . . . . . . . . 231
|
137 |
+
C Appendix - DUDE
|
138 |
+
A
|
139 |
+
Baseline Experiments Setup . . . . . . . . . .
|
140 |
+
A.1
|
141 |
+
Hyperparameter Defaults . . . . . . .
|
142 |
+
A.2
|
143 |
+
Generative LLM Prompt Fine-tuning
|
144 |
+
A.3
|
145 |
+
Confidence Estimation . . . . . . . . .
|
146 |
+
A.4
|
147 |
+
Evaluation . . . . . . . . . . . . . . .
|
148 |
+
B
|
149 |
+
Qualitative Examples . . . . . . . . . . . . .
|
150 |
+
B.1
|
151 |
+
Qualitative Examples - Competition .
|
152 |
+
|
153 |
+
.
|
154 |
+
.
|
155 |
+
.
|
156 |
+
.
|
157 |
+
.
|
158 |
+
.
|
159 |
+
.
|
160 |
+
|
161 |
+
.
|
162 |
+
.
|
163 |
+
.
|
164 |
+
.
|
165 |
+
.
|
166 |
+
.
|
167 |
+
.
|
168 |
+
|
169 |
+
.
|
170 |
+
.
|
171 |
+
.
|
172 |
+
.
|
173 |
+
.
|
174 |
+
.
|
175 |
+
.
|
176 |
+
|
177 |
+
.
|
178 |
+
.
|
179 |
+
.
|
180 |
+
.
|
181 |
+
.
|
182 |
+
.
|
183 |
+
.
|
184 |
+
|
185 |
+
.
|
186 |
+
.
|
187 |
+
.
|
188 |
+
.
|
189 |
+
.
|
190 |
+
.
|
191 |
+
.
|
192 |
+
|
193 |
+
.
|
194 |
+
.
|
195 |
+
.
|
196 |
+
.
|
197 |
+
.
|
198 |
+
.
|
199 |
+
.
|
200 |
+
|
201 |
+
.
|
202 |
+
.
|
203 |
+
.
|
204 |
+
.
|
205 |
+
.
|
206 |
+
.
|
207 |
+
.
|
208 |
+
|
209 |
+
.
|
210 |
+
.
|
211 |
+
.
|
212 |
+
.
|
213 |
+
.
|
214 |
+
.
|
215 |
+
.
|
216 |
+
|
217 |
+
.
|
218 |
+
.
|
219 |
+
.
|
220 |
+
.
|
221 |
+
.
|
222 |
+
.
|
223 |
+
.
|
224 |
+
|
225 |
+
.
|
226 |
+
.
|
227 |
+
.
|
228 |
+
.
|
229 |
+
.
|
230 |
+
.
|
231 |
+
.
|
232 |
+
|
233 |
+
232
|
234 |
+
232
|
235 |
+
232
|
236 |
+
232
|
237 |
+
233
|
238 |
+
235
|
239 |
+
235
|
240 |
+
241
|
241 |
+
|
242 |
+
D Appendix - KDD
|
243 |
+
A
|
244 |
+
Code and Datasets . . . . . . . . . . .
|
245 |
+
B
|
246 |
+
Implementation Details . . . . . . . .
|
247 |
+
C
|
248 |
+
Task Definitions . . . . . . . . . . . .
|
249 |
+
D
|
250 |
+
Additional Experiment Results . . . .
|
251 |
+
D.1
|
252 |
+
Tobacco-3482 Results . . . . .
|
253 |
+
D.2
|
254 |
+
PRImA Results . . . . . . . . .
|
255 |
+
D.3
|
256 |
+
RVL-CDIP-N Results . . . . .
|
257 |
+
D.4
|
258 |
+
Downstream DocVQA Results
|
259 |
+
D.5
|
260 |
+
Ablation Experiments . . . . .
|
261 |
+
|
262 |
+
.
|
263 |
+
.
|
264 |
+
.
|
265 |
+
.
|
266 |
+
.
|
267 |
+
.
|
268 |
+
.
|
269 |
+
.
|
270 |
+
.
|
271 |
+
|
272 |
+
.
|
273 |
+
.
|
274 |
+
.
|
275 |
+
.
|
276 |
+
.
|
277 |
+
.
|
278 |
+
.
|
279 |
+
.
|
280 |
+
.
|
281 |
+
|
282 |
+
.
|
283 |
+
.
|
284 |
+
.
|
285 |
+
.
|
286 |
+
.
|
287 |
+
.
|
288 |
+
.
|
289 |
+
.
|
290 |
+
.
|
291 |
+
|
292 |
+
.
|
293 |
+
.
|
294 |
+
.
|
295 |
+
.
|
296 |
+
.
|
297 |
+
.
|
298 |
+
.
|
299 |
+
.
|
300 |
+
.
|
301 |
+
|
302 |
+
.
|
303 |
+
.
|
304 |
+
.
|
305 |
+
.
|
306 |
+
.
|
307 |
+
.
|
308 |
+
.
|
309 |
+
.
|
310 |
+
.
|
311 |
+
|
312 |
+
.
|
313 |
+
.
|
314 |
+
.
|
315 |
+
.
|
316 |
+
.
|
317 |
+
.
|
318 |
+
.
|
319 |
+
.
|
320 |
+
.
|
321 |
+
|
322 |
+
.
|
323 |
+
.
|
324 |
+
.
|
325 |
+
.
|
326 |
+
.
|
327 |
+
.
|
328 |
+
.
|
329 |
+
.
|
330 |
+
.
|
331 |
+
|
332 |
+
.
|
333 |
+
.
|
334 |
+
.
|
335 |
+
.
|
336 |
+
.
|
337 |
+
.
|
338 |
+
.
|
339 |
+
.
|
340 |
+
.
|
341 |
+
|
342 |
+
.
|
343 |
+
.
|
344 |
+
.
|
345 |
+
.
|
346 |
+
.
|
347 |
+
.
|
348 |
+
.
|
349 |
+
.
|
350 |
+
.
|
351 |
+
|
352 |
+
.
|
353 |
+
.
|
354 |
+
.
|
355 |
+
.
|
356 |
+
.
|
357 |
+
.
|
358 |
+
.
|
359 |
+
.
|
360 |
+
.
|
361 |
+
|
362 |
+
244
|
363 |
+
244
|
364 |
+
244
|
365 |
+
246
|
366 |
+
247
|
367 |
+
249
|
368 |
+
249
|
369 |
+
249
|
370 |
+
249
|
371 |
+
249
|
372 |
+
|
373 |
+
.
|
374 |
+
.
|
375 |
+
.
|
376 |
+
.
|
377 |
+
.
|
378 |
+
.
|
379 |
+
.
|
380 |
+
.
|
381 |
+
.
|
382 |
+
|
383 |
+
.
|
384 |
+
.
|
385 |
+
.
|
386 |
+
.
|
387 |
+
.
|
388 |
+
.
|
389 |
+
.
|
390 |
+
.
|
391 |
+
.
|
392 |
+
|
393 |
+
.
|
394 |
+
.
|
395 |
+
.
|
396 |
+
.
|
397 |
+
.
|
398 |
+
.
|
399 |
+
.
|
400 |
+
.
|
401 |
+
.
|
402 |
+
|
403 |
+
.
|
404 |
+
.
|
405 |
+
.
|
406 |
+
.
|
407 |
+
.
|
408 |
+
.
|
409 |
+
.
|
410 |
+
.
|
411 |
+
.
|
412 |
+
|
413 |
+
Curriculum
|
414 |
+
|
415 |
+
253
|
416 |
+
|
417 |
+
Publications
|
418 |
+
|
419 |
+
255
|
420 |
+
|
421 |
+
|
assets/txts/pg_0033.txt
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Chapter 1
|
2 |
+
|
3 |
+
Introduction
|
4 |
+
“yourAmid
|
5 |
+
significant life events—like buying a house or expecting
|
6 |
+
firstborn child—lies a less cheerful reality that I experienced
|
7 |
+
firsthand: the hassle of dealing with manual paperwork.
|
8 |
+
|
9 |
+
For the former case, this required a lot of back-and-forth with
|
10 |
+
the bank, the notary, and the real estate agent, with each of
|
11 |
+
them requiring a different set of documents (e.g., monthly pay
|
12 |
+
stubs, bank statements, copies of national registry, etc.) to be
|
13 |
+
filled in, signed, and sent back for processing.
|
14 |
+
On the side of the document processors, each document needed
|
15 |
+
to be classified, key information extracted, and the information
|
16 |
+
validated against other documents to be able to prove my
|
17 |
+
solvency in making an offer, applying for a loan, or being drafted
|
18 |
+
as the future house owner. In between all parties and external
|
19 |
+
organizations, even more documents were either created, adapted,
|
20 |
+
or passed along such as the offer, the loan agreement, the deed
|
21 |
+
of sale, a soil certificate, etc.
|
22 |
+
This juxtaposition of valuable moments in life with cumbersome
|
23 |
+
administrative procedures involving manual document
|
24 |
+
processing forms the backdrop against which I aim to explore
|
25 |
+
and propose potential solutions in this thesis.
|
26 |
+
|
27 |
+
”
|
28 |
+
1
|
29 |
+
|
30 |
+
|
assets/txts/pg_0034.txt
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2
|
2 |
+
|
3 |
+
INTRODUCTION
|
4 |
+
|
5 |
+
Documents are containers of information that are easily shareable. The concept
|
6 |
+
of a document dates back to when humans started writing and has been a
|
7 |
+
cornerstone of human communication ever since. In the age of digital technology,
|
8 |
+
documents are still the primary means of communication between humans and
|
9 |
+
organizations and form the backbone of many business processes. Human
|
10 |
+
communication is increasingly happening through digital channels, and the
|
11 |
+
COVID-19 pandemic has only accelerated this trend. We are increasingly living
|
12 |
+
in a “document society” [53], dependent on documents in our daily lives or for
|
13 |
+
recording second-hand knowledge. With instant gratification as the norm in
|
14 |
+
the digital age, people expect similar seamless interactions with businesses and
|
15 |
+
governments. While digitization has increased the speed and ease of documentbased communication, document processing remains a largely human effort with
|
16 |
+
organizations drowning under the sheer volume of documents they receive.
|
17 |
+
So why have organizations not switched en masse to
|
18 |
+
automated document processing?
|
19 |
+
The answer lies for some part in (I) the complexity of the task, and for the
|
20 |
+
other part in (II) the need for reliability and risk control.
|
21 |
+
(I) While it might be straightforward for a human (white-collar) worker to read
|
22 |
+
a long, structured document, understand its contents, categorize it, and extract
|
23 |
+
crucial information accordingly, this is not so easy for a machine. This could be
|
24 |
+
perceived as an instance of Moravec’s paradox [319], which states that tasks
|
25 |
+
that are easy for humans are hard for machines, and vice versa. However, in
|
26 |
+
recent times, significant strides forward have been made thanks to technological
|
27 |
+
advances combining Natural Language Processing (NLP), Computer Vision
|
28 |
+
(CV) and Machine Learning (ML). Document Understanding (DU) is
|
29 |
+
the umbrella term for both the end-to-end solution and the research field
|
30 |
+
studying to make machines interpret and understand documents (elaborated
|
31 |
+
on in Section 2.3). It has seen a surge in interest in the past few years, with
|
32 |
+
the rise of large-scale pretrained Language and Vision models (LLM, VLM)
|
33 |
+
[52, 94, 101, 187, 380, 383, 502] capable of modeling document inputs.
|
34 |
+
What makes DU challenging is that it encompasses multiple subtasks, each of
|
35 |
+
which is a research field in its own right, such as Optical Character Recognition
|
36 |
+
(OCR), Document Layout Analysis (DLA), Document Classification (DC), Key
|
37 |
+
Information Extraction (KIE), Visual Question Answering (VQA), etc. The
|
38 |
+
complexity of the task is further increased by the fact that documents are
|
39 |
+
multimodal, containing both text and images and that they are compositional,
|
40 |
+
i.e., the meaning of the document is not just the sum of its parts. Information
|
41 |
+
can appear in a wide range of forms including text, images, tables or graphs,
|
42 |
+
and be spread across multiple pages. Moreover, the meaning of a document
|
43 |
+
|
44 |
+
|
assets/txts/pg_0035.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
INTRODUCTION
|
2 |
+
|
3 |
+
3
|
4 |
+
|
5 |
+
can change depending on the context in which it is used. As an artifact of the
|
6 |
+
communication channel, not all documents are born digitally, and the quality
|
7 |
+
of the document can vary greatly, with some documents being handwritten,
|
8 |
+
scanned with low resolution, or even a picture of a document. Furthermore,
|
9 |
+
documents are often not standardized templates and can be highly variable in
|
10 |
+
terms of layout, structure, and content. Finally, the longer the document, the
|
11 |
+
more computationally demanding it becomes to process, and the more likely it
|
12 |
+
is to induce errors, which can be harder to detect.
|
13 |
+
Addressing the inherent challenges of document processing, and achieving high
|
14 |
+
levels of accuracy, processing speed, reliability, robustness, and scalability in
|
15 |
+
DU forms the applied scope of this thesis.
|
16 |
+
(II) Consider the example given of the birth certificate. While I might not
|
17 |
+
appreciate as much the manual handling of this document, if they had registered
|
18 |
+
my baby girl’s name (Feliz, Spanish writing without an accent on the ‘e’)
|
19 |
+
incorrectly, I would be pretty upset as this could have further repercussions.
|
20 |
+
Whereas this error might be easily rectified, it is not so easy to do so in the
|
21 |
+
case of a mortgage application, where the wrong information could lead to a
|
22 |
+
rejection of the application, or even worse, a loan agreement with the wrong
|
23 |
+
terms and conditions. This demonstrates that, even when full automation of
|
24 |
+
document processing is in high demand, it is not always desirable if the risk of
|
25 |
+
failure might be too large.
|
26 |
+
Nevertheless, a lot of the potential for automation remains untapped, and
|
27 |
+
organizations are increasingly looking for solutions to fully automate their
|
28 |
+
document processing workflows. However, full automation, implying perfect
|
29 |
+
recognition of document categories and impeccable information extraction is an
|
30 |
+
unattainable goal with the current state of technology [79].
|
31 |
+
The more realistic objective set is Intelligent Automation (IA) (elaborated
|
32 |
+
on in Section 2.4), where the goal is to have the machine estimate confidence
|
33 |
+
in its predictions, deriving business value with as high as possible volumes of
|
34 |
+
perfect predictions (Straight-Through-Processing, STP) without incurring extra
|
35 |
+
costs (False Positives, FP).
|
36 |
+
The leitmotif of this thesis will be the fundamental enablers of IA: confidence
|
37 |
+
estimation and failure prediction.
|
38 |
+
Calibrated uncertainty estimation with efficient and effective DU technology
|
39 |
+
will allow organizations to confidently automate their document processing
|
40 |
+
workflow, while keeping a human in the loop only for predictions with a higher
|
41 |
+
likelihood of being wrong. To date, however, little research has addressed the
|
42 |
+
question of how to make DU technology more reliable, as is illustrated in a toy
|
43 |
+
analysis (Table 1.1) reporting the absence of many IA-related keywords in the
|
44 |
+
Proceedings of the 2021 International Conference on Document Analysis and
|
45 |
+
|
46 |
+
|
assets/txts/pg_0036.txt
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
4
|
2 |
+
|
3 |
+
INTRODUCTION
|
4 |
+
|
5 |
+
Recognition (ICDAR) [289].
|
6 |
+
The thesis aims to fill this gap by proposing novel methods for uncertainty
|
7 |
+
estimation and failure prediction (Part I), and by providing a framework for
|
8 |
+
benchmarking and evaluating the reliability and robustness of DU technology,
|
9 |
+
as close as possible to real-world requirements (Part II).
|
10 |
+
Table 1.1. Comparative analysis of keywords in the ICDAR 2021 proceedings. While
|
11 |
+
many DU subtasks are represented, there is a lack of keywords related to IA. Do note
|
12 |
+
that calibration is used in the context of camera calibration, and not in the context of
|
13 |
+
confidence estimation.
|
14 |
+
keyword
|
15 |
+
|
16 |
+
freq
|
17 |
+
|
18 |
+
keyword
|
19 |
+
|
20 |
+
freq
|
21 |
+
|
22 |
+
document
|
23 |
+
classification
|
24 |
+
|
25 |
+
3388
|
26 |
+
242
|
27 |
+
|
28 |
+
33
|
29 |
+
0
|
30 |
+
|
31 |
+
key information
|
32 |
+
|
33 |
+
56
|
34 |
+
|
35 |
+
question answering
|
36 |
+
|
37 |
+
106
|
38 |
+
|
39 |
+
layout analysis
|
40 |
+
|
41 |
+
223
|
42 |
+
|
43 |
+
calibration/calibrate
|
44 |
+
temperature scaling
|
45 |
+
failure prediction
|
46 |
+
misclassification detection
|
47 |
+
out-of-distribution
|
48 |
+
OOD
|
49 |
+
predictive uncertainty
|
50 |
+
|
51 |
+
0
|
52 |
+
25
|
53 |
+
0
|
54 |
+
|
55 |
+
In the remainder of the Introduction, I will sketch the surrounding research
|
56 |
+
context, followed by the problem statement and research questions, and finally
|
57 |
+
the outline of the thesis manuscript.
|
58 |
+
|
59 |
+
1.1
|
60 |
+
|
61 |
+
Research Context
|
62 |
+
|
63 |
+
All chapters of this dissertation have been executed as part of the Baekeland
|
64 |
+
PhD mandate (HBC.2019.2604) with financial support of VLAIO (Flemish
|
65 |
+
Innovation & Entrepreneurship) and Contract.fit. The latter is a Belgian-based
|
66 |
+
software-as-a-service (SaaS) provider of Intelligent Document Processing (IDP)
|
67 |
+
drawing on innovations in DU to power their product suite (email-routing,
|
68 |
+
Parble), and my generous employer since 2017.
|
69 |
+
Some of the joint work (Chapter 5) has been partially funded by a PhD
|
70 |
+
Scholarship from AGAUR (2023 FI-3-00223), and the Smart Growth Operational
|
71 |
+
Programme under projects no. POIR.01.01.01-00-1624/20 (Hiper-OCR - an
|
72 |
+
innovative solution for information extraction from scanned documents) and
|
73 |
+
POIR.01.01.01-00-0605/19 (Disruptive adoption of Neural Language Modelling
|
74 |
+
for automation of text-intensive work).
|
75 |
+
Moreover, given that the dissertation work has been performed over a large
|
76 |
+
span of time, it warrants putting it in the larger context and dynamics of AI
|
77 |
+
innovations, the state of DU as a field, how notions of ’reliability’ have evolved
|
78 |
+
over time, and finally the business context.
|
79 |
+
|
80 |
+
|
assets/txts/pg_0037.txt
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
RESEARCH CONTEXT
|
2 |
+
|
3 |
+
5
|
4 |
+
|
5 |
+
This thesis started almost concurrently with the rise of the global COVID19 pandemic, making it hard to foster collaborations in the early stages. At
|
6 |
+
the start of the PhD, DU methodology was fairly established, with OCR and
|
7 |
+
Transformer-based pipelines such as BERT [94] and LayoutLM [502], which
|
8 |
+
is why we first prioritized the more fundamental challenge of decision-making
|
9 |
+
under uncertainty (Part I); which was followed by a step back, closer to applied
|
10 |
+
DU research (Part II).
|
11 |
+
The research community’s understanding of ‘reliability’ has also evolved over
|
12 |
+
time. When starting the work of Chapter 3, the notion of reliability was mostly
|
13 |
+
associated with uncertainty quantification and calibration. However, calibration
|
14 |
+
is not a panacea, and only fairly recently, Jaeger et al. [193] proposed a more
|
15 |
+
general framework encapsulating reliability and robustness. They promote the
|
16 |
+
more concrete and useful notion of failure prediction, which still involves
|
17 |
+
confidence/uncertainty estimation yet with an explicit definition of the failure
|
18 |
+
source which one wants to detect or guard against, e.g., in-domain test errors,
|
19 |
+
changing input feature distributions, novel class shifts, etc. Since I share a
|
20 |
+
similar view of the problem, I have focused following works on the more general
|
21 |
+
notion of failure prediction, which is also more in line with the business context
|
22 |
+
of IA.
|
23 |
+
Whereas we originally intended to work on multi-task learning of DU subtasks,
|
24 |
+
the rise of general-purpose LLMs offering a natural language interface to
|
25 |
+
documents rather than discriminative modeling (e.g., ChatGPT [52, 344]),
|
26 |
+
prompted us toward evaluating this promising technology in the context of
|
27 |
+
DU. More importantly, we observed the lack of sufficiently complex datasets
|
28 |
+
and benchmarks in DU that would allow us to tackle larger, more fundamental
|
29 |
+
questions such as ’Do text-only LLMs suffice for most low-level DU subtasks?’
|
30 |
+
(subsequently tackled in Chapter 5), which is why we shifted our focus to the
|
31 |
+
more applied research questions of benchmarking and evaluation (Part II).
|
32 |
+
Finally, the business context has also evolved over time. Originally, IDP was
|
33 |
+
practiced by legacy OCR companies; specialized vendors, offering a range of
|
34 |
+
solutions for specific document types (e.g., invoices, contracts, tax forms, etc.);
|
35 |
+
or cloud service providers, offering IDP as part of a larger suite of services
|
36 |
+
(e.g., AWS Textract, Azure Form Recognizer, etc.). However, the rise of both
|
37 |
+
open-source LLM development and powerful, though closed-source models has
|
38 |
+
lowered the barrier to entry for any new entrants or incumbents. This has led
|
39 |
+
to a commoditization of IDP, with the quality of the LLMs and the ease of
|
40 |
+
integration with existing business processes becoming key differentiators.
|
41 |
+
|
42 |
+
|
assets/txts/pg_0038.txt
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
6
|
2 |
+
|
3 |
+
1.2
|
4 |
+
|
5 |
+
INTRODUCTION
|
6 |
+
|
7 |
+
Problem Statement and Questions
|
8 |
+
|
9 |
+
The general introduction sketches the context of the research, and motivates
|
10 |
+
the research questions. In this Section, I will formulate the problem statement
|
11 |
+
and research questions more formally and how they relate to the manuscript’s
|
12 |
+
contents.
|
13 |
+
|
14 |
+
1.2.1
|
15 |
+
|
16 |
+
Reliable and Robust Deep Learning
|
17 |
+
|
18 |
+
The dissertation opens with the more fundamental challenge of targeting
|
19 |
+
reliability and robustness in Deep Learning, which covers fairly abstract concepts
|
20 |
+
that have been used interchangeably and inconsistently in the literature. They
|
21 |
+
will be defined more extensively in Section 2.2, but for now, consider reliability
|
22 |
+
as the ability to avoid failure, robustness as the ability to resist failure, and
|
23 |
+
resilience as the ability to recover from failure [373, 438, 455]. In Chapter 3, we
|
24 |
+
focus on the more concrete objective of predictive uncertainty quantification
|
25 |
+
(PUQ), which shows promise for improving reliability and robustness in Deep
|
26 |
+
Learning (DL) [123, 140, 173, 455]. Concretely, PUQ methods are expected to
|
27 |
+
elucidate sources of uncertainty such as a model’s lack of in-domain knowledge
|
28 |
+
due to either training data scarcity or model misspecification, or its ability to
|
29 |
+
flag potentially noisy, shifted or unknown input data [136].
|
30 |
+
We observed that the majority of prior PUQ research focused on regression and
|
31 |
+
CV tasks, while the applicability of PUQ methods had not been thoroughly
|
32 |
+
explored in the context of NLP. As mentioned earlier, most DU pipelines (in
|
33 |
+
2020) were text-centric with a high dependency on the quality of OCR. Since
|
34 |
+
OCR is often considered a solved problem [262], we hypothesized that the main
|
35 |
+
source of error and uncertainty in DU would reside in the text representations
|
36 |
+
learned by deep neural networks (DNN)s. This is why we focused on the
|
37 |
+
more fundamental question of how well do PUQ methods scale in NLP? More
|
38 |
+
specifically, we restricted the scope to the prototypical, well-studied task of
|
39 |
+
text classification, for which we could leverage existing multi-domain datasets
|
40 |
+
varying in complexity, size and label space (multi-class vs. multi-label).
|
41 |
+
This leads to the following research questions:
|
42 |
+
RQ 1. When tested in realistic language data distributions on various text
|
43 |
+
classification tasks, how well do PUQ methods fare in NLP?
|
44 |
+
|
45 |
+
|
assets/txts/pg_0039.txt
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
PROBLEM STATEMENT AND QUESTIONS
|
2 |
+
|
3 |
+
7
|
4 |
+
|
5 |
+
RQ 2. In which settings are PUQ methods most useful, i.e., which failure sources
|
6 |
+
/ distribution shifts are they most sensitive to?
|
7 |
+
RQ 3. How can we obtain better PUQ estimates without overrelying on
|
8 |
+
computationally prohibitive methods, e.g., Deep Ensemble [238]?
|
9 |
+
RQ 4. How important are certain prior, neural architecture or hyperparameter
|
10 |
+
influences on the quality of PUQ estimation?
|
11 |
+
In a later chapter (Chapter 5), we introduce a complex benchmark for generic
|
12 |
+
DU that additionally tests for robustness to domain, visual and layout shifts,
|
13 |
+
and explores the novel problem of hallucination and control in natural language
|
14 |
+
generation (NLG) with LLMs from the perspective of calibrated and selective
|
15 |
+
DocVQA. The general task formulation involves a natural language question (on
|
16 |
+
content, aspect, form, visual/layout), an input document, and a set of reference
|
17 |
+
answers. The model is expected to provide a natural language answer, an answer
|
18 |
+
confidence and a (binary) abstention decision. Evaluation is done in terms of
|
19 |
+
answer correctness, calibration and selective prediction. On the one hand, one
|
20 |
+
expects a model to lower confidence when unsure about the correctness of a
|
21 |
+
predicted answer. On the other hand, one expects a model to abstain from
|
22 |
+
answering and refrain from hallucinations on unanswerable questions (which
|
23 |
+
had been explicitly added in the dataset).
|
24 |
+
RQ 5. How severe is the problem of hallucination and control in LLMs when
|
25 |
+
evaluated in a selective, free-form DocVQA task setting?
|
26 |
+
|
27 |
+
1.2.2
|
28 |
+
|
29 |
+
Realistic and Efficient Document Understanding
|
30 |
+
|
31 |
+
The second part of the dissertation focuses on the more applied research questions
|
32 |
+
of realistic and efficient DU. The overall objective is to make DU technology
|
33 |
+
more generically applicable (Chapter 5), evaluation more in sync with real-world
|
34 |
+
requirements (Chapters 4 and 5), and more efficient at modeling the multimodal
|
35 |
+
and compositional nature of documents (Chapters 5 and 6).
|
36 |
+
Due to the proximity to business applications and the risks of leaking personal
|
37 |
+
information, DU research benchmarks have diverged substantially from the
|
38 |
+
real-world distributions of document data. For instance, DU datasets are often
|
39 |
+
limited to single-page document images, are from outdated sources (e.g., IIT-
|
40 |
+
|
41 |
+
|
assets/txts/pg_0040.txt
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
8
|
2 |
+
|
3 |
+
INTRODUCTION
|
4 |
+
|
5 |
+
CDIP [252]), or are restricted to a single domain or a small set of document
|
6 |
+
types.
|
7 |
+
We posit that larger, fundamental questions in DU remain unanswered due to a
|
8 |
+
lack of sufficiently complex datasets and benchmarks with a rich methodology
|
9 |
+
covering evaluation beyond the independent and identically distributed (i.i.d.)
|
10 |
+
test set setting. While there exist performant models for DU subtasks such
|
11 |
+
as OCR, DC, KIE, etc., it is unclear how to move from these specific analysis
|
12 |
+
and recognition tasks to models that can reason and understand documents. A
|
13 |
+
truly end-to-end DU solution must handle the complexity and variety of realworld documents and subtasks, which could be expressed as natural language
|
14 |
+
questions. Moreover, it should be able to generalize to any question on any
|
15 |
+
document and reason over multiple pages and modalities.
|
16 |
+
The following research questions are addressed in Chapters 4 and 5:
|
17 |
+
RQ 6. How can we iteratively close the gap between research and practice in DU?
|
18 |
+
RQ 7. How can we design a resource that comprehensively challenges the state-ofthe-art?
|
19 |
+
RQ 8. Which DU aspects are most challenging for current state-of-the-art LLMs?
|
20 |
+
How can these be incorporated in a benchmark to allow proper measurements
|
21 |
+
of future improvements?
|
22 |
+
However, moving the goalpost beyond a single-page context inevitably requires
|
23 |
+
us to reconsider the research challenge of efficiency in DU. The rise of LLMs
|
24 |
+
has enabled a new generation of DU pipelines, which are more flexible and
|
25 |
+
easier to maintain than separate and specialized subtask modules, but also
|
26 |
+
more computationally demanding. Importantly, most LLMs are not designed
|
27 |
+
to handle the multimodality and long context windows of multipage documents,
|
28 |
+
and are often unaware of the visual and layout semantics of documents.
|
29 |
+
The research questions for Chapter 6 address the efficiency challenge in DU:
|
30 |
+
RQ 9. How can we efficiently infuse LLMs with semantic layout awareness for
|
31 |
+
more focused information extraction?
|
32 |
+
RQ 10. To what degree can model compression resolve the problem of efficiency
|
33 |
+
in processing documents?
|
34 |
+
|
35 |
+
|
assets/txts/pg_0041.txt
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
OUTLINE
|
2 |
+
|
3 |
+
1.3
|
4 |
+
|
5 |
+
9
|
6 |
+
|
7 |
+
Outline
|
8 |
+
|
9 |
+
Figure 1.1. Overview of publications and how they relate to the chapters.
|
10 |
+
|
11 |
+
Figure 1.2. Visual Overview of the research questions and how they relate to the
|
12 |
+
chapters.
|
13 |
+
|
14 |
+
After the introductory Chapters 1 and 2, we continue with the publication-based
|
15 |
+
chapters that form the core of the thesis, which are structured in two parts.
|
16 |
+
Part I consists of a single chapter, Chapter 3, which presents a benchmarking
|
17 |
+
study of PUQ methods applied on real-world text classification datasets with
|
18 |
+
1-D convolutional neural networks and pretrained transformers. It motivates
|
19 |
+
a novel PUQ method, Deep Ensemble with Concrete Dropout, combining the
|
20 |
+
benefits of both methods, and showing promise for improving reliability and
|
21 |
+
robustness in NLP at a lower computational cost. The chapter concludes with
|
22 |
+
a discussion of the results, including targeted ablation studies, and provides
|
23 |
+
recommendations for future research.
|
24 |
+
Part II consists of three chapters, Chapters 4 to 6, which all focus on the more
|
25 |
+
applied research questions of realistic and efficient DU.
|
26 |
+
|
27 |
+
|
assets/txts/pg_0042.txt
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
10
|
2 |
+
|
3 |
+
INTRODUCTION
|
4 |
+
|
5 |
+
Chapter 4 reflects on the current state of DU research, and proposes guidelines to
|
6 |
+
foster document dataset construction efforts. It introduces two novel document
|
7 |
+
classification datasets, RVL-CDIP_MP and RVL-CDIP-N_MP, as extensions
|
8 |
+
of the RVL-CDIP dataset [165] with multipage documents. The datasets are
|
9 |
+
accompanied by a comprehensive experimental analysis, which shows promise
|
10 |
+
from advancing multipage document representations and inference.
|
11 |
+
Chapter 5 introduces the multi-faceted DUDE
|
12 |
+
benchmark for assessing
|
13 |
+
generic DU, that was also hosted as a competition to challenge the DU
|
14 |
+
community. It describes the complete methodology and design of the dataset,
|
15 |
+
targeting model innovations that can handle the complexity and variety of
|
16 |
+
real-world documents and subtasks, and generalize to any documents and any
|
17 |
+
questions. Next to a discussion of the competition results, it also presents
|
18 |
+
our own comprehensive benchmarking study of SOTA LLMs with varying the
|
19 |
+
context length and what modalities are represented.
|
20 |
+
Chapter 6 investigates how to efficiently obtain more semantic document layout
|
21 |
+
awareness. We explore what affects the teacher-student knowledge gap in
|
22 |
+
KD-based model compression methods, and design a downstream task setup
|
23 |
+
to evaluate the robustness of distilled DLA models on zero-shot layout-aware
|
24 |
+
DocVQA.
|
25 |
+
Finally, Chapter 7 concludes the thesis with a summary of the main contributions
|
26 |
+
(Section 7.1), and a discussion of future research directions. As a logical followup to Chapter 5, we propose in Section 7.2.2.1 how the DUDE dataset could
|
27 |
+
be extended to become the ‘ultimate’ DU benchmark. The thesis ends with a
|
28 |
+
hypothetical, informed design of how the research presented would form part of
|
29 |
+
an end-to-end, fully-fledged IA-DU solution (Section 7.2.2.2).
|
30 |
+
|
31 |
+
|
assets/txts/pg_0043.txt
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Chapter 2
|
2 |
+
|
3 |
+
Fundamentals
|
4 |
+
This chapter provides all the necessary background knowledge necessary to
|
5 |
+
understand the contributions of this thesis.
|
6 |
+
The key questions covered here are:
|
7 |
+
i.
|
8 |
+
ii.
|
9 |
+
iii.
|
10 |
+
iv.
|
11 |
+
v.
|
12 |
+
vi.
|
13 |
+
|
14 |
+
How to feed a document to an algorithm to perform arbitrary tasks on it?
|
15 |
+
How to model language, vision, layout or structure?
|
16 |
+
How does it learn and then operate at inference time?
|
17 |
+
How does it estimate prediction uncertainty?
|
18 |
+
How to evaluate its performance?
|
19 |
+
How to integrate it as a useful, end-to-end system in a document workflow?
|
20 |
+
|
21 |
+
Section 2.1 explains the basic setting from the perspective of statistical learning
|
22 |
+
theory [472], which is a mathematical framework for analyzing how algorithms
|
23 |
+
learn from data with minimal error. Section 2.2 gives a primer on reliability and
|
24 |
+
robustness, particularly calibration, failure detection and relevant evaluation
|
25 |
+
metrics. Section 2.3 surveys the DU field, and discusses the state of the art in
|
26 |
+
DU technology. Finally, Section 2.4 covers Intelligent Automation to illustrate
|
27 |
+
how solving the challenges posed in this thesis will enable to augment human
|
28 |
+
intelligence, creativity and productivity in straight-through business processes.
|
29 |
+
|
30 |
+
11
|
31 |
+
|
32 |
+
|
assets/txts/pg_0044.txt
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
12
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
Contents
|
6 |
+
2.1
|
7 |
+
|
8 |
+
2.2
|
9 |
+
|
10 |
+
2.3
|
11 |
+
|
12 |
+
2.4
|
13 |
+
|
14 |
+
2.1
|
15 |
+
|
16 |
+
Statistical Learning - basics . . . . . . . . . . . .
|
17 |
+
2.1.1 Neural Networks . . . . . . . . . . . . .
|
18 |
+
2.1.2 Probabilistic Evaluation . . . . . . . . .
|
19 |
+
2.1.3 Architectures . . . . . . . . . . . . . . .
|
20 |
+
Reliability and Robustness . . . . . . . . . . . .
|
21 |
+
2.2.1 Generalization and Adaptation . . . . .
|
22 |
+
2.2.2 Confidence Estimation . . . . . . . . . .
|
23 |
+
2.2.3 Evaluation Metrics . . . . . . . . . . . .
|
24 |
+
2.2.4 Calibration . . . . . . . . . . . . . . . .
|
25 |
+
2.2.5 Predictive Uncertainty Quantification . .
|
26 |
+
2.2.6 Failure Prediction . . . . . . . . . . . . .
|
27 |
+
Document Understanding . . . . . . . . . . . . .
|
28 |
+
2.3.1 Task Definitions . . . . . . . . . . . . . .
|
29 |
+
2.3.2 Datasets . . . . . . . . . . . . . . . . . .
|
30 |
+
2.3.3 Models . . . . . . . . . . . . . . . . . . .
|
31 |
+
2.3.4 Challenges in Document Understanding
|
32 |
+
Intelligent Automation . . . . . . . . . . . . . .
|
33 |
+
|
34 |
+
.
|
35 |
+
.
|
36 |
+
.
|
37 |
+
.
|
38 |
+
.
|
39 |
+
.
|
40 |
+
.
|
41 |
+
.
|
42 |
+
.
|
43 |
+
.
|
44 |
+
.
|
45 |
+
.
|
46 |
+
.
|
47 |
+
.
|
48 |
+
.
|
49 |
+
.
|
50 |
+
.
|
51 |
+
|
52 |
+
.
|
53 |
+
.
|
54 |
+
.
|
55 |
+
.
|
56 |
+
.
|
57 |
+
.
|
58 |
+
.
|
59 |
+
.
|
60 |
+
.
|
61 |
+
.
|
62 |
+
.
|
63 |
+
.
|
64 |
+
.
|
65 |
+
.
|
66 |
+
.
|
67 |
+
.
|
68 |
+
.
|
69 |
+
|
70 |
+
.
|
71 |
+
.
|
72 |
+
.
|
73 |
+
.
|
74 |
+
.
|
75 |
+
.
|
76 |
+
.
|
77 |
+
.
|
78 |
+
.
|
79 |
+
.
|
80 |
+
.
|
81 |
+
.
|
82 |
+
.
|
83 |
+
.
|
84 |
+
.
|
85 |
+
.
|
86 |
+
.
|
87 |
+
|
88 |
+
.
|
89 |
+
.
|
90 |
+
.
|
91 |
+
.
|
92 |
+
.
|
93 |
+
.
|
94 |
+
.
|
95 |
+
.
|
96 |
+
.
|
97 |
+
.
|
98 |
+
.
|
99 |
+
.
|
100 |
+
.
|
101 |
+
.
|
102 |
+
.
|
103 |
+
.
|
104 |
+
.
|
105 |
+
|
106 |
+
.
|
107 |
+
.
|
108 |
+
.
|
109 |
+
.
|
110 |
+
.
|
111 |
+
.
|
112 |
+
.
|
113 |
+
.
|
114 |
+
.
|
115 |
+
.
|
116 |
+
.
|
117 |
+
.
|
118 |
+
.
|
119 |
+
.
|
120 |
+
.
|
121 |
+
.
|
122 |
+
.
|
123 |
+
|
124 |
+
12
|
125 |
+
14
|
126 |
+
15
|
127 |
+
17
|
128 |
+
18
|
129 |
+
19
|
130 |
+
20
|
131 |
+
21
|
132 |
+
25
|
133 |
+
27
|
134 |
+
29
|
135 |
+
30
|
136 |
+
31
|
137 |
+
33
|
138 |
+
34
|
139 |
+
35
|
140 |
+
38
|
141 |
+
|
142 |
+
Statistical Learning
|
143 |
+
|
144 |
+
Two popular definitions of Machine Learning (ML) are given below.
|
145 |
+
Machine Learning is the field of study that gives computers the ability
|
146 |
+
to learn without being explicitly programmed. [406]
|
147 |
+
A computer program is said to learn from experience E with respect to
|
148 |
+
some class of tasks T, and performance measure P, if its performance
|
149 |
+
at tasks in T, as measured by P, improves with experience E. [317]
|
150 |
+
Following these, different types of learning problems [472] can be discerned, of
|
151 |
+
which the most common (and the one used throughout our works) is supervised
|
152 |
+
learning. It defines experience E as a set of input-output pairs for which the
|
153 |
+
task T is to learn a mapping f from inputs X ∈ X to outputs Y ∈ Y, and the
|
154 |
+
performance measure P is the risk or expected loss (Equation (2.1)), given a
|
155 |
+
(0-1) loss function ` : Y × Y → R+ .
|
156 |
+
R(f ) = E(X,Y )∼P [`(Y, f (X))]
|
157 |
+
|
158 |
+
(2.1)
|
159 |
+
|
160 |
+
The mapping f (·; θ) : X → Y is typically parameterized by a set of parameters
|
161 |
+
θ (omitted whenever it is fixed) and a hypothesis class F, which is a set of
|
162 |
+
|
163 |
+
|
assets/txts/pg_0045.txt
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
STATISTICAL LEARNING
|
2 |
+
|
3 |
+
13
|
4 |
+
|
5 |
+
possible functions. The objective is to find a function f ∈ F that minimizes the
|
6 |
+
risk, or even better, the Bayes risk
|
7 |
+
f ∗ = inf R(f ),
|
8 |
+
f ∈F
|
9 |
+
|
10 |
+
(2.2)
|
11 |
+
|
12 |
+
which is the minimum achievable risk over all functions in F. The latter is only
|
13 |
+
realizable with infinite data or having access to the data-generating distribution
|
14 |
+
P(X , Y). In practice, Equation (2.2) is unknown, and the goal is to find a
|
15 |
+
function fˆ that minimizes the empirical risk
|
16 |
+
N
|
17 |
+
1 X
|
18 |
+
`(yi , f (xi )),
|
19 |
+
fˆ =
|
20 |
+
N i=1
|
21 |
+
|
22 |
+
(2.3)
|
23 |
+
|
24 |
+
where (xi , yi ) are N independently and identically distributed (i.i.d.) samples
|
25 |
+
drawn from an unknown distribution P on X × Y. This is known as empirical
|
26 |
+
risk minimization (ERM), which is a popular approach to supervised learning,
|
27 |
+
under which three important processes are defined.
|
28 |
+
Training or model fitting is the process of estimating the parameters θ of a
|
29 |
+
model, which is done by minimizing a suitable loss function ` over a training
|
30 |
+
set D = {(xi , yi )}N
|
31 |
+
i=1 of N i.i.d. samples.
|
32 |
+
Inference or prediction is the process of estimating the output of a model for
|
33 |
+
a given input, which is typically done by computing the posterior probability
|
34 |
+
P (y|x) over the output space Y. Classification output is a discrete label, while
|
35 |
+
regression output is a continuous value.
|
36 |
+
Evaluation involves measuring the quality of a model’s predictions, which is
|
37 |
+
typically done by computing a suitable evaluation metric over a test set Dtest
|
38 |
+
of i.i.d. samples, which were not used for training.
|
39 |
+
However, ERM has its caveats concerning generalization to unseen data,
|
40 |
+
requiring either additional assumptions on the hypothesis class F, which
|
41 |
+
are known as inductive biases, and/or regularization to penalize the
|
42 |
+
complexity of the function class F [445]. In neural networks (discussed in
|
43 |
+
detail Section 2.1.1), the former is controlled by the architecture of the network,
|
44 |
+
while the latter involves specifying constraints to parameters or adding a
|
45 |
+
regularization term to the loss function.
|
46 |
+
|
47 |
+
|
48 |
+
fˆ = arg min R̂(f ) + λΨ(θ)
|
49 |
+
f ∈F
|
50 |
+
|
51 |
+
(2.4)
|
52 |
+
|
53 |
+
|
assets/txts/pg_0046.txt
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
14
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
Equation (2.4) defines regularized empirical risk minimization (RERM),
|
6 |
+
where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the
|
7 |
+
trade-off between the empirical risk (denoted with R̂) and the regularization
|
8 |
+
term.
|
9 |
+
All these concepts will be revisited in the context of neural networks in
|
10 |
+
Section 2.1.1, where we will also discuss the optimization process of the model
|
11 |
+
parameters θ, how inference differs in the case of probabilistic models to estimate
|
12 |
+
uncertainty (Section 2.2.5), and how regularization affects confidence estimation
|
13 |
+
and calibration (Section 2.2.4).
|
14 |
+
|
15 |
+
2.1.1
|
16 |
+
|
17 |
+
Neural Networks
|
18 |
+
|
19 |
+
An artificial neural network (NN) is a mathematical approximation inspired
|
20 |
+
by data processing in the human brain [396]. It can be represented by a
|
21 |
+
network topology of interconnected neurons that are organized in layers that
|
22 |
+
successively refine intermediately learned feature representations of the input
|
23 |
+
[448] that are useful for the task at hand, e.g., classifying an animal by means
|
24 |
+
of its size, shape and fur, or detecting the sentiment of a review by focusing on
|
25 |
+
adjectives.
|
26 |
+
A basic NN building block is a linear layer, which is a linear function of the
|
27 |
+
input parameters: f (x) = W x + b, where the bias term b is a constant vector
|
28 |
+
shifting the decision boundary away from the origin and the weight matrix
|
29 |
+
W holds most parameters that rotate the decision boundary in input space.
|
30 |
+
Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to
|
31 |
+
introduce non-linearity in the model, which is required for learning complex
|
32 |
+
functions.
|
33 |
+
The first deep learning (DL) network (stacking multiple linear layers) dates
|
34 |
+
back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398].
|
35 |
+
The first successful DL application was a demonstration of digit recognition
|
36 |
+
in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent
|
37 |
+
success of DL is attributed to the availability of large datasets, the increase in
|
38 |
+
computational power, the development of new algorithms and architectures,
|
39 |
+
and the commercial interest of large companies.
|
40 |
+
Consider a conventional DL architecture as a composition of parameterized
|
41 |
+
functions. Each consists of a configuration of layers (e.g., convolution, pooling,
|
42 |
+
activation function, normalization, embeddings) determining the type of input
|
43 |
+
transformation (e.g., convolutional, recurrent, attention) with (trainable)
|
44 |
+
parameters linear/non-linear w.r.t. the input x. Given the type of input,
|
45 |
+
e.g., language which is naturally discrete-sequential, or vision which presents a
|
46 |
+
|
47 |
+
|
assets/txts/pg_0047.txt
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
STATISTICAL LEARNING
|
2 |
+
|
3 |
+
15
|
4 |
+
|
5 |
+
Sigmoid Function
|
6 |
+
1
|
7 |
+
σ(z) =
|
8 |
+
1 + exp−z
|
9 |
+
|
10 |
+
Softmax Function
|
11 |
+
exp(z)
|
12 |
+
softmax(z) = PK
|
13 |
+
k=1 exp(zk )
|
14 |
+
|
15 |
+
Table 2.1. Sigmoid and softmax activation functions for binary and multi-class
|
16 |
+
classification, respectively.
|
17 |
+
|
18 |
+
ready continuous-spatial signal, different DL architectures have been established,
|
19 |
+
which will be discussed in Section 2.1.3.
|
20 |
+
A K-class classification function with an l-layer NN with d dimensional input x ∈
|
21 |
+
Rd is shorthand fθ : Rd → RK , with θ = {θj }lj=1 assumed to be optimized, either
|
22 |
+
partially or fully, using backpropagation and a loss function. More specifically,
|
23 |
+
it presents a non-convex optimization problem, concerning multiple feasible
|
24 |
+
regions with multiple locally optimal points within each. With maximumlikelihood estimation estimation, the goal is to find the optimal parameters
|
25 |
+
or weights that minimize the loss function, effectively interpolating the training
|
26 |
+
data. This process involves traversing the high-dimensional loss landscape.
|
27 |
+
Upon convergence of model training, the optimized parameters form a solution
|
28 |
+
in the weight-space, representing a unique mode (specific function fθ̂ ). However,
|
29 |
+
when regularization techniques such as weight decay, dropout, or early stopping
|
30 |
+
are applied, the objective shifts towards maximum-a-posteriori (MAP), to
|
31 |
+
take into account the prior probability of the parameters. The difference in
|
32 |
+
parameter estimation forms the basis for several uncertainty estimation methods,
|
33 |
+
covered in Section 2.2.5.
|
34 |
+
A prediction is a translation of a model’s output to which a standard decision
|
35 |
+
rule is applied, e.g., to obtain the top-1/k prediction (Equation (2.5)), or decode
|
36 |
+
structured output according to a function maximizing total likelihood with
|
37 |
+
optionally additional diversity criteria.
|
38 |
+
ŷ = argmax fθ̂ (x)
|
39 |
+
|
40 |
+
(2.5)
|
41 |
+
|
42 |
+
Considering standard NNs, the last layer outputs a vector of real-valued logits
|
43 |
+
z ∈ RK , which in turn are normalized to a probability distribution over K
|
44 |
+
classes using a sigmoid or softmax function (Table 2.1).
|
45 |
+
|
46 |
+
2.1.2
|
47 |
+
|
48 |
+
Probabilistic Evaluation
|
49 |
+
|
50 |
+
The majority of our works involves supervised learning with NNs, formulated
|
51 |
+
generically as a probabilistic predictor in Definition 1.
|
52 |
+
|
53 |
+
|
assets/txts/pg_0048.txt
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
16
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
Definition 1. Probabilistic predictor f : X → ∆Y that outputs a conditional
|
6 |
+
probability distribution P (y 0 |x) over outputs y 0 ∈ Y for an i.i.d. drawn sample
|
7 |
+
(x,y).
|
8 |
+
|Y|
|
9 |
+
|
10 |
+
Definition 2 (Probability Simplex). Let ∆Y := {v ∈ R≥0 : kvk1 = 1} be a
|
11 |
+
probability simplex of size |Y| − 1 as a geometric representation of a probability
|
12 |
+
space, where each vertex represents a mutually exclusive label and each point
|
13 |
+
has an associated probability vector v [368].
|
14 |
+
Figure 2.1 illustrates a multi-class classifier, where Y = [K] for K=3 classes.
|
15 |
+
photos.google.com
|
16 |
+
|
17 |
+
Google Photos
|
18 |
+
Home for all your photos and videos,
|
19 |
+
automatically organized and easy to
|
20 |
+
share.
|
21 |
+
|
22 |
+
https://photos.google.com/search/fox
|
23 |
+
|
24 |
+
Figure 2.1. Scatter plot of a ternary problem (K = 3, N = 100) in the probability
|
25 |
+
simplex space. Example of overconfident misprediction (above is a Shiba Inu dog) and
|
26 |
+
correct sharp prediction (clear image of Beagle).
|
27 |
+
|
28 |
+
In practice, loss functions are proper scoring rules [330], S : ∆Y × Y → R, that
|
29 |
+
measure the quality of a probabilistic prediction P (ŷ|x) given the true label y.
|
30 |
+
The cross-entropy (CE) loss is a popular loss function for classification, while
|
31 |
+
the mean-squared error (MSE) loss is used for regression. In Section 2.2, we
|
32 |
+
will discuss the evaluation of probabilistic predictors in more detail, including
|
33 |
+
the calibration of confidence estimates and the detection of out-of-distribution
|
34 |
+
samples.
|
35 |
+
|
36 |
+
2.1.3
|
37 |
+
|
38 |
+
Architectures
|
39 |
+
|
40 |
+
Throughout the chapters of the thesis, we have primarily used the following
|
41 |
+
NN architectures: Convolutional Neural Networks (CNNs), Transformer
|
42 |
+
Networks . We will briefly introduce the building blocks of these architectures,
|
43 |
+
with a focus on how they are used in the context of document understanding.
|
44 |
+
|
45 |
+
|
assets/txts/pg_0049.txt
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
STATISTICAL LEARNING
|
2 |
+
|
3 |
+
2.1.3.1
|
4 |
+
|
5 |
+
17
|
6 |
+
|
7 |
+
Convolutional Neural Networks
|
8 |
+
|
9 |
+
Convolutional Neural Networks (CNNs) [244] are a class of DNNs designed
|
10 |
+
primarily for visual and grid-spatial data such as images. They are inspired by
|
11 |
+
the visual cortex of animals, which contains neurons that are sensitive to small
|
12 |
+
subregions of the visual field, called a receptive field. The receptive fields of
|
13 |
+
different neurons partially overlap such that they cover the entire visual field,
|
14 |
+
growing larger in deeper layers of the visual cortex.
|
15 |
+
|
16 |
+
Figure 2.2. Sketch of a CNN architecture. The input is a 2D image, which is iteratively
|
17 |
+
convolved with a set of learned filters detecting specific input features, e.g., edges,
|
18 |
+
corners, blobs, to produce feature maps. Feature maps are then downsampled using
|
19 |
+
a pooling operation.
|
20 |
+
|
21 |
+
As illustrated in Figure 2.2, CNNs are composed of multiple convolutional layers,
|
22 |
+
which hierarchically extract features from the input, followed by pooling and
|
23 |
+
fully-connected layers to classify the input based on the downsampled features.
|
24 |
+
A filter K ∈ Rd×d is a rectangular matrix of trainable weights with width and
|
25 |
+
height d typically smaller than the input x. A convolutional layer applies filters
|
26 |
+
sliding over the input, with each filter producing a feature map:
|
27 |
+
F = K ∗ x,
|
28 |
+
|
29 |
+
(2.6)
|
30 |
+
|
31 |
+
where the convolution operation ∗ computes a dot product between filter entries
|
32 |
+
and the covered portions of the input.
|
33 |
+
Thanks to the weight sharing property of the convolution operation, CNNs are
|
34 |
+
able to learn translation invariance, i.e., the ability to recognize an object
|
35 |
+
regardless of its position in the image. This is particularly useful for object
|
36 |
+
detection, where the position of the object in the image is unknown.
|
37 |
+
This architecture was used for document image classification and document
|
38 |
+
layout analysis (Section 6.3.2). A special version is 1-D CNNs, which we applied
|
39 |
+
to one-hot encoded text data in text classification benchmarking (Section 3.4.3).
|
40 |
+
|
41 |
+
|
assets/txts/pg_0050.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
18
|
2 |
+
|
3 |
+
2.1.3.2
|
4 |
+
|
5 |
+
FUNDAMENTALS
|
6 |
+
|
7 |
+
Language Neural Networks
|
8 |
+
|
9 |
+
The first step to represent language input into a format compatible with NNs is
|
10 |
+
to convert units of language, words or characters or “tokens” as depending on
|
11 |
+
a tokenizer, into numerical vectors. This is done by means of embeddings,
|
12 |
+
which are typically learned as part of the training process, and are used to
|
13 |
+
represent the meaning of words in a continuous vector space. There have been
|
14 |
+
multiple generations of word embeddings, starting with one-hot vectors that
|
15 |
+
represent each word by a vector of zeros with a single one at its vocabulary index,
|
16 |
+
which depends highly on the tokenizer used and does not capture semantic
|
17 |
+
relationships between words. Alternatives are frequency-based embeddings,
|
18 |
+
such as TF-IDF vectors, which represent each word by its frequency in the
|
19 |
+
corpus, weighted by its inverse frequency in the corpus, capturing some lexical
|
20 |
+
semantics, but not the context in which the word appears. The next generation
|
21 |
+
are Word2Vec embeddings that are trained to predict the context of a word, i.e.,
|
22 |
+
the words that appear before and after it in a sentence. FastText embeddings
|
23 |
+
improve this by considering a character n-gram context, i.e., a sequence of n
|
24 |
+
characters. The current generation are contextual word embeddings that
|
25 |
+
are trained to predict the context of a word, taking into account the surrounding
|
26 |
+
context and learning the sense of a word based on its context, e.g., ‘bank’ as
|
27 |
+
a river bank vs. a financial institution in ‘Feliz sits at the bank of the river
|
28 |
+
Nete’. Another important innovation is subword tokenization to deal with
|
29 |
+
the out-of-vocabulary (OOV) problem, which is particularly relevant for
|
30 |
+
morphologically rich languages, such as Dutch, where word meaning can be
|
31 |
+
inferred from its subwords. A clever extension is byte pair encoding (BPE)
|
32 |
+
[412], which is a data compression algorithm that iteratively replaces the most
|
33 |
+
frequent pair of bytes in a sequence with a single, unused byte, until a predefined
|
34 |
+
vocabulary size is reached. This is particularly useful for multilingual models,
|
35 |
+
where the vocabulary size would otherwise be too large to fit in memory.
|
36 |
+
The first embedding layer is typically a lookup table, which maps each word
|
37 |
+
to a unique index in a vocabulary, and each index to a vector of real numbers.
|
38 |
+
The embedding layer is typically followed by a recurrent, convolutional or
|
39 |
+
attention layer, which is used to capture the sequential nature of language.
|
40 |
+
Recurrent Neural Networks (RNNs) and recurrent architectures extended
|
41 |
+
to model long-range dependencies such as Long Short-Term Memory (LSTM)
|
42 |
+
and Gated Recurrent Unit (GRU) networks were the dominant architectures
|
43 |
+
for sequence modeling in NLP, yet they have been superseded by Transformers
|
44 |
+
in recent years.
|
45 |
+
|
46 |
+
|
assets/txts/pg_0051.txt
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
STATISTICAL LEARNING
|
2 |
+
|
3 |
+
2.1.3.3
|
4 |
+
|
5 |
+
19
|
6 |
+
|
7 |
+
Transformer Network
|
8 |
+
|
9 |
+
A Transformer [473] is a sequence-to-sequence model that uses an attention
|
10 |
+
mechanism to capture long-range dependencies in the input sequence, benefiting
|
11 |
+
from increased parallelization. Traditionally, it consists of an encoder and a
|
12 |
+
decoder, each composed of multiple layers of self-attention and feed-forward
|
13 |
+
layers.
|
14 |
+
Attention is a mechanism that allows for soft selection of relevant information
|
15 |
+
from a set of candidates, e.g., tokens in a document, based on a query, e.g.,
|
16 |
+
a token in the document. The scaled dot-product P
|
17 |
+
attention is defined
|
18 |
+
n
|
19 |
+
for a sequence of length n as follows: Att(Q, K, V ) = i=1 αi Vi . It utilizes
|
20 |
+
three learnable weight matrices, each multiplied with all token embeddings in a
|
21 |
+
sequence to build queries Q ∈ Rn×dq , keys K ∈ Rn×dq , and values V ∈ Rn×dv .
|
22 |
+
The output of the attention mechanism is a weighted sum of the unnormalized
|
23 |
+
values, where each attention weight of the i-th key is computed by normalizing
|
24 |
+
exp(QT
|
25 |
+
i Ki )
|
26 |
+
the dot product between the query and key vectors αi = Pn exp(Q
|
27 |
+
T K ) . For
|
28 |
+
j=1
|
29 |
+
|
30 |
+
J
|
31 |
+
|
32 |
+
j
|
33 |
+
|
34 |
+
training stability, the dot product is typically scaled by the square root of the
|
35 |
+
dimensionality of the query and key vectors. This is followed by a feed-forward
|
36 |
+
layer to capture non-linear relationships between the tokens in the sequence.
|
37 |
+
There exist different forms of attention, depending on the type of relationship
|
38 |
+
that is captured. Self-attention computes the attention of each token w.r.t.
|
39 |
+
all other tokens in the sequence, which changes the representation of each token
|
40 |
+
based on the other tokens in the sequence. Multi-head attention is a set
|
41 |
+
of h attention layers, which every Transformer uses to concurrently capture
|
42 |
+
different types of relationships, concatenated together after the parallelized
|
43 |
+
processing. Cross-attention computes the attention of each token in one
|
44 |
+
sequence w.r.t. all tokens in another sequence, which is used in encoder-decoder
|
45 |
+
Transformer architectures for e.g., summarization and machine translation.
|
46 |
+
Specific to decoder layers, masked attention is used to prevent the decoder
|
47 |
+
from attending to future tokens in the sequence by masking the upper triangle
|
48 |
+
of the attention matrix calculation.
|
49 |
+
A major downside to Transformers is the quadratic complexity of the attention
|
50 |
+
mechanism (Figure 2.3), which makes them computationally inefficient for long
|
51 |
+
sequences. This has been addressed by a wealth of techniques [120], such as
|
52 |
+
sparsifing attention, targeting recurrence, downsampling, random or low-rank
|
53 |
+
approximations.
|
54 |
+
Position Embeddings are indispensable for Transformers to be able to process
|
55 |
+
sequences, as they do not have any notion of order or position of tokens in
|
56 |
+
a sequence. The most common type of position embedding is a sinusoidal
|
57 |
+
|
58 |
+
|
assets/txts/pg_0052.txt
ADDED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
20
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
Quadratic complexity
|
6 |
+
|
7 |
+
Figure 2.3. Illustration of the main attention mechanisms in a Transformer.
|
8 |
+
|
9 |
+
embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the
|
10 |
+
frequency and φ is the phase which are learned as part of the training process,
|
11 |
+
and they are typically shared across all tokens in the sequence. Integrating
|
12 |
+
position information into Transformers can be achieved in different ways, which
|
13 |
+
[105, Table 1] gives an overview for.
|
14 |
+
Transformers have gradually taken over as an end-to-end architecture for both
|
15 |
+
NLP and CV tasks, albeit adoption in CV has been slower, due to the lack
|
16 |
+
of spatial invariance in the original Transformer architecture. This has been
|
17 |
+
addressed by recent works, such as Vision Transformer (ViT) [101], which uses
|
18 |
+
a patch-based input representation with position embeddings.
|
19 |
+
A large language model (LLM) consists of a stack of Transformers that is
|
20 |
+
pretrained on a large corpus of text, typically using a self-supervised learning
|
21 |
+
objective, such as predicting the next token in a sequence. The goal of LLMs
|
22 |
+
is to learn a general-purpose language representation that can be fine-tuned
|
23 |
+
to perform well on a wide range of downstream tasks. LLMs have disrupted
|
24 |
+
NLP in recent years, as they have achieved SOTA performance on a wide
|
25 |
+
range of tasks thanks to pretraining on large amounts of data. The most
|
26 |
+
popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383],
|
27 |
+
GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to
|
28 |
+
modeling document inputs, explained in Section 2.3.4, open challenges for
|
29 |
+
LLMs include: (i) structured output generation, (ii) domain-specific knowledge
|
30 |
+
injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]),
|
31 |
+
(iii) multimodality.
|
32 |
+
Vision-language models (VLM) are a recent development in multimodal
|
33 |
+
learning, which combine the power of LLMs with vision encoders to perform
|
34 |
+
tasks that require understanding both visual and textual information. The most
|
35 |
+
popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344].
|
36 |
+
In every chapter of this dissertation we have used Transformers, either as part
|
37 |
+
|
38 |
+
|
assets/txts/pg_0053.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
RELIABILITY AND ROBUSTNESS
|
2 |
+
|
3 |
+
21
|
4 |
+
|
5 |
+
of a foundation model for DU tasks (Chapters 4 to 6) or to contrast with 1-D
|
6 |
+
CNNs in text classification (Chapter 3). Note that [265] share our concerns that
|
7 |
+
NLP needs a new ‘playground’ with more realistic tasks and benchmarks, which
|
8 |
+
extend beyond sentence-level contexts to more complex document-level tasks.
|
9 |
+
Alternative sub-quadratic architectures have started addressing Transformer’s
|
10 |
+
computational inefficiency on long sequences, e.g., Mamba [152] and Longnet
|
11 |
+
[99]. Time will tell if these will be able to compete with the Transformer’s
|
12 |
+
dominance in foundation models.
|
13 |
+
|
14 |
+
2.2
|
15 |
+
|
16 |
+
Reliability and Robustness
|
17 |
+
|
18 |
+
Chapter 3 contains a lot of relevant content on the basic relation between
|
19 |
+
uncertainty quantification, calibration, and distributional generalization or
|
20 |
+
detection tasks. Here, we will focus on the more general concepts of reliability
|
21 |
+
and robustness, and how they relate to concepts used throughout the rest of
|
22 |
+
the thesis. Next, we discuss the need for confidence estimation and appropriate
|
23 |
+
evaluation metrics, followed by short summaries of the main research trends in
|
24 |
+
calibration and uncertainty quantification.
|
25 |
+
Emerging guidance and regulations [2, 3, 475] place increasing importance on
|
26 |
+
the reliability and robustness of ML systems, particularly once they are used
|
27 |
+
in the public sphere or in safety-critical applications. In ML, reliability and
|
28 |
+
robustness are often used interchangeably [78, 420, 455], yet they are distinct
|
29 |
+
concepts, and it is important to understand the difference between them. This
|
30 |
+
thesis uses the following definitions of reliability and robustness, adapted from
|
31 |
+
systems engineering literature [395]:
|
32 |
+
Definition 3 [Reliability]. Reliability is the ability of a system to consistently
|
33 |
+
perform its intended function in a specific, known environment for a specific
|
34 |
+
period of time, with a specific level of expected accuracy [395]. Closer to the ML
|
35 |
+
context, this entails all evaluation under the i.i.d. assumption, allowing for some
|
36 |
+
benign shifts of the distribution, including predictive performance evaluation
|
37 |
+
with task-dependent metrics (accuracy, F1, perplexity, etc.), calibration, selective
|
38 |
+
prediction, uncertainty estimation, etc.
|
39 |
+
Reliability requires to clearly specify the role an ML component plays in a
|
40 |
+
larger system, and to define the expected behavior of the system as a function
|
41 |
+
of alignment with the training data distribution. This is particularly important
|
42 |
+
in the context of black-box models, where the inner workings of the model are
|
43 |
+
not transparent to the user. In this case, the user needs to be aware of the
|
44 |
+
model’s limitations, e.g., model misspecification, lack of training data, and the
|
45 |
+
|
46 |
+
|
assets/txts/pg_0054.txt
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
22
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
model needs to be able to communicate its own uncertainty to the user. This is
|
6 |
+
the focus of Chapter 3.
|
7 |
+
Definition 4 [Robustness]. Robustness is the ability of a system to maintain
|
8 |
+
its intended function despite a wide range of disturbances, with a minimal
|
9 |
+
degradation of performance [395]. Such disturbances can take the form of
|
10 |
+
adversarial attacks, distributional shifts, or other types of noise. In the ML
|
11 |
+
context, this entails all evaluation violating the i.i.d. assumption, including
|
12 |
+
adversarial and label noise robustness, out-of-distribution detection, domain
|
13 |
+
generalization, extrapolation, etc.
|
14 |
+
Robustness is more involved with the application scope in which a model can
|
15 |
+
perform well, assuming that the model can maintain some degree of its prediction
|
16 |
+
capacity on non-i.i.d. data which might be unknown at training time. Detecting
|
17 |
+
when the model is operating outside of its intended scope is an important part
|
18 |
+
of robustness to prevent failure propagation to downstream systems.
|
19 |
+
Resilience is another component of the R3 : reliability, robustness, resilience
|
20 |
+
concept in systems engineering, yet it is not a focus of this thesis, nor is it
|
21 |
+
a relevant qualifier of the ML model in isolation, as it is more related to the
|
22 |
+
system as a whole. Resilient systems are able to recover from disturbances, even
|
23 |
+
those caused by model misspecification, e.g., by adapting to new environments
|
24 |
+
and unexpected inputs from unknown distributions or by self-healing.
|
25 |
+
|
26 |
+
2.2.1
|
27 |
+
|
28 |
+
Generalization and Adaptation
|
29 |
+
|
30 |
+
To complete the R3 picture, we cannot overlook the generalizationadaptation spectrum, which has been less explored in our works, yet it is an
|
31 |
+
important part of current practices in ML.
|
32 |
+
Definition 5 [Generalization-adaptation]. Generalization is the ability of
|
33 |
+
a system to perform its intended function in a wide range of environments,
|
34 |
+
including those not known at design time [395]. Each environment is defined by
|
35 |
+
a data distribution over a domain and a task, and generalization is the ability
|
36 |
+
of a model to perform well on new data drawn from the same distribution.
|
37 |
+
Adaptation is the ability of a system to perform its intended function in a specific,
|
38 |
+
known environment, despite changes in the system itself or its environment
|
39 |
+
[395]. This entails the ability of a model to perform well on new data drawn
|
40 |
+
from a different distribution, which is known at design time.
|
41 |
+
Different settings of generalization-adaptation are: in-distribution (same
|
42 |
+
domain and task), domain generalization (same task, different domain), task
|
43 |
+
generalization (same domain, different task), out-of-distribution (different
|
44 |
+
|
45 |
+
|
assets/txts/pg_0055.txt
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
RELIABILITY AND ROBUSTNESS
|
2 |
+
|
3 |
+
23
|
4 |
+
|
5 |
+
domain or task). If the model has access to limited samples for training
|
6 |
+
on the new distribution, it is referred to as few-shot learning or no samples at
|
7 |
+
all, zero-shot learning; if it is able to adapt to new distributions over time, or
|
8 |
+
accumulate knowledge over different tasks without retraining from scratch [87],
|
9 |
+
it is referred to as continual learning or incremental learning.
|
10 |
+
Many of these settings are referred to in business as out-of-the-box, self-learning,
|
11 |
+
yet without any formal definitions given. Domain and task generalization are
|
12 |
+
major selling points of pretrained LLMs, which are able to perform well on a
|
13 |
+
wide range of tasks and domains. In the case of very different distributions, e.g.,
|
14 |
+
a different task/expected output or an additional domain/input modality, it is
|
15 |
+
often necessary to fine-tune the model on a small amount of data from the new
|
16 |
+
distribution, which is known as transfer learning. Specific to LLMs, instruction
|
17 |
+
tuning is a form of transfer learning, where samples from a new distribution are
|
18 |
+
appended with natural language instructions [69, 532]. This approach has been
|
19 |
+
used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an
|
20 |
+
effort to reduce the amount of annotated data required to generalize to unseen
|
21 |
+
domains and questions.
|
22 |
+
|
23 |
+
2.2.2
|
24 |
+
|
25 |
+
Confidence Estimation
|
26 |
+
|
27 |
+
A quintessential component of reliability and robustness requires a model to
|
28 |
+
estimate its own uncertainty, or inversely to translate model outputs into
|
29 |
+
probabilities or ‘confidence’ (Definition 6).
|
30 |
+
Definition 6 [Confidence Scoring Function]. Any function g : X → R
|
31 |
+
whose continuous output aims to separate a model’s failures from correct
|
32 |
+
predictions can be interpreted as a confidence scoring function (CSF) [193].
|
33 |
+
Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier
|
34 |
+
thresholding, this is not a strict requirement.
|
35 |
+
Circling back on the question of why one needs a CSF, there are multiple reasons:
|
36 |
+
i) ML models are continually improving, yet 0 test error is an illusion, even a
|
37 |
+
toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed,
|
38 |
+
performance deterioration is expected due to i.i.d. assumptions breaking; iii)
|
39 |
+
generative models are prone to hallucinations [198], requiring some control
|
40 |
+
mechanisms and guardrails to guide them.
|
41 |
+
Below, we present some common CSFs used in practice [114, 172, 194, 539],
|
42 |
+
where for convenience the subscript is reused to denote the k-th element of the
|
43 |
+
output vector g(x) = gk (x).
|
44 |
+
|
45 |
+
|
assets/txts/pg_0056.txt
ADDED
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
24
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
I. Maximum softmax probability (MSP): g(x) = maxy0 ∈Y fy0 (x)
|
6 |
+
II. Maximum logit: g(x) = maxy0 ∈Y zy0 (x), with logits z ∈ RK
|
7 |
+
P
|
8 |
+
III. Negative entropy: g(x) = − y0 ∈Y fy0 (x) log fy0 (x)
|
9 |
+
IV. Margin: g(x) = maxy0 ∈Y fy0 (x) − maxy00 ∈Y\y0 fy00 (x)
|
10 |
+
V. Distance-based measures
|
11 |
+
• kNN distance: A 1D outlier score derived from the average distance
|
12 |
+
of the feature representation of x to its k nearest neighbors in the
|
13 |
+
training distribution
|
14 |
+
• Mahalanobis distance [390]: The minimum distance of the feature
|
15 |
+
map (e.g., penultimate layer activations) of a test input to classconditional Gaussian distributions of the training data.
|
16 |
+
VI. Bayesian uncertainty estimation
|
17 |
+
Chapter 3 used MSP and negative entropy as CSFs, next to various PUQ
|
18 |
+
methods for Bayesian uncertainty estimation. Other chapters used MSP as it
|
19 |
+
is the most common CSF in practice, requiring only logits as input. From the
|
20 |
+
use of CSFs also follows the need to evaluate their statistical quality next to
|
21 |
+
task-specific predictive performance metrics, which is discussed next.
|
22 |
+
|
23 |
+
2.2.3
|
24 |
+
|
25 |
+
Evaluation Metrics
|
26 |
+
|
27 |
+
In an ideal world, the evaluation metric of interest would be the same as the loss
|
28 |
+
function used for training, yet this is rarely the case in practice, as the gradientbased optimization process requires a continuously differentiable function, while
|
29 |
+
the metric of interest is often non-differentiable, e.g., accuracy vs. cross-entropy
|
30 |
+
in classification.
|
31 |
+
Throughout our works, we have used (or extended) multiple predictive
|
32 |
+
performance, calibration, and robustness metrics, of which the most interesting
|
33 |
+
are respectively outlined.
|
34 |
+
Average Normalized Levenshtein Similarity (ANLS) is a metric introduced in [39] for the evaluation of VQA, which was then extended [449] to
|
35 |
+
support lists and be invariant to the order of provided answers. We adapted the
|
36 |
+
underlying Levenshtein Distance (LD) metric [251] to support not-answerable
|
37 |
+
questions, NA(G) = I[type(G) = not-answerable ] (see Equation (2.7)).
|
38 |
+
|
39 |
+
|
assets/txts/pg_0057.txt
ADDED
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
RELIABILITY AND ROBUSTNESS
|
2 |
+
|
3 |
+
25
|
4 |
+
|
5 |
+
Consider for simplicity, the evaluation of a single non-list ground truth answer
|
6 |
+
G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively.
|
7 |
+
|
8 |
+
1 if NA(G) ∧ |P̂ | > 0,
|
9 |
+
|
10 |
+
|
11 |
+
|
12 |
+
|
13 |
+
|
14 |
+
0 if NA(G) ∧ |P̂ | = 0,
|
15 |
+
|
16 |
+
|
17 |
+
|
18 |
+
|
19 |
+
|G| if |P̂ | = 0,
|
20 |
+
LD(G, P̂ ) =
|
21 |
+
LD(tail(G), tail(P̂ )) if G[0] = P̂ [0],
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
|
26 |
+
if G[0] 6= P̂ [0] (deletion),
|
27 |
+
LD(tail(G), P̂ )
|
28 |
+
|
29 |
+
|
30 |
+
|
31 |
+
|
32 |
+
1 + min
|
33 |
+
LD(G, tail(P̂ ))
|
34 |
+
if G[0] 6= P̂ [0] (insertion),
|
35 |
+
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution)
|
40 |
+
(2.7)
|
41 |
+
Each of the conditions is tested in turn, and the first one that is true is executed.
|
42 |
+
The normalized similarity metric is then defined as
|
43 |
+
NLS(G, P̂ ) =
|
44 |
+
|
45 |
+
1 − LD(G, P̂ )
|
46 |
+
max(1, |G|, |P̂ |)
|
47 |
+
|
48 |
+
.
|
49 |
+
|
50 |
+
Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted
|
51 |
+
answer for P̂Qi for each question Q in the test set of size N , we define the
|
52 |
+
complete metric as follows:
|
53 |
+
N
|
54 |
+
|
55 |
+
|
56 |
+
1 X
|
57 |
+
ANLS =
|
58 |
+
max s a, P̂Qi
|
59 |
+
N i=1 a∈Gi
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
+
|
65 |
+
s a, P̂Qi =
|
66 |
+
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
NLS a, P̂Q
|
71 |
+
i
|
72 |
+
0
|
73 |
+
|
74 |
+
|
75 |
+
|
76 |
+
if NLS a, P̂Qi > τ
|
77 |
+
|
78 |
+
|
79 |
+
,
|
80 |
+
if NLS a, P̂Qi < τ
|
81 |
+
|
82 |
+
(2.8)
|
83 |
+
|
84 |
+
(2.9)
|
85 |
+
|
86 |
+
where we follow prior literature [39, 449] in setting the threshold τ = 0.5.
|
87 |
+
In the case of a list-type question, Hungarian matching is performed following
|
88 |
+
[449] according to NLS between each ground truth answer part and each
|
89 |
+
prediction answer part.
|
90 |
+
Proper scoring rules [330] are used for generic evaluation of predictive
|
91 |
+
performance, which calculate scoring at the instance-level while measuring both
|
92 |
+
the quality of the predictive function and predicted probability distribution (as
|
93 |
+
they are not compatible with an arbitrary CSF):
|
94 |
+
• Negative Log Likelihood (NLL) [378] is both a popular loss function
|
95 |
+
(cross-entropy) and scoring rule which only penalizes (wrong) log
|
96 |
+
probabilities qi given to the true class, with I an indicator function defining
|
97 |
+
|
98 |
+
|
assets/txts/pg_0058.txt
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
26
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
the true class. This measure more heavily penalizes sharp probabilities,
|
6 |
+
which are close to the wrong edge or class by over/under-confidence.
|
7 |
+
`NLL (f ) = −
|
8 |
+
|
9 |
+
N K
|
10 |
+
1 XX
|
11 |
+
I [yi = k] · log (fk (xi ))
|
12 |
+
N i=1
|
13 |
+
|
14 |
+
(2.10)
|
15 |
+
|
16 |
+
k=1
|
17 |
+
|
18 |
+
• Brier Score [50] is a scoring rule that measures the accuracy of a
|
19 |
+
probabilistic classifier and is related to the mean-squared error (MSE) loss
|
20 |
+
function. Brier score is more commonly used in industrial practice since it
|
21 |
+
is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities
|
22 |
+
less severely than NLL.
|
23 |
+
`BS (f ) =
|
24 |
+
|
25 |
+
N K
|
26 |
+
1 XX
|
27 |
+
2
|
28 |
+
(I (yi = k) − fk (xi ))
|
29 |
+
N i=1
|
30 |
+
|
31 |
+
(2.11)
|
32 |
+
|
33 |
+
k=1
|
34 |
+
|
35 |
+
All metrics following require a CSF g(x) to be defined, and can pertain to
|
36 |
+
specific evaluation settings [389] tested in Section 3.4.5.
|
37 |
+
Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate
|
38 |
+
top-1 prediction miscalibration. A calibration estimator (Definition 7) measures
|
39 |
+
the Lp norm difference between a model’s posterior and the true likelihood of
|
40 |
+
being correct.
|
41 |
+
Definition 7 (Lp Calibration Error). [231, 463]
|
42 |
+
The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y )
|
43 |
+
with the Lp norm p ∈ [1, ∞) is given by:
|
44 |
+
|
45 |
+
|
46 |
+
CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp
|
47 |
+
(2.12)
|
48 |
+
The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the
|
49 |
+
above with p = 1, where the expectation is approximated using a histogram.
|
50 |
+
MaxCE defines the worst-case risk version with p = ∞, effectively reporting on
|
51 |
+
the bin with the highest error. As part of Chapter 5, we contributed a novel
|
52 |
+
empirical estimator of top-1 calibration for the task of VQA, where the exact
|
53 |
+
accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior
|
54 |
+
work [329] used a similar strategy of thresholding continuous quality scores to
|
55 |
+
be able to estimate ECE.
|
56 |
+
In practice, ECE is implemented as a histogram binning estimator that
|
57 |
+
discretizes predicted probabilities into ranges of possible values for which
|
58 |
+
conditional expectation can be estimated. Concretely, the probability space
|
59 |
+
is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap
|
60 |
+
between observed accuracy and bin confidence P¯b is measured, with a final
|
61 |
+
|
62 |
+
|
assets/txts/pg_0059.txt
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
RELIABILITY AND ROBUSTNESS
|
2 |
+
|
3 |
+
27
|
4 |
+
|
5 |
+
average weighted by the number of samples per bin |bi |.
|
6 |
+
ECE =
|
7 |
+
|
8 |
+
B
|
9 |
+
X
|
10 |
+
|bi |
|
11 |
+
i=1
|
12 |
+
|
13 |
+
N
|
14 |
+
|
15 |
+
acc(bi ) − P¯b (bi )
|
16 |
+
|
17 |
+
(2.13)
|
18 |
+
|
19 |
+
To minimize the drawbacks inherited from histogram binning, as suggested
|
20 |
+
by the literature [231, 342, 393, √
|
21 |
+
463], we have applied an equal-mass binning
|
22 |
+
scheme with 100 bins (close to N ). While plenty of histogram-based ECE
|
23 |
+
estimator implementations exist, many design hyperparameters are not reported
|
24 |
+
or exposed:
|
25 |
+
I.
|
26 |
+
II.
|
27 |
+
III.
|
28 |
+
IV.
|
29 |
+
V.
|
30 |
+
|
31 |
+
`p norm
|
32 |
+
The number of bins (beyond the unfounded default of |B| = 15)
|
33 |
+
Different binning schemes (equal-range, equal-mass)
|
34 |
+
Binning range to define the operating zone
|
35 |
+
Proxy used as bin accuracy (lower-e.g., center, upper-edge)
|
36 |
+
|
37 |
+
We upstreamed 1 a generic implementation of binning-based ECE as part of
|
38 |
+
the ICDAR 2023 DUDE competition (Chapter 5).
|
39 |
+
Alternative formulations have been developed for multi-class [342, 370, 492]
|
40 |
+
and multi-label calibration [493, 520]. Measurements of “strong” calibration,
|
41 |
+
over the full predicted vector instead of the winning class, are reported less in
|
42 |
+
practice. Possible reasons are that they render class-wise scorings, either based
|
43 |
+
on adaptive thresholds or require estimation of kernel-based calibration error
|
44 |
+
to derive hypothesis tests. While we are mindful of alternatives (revisited in
|
45 |
+
Section 2.2.4), we have found that the simpler “weak” calibration measured by
|
46 |
+
ECE meets the practical requirements for most of our benchmarking.
|
47 |
+
Area-Under-Risk-Coverage-Curve (AURC) [138, 193] measures the possible trade-offs between coverage (proportion of test set%) and risk (error %
|
48 |
+
under given coverage). The metric explicitly assesses i.i.d. failure detection
|
49 |
+
performance as desired for safe deployment. It has advantages as a primary
|
50 |
+
evaluation metric given that it is effective both when underlying prediction
|
51 |
+
models are the same or different (as opposed to AUROC or AUPR). Its most
|
52 |
+
general form (without any curve approximation), with a task-specific evaluation
|
53 |
+
metric ` and CSF g, is defined as:
|
54 |
+
|
55 |
+
|
56 |
+
E(x̃,ỹ)∼PXY [`([f (x̃)], ỹ)I[g(x̃) > g(x)]]
|
57 |
+
AURC(f, g) = Ex∼P(X)
|
58 |
+
(2.14)
|
59 |
+
Ex̃∼PX [I[g(x̃) > g(x)]]
|
60 |
+
This captures the intuition that the CSF g should be able to rank instances by
|
61 |
+
their risk, and that the risk should be low for instances with high confidence.
|
62 |
+
1 https://huggingface.co/spaces/jordyvl/ece
|
63 |
+
|
64 |
+
|
assets/txts/pg_0060.txt
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
28
|
2 |
+
|
3 |
+
FUNDAMENTALS
|
4 |
+
|
5 |
+
The standard curve metric can be obtained by sorting all CSF estimates and
|
6 |
+
P
|
7 |
+
T P +F P
|
8 |
+
evaluating risk ( T PF+F
|
9 |
+
P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P
|
10 |
+
if above threshold) from high to low, together with their respective correctness (T
|
11 |
+
if correct). This is normally based on exact match, yet for generative evaluation
|
12 |
+
in Section 5.3.5, we have applied ANLS thresholding instead. Formulated
|
13 |
+
this way, the best possible AURC is constrained by the model’s test error
|
14 |
+
(1-ANLS) and the number of test instances. AURC might be more sensible for
|
15 |
+
evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be
|
16 |
+
better controlled and error tolerance is an apriori system-level decision [115].
|
17 |
+
This metric was used in every chapter of Part II.
|
18 |
+
For the evaluation under distribution shift in Chapter 3, we have used binary
|
19 |
+
classification metrics following [172], Area Under the Receiver Operating
|
20 |
+
Characteristic Curve (AUROC) and Area Under the Precision-Recall
|
21 |
+
Curve (AUPR), which are threshold-independent measures that summarize
|
22 |
+
detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability
|
23 |
+
that a randomly chosen out-of-distribution sample is assigned a higher confidence
|
24 |
+
score than a randomly chosen in-distribution sample. AUPR is more informative
|
25 |
+
under class imbalance.
|
26 |
+
|
27 |
+
2.2.4
|
28 |
+
|
29 |
+
Calibration
|
30 |
+
|
31 |
+
The study of calibration originated in the meteorology and statistics literature,
|
32 |
+
primarily in the context of proper loss functions [330] for evaluating
|
33 |
+
probabilistic forecasts. Calibration promises i) interpretability, ii) system
|
34 |
+
integration, iii) active learning, and iv) improved accuracy. A calibrated model,
|
35 |
+
as defined in Definition 8, can be interpreted as a probabilistic model, which can
|
36 |
+
be integrated into a larger system, and can guide active learning with potentially
|
37 |
+
fewer samples. Research into calibration regained popularity after repeated
|
38 |
+
empirical observations of overconfidence in DNNs [156, 339].
|
39 |
+
Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of
|
40 |
+
an empirical predictor f , which states that on finite-sample data it converges
|
41 |
+
to a solution where the confidence scoring function reflects the probability ρ of
|
42 |
+
being correct. Perfect calibration, CE(f ) = 0, is satisfied iff:
|
43 |
+
P(Y = Ŷ | f (X) = ρ) = ρ,
|
44 |
+
|
45 |
+
∀ρ ∈ [0, 1]
|
46 |
+
|
47 |
+
(2.15)
|
48 |
+
|
49 |
+
Below, we characterize calibration research in two directions: (A) CSF evaluation
|
50 |
+
with both theoretical guarantees and practical estimation methodologies
|
51 |
+
• Estimators for calibration notions beyond top-1 [229, 231, 342, 463]
|
52 |
+
|
53 |
+
|
assets/txts/pg_0061.txt
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
RELIABILITY AND ROBUSTNESS
|
2 |
+
|
3 |
+
29
|
4 |
+
|
5 |
+
• Theoretical frameworks to generalize over existing metrics and design
|
6 |
+
novel metrics [43, 231, 492, 493]
|
7 |
+
• Specialize towards a task such as multi-class classification [463], regression
|
8 |
+
[228, 428], or structured prediction [227]
|
9 |
+
• Alternative error estimation procedures, based on histogram regression
|
10 |
+
[156, 331, 332, 340, 343], kernels [230, 370, 492, 493] or splines [159]
|
11 |
+
(B) Calibration methods for improving the reliability of a model by adapting
|
12 |
+
the CSF or inducing calibration during training of f :
|
13 |
+
• Learn a post-hoc forecaster F : f (X) → [0, 1] on top of f (overview: [298])
|
14 |
+
• Modify the training procedure with regularization (overview: [277, 370])
|
15 |
+
Due to its importance in practice, we will provide more detail on train-time
|
16 |
+
calibration methods. It has been shown for a broad class of loss functions
|
17 |
+
that risk minimization leads to Fisher consistent, Bayes optimal classifiers in
|
18 |
+
the asymptotic limit [25, 495]. These can be shown to decompose into a sum
|
19 |
+
of multiple metrics including both accuracy and calibration error [144, 177].
|
20 |
+
However, there is no –finite data, nor asymptotic– guarantee that classifiers
|
21 |
+
trained with proper loss functions containing an explicit calibration term
|
22 |
+
will eventually be well-calibrated. In practice, being entangled with other
|
23 |
+
optimization terms often leads to sub-optimal calibration. For this reason,
|
24 |
+
recent studies [12, 230, 492] have derived trainable estimators of calibration
|
25 |
+
to have a better handle (γ > 0) on penalizing miscalibration, i.e., by jointly
|
26 |
+
optimizing risk (R(f ) = EX,Y [` (Y, f (X))]) and parameterized calibration error
|
27 |
+
(CE) as in Equation (2.16).
|
28 |
+
fˆ = arg min (R(f ) + γ CE(f ))
|
29 |
+
f ∈F
|
30 |
+
|
31 |
+
(2.16)
|
32 |
+
|
33 |
+
Many of these methods are implicitly or explicitly maximizing entropy of
|
34 |
+
predictions or entropy relative to another probability distribution, e.g., Entropy
|
35 |
+
Regularization [361], Label Smoothing (LS) [327], Focal Loss [324], Marginbased LS [277], next to more direct (differentiable), kernel-based calibration
|
36 |
+
error estimation [211, 230, 370, 492, 493, 526]. We had expected community
|
37 |
+
contribution on the DUDE competition (Chapter 5) to take advantage of this
|
38 |
+
wealth of calibration methods, yet the majority of submissions used uncalibrated
|
39 |
+
models with MSP, requiring more education on the importance of calibration
|
40 |
+
in practice.
|
41 |
+
|
42 |
+
|