yujunhuinlp commited on
Commit
45e69b4
·
verified ·
1 Parent(s): ae5237a

Update README_EN.md

Browse files
Files changed (1) hide show
  1. README_EN.md +9 -3
README_EN.md CHANGED
@@ -4,11 +4,15 @@
4
 
5
  ## I. Background
6
 
7
- In today's digital era, **Document Layout Analysis** is one of the key steps in information extraction and document understanding. In today's digital era, **Document Layout Analysis** is one of the key steps in information extraction and document understanding. Also known as document image analysis or document layout analysis, it involves the process of identifying and extracting text, images, tables, and other elements from scanned document images. This technology has a broad range of applications in automated document processing, electronic data exchange, historical document digitization, and other fields. Traditional document layout analysis models often struggle to accurately distinguish between paragraphs and other layout elements within documents, which limits further processing and utilization of document information. The advancement of deep learning and pattern recognition technologies has brought new opportunities for document layout analysis. By training datasets, the model's understanding of document structure can be enhanced. High-quality annotated datasets are fundamental to training effective models. In document layout analysis, detailed annotation is essential, particularly the annotation of **paragraphs**, as it directly affects semantic understanding and information extraction of the text.
8
 
9
- Our team has constructed multiple Chinese document datasets with paragraph annotations for various scenarios to ensure the model's generalization capability. For example, in the **academic paper** scenario, previous open-source datasets such as CDLA (A Chinese document layout analysis) lacked annotations for paragraph information; in the **research report** scenario, we have filled the gap for this particular area. Using these annotated datasets, we have trained several new Chinese document layout analysis models. These models are designed to identify paragraph boundaries in documents and accurately distinguish between text, images, tables, formulas, and other elements.
10
 
11
- This time, we have open-sourced the layout analysis model weights and corresponding label systems for both the academic paper and research report scenarios.
 
 
 
 
12
 
13
  ## II. Usage
14
 
@@ -55,6 +59,7 @@ This time, we have open-sourced the layout analysis model weights and correspond
55
  <img src="./case/paper/2.jpg" width="50%" height="50%">
56
  </div>
57
 
 
58
  ### 3.2 Research Report Scenario
59
 
60
  - Label Categories
@@ -78,6 +83,7 @@ This time, we have open-sourced the layout analysis model weights and correspond
78
  <img src="./case/report/2.jpg" width="50%" height="50%">
79
  </div>
80
 
 
81
  ## License
82
 
83
  This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0](./LICENSE.txt).
 
4
 
5
  ## I. Background
6
 
7
+ In the current digital era, **Document Layout Analysis** is one of the key steps for information extraction and document comprehension. Also known as document image analysis or document layout analysis, it is the process of identifying and extracting text, images, tables, and other elements from scanned document images. This technology has a wide range of applications in areas such as automated document processing, electronic data interchange, and the digitization of historical documents.
8
 
9
+ Traditional document layout analysis models often struggle to accurately distinguish between paragraphs and other layout elements within documents, which restricts further processing and utilization of document information. However, the development of deep learning and pattern recognition technologies has brought new opportunities for document layout analysis. By training on datasets, the models' understanding of document structure can be enhanced. Yet, high-quality annotated datasets are fundamental to training effective models.
10
 
11
+ In document layout analysis, fine-grained annotation is essential, especially the annotation of **paragraphs**, as it directly affects semantic understanding and information extraction of the text. Currently, in the field of layout analysis, to our knowledge, open-source datasets such as CDLA (A Chinese document layout analysis) lack annotations for paragraph information; layout analysis models for the research report scenario are relatively scarce.
12
+
13
+ Therefore, to address this issue, we have manually annotated and fine-tuned the CDLA with granular tags and data optimization, and have built a fine-grained layout analysis dataset for the research report scenario. Utilizing these annotated datasets, we have trained several new Chinese document layout analysis models, which have shown **excellent performance on the closed test set**.
14
+
15
+ In this open-source release, we have prioritized the release of lightweight model weights and corresponding label systems for **academic papers** and **research reports**, aiming to identify paragraph boundaries and accurately distinguish between text, images, tables, formulas, and other elements, ultimately promoting industry development.
16
 
17
  ## II. Usage
18
 
 
59
  <img src="./case/paper/2.jpg" width="50%" height="50%">
60
  </div>
61
 
62
+
63
  ### 3.2 Research Report Scenario
64
 
65
  - Label Categories
 
83
  <img src="./case/report/2.jpg" width="50%" height="50%">
84
  </div>
85
 
86
+
87
  ## License
88
 
89
  This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0](./LICENSE.txt).