yujunhuinlp commited on
Commit
6ce33d7
·
verified ·
1 Parent(s): 2f001dd

Update README_EN.md

Browse files
Files changed (1) hide show
  1. README_EN.md +6 -10
README_EN.md CHANGED
@@ -4,19 +4,15 @@
4
 
5
  ## I. Background
6
 
7
- In the current digital era, **Document Layout Analysis** is one of the key steps for information extraction and document comprehension. Also known as document image analysis or document layout analysis, it is the process of identifying and extracting text, images, tables, and other elements from scanned document images. This technology has a wide range of applications in areas such as automated document processing, electronic data interchange, and the digitization of historical documents.
8
-
9
- Traditional document layout analysis models often struggle to accurately distinguish between paragraphs and other layout elements within documents, which restricts further processing and utilization of document information. However, the development of deep learning and pattern recognition technologies has brought new opportunities for document layout analysis. By training on datasets, the models' understanding of document structure can be enhanced. Yet, high-quality annotated datasets are fundamental to training effective models.
10
-
11
- In document layout analysis, fine-grained annotation is essential, especially the annotation of **paragraphs**, as it directly affects semantic understanding and information extraction of the text. Currently, in the field of layout analysis, to our knowledge, open-source datasets such as CDLA (A Chinese document layout analysis) lack annotations for paragraph information; layout analysis models for the research report scenario are relatively scarce.
12
-
13
- Therefore, to address this issue, we have manually annotated and fine-tuned the CDLA with granular tags and data optimization, and have built a fine-grained layout analysis dataset for the research report scenario. Utilizing these annotated datasets, we have trained several new Chinese document layout analysis models, which have shown **excellent performance on the closed test set**.
14
-
15
- In this open-source release, we have prioritized the release of lightweight model weights and corresponding label systems for **academic papers** and **research reports**, aiming to identify paragraph boundaries and accurately distinguish between text, images, tables, formulas, and other elements, ultimately promoting industry development.
16
 
17
  ## II. Usage
18
 
19
- - Weights download link: [🤗LINK](https://huggingface.co/qihoo360)
20
 
21
  - Usage:
22
 
 
4
 
5
  ## I. Background
6
 
7
+ In today's digital age, document layout analysis is one of the key steps in information extraction and document understanding. Document layout analysis, also known as document image analysis or document layout analysis, refers to the process of identifying and extracting text, images, tables, and other elements from scanned document images. This technology has extensive applications in fields such as automated document processing, electronic data exchange, and digitization of historical documents.
8
+ Traditional document layout analysis models often find it difficult to accurately distinguish paragraphs and other layout elements in documents, which limits the further processing and utilization of document information. The development of deep learning and pattern recognition technologies has brought new opportunities for document layout analysis. By training datasets, the model's understanding of document structure can be improved. However, high-quality annotated datasets are the foundation for training effective models.
9
+ In document layout analysis, refined annotation is very necessary, among which paragraph annotation is particularly crucial because it directly affects the semantic understanding and information extraction of the text. Currently, in the field of layout analysis, as far as we know, in paper scenarios, previous open-source datasets such as CDLA (A Chinese document layout analysis) lack annotation of paragraph information; The layout analysis model in the research report scenario is still relatively lacking.
10
+ Therefore, in order to solve this problem, we manually annotated the paper documents for fine-grained label transformation and data optimization, and constructed a fine-grained layout analysis dataset for research report scenarios. It is best to use these annotated datasets to train multiple new Chinese document layout analysis models, which performed well on the **closed test set**.
11
+ In this open source project, we have prioritized the development of lightweight model weights and corresponding label systems for page analysis in two scenarios: **paper** and **research report**. The aim is to identify paragraph boundaries and other information in documents, accurately distinguish text, images, tables, formulas, and other elements, and ultimately promote industrial development.
 
 
 
 
12
 
13
  ## II. Usage
14
 
15
+ - Weights download link: [🤗LINK](https://huggingface.co/qihoo360/360LayoutAnalysis)
16
 
17
  - Usage:
18