MohamedRashad commited on
Commit
3e40121
·
verified ·
1 Parent(s): 8bfa647

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -80
README.md CHANGED
@@ -12,33 +12,25 @@ base_model:
12
  - facebook/nougat-small
13
  ---
14
 
15
- # Arabic Small Nougat: End-to-End Structured OCR for Arabic Books
16
-
17
- **Arabic Small Nougat** is an end-to-end Optical Character Recognition (OCR) model designed specifically for the Arabic language. It converts images of Arabic book pages into structured text, with output formatted in Markdown. The model is based on the [facebook/nougat-small](https://huggingface.co/facebook/nougat-small) architecture and fine-tuned on the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt), with additional custom data to enhance performance for Arabic text recognition.
18
 
 
19
  <center>
20
  <img src="https://huggingface.co/MohamedRashad/arabic-small-nougat/resolve/main/thumbnail_image.jpg">
21
  </center>
22
 
23
  ## Description
24
 
25
- The **Arabic Small Nougat** OCR model is designed to perform high-quality, end-to-end text extraction from images of Arabic book pages. Fine-tuned from the Nougat model architecture, it is tailored to handle the nuances of Arabic scripts, including their rich structure and complex typography.
26
-
27
- - **Key Features:**
28
- - Support for both **Arabic** and **English** text
29
- - **Markdown-formatted output** ideal for digital book projects, academic work, and content conversion
30
- - Capable of handling full-page book scans, producing well-structured text
31
- - Built on a robust OCR pipeline, ideal for Arabic literature and document digitization
32
 
33
- ## Getting Started
34
 
35
- **Demo:** [Live Demo](https://huggingface.co/spaces/MohamedRashad/Arabic-Small-Nougat)
36
 
37
- **Quickstart Code:**
38
 
39
- Here’s a code snippet to get started with the model:
40
-
41
- ```python
42
  from PIL import Image
43
  import torch
44
  from transformers import NougatProcessor, VisionEncoderDecoderModel
@@ -49,104 +41,78 @@ model = VisionEncoderDecoderModel.from_pretrained("MohamedRashad/arabic-small-no
49
  device = "cuda" if torch.cuda.is_available() else "cpu"
50
  model.to(device)
51
 
52
- # Define maximum context length for generated text
53
  context_length = 2048
54
 
55
  def predict(img_path):
56
- """
57
- Predicts OCR text from an image of a book page.
58
- """
59
- # Open and preprocess the image
60
  image = Image.open(img_path)
61
  pixel_values = processor(image, return_tensors="pt").pixel_values
62
-
63
- # Generate transcription
64
  outputs = model.generate(
65
  pixel_values.to(device),
66
  min_length=1,
67
- max_new_tokens=context_length
 
68
  )
69
-
70
- # Decode the generated text and post-process
71
  page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
72
  page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)
73
-
74
  return page_sequence
75
 
76
- # Example usage
77
  print(predict("path/to/page_image.jpg"))
78
- ```
79
-
80
- ### Installation Requirements
81
-
82
- Make sure to install the following dependencies:
83
- ```bash
84
- pip install -U transformers Pillow torch
85
- ```
86
-
87
  ## Bias, Risks, and Limitations
88
 
89
- - **Text Hallucination**: The model may occasionally generate repetitive or incorrect text due to the complexities of OCR tasks, especially with noisy or difficult-to-read images.
90
- - **Image Quality Sensitivity**: Low-resolution or blurred images may significantly affect the accuracy of the OCR.
91
- - **Context Length Limitation**: The model has a maximum context length of 2048 tokens, which may result in incomplete transcriptions for long book pages or documents with extensive content.
92
 
93
  ## Intended Use
94
 
95
- This model is ideal for converting scanned or photographed images of Arabic book pages into digital text, making it useful for:
96
- - **Digitizing Arabic literature**
97
- - **Extracting structured text from scanned books**
98
- - **Academic research** and archival purposes, particularly in the humanities and social sciences
99
 
100
  ## Ethical Considerations
101
 
102
- - **Accuracy Verification**: Users should always verify the output for accuracy, especially for critical applications or when processing historical texts.
103
- - **Copyright and Fair Use**: This tool can be used for digitizing public domain works, but care should be taken when dealing with copyrighted material. Ensure proper permissions and comply with intellectual property laws when using this model in such contexts.
104
 
105
  ## Model Details
106
 
107
- - **Developed by**: Mohamed Rashad
108
- - **Model Type**: VisionEncoderDecoderModel
109
- - **Languages**: Arabic, English
110
- - **License**: GPL 3.0
111
- - **Base Model**: [facebook/nougat-small](https://huggingface.co/facebook/nougat-small)
112
- - **Fine-Tuned Dataset**: [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt)
113
 
114
- ### Acknowledgments
115
 
116
- If you use or build upon the **Arabic Small Nougat** OCR model, please acknowledge the contributions of:
117
- - The original authors of the [facebook/nougat-small](https://huggingface.co/facebook/nougat-small) model
118
- - The creators of the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt)
119
- - The open-source community and the model’s developer, Mohamed Rashad
120
 
121
- ### Citation
122
 
123
- If you find this model useful, please cite the following references:
124
 
125
- ```bibtex
 
126
  @misc{blecher2023nougat,
127
- title={Nougat: Neural Optical Understanding for Academic Documents},
128
- author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},
129
- year={2023},
130
- eprint={2308.13418},
131
- archivePrefix={arXiv},
132
- primaryClass={cs.LG}
133
  }
134
-
135
  @misc{fakhraddin2023khatt,
136
- title={Khatt Arabic Handwriting Dataset},
137
- author={Fakhraddin},
138
- year={2023},
139
- howpublished={\url{https://huggingface.co/datasets/Fakhraddin/khatt}}
140
  }
141
-
142
  @misc{rashad2023arabicsmallnougat,
143
- title={Arabic Small Nougat Model},
144
- author={Mohamed Rashad},
145
- year={2023},
146
- howpublished={\url{https://huggingface.co/MohamedRashad/arabic-small-nougat}}
147
  }
148
- ```
149
-
150
- ## Disclaimer
151
 
152
- The **Arabic Small Nougat OCR** model is provided "as is." The developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to evaluate the models performance and carefully review the output for their specific use cases and requirements.
 
12
  - facebook/nougat-small
13
  ---
14
 
15
+ # Arabic Small Nougat
 
 
16
 
17
+ **En**d-**t**o-**En**d **Structur**ed **OC**R **fo**r **Arab**ic **boo**ks.
18
  <center>
19
  <img src="https://huggingface.co/MohamedRashad/arabic-small-nougat/resolve/main/thumbnail_image.jpg">
20
  </center>
21
 
22
  ## Description
23
 
24
+ The arabic-small-nougat OCR is an end-to-end structured Optical Character Recognition (OCR) system designed specifically for the Arabic language.
 
 
 
 
 
 
25
 
26
+ The model is based on the [facebook/nougat-small](https://huggingface.co/facebook/nougat-small) architecture and has been fine-tuned using the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt) along with a custom dataset created for this purpose.
27
 
28
+ ## How to Get Started with the Model
29
 
30
+ **Demo:** https://huggingface.co/spaces/MohamedRashad/Arabic-Small-Nougat
31
 
32
+ Or, use the code below to get started with the model locally.
33
+ python
 
34
  from PIL import Image
35
  import torch
36
  from transformers import NougatProcessor, VisionEncoderDecoderModel
 
41
  device = "cuda" if torch.cuda.is_available() else "cpu"
42
  model.to(device)
43
 
 
44
  context_length = 2048
45
 
46
  def predict(img_path):
47
+ # prepare PDF image for the model
 
 
 
48
  image = Image.open(img_path)
49
  pixel_values = processor(image, return_tensors="pt").pixel_values
50
+
51
+ # generate transcription
52
  outputs = model.generate(
53
  pixel_values.to(device),
54
  min_length=1,
55
+ max_new_tokens=context_length,
56
+ bad_words_ids=[[processor.tokenizer.unk_token_id]],
57
  )
58
+
 
59
  page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
60
  page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)
 
61
  return page_sequence
62
 
 
63
  print(predict("path/to/page_image.jpg"))
 
 
 
 
 
 
 
 
 
64
  ## Bias, Risks, and Limitations
65
 
66
+ 1. **Text Hallucination:** The model may occasionally generate repeated or incorrect text due to the inherent complexities of OCR tasks.
67
+ 1. **Erroneous Image Paths:** There are instances where the model outputs image paths that are not relevant to the input, indicating occasional confusion.
68
+ 1. **Context Length Constraint:** The model has a maximum context length of 2048 tokens, which may result in incomplete transcriptions for longer book pages.
69
 
70
  ## Intended Use
71
 
72
+ The arabic-small-nougat OCR is designed for tasks that involve converting images of Arabic book pages into structured text, especially when Markdown format is desired. It is suitable for applications in the field of digitizing Arabic literature and facilitating text extraction from printed materials.
 
 
 
73
 
74
  ## Ethical Considerations
75
 
76
+ It is crucial to be aware of the model's limitations, particularly in instances where accurate OCR results are critical. Users are advised to verify and review the output, especially in scenarios where precision is paramount.
 
77
 
78
  ## Model Details
79
 
80
+ - **Developed by:** Mohamed Rashad
81
+ - **Model type:** VisionEncoderDecoderModel
82
+ - **Language(s) (NLP):** Arabic & English
83
+ - **License:** GPL 3.0
84
+ - **Finetuned from model:** [nougat-small](https://huggingface.co/facebook/nougat-small)
 
85
 
86
+ ## Acknowledgment
87
 
88
+ If you use or build upon the Arabic Small Nougat OCR, please acknowledge the model developer and the open-source community for their contributions. Additionally, be sure to include a copy of the GPL 3.0 license with any redistributed or modified versions of the model.
 
 
 
89
 
90
+ By selecting the GPL 3.0 license, you promote the principles of open source and ensure that the benefits of the model are shared with the broader community.
91
 
92
+ ### Citation
93
 
94
+ If you find this model useful, please consider citing the original [facebook/nougat-small]((https://huggingface.co/facebook/nougat-small)) model and the datasets used for fine-tuning, including the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt) and any details regarding the custom dataset.
95
+ bibtex
96
  @misc{blecher2023nougat,
97
+ title={Nougat: Neural Optical Understanding for Academic Documents},
98
+ author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},
99
+ year={2023},
100
+ eprint={2308.13418},
101
+ archivePrefix={arXiv},
102
+ primaryClass={cs.LG}
103
  }
 
104
  @misc{fakhraddin2023khatt,
105
+ title={Khatt Arabic Handwriting Dataset},
106
+ author={Fakhraddin},
107
+ year={2023},
108
+ howpublished={\url{https://huggingface.co/datasets/Fakhraddin/khatt}}
109
  }
 
110
  @misc{rashad2023arabicsmallnougat,
111
+ title={Arabic Small Nougat Model},
112
+ author={Mohamed Rashad},
113
+ year={2023},
114
+ howpublished={\url{https://huggingface.co/MohamedRashad/arabic-small-nougat}}
115
  }
116
+ ### Disclaimer
 
 
117
 
118
+ The arabic-small-nougat OCR is a tool provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.