Update README.md
Browse files
README.md
CHANGED
@@ -12,33 +12,25 @@ base_model:
|
|
12 |
- facebook/nougat-small
|
13 |
---
|
14 |
|
15 |
-
# Arabic Small Nougat
|
16 |
-
|
17 |
-
**Arabic Small Nougat** is an end-to-end Optical Character Recognition (OCR) model designed specifically for the Arabic language. It converts images of Arabic book pages into structured text, with output formatted in Markdown. The model is based on the [facebook/nougat-small](https://huggingface.co/facebook/nougat-small) architecture and fine-tuned on the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt), with additional custom data to enhance performance for Arabic text recognition.
|
18 |
|
|
|
19 |
<center>
|
20 |
<img src="https://huggingface.co/MohamedRashad/arabic-small-nougat/resolve/main/thumbnail_image.jpg">
|
21 |
</center>
|
22 |
|
23 |
## Description
|
24 |
|
25 |
-
The
|
26 |
-
|
27 |
-
- **Key Features:**
|
28 |
-
- Support for both **Arabic** and **English** text
|
29 |
-
- **Markdown-formatted output** ideal for digital book projects, academic work, and content conversion
|
30 |
-
- Capable of handling full-page book scans, producing well-structured text
|
31 |
-
- Built on a robust OCR pipeline, ideal for Arabic literature and document digitization
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
36 |
|
37 |
-
**
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
```python
|
42 |
from PIL import Image
|
43 |
import torch
|
44 |
from transformers import NougatProcessor, VisionEncoderDecoderModel
|
@@ -49,104 +41,78 @@ model = VisionEncoderDecoderModel.from_pretrained("MohamedRashad/arabic-small-no
|
|
49 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
50 |
model.to(device)
|
51 |
|
52 |
-
# Define maximum context length for generated text
|
53 |
context_length = 2048
|
54 |
|
55 |
def predict(img_path):
|
56 |
-
|
57 |
-
Predicts OCR text from an image of a book page.
|
58 |
-
"""
|
59 |
-
# Open and preprocess the image
|
60 |
image = Image.open(img_path)
|
61 |
pixel_values = processor(image, return_tensors="pt").pixel_values
|
62 |
-
|
63 |
-
#
|
64 |
outputs = model.generate(
|
65 |
pixel_values.to(device),
|
66 |
min_length=1,
|
67 |
-
max_new_tokens=context_length
|
|
|
68 |
)
|
69 |
-
|
70 |
-
# Decode the generated text and post-process
|
71 |
page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
72 |
page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)
|
73 |
-
|
74 |
return page_sequence
|
75 |
|
76 |
-
# Example usage
|
77 |
print(predict("path/to/page_image.jpg"))
|
78 |
-
```
|
79 |
-
|
80 |
-
### Installation Requirements
|
81 |
-
|
82 |
-
Make sure to install the following dependencies:
|
83 |
-
```bash
|
84 |
-
pip install -U transformers Pillow torch
|
85 |
-
```
|
86 |
-
|
87 |
## Bias, Risks, and Limitations
|
88 |
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
|
93 |
## Intended Use
|
94 |
|
95 |
-
|
96 |
-
- **Digitizing Arabic literature**
|
97 |
-
- **Extracting structured text from scanned books**
|
98 |
-
- **Academic research** and archival purposes, particularly in the humanities and social sciences
|
99 |
|
100 |
## Ethical Considerations
|
101 |
|
102 |
-
|
103 |
-
- **Copyright and Fair Use**: This tool can be used for digitizing public domain works, but care should be taken when dealing with copyrighted material. Ensure proper permissions and comply with intellectual property laws when using this model in such contexts.
|
104 |
|
105 |
## Model Details
|
106 |
|
107 |
-
- **Developed by
|
108 |
-
- **Model
|
109 |
-
- **
|
110 |
-
- **License
|
111 |
-
- **
|
112 |
-
- **Fine-Tuned Dataset**: [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt)
|
113 |
|
114 |
-
|
115 |
|
116 |
-
If you use or build upon the
|
117 |
-
- The original authors of the [facebook/nougat-small](https://huggingface.co/facebook/nougat-small) model
|
118 |
-
- The creators of the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt)
|
119 |
-
- The open-source community and the model’s developer, Mohamed Rashad
|
120 |
|
121 |
-
|
122 |
|
123 |
-
|
124 |
|
125 |
-
|
|
|
126 |
@misc{blecher2023nougat,
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
}
|
134 |
-
|
135 |
@misc{fakhraddin2023khatt,
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
}
|
141 |
-
|
142 |
@misc{rashad2023arabicsmallnougat,
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
}
|
148 |
-
|
149 |
-
|
150 |
-
## Disclaimer
|
151 |
|
152 |
-
The
|
|
|
12 |
- facebook/nougat-small
|
13 |
---
|
14 |
|
15 |
+
# Arabic Small Nougat
|
|
|
|
|
16 |
|
17 |
+
**En**d-**t**o-**En**d **Structur**ed **OC**R **fo**r **Arab**ic **boo**ks.
|
18 |
<center>
|
19 |
<img src="https://huggingface.co/MohamedRashad/arabic-small-nougat/resolve/main/thumbnail_image.jpg">
|
20 |
</center>
|
21 |
|
22 |
## Description
|
23 |
|
24 |
+
The arabic-small-nougat OCR is an end-to-end structured Optical Character Recognition (OCR) system designed specifically for the Arabic language.
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
+
The model is based on the [facebook/nougat-small](https://huggingface.co/facebook/nougat-small) architecture and has been fine-tuned using the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt) along with a custom dataset created for this purpose.
|
27 |
|
28 |
+
## How to Get Started with the Model
|
29 |
|
30 |
+
**Demo:** https://huggingface.co/spaces/MohamedRashad/Arabic-Small-Nougat
|
31 |
|
32 |
+
Or, use the code below to get started with the model locally.
|
33 |
+
python
|
|
|
34 |
from PIL import Image
|
35 |
import torch
|
36 |
from transformers import NougatProcessor, VisionEncoderDecoderModel
|
|
|
41 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
42 |
model.to(device)
|
43 |
|
|
|
44 |
context_length = 2048
|
45 |
|
46 |
def predict(img_path):
|
47 |
+
# prepare PDF image for the model
|
|
|
|
|
|
|
48 |
image = Image.open(img_path)
|
49 |
pixel_values = processor(image, return_tensors="pt").pixel_values
|
50 |
+
|
51 |
+
# generate transcription
|
52 |
outputs = model.generate(
|
53 |
pixel_values.to(device),
|
54 |
min_length=1,
|
55 |
+
max_new_tokens=context_length,
|
56 |
+
bad_words_ids=[[processor.tokenizer.unk_token_id]],
|
57 |
)
|
58 |
+
|
|
|
59 |
page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
60 |
page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)
|
|
|
61 |
return page_sequence
|
62 |
|
|
|
63 |
print(predict("path/to/page_image.jpg"))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
## Bias, Risks, and Limitations
|
65 |
|
66 |
+
1. **Text Hallucination:** The model may occasionally generate repeated or incorrect text due to the inherent complexities of OCR tasks.
|
67 |
+
1. **Erroneous Image Paths:** There are instances where the model outputs image paths that are not relevant to the input, indicating occasional confusion.
|
68 |
+
1. **Context Length Constraint:** The model has a maximum context length of 2048 tokens, which may result in incomplete transcriptions for longer book pages.
|
69 |
|
70 |
## Intended Use
|
71 |
|
72 |
+
The arabic-small-nougat OCR is designed for tasks that involve converting images of Arabic book pages into structured text, especially when Markdown format is desired. It is suitable for applications in the field of digitizing Arabic literature and facilitating text extraction from printed materials.
|
|
|
|
|
|
|
73 |
|
74 |
## Ethical Considerations
|
75 |
|
76 |
+
It is crucial to be aware of the model's limitations, particularly in instances where accurate OCR results are critical. Users are advised to verify and review the output, especially in scenarios where precision is paramount.
|
|
|
77 |
|
78 |
## Model Details
|
79 |
|
80 |
+
- **Developed by:** Mohamed Rashad
|
81 |
+
- **Model type:** VisionEncoderDecoderModel
|
82 |
+
- **Language(s) (NLP):** Arabic & English
|
83 |
+
- **License:** GPL 3.0
|
84 |
+
- **Finetuned from model:** [nougat-small](https://huggingface.co/facebook/nougat-small)
|
|
|
85 |
|
86 |
+
## Acknowledgment
|
87 |
|
88 |
+
If you use or build upon the Arabic Small Nougat OCR, please acknowledge the model developer and the open-source community for their contributions. Additionally, be sure to include a copy of the GPL 3.0 license with any redistributed or modified versions of the model.
|
|
|
|
|
|
|
89 |
|
90 |
+
By selecting the GPL 3.0 license, you promote the principles of open source and ensure that the benefits of the model are shared with the broader community.
|
91 |
|
92 |
+
### Citation
|
93 |
|
94 |
+
If you find this model useful, please consider citing the original [facebook/nougat-small]((https://huggingface.co/facebook/nougat-small)) model and the datasets used for fine-tuning, including the [Khatt dataset](https://huggingface.co/datasets/Fakhraddin/khatt) and any details regarding the custom dataset.
|
95 |
+
bibtex
|
96 |
@misc{blecher2023nougat,
|
97 |
+
title={Nougat: Neural Optical Understanding for Academic Documents},
|
98 |
+
author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},
|
99 |
+
year={2023},
|
100 |
+
eprint={2308.13418},
|
101 |
+
archivePrefix={arXiv},
|
102 |
+
primaryClass={cs.LG}
|
103 |
}
|
|
|
104 |
@misc{fakhraddin2023khatt,
|
105 |
+
title={Khatt Arabic Handwriting Dataset},
|
106 |
+
author={Fakhraddin},
|
107 |
+
year={2023},
|
108 |
+
howpublished={\url{https://huggingface.co/datasets/Fakhraddin/khatt}}
|
109 |
}
|
|
|
110 |
@misc{rashad2023arabicsmallnougat,
|
111 |
+
title={Arabic Small Nougat Model},
|
112 |
+
author={Mohamed Rashad},
|
113 |
+
year={2023},
|
114 |
+
howpublished={\url{https://huggingface.co/MohamedRashad/arabic-small-nougat}}
|
115 |
}
|
116 |
+
### Disclaimer
|
|
|
|
|
117 |
|
118 |
+
The arabic-small-nougat OCR is a tool provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.
|