UniquePratham commited on
Commit
8c35d87
1 Parent(s): f94fc9b

Upload 5 files

Browse files

DualTextOCRFusion

Files changed (5) hide show
  1. .gitignore +55 -0
  2. README.md +150 -13
  3. app.py +72 -0
  4. ocr_cpu.py +97 -0
  5. requirements.txt +14 -0
.gitignore ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+ MANIFEST
27
+
28
+ # Virtual environment
29
+ venv/
30
+ env/
31
+ .venv/
32
+ .env/
33
+ ENV/
34
+ .env.bak/
35
+ *.env
36
+ __pycache__
37
+
38
+ # VS Code
39
+ .vscode/
40
+ .history/
41
+
42
+ # PyCharm
43
+ .idea/
44
+
45
+ # Jupyter Notebook
46
+ .ipynb_checkpoints/
47
+
48
+ # Logs
49
+ *.log
50
+
51
+ # Mac OS files
52
+ .DS_Store
53
+
54
+ # Streamlit Cache (Optional)
55
+ streamlit_cache/
README.md CHANGED
@@ -1,13 +1,150 @@
1
- ---
2
- title: DualTextOCRFusion
3
- emoji:
4
- colorFrom: gray
5
- colorTo: purple
6
- sdk: streamlit
7
- sdk_version: 1.38.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🔍 DualTextOCRFusion
2
+
3
+ **DualTextOCRFusion** is a web-based Optical Character Recognition (OCR) application that allows users to upload images containing both Hindi and English text, extract the text, and search for keywords within the extracted text. The app uses advanced models like **ColPali’s Byaldi + Qwen2-VL** or **General OCR Theory (GOT)** for multilingual text extraction.
4
+
5
+ ## Features
6
+
7
+ - **Multilingual OCR**: Extract text from images containing both **Hindi** and **English**.
8
+ - **Keyword Search**: Search for specific keywords in the extracted text.
9
+ - **User-Friendly Interface**: Simple, intuitive interface for easy image uploading and searching.
10
+ - **Deployed Online**: Accessible through a live URL for easy use.
11
+
12
+ ## Technologies Used
13
+
14
+ - **Python**: Backend logic.
15
+ - **Streamlit**: For building the web interface.
16
+ - **Huggingface Transformers**: For integrating OCR models (Qwen2-VL or GOT).
17
+ - **PyTorch**: For deep learning inference.
18
+ - **Pytesseract**: Optional OCR engine.
19
+ - **OpenCV**: For image preprocessing.
20
+
21
+ ## Project Structure
22
+
23
+ ```
24
+ DualTextOCRFusion/
25
+
26
+ ├── app.py # Main Streamlit application
27
+ ├── ocr.py # Handles OCR extraction using the selected model
28
+ ├── .gitignore # Files and directories to ignore in Git
29
+ ├── .streamlit/
30
+ │ └── config.toml # Streamlit theme configuration
31
+ ├── requirements.txt # Dependencies for the project
32
+ └── README.md # This file
33
+ ```
34
+
35
+ ## How to Run Locally
36
+
37
+ ### Prerequisites
38
+
39
+ - Python 3.8 or above installed on your machine.
40
+ - Tesseract installed for using `pytesseract` (optional if using Huggingface models). You can download Tesseract from [here](https://github.com/tesseract-ocr/tesseract).
41
+
42
+ ### Steps
43
+
44
+ 1. **Clone the Repository**:
45
+
46
+ ```bash
47
+ git clone https://github.com/yourusername/dual-text-ocr-fusion.git
48
+ cd dual-text-ocr-fusion
49
+ ```
50
+
51
+ 2. **Install Dependencies**:
52
+
53
+ Make sure you have the required dependencies by running the following:
54
+
55
+ ```bash
56
+ pip install -r requirements.txt
57
+ ```
58
+
59
+ 3. **Run the Application**:
60
+
61
+ Start the Streamlit app by running the following command:
62
+
63
+ ```bash
64
+ streamlit run app.py
65
+ ```
66
+
67
+ 4. **Open the App**:
68
+
69
+ Once the server starts, the app will be available in your browser at:
70
+
71
+ ```
72
+ http://localhost:8501
73
+ ```
74
+
75
+ ### Usage
76
+
77
+ 1. **Upload an Image**: Upload an image containing Hindi and English text in formats like JPG, JPEG, or PNG.
78
+ 2. **View Extracted Text**: The app will extract and display the text from the image.
79
+ 3. **Search for Keywords**: Enter any keyword to search within the extracted text.
80
+
81
+ ## Deployment
82
+
83
+ The app is deployed on **Streamlit Sharing** and can be accessed via the live URL:
84
+
85
+ **[Live Application](https://your-app-link.streamlit.app)**
86
+
87
+ ## Customization
88
+
89
+ ### Changing the OCR Model
90
+
91
+ By default, the app uses the **Qwen2-VL** model, but you can switch to the **General OCR Theory (GOT)** model by editing the `ocr.py` file.
92
+
93
+ - **For Qwen2-VL**:
94
+
95
+ ```python
96
+ from ocr import extract_text_byaldi
97
+ ```
98
+
99
+ - **For General OCR Theory (GOT)**:
100
+
101
+ ```python
102
+ from ocr import extract_text_got
103
+ ```
104
+
105
+ ### Custom UI Theme
106
+
107
+ You can customize the look and feel of the application by modifying the `.streamlit/config.toml` file. Adjust colors, fonts, and layout options to suit your preferences.
108
+
109
+ ## Example Images
110
+
111
+ Here are some sample images you can use to test the OCR functionality:
112
+
113
+ 1. **Sample 1**: A document with mixed Hindi and English text.
114
+ 2. **Sample 2**: An image with only Hindi text for multilingual OCR testing.
115
+
116
+ ## Contributing
117
+
118
+ If you'd like to contribute to this project, feel free to fork the repository and submit a pull request. Follow these steps:
119
+
120
+ 1. Fork the project.
121
+ 2. Create a feature branch:
122
+
123
+ ```bash
124
+ git checkout -b feature-branch
125
+ ```
126
+
127
+ 3. Commit your changes:
128
+
129
+ ```bash
130
+ git commit -am 'Add new feature'
131
+ ```
132
+
133
+ 4. Push to the branch:
134
+
135
+ ```bash
136
+ git push origin feature-branch
137
+ ```
138
+
139
+ 5. Open a pull request.
140
+
141
+ ## License
142
+
143
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
144
+
145
+ ## Credits
146
+
147
+ - **Streamlit**: For the easy-to-use web interface.
148
+ - **Huggingface Transformers**: For the powerful OCR models.
149
+ - **Tesseract**: For optional OCR functionality.
150
+ - **ColPali & GOT Models**: For the multilingual OCR support.
app.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from ocr_cpu import extract_text_got # The updated OCR function
3
+ import json
4
+
5
+ # --- UI Styling ---
6
+ st.set_page_config(page_title="DualTextOCRFusion",
7
+ layout="centered", page_icon="🔍")
8
+
9
+ st.markdown(
10
+ """
11
+ <style>
12
+ .reportview-container {
13
+ background: #f4f4f4;
14
+ }
15
+ .sidebar .sidebar-content {
16
+ background: #e0e0e0;
17
+ }
18
+ h1 {
19
+ color: #007BFF;
20
+ }
21
+ .upload-btn {
22
+ background-color: #007BFF;
23
+ color: white;
24
+ padding: 10px;
25
+ border-radius: 5px;
26
+ text-align: center;
27
+ }
28
+ </style>
29
+ """, unsafe_allow_html=True
30
+ )
31
+
32
+ # --- Title ---
33
+ st.title("🔍 DualTextOCRFusion")
34
+ st.write("Upload an image with **Hindi** and **English** text to extract and search for keywords.")
35
+
36
+ # --- Image Upload Section ---
37
+ uploaded_file = st.file_uploader(
38
+ "Choose an image file", type=["jpg", "jpeg", "png"])
39
+
40
+ if uploaded_file is not None:
41
+ st.image(uploaded_file, caption='Uploaded Image', use_column_width=True)
42
+
43
+ # Extract text from the image using the selected OCR function (GOT)
44
+ with st.spinner("Extracting text using the model..."):
45
+ try:
46
+ extracted_text = extract_text_got(
47
+ uploaded_file) # Pass uploaded_file directly
48
+ if not extracted_text.strip():
49
+ st.warning("No text extracted from the image.")
50
+ except Exception as e:
51
+ st.error(f"Error during text extraction: {str(e)}")
52
+ extracted_text = ""
53
+
54
+ # Display extracted text
55
+ st.subheader("Extracted Text")
56
+ st.text_area("Text", extracted_text, height=250)
57
+
58
+ # Save extracted text for search
59
+ if extracted_text:
60
+ with open("extracted_text.json", "w") as json_file:
61
+ json.dump({"text": extracted_text}, json_file)
62
+
63
+ # --- Keyword Search ---
64
+ st.subheader("Search for Keywords")
65
+ keyword = st.text_input(
66
+ "Enter a keyword to search in the extracted text")
67
+
68
+ if keyword:
69
+ if keyword.lower() in extracted_text.lower():
70
+ st.success(f"Keyword **'{keyword}'** found in the text!")
71
+ else:
72
+ st.error(f"Keyword **'{keyword}'** not found.")
ocr_cpu.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from transformers import AutoModel, AutoTokenizer
3
+ import torch
4
+
5
+ # Load model and tokenizer
6
+ model_name = "ucaslcl/GOT-OCR2_0"
7
+ tokenizer = AutoTokenizer.from_pretrained(
8
+ model_name, trust_remote_code=True, return_tensors='pt'
9
+ )
10
+
11
+ # Load the model
12
+ model = AutoModel.from_pretrained(
13
+ model_name,
14
+ trust_remote_code=True,
15
+ low_cpu_mem_usage=True,
16
+ use_safetensors=True,
17
+ pad_token_id=tokenizer.eos_token_id,
18
+ )
19
+
20
+ # Ensure the model is in evaluation mode and loaded on CPU
21
+ device = torch.device("cpu")
22
+ dtype = torch.float32 # Use float32 on CPU
23
+ model = model.eval().to(device)
24
+
25
+ # OCR function
26
+
27
+
28
+ def extract_text_got(uploaded_file):
29
+ """Use GOT-OCR2.0 model to extract text from the uploaded image."""
30
+ try:
31
+ temp_file_path = 'temp_image.jpg'
32
+ with open(temp_file_path, 'wb') as temp_file:
33
+ temp_file.write(uploaded_file.read()) # Save file
34
+
35
+ # OCR attempts
36
+ ocr_types = ['ocr', 'format']
37
+ fine_grained_options = ['ocr', 'format']
38
+ color_options = ['red', 'green', 'blue']
39
+ box = [10, 10, 100, 100] # Example box for demonstration
40
+ multi_crop_types = ['ocr', 'format']
41
+
42
+ results = []
43
+
44
+ # Run the model without autocast (not necessary for CPU)
45
+ for ocr_type in ocr_types:
46
+ with torch.no_grad():
47
+ outputs = model.chat(
48
+ tokenizer, temp_file_path, ocr_type=ocr_type
49
+ )
50
+ if isinstance(outputs, list) and outputs[0].strip():
51
+ return outputs[0].strip() # Return if successful
52
+ results.append(outputs[0].strip() if outputs else "No result")
53
+
54
+ # Try FINE-GRAINED OCR with box options
55
+ for ocr_type in fine_grained_options:
56
+ with torch.no_grad():
57
+ outputs = model.chat(
58
+ tokenizer, temp_file_path, ocr_type=ocr_type, ocr_box=box
59
+ )
60
+ if isinstance(outputs, list) and outputs[0].strip():
61
+ return outputs[0].strip() # Return if successful
62
+ results.append(outputs[0].strip() if outputs else "No result")
63
+
64
+ # Try FINE-GRAINED OCR with color options
65
+ for ocr_type in fine_grained_options:
66
+ for color in color_options:
67
+ with torch.no_grad():
68
+ outputs = model.chat(
69
+ tokenizer, temp_file_path, ocr_type=ocr_type, ocr_color=color
70
+ )
71
+ if isinstance(outputs, list) and outputs[0].strip():
72
+ return outputs[0].strip() # Return if successful
73
+ results.append(outputs[0].strip()
74
+ if outputs else "No result")
75
+
76
+ # Try MULTI-CROP OCR
77
+ for ocr_type in multi_crop_types:
78
+ with torch.no_grad():
79
+ outputs = model.chat_crop(
80
+ tokenizer, temp_file_path, ocr_type=ocr_type
81
+ )
82
+ if isinstance(outputs, list) and outputs[0].strip():
83
+ return outputs[0].strip() # Return if successful
84
+ results.append(outputs[0].strip() if outputs else "No result")
85
+
86
+ # If no text was extracted
87
+ if all(not text for text in results):
88
+ return "No text extracted."
89
+ else:
90
+ return "\n".join(results)
91
+
92
+ except Exception as e:
93
+ return f"Error during text extraction: {str(e)}"
94
+
95
+ finally:
96
+ if os.path.exists(temp_file_path):
97
+ os.remove(temp_file_path)
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch==2.0.1
2
+ torchvision==0.15.2
3
+ transformers==4.37.2
4
+ megfile==3.1.2
5
+ tiktoken
6
+ verovio
7
+ opencv-python
8
+ cairosvg
9
+ accelerate
10
+ numpy==1.26.4
11
+ loadimg
12
+ pillow
13
+ markdown
14
+ shutils