prithivMLmods commited on
Commit
4f6c188
·
verified ·
1 Parent(s): f3ad0c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -1
README.md CHANGED
@@ -11,4 +11,115 @@ tags:
11
  - text-generation-inference
12
  - Qwen
13
  - Hoags
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - text-generation-inference
12
  - Qwen
13
  - Hoags
14
+ ---
15
+ > [!WARNING]
16
+ > **Note:** This model contains artifacts and may perform poorly in some cases.
17
+
18
+ # **Hoags-2B-Exp**
19
+
20
+ The **Hoags-2B-Exp** model is a fine-tuned version of Qwen2-VL-2B-Instruct, specifically designed for reasoning tasks, context reasoning, and multi-modal understanding. If you ask for an image description, it will automatically describe the image and answer the question in a conversational manner.
21
+
22
+ # **Key Enhancements**
23
+
24
+ * **Advanced Contextual Reasoning**: Hoags-2B-Exp achieves state-of-the-art performance in reasoning tasks by enhancing logical inference and decision-making.
25
+
26
+ * **Understanding images of various resolution & ratio**: The model excels at visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
27
+
28
+ * **Long-Context Video Understanding**: Capable of processing and reasoning over videos of 20 minutes or more for high-quality video-based question answering, content creation, and dialogue.
29
+
30
+ * **Device Integration**: With strong reasoning and decision-making abilities, the model can be integrated into mobile devices, robots, and automation systems for real-time operation based on both visual and textual input.
31
+
32
+ * **Multilingual Support**: Supports text understanding in various languages within images, including English, Chinese, Japanese, Korean, Arabic, most European languages, and Vietnamese.
33
+
34
+ # **How to Use**
35
+
36
+ ```python
37
+ instruction = "Analyze the image and generate a clear, concise description of the scene, objects, and actions. Respond to user queries with accurate, relevant details derived from the visual content. Maintain a natural conversational flow and ensure logical consistency. Summarize or clarify as needed for understanding."
38
+ ```
39
+
40
+ ```python
41
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
42
+ from qwen_vl_utils import process_vision_info
43
+
44
+ # Load the model with automatic device placement
45
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
46
+ "prithivMLmods/Hoags-2B-Exp", torch_dtype="auto", device_map="auto"
47
+ )
48
+
49
+ # Recommended: Enable flash_attention_2 for better performance in multi-image and video tasks
50
+ # model = Qwen2VLForConditionalGeneration.from_pretrained(
51
+ # "prithivMLmods/Hoags-2B-Exp",
52
+ # torch_dtype=torch.bfloat16,
53
+ # attn_implementation="flash_attention_2",
54
+ # device_map="auto",
55
+ # )
56
+
57
+ # Load processor
58
+ processor = AutoProcessor.from_pretrained("prithivMLmods/Hoags-2B-Exp")
59
+
60
+ messages = [
61
+ {
62
+ "role": "user",
63
+ "content": [
64
+ {
65
+ "type": "image",
66
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
67
+ },
68
+ {"type": "text", "text": "Analyze the context of this image."},
69
+ ],
70
+ }
71
+ ]
72
+
73
+ # Prepare input
74
+ text = processor.apply_chat_template(
75
+ messages, tokenize=False, add_generation_prompt=True
76
+ )
77
+ image_inputs, video_inputs = process_vision_info(messages)
78
+ inputs = processor(
79
+ text=[text],
80
+ images=image_inputs,
81
+ videos=video_inputs,
82
+ padding=True,
83
+ return_tensors="pt",
84
+ )
85
+ inputs = inputs.to("cuda")
86
+
87
+ # Inference
88
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
89
+ generated_ids_trimmed = [
90
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
91
+ ]
92
+ output_text = processor.batch_decode(
93
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
94
+ )
95
+ print(output_text)
96
+ ```
97
+
98
+ # **Buffer Handling**
99
+ ```python
100
+ buffer = ""
101
+ for new_text in streamer:
102
+ buffer += new_text
103
+ buffer = buffer.replace("<|im_end|>", "")
104
+ yield buffer
105
+ ```
106
+
107
+ # **Key Features**
108
+
109
+ 1. **Advanced Contextual Reasoning:**
110
+ - Optimized for **context-aware problem-solving** and **logical inference**.
111
+
112
+ 2. **Optical Character Recognition (OCR):**
113
+ - Extracts and processes text from images with exceptional accuracy.
114
+
115
+ 3. **Mathematical and Logical Problem Solving:**
116
+ - Supports complex reasoning and outputs equations in **LaTeX format**.
117
+
118
+ 4. **Conversational and Multi-Turn Interaction:**
119
+ - Handles **multi-turn dialogue** with enhanced memory retention and response coherence.
120
+
121
+ 5. **Multi-Modal Inputs & Outputs:**
122
+ - Processes images, text, and combined inputs to generate insightful analyses.
123
+
124
+ 6. **Secure and Efficient Model Loading:**
125
+ - Uses **Safetensors** for faster and more secure model weight handling.