Visual Question Answering
Transformers
Safetensors
llava
image-text-to-text
AIGC
LLaVA
Inference Endpoints
ponytail commited on
Commit
9759053
1 Parent(s): 32e58f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -12
README.md CHANGED
@@ -15,31 +15,65 @@ Human-related vision and language tasks are widely applied across various social
15
  Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
16
 
17
 
18
- ## Architecture
 
 
 
 
 
19
 
20
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/SkFB0x3JunWE_Wae808Nq.png)
21
 
22
 
23
 
24
 
25
 
 
 
 
 
26
 
 
 
27
 
28
- #### Data Cleaning Process
29
 
30
- ![img](file:///C:\Users\hp-pc\AppData\Local\Temp\ksohtml4716\wps1.png)
 
 
 
 
 
 
 
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ## Get the Dataset
33
 
34
  #### Domain Alignment Stage
35
 
36
- HumanCaption-10M(ours): coming soon
37
 
38
  #### Instruction Tuning Stage
39
 
 
 
 
 
40
  **Caption**
41
 
42
- HumanCaption-300K: [FreedomIntelligence/PubMedVision · Datasets at Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/PubMedVision)
43
 
44
  ShareGPT4V:
45
 
@@ -63,12 +97,6 @@ celeba_attribute:
63
 
64
  Face_hq:
65
 
66
- ## Result
67
-
68
-
69
-
70
-
71
-
72
 
73
  ## Citation
74
 
 
15
  Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
16
 
17
 
18
+ ## DEMO
19
+
20
+ <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
21
+
22
+
23
+ ## Result
24
 
 
25
 
26
 
27
 
28
 
29
 
30
+ ## How to Use
31
+ ``` python
32
+ import requests
33
+ from PIL import Image
34
 
35
+ import torch
36
+ from transformers import AutoProcessor, LlavaForConditionalGeneration
37
 
 
38
 
39
+ model_id = "huangfx1020/human_llama3_8b"
40
+ cuda = 0
41
+ model = LlavaForConditionalGeneration.from_pretrained(
42
+ model_id,
43
+ torch_dtype=torch.float16,
44
+ low_cpu_mem_usage=True,
45
+ trust_remote_code=True
46
+ ).to(cuda)
47
 
48
+ processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
49
+
50
+
51
+ text = "Please describe this picture"
52
+ prompt = "<|start_header_id|>user<|end_header_id|><image>"+text+"<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
53
+ image_file = "/home/ubuntu/san/LYT/UniDetRet-exp/HumanLlama3/test1.jpg"
54
+ # raw_image = Image.open(image_file)
55
+ raw_image = Image.open(requests.get(image_file, stream=True).raw)
56
+ inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
57
+
58
+ output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
59
+ predict = processor.decode(output[0][:], skip_special_tokens=True)
60
+ print(predict)
61
+ ```
62
  ## Get the Dataset
63
 
64
  #### Domain Alignment Stage
65
 
66
+ HumanCaption-10M(self construct): Coming Soon!
67
 
68
  #### Instruction Tuning Stage
69
 
70
+ #### Instruct Data Example
71
+
72
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/1pE0bLxhlr5HHPME3D1k8.png)
73
+
74
  **Caption**
75
 
76
+ HumanCaption-300K: Coming Soon!
77
 
78
  ShareGPT4V:
79
 
 
97
 
98
  Face_hq:
99
 
 
 
 
 
 
 
100
 
101
  ## Citation
102