Update README.md
Browse files
README.md
CHANGED
@@ -15,31 +15,65 @@ Human-related vision and language tasks are widely applied across various social
|
|
15 |
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
16 |
|
17 |
|
18 |
-
##
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/SkFB0x3JunWE_Wae808Nq.png)
|
21 |
|
22 |
|
23 |
|
24 |
|
25 |
|
|
|
|
|
|
|
|
|
26 |
|
|
|
|
|
27 |
|
28 |
-
#### Data Cleaning Process
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
## Get the Dataset
|
33 |
|
34 |
#### Domain Alignment Stage
|
35 |
|
36 |
-
HumanCaption-10M(
|
37 |
|
38 |
#### Instruction Tuning Stage
|
39 |
|
|
|
|
|
|
|
|
|
40 |
**Caption**
|
41 |
|
42 |
-
HumanCaption-300K:
|
43 |
|
44 |
ShareGPT4V:
|
45 |
|
@@ -63,12 +97,6 @@ celeba_attribute:
|
|
63 |
|
64 |
Face_hq:
|
65 |
|
66 |
-
## Result
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
|
73 |
## Citation
|
74 |
|
|
|
15 |
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
16 |
|
17 |
|
18 |
+
## DEMO
|
19 |
+
|
20 |
+
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
|
21 |
+
|
22 |
+
|
23 |
+
## Result
|
24 |
|
|
|
25 |
|
26 |
|
27 |
|
28 |
|
29 |
|
30 |
+
## How to Use
|
31 |
+
``` python
|
32 |
+
import requests
|
33 |
+
from PIL import Image
|
34 |
|
35 |
+
import torch
|
36 |
+
from transformers import AutoProcessor, LlavaForConditionalGeneration
|
37 |
|
|
|
38 |
|
39 |
+
model_id = "huangfx1020/human_llama3_8b"
|
40 |
+
cuda = 0
|
41 |
+
model = LlavaForConditionalGeneration.from_pretrained(
|
42 |
+
model_id,
|
43 |
+
torch_dtype=torch.float16,
|
44 |
+
low_cpu_mem_usage=True,
|
45 |
+
trust_remote_code=True
|
46 |
+
).to(cuda)
|
47 |
|
48 |
+
processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
|
49 |
+
|
50 |
+
|
51 |
+
text = "Please describe this picture"
|
52 |
+
prompt = "<|start_header_id|>user<|end_header_id|><image>"+text+"<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
|
53 |
+
image_file = "/home/ubuntu/san/LYT/UniDetRet-exp/HumanLlama3/test1.jpg"
|
54 |
+
# raw_image = Image.open(image_file)
|
55 |
+
raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
56 |
+
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
|
57 |
+
|
58 |
+
output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
|
59 |
+
predict = processor.decode(output[0][:], skip_special_tokens=True)
|
60 |
+
print(predict)
|
61 |
+
```
|
62 |
## Get the Dataset
|
63 |
|
64 |
#### Domain Alignment Stage
|
65 |
|
66 |
+
HumanCaption-10M(self construct): Coming Soon!
|
67 |
|
68 |
#### Instruction Tuning Stage
|
69 |
|
70 |
+
#### Instruct Data Example
|
71 |
+
|
72 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/1pE0bLxhlr5HHPME3D1k8.png)
|
73 |
+
|
74 |
**Caption**
|
75 |
|
76 |
+
HumanCaption-300K: Coming Soon!
|
77 |
|
78 |
ShareGPT4V:
|
79 |
|
|
|
97 |
|
98 |
Face_hq:
|
99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
## Citation
|
102 |
|