Visual Question Answering
Transformers
Safetensors
llava
image-text-to-text
AIGC
LLaVA
Inference Endpoints
File size: 5,368 Bytes
af8875a
 
c095e03
af8875a
c095e03
 
40a1b9b
bf9a2b9
 
 
 
 
32e58f4
914a564
32e58f4
bee6755
 
9fa9b43
 
 
5c20ae1
 
bee6755
32e58f4
 
 
 
03e227f
32e58f4
 
9759053
e87538b
4d156e4
605edd6
f5426be
32e58f4
aba1ea0
7ccbc0f
93e0bf8
a112ff5
7ccbc0f
32e58f4
 
03e227f
 
9759053
 
 
32e58f4
9759053
2d18430
32e58f4
 
81361c1
9759053
2d18430
32e58f4
9759053
 
 
 
9f9169a
 
3d262b5
 
9759053
 
 
 
 
 
8a66cc5
9888832
32e58f4
9fa9b43
ecc8df9
 
 
32e58f4
d1cfd1f
32e58f4
 
40a1b9b
32e58f4
f5426be
743894a
f5426be
991ed02
f5426be
743894a
2068c4e
743894a
6ac7bbc
743894a
42f234e
743894a
42f234e
743894a
42f234e
fac7467
743894a
32e58f4
 
 
 
a70b13c
 
 
 
 
 
 
 
 
32e58f4
 
 
 
bf9a2b9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: llama3
base_model: meta-llama/Meta-Llama-3-8B-Instruct
library_name: transformers
tags:
- AIGC
- LLaVA
datasets:
- OpenFace-CQUPT/FaceCaption-15M
metrics:
- accuracy
pipeline_tag: visual-question-answering
---
# Human-LLaVA-8B

## DEMO


<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/TpN2t19Poe5YbHHP8uN7_.mp4"></video>


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/1xS27bvECvGTKntvOa1SQ.png)

### Introduction

Human-related vision and language tasks are widely applied across various social scenarios.  The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding.  Since, models in the general domain often not perform well in the specialized field.  In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.

Specifically, (1) we first construct **a large-scale and high-quality human-related image-text (caption) dataset** extracted from Internet for domain-specific alignment in the first stage (Coming soon);  (2) we also propose to construct **a multi-granularity caption for human-related images** (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model.  Lastly, we evaluate our model on a series of downstream tasks, our **Human-LLaVA** achieved the best overall performance among multimodal models of similar scale.  In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o.  We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.


## Result
human-llava has a good performance in both general and special fields


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/X-712oVUBPXbfLcAz83fb.png)

## News and Update πŸ”₯πŸ”₯πŸ”₯
* Oct.23, 2024.  **πŸ€—[HumanCaption-HQ-311K](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K), is released!πŸ‘πŸ‘πŸ‘**
* Sep.12, 2024.  **πŸ€—[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M), is released!πŸ‘πŸ‘πŸ‘**
* Sep.8, 2024.   **πŸ€—[HumanVLM](https://huggingface.co/OpenFace-CQUPT/Human_LLaVA), is released!πŸ‘πŸ‘πŸ‘**



## πŸ€— Transformers
To use Human-LLaVA for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using latest code.
``` python
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoModelForPreTraining


model_id = "OpenFace-CQUPT/Human_LLaVA"
cuda = 0
model = AutoModelForPreTraining.from_pretrained("OpenFace-CQUPT/Human_LLaVA", torch_dtype=torch.float16).to(cuda)

processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)


text = "Please describe this picture"
prompt = "USER: <image>\n" + text + "\nASSISTANT:"
image_file = "./test1.jpg"
raw_image = Image.open(image_file)
# raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)

output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
predict = processor.decode(output[0][:], skip_special_tokens=True)
print(predict)
```

Our training code have been released publicly on github.[ddw2AIGROUP2CQUPT/Human-LLaVA-8B(github.com)](https://github.com/ddw2AIGROUP2CQUPT/Human-LLaVA-8B)
## Get the Dataset
#### Dataset Example

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/-gTV7ym_gmNmJqNRDzlCx.png)

#### Domain Alignment Stage
[HumanCaption-10M](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-10M)(self construct): is released!

#### Instruction Tuning Stage
**All public data sets have been filtered, and we will consider publishing all processed text in the future**

[HumanCaption-HQ](https://huggingface.co/datasets/OpenFace-CQUPT/HumanCaption-HQ-311K)(self construct): is released!

[FaceCaptionA](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M)(self construct): is released!

CelebA: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

ShareGPT4V:https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md

LLaVA-Instruct_zh : https://huggingface.co/datasets/openbmb/llava_zh

verified_ref3rec: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json

verified_ref3reg: https://huggingface.co/datasets/lucasjin/refcoco/blob/main/ref3rec.json

verified_shikra: https://github.com/shikras/shikra



## Citation

```
@misc{dai2024humanvlmfoundationhumanscenevisionlanguage,
      title={HumanVLM: Foundation for Human-Scene Vision-Language Model}, 
      author={Dawei Dai and Xu Long and Li Yutang and Zhang Yuanhui and Shuyin Xia},
      year={2024},
      eprint={2411.03034},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2411.03034}, 
}
```

## contact

mailto: [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])