OpenFace-CQUPT
/

Human_LLaVA

@@ -15,31 +15,65 @@ Human-related vision and language tasks are widely applied across various social
 Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon);  (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model.  Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale.  In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o.  We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
-## Architecture
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/SkFB0x3JunWE_Wae808Nq.png)
-#### Data Cleaning Process
-![img](file:///C:\Users\hp-pc\AppData\Local\Temp\ksohtml4716\wps1.png)
 ## Get the Dataset
 #### Domain Alignment Stage
-HumanCaption-10M(ours): coming soon
 #### Instruction Tuning Stage
 **Caption**
-HumanCaption-300K: [FreedomIntelligence/PubMedVision · Datasets at Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/PubMedVision)
 ShareGPT4V:
@@ -63,12 +97,6 @@ celeba_attribute:
 Face_hq:
-## Result
 ## Citation

 Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon);  (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model.  Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale.  In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o.  We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
+## DEMO
+<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/tyT9FvycyyVWISd1-_A-m.mp4"></video>
+## Result
+## How to Use
+``` python
+import requests
+from PIL import Image
+import torch
+from transformers import AutoProcessor, LlavaForConditionalGeneration
+model_id = "huangfx1020/human_llama3_8b"
+cuda = 0
+model = LlavaForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True
+).to(cuda)
+processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True)
+text = "Please describe this picture"
+prompt = "<|start_header_id|>user<|end_header_id|><image>"+text+"<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
+image_file = "/home/ubuntu/san/LYT/UniDetRet-exp/HumanLlama3/test1.jpg"
+# raw_image = Image.open(image_file)
+raw_image = Image.open(requests.get(image_file, stream=True).raw)
+inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(cuda, torch.float16)
+output = model.generate(**inputs, max_new_tokens=400, do_sample=False)
+predict = processor.decode(output[0][:], skip_special_tokens=True)
+print(predict)
+```
 ## Get the Dataset
 #### Domain Alignment Stage
+HumanCaption-10M(self construct): Coming Soon!
 #### Instruction Tuning Stage
+#### Instruct Data Example
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64259db7d3e6fdf87e4792d0/1pE0bLxhlr5HHPME3D1k8.png)
 **Caption**
+HumanCaption-300K: Coming Soon!
 ShareGPT4V:
 Face_hq:
 ## Citation