Image-to-Text
Chinese
English

Demonstration of Cross-modal Retrieval (FLIP-based model)

FLIP (Facial Language Image Pretraining)

This repository is the official implementation of FaceCaption-15M.

Updates:

[24/07/20] The usage of FLIP has been released! OpenFace-CQUPT/FLIP-demo

[24/07/17] The model named FLIP has been released! OpenFace-CQUPT/FLIP

Overview of FLIP architecture.

image-20240318101027127

Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.

Training

Coming soon......(Only for the datasets been published, the code of training is meaningful.)

python pretrain.py > log.log

Pre-trained Models

We provide pretrained model weights :
FLIP Base —— click here
FLIP Large —— coming soon......

Datasets

Download the FaceCaption-15M dataset from here.

Results

Task1: Text-Image Retrieval

Table 1: Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.

Task2: Facial Attributes Prediction

Table 2: Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.

image-20240318101126897

Task3: Sketch Less Facial Image Retrieval

Table 3: Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.

image-20240318101633671

image/png

Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.

Contacts

mailto: [email protected] or [email protected]

Citation

@misc{dai202415mmultimodalfacialimagetext,
      title={15M Multimodal Facial Image-Text Dataset}, 
      author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
      year={2024},
      eprint={2407.08515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.08515}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train OpenFace-CQUPT/FLIP