--- license: apache-2.0 # inference: false # pipeline_tag: zero-shot-image-classification pipeline_tag: feature-extraction # inference: # parameters: tags: - clip - zh - image-text - feature-extraction --- # Model Details This model is a Chinese CLIP model trained on [Noah-Wukong Dataset(100M)](https://wukong-dataset.github.io/wukong-dataset/) and [Zero(23M)](https://zero.so.com/). We use ViT-L-14 from [openAI](https://github.com/openai/CLIP) as image encoder and Chinese pre-trained language model [chinese-roberta-wwm-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large) as text encoder. We freeze the image encoder and only finetune the text encoder. The model was first trained 10 epochs on wukong and then train another 12 epochs on wukong and zero. # Taiyi (太乙) Taiyi models are a branch of the Fengshenbang (封神榜) series of models. The models in Taiyi are pre-trained with multimodal pre-training strategies. We will release more image-text model trained on Chinese dataset and benefit the Chinese community. # Usage ```python3 from PIL import Image import requests import clip import torch from transformers import BertForSequenceClassification, BertConfig, BertTokenizer from transformers import CLIPProcessor, CLIPModel import numpy as np query_texts = ["一只猫", "一只狗",'两只猫', '两只老虎','一只老虎'] # 这里是输入文本的,可以随意替换。 # 加载Taiyi 中文 text encoder text_tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese") text_encoder = BertForSequenceClassification.from_pretrained("IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese").eval() text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids'] url = "http://images.cocodataset.org/val2017/000000039769.jpg" # 这里可以换成任意图片的url # 加载CLIP的image encoder clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt") with torch.no_grad(): image_features = clip_model.get_image_features(**image) text_features = text_encoder(text).logits # 归一化 image_features = image_features / image_features.norm(dim=1, keepdim=True) text_features = text_features / text_features.norm(dim=1, keepdim=True) # 计算余弦相似度 logit_scale是尺度系数 logit_scale = clip_model.logit_scale.exp() logits_per_image = logit_scale * image_features @ text_features.t() logits_per_text = logits_per_image.t() probs = logits_per_image.softmax(dim=-1).cpu().numpy() print(np.around(probs, 3)) ``` # Evaluation ### Zero-Shot Classification | model | dataset | Top1 | Top5 | | ---- | ---- | ---- | ---- | | Taiyi-CLIP-Roberta-326M-Chinese | ImageNet1k-CN | 53.05% | 79.55% | ### Zero-Shot Text-to-Image Retrieval | model | dataset | Top1 | Top5 | Top10 | | ---- | ---- | ---- | ---- | ---- | | Taiyi-CLIP-Roberta-326M-Chinese | Flickr30k-CNA-test | 54.36% | 80.56% | 87.90% | | Taiyi-CLIP-Roberta-326M-Chinese | COCO-CN-test | 51.47% | 81.00% | 90.40% | | Taiyi-CLIP-Roberta-326M-Chinese | wukong50k | 61.18% | 90.46% | 95.74% | # Citation If you find the resource is useful, please cite the following website in your paper. ``` @misc{Fengshenbang-LM, title={Fengshenbang-LM}, author={IDEA-CCNL}, year={2022}, howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}}, } ```