Model Card for ConceptCLIP
Model Details
Model Description
ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts for diverse medical image modalities. It enables robust performance across multiple medical imaging tasks through concept-enhanced language-image alignment.
- Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Hao Chen
- Model type: Vision-Language Pre-trained Model (Medical Specialized)
- Language(s): English (text), Multi-modal (medical imaging)
- License: MIT
- Finetuned from model: Based on OpenCLIP
Model Sources
- Repository: GitHub Project
- Paper: ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Language-Image Pre-training
- Demo: Hugging Face Model Hub
Uses
Direct Use
- Zero-shot medical image classification
- Cross-modal retrieval
- Zero-shot concept annotation
- Extract features for whole-slide image analysis
- Extract features for medical report generation
Downstream Use
- Fine-tuning for specific medical imaging tasks (CT, MRI, X-ray analysis) for classification, and visual question answering
- Concept bottleneck model for explanation
- Integration into clinical decision support systems
- Medical education and training tools
Out-of-Scope Use
- Direct clinical diagnosis without clinical validation
- Non-medical image analysis
- General purpose vision tasks outside medical domain
Bias, Risks, and Limitations
- Trained primarily on medical imaging data which may contain demographic biases
- Performance may vary across different medical imaging modalities
- Should not be used as sole diagnostic tool without human oversight
Recommendations
- Validate outputs with clinical experts before medical decision making
- Fine-tune on domain-specific data for specialized applications
- Conduct bias analysis when deploying in new clinical environments
How to Get Started with the Model
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]
inputs = processor(
images=image,
text=texts,
return_tensors='pt',
padding=True,
truncation=True
).to(model.device)
with torch.no_grad():
outputs = model(**inputs)
logits = (outputs.logit_scale * outputs.image_features @ outputs.text_features.t()).softmax(dim=-1)[0]
print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})
Training Details
Training Data
- Large-scale medical image-text pairs with concept information
Training Procedure
- Built on OpenCLIP architecture with medical concept integration
- Pre-training with image-text alignment (IT-Align) and patch-concept alignment (PC-Align) objectives
Training Hyperparameters
- Base architecture: SigLIP-ViT-400M-16 + PubMedBERT
- Training regime: Mixed precision training
- Batch size: 12,288 w/o PC-Align, 6,144 w/ PC-Align
- Learning rate: 5e-4 w/o PC-Align, 3e-4 w/ PC-Align
Evaluation
Testing Data & Metrics
Testing Data
- Evaluated on multiple open-sourced medical imaging benchmarks including medical image diagnosis, cross-modal retrieval, medical visual question answering, medical report generation, whole-slide image analysis, and explainable AI
Citation
BibTeX:
@article{nie2025conceptclip,
title={ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Language-Image Pre-training},
author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Chen, Hao},
journal={arXiv preprint arXiv:2501.xxxxx},
year={2025}
}
APA:
[More Information Needed]
Model Card Contact
Yuxiang Nie: [email protected]
- Downloads last month
- 55
Inference API (serverless) does not yet support model repos that contain custom code.