Add pipeline tag and license to model card, link to code

6378f75 verified 4 days ago

2.98 kB

	---
	library_name: transformers
	tags: []
	pipeline_tag: image-text-to-text
	license: mit
	---

	# Fine-Grained Visual Classification on HAM10000

	Project Page: [SelfSynthX](https://github.com/sycny/SelfSynthX).

	Paper on arXiv: [Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data](https://arxiv.org/abs/2502.14044)

	This model is a fine-tuned multimodal foundation model developed on the [LLaVA-1.5-7B-hf](https://huggingface.co/llava-hf/llava-1.5-7B-hf) base, optimized for fine-grained skin lesion classification and explainability using the HAM10000 dataset.

	## Key Details

	- Base Model: LLaVA-1.5-7B
	- Dataset: HAM10000
	- Innovation:
	- Self-Synthesized Data: Generates interpretable explanations by extracting lesion-specific visual concepts using the Information Bottleneck principle.
	- Iterative Fine-Tuning: Uses reward model-free rejection sampling to progressively improve classification accuracy and explanation quality.
	- Intended Use: Skin lesion classification with human-verifiable explanations for dermatological analysis.

	## How to Use

	```python
	import requests
	from PIL import Image
	import torch
	from transformers import AutoProcessor, LlavaForConditionalGeneration

	model_id = "YuchengShi/LLaVA-v1.5-7B-HAM10000"
	model = LlavaForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	).to("cuda")
	processor = AutoProcessor.from_pretrained(model_id)

	conversation = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "What type of skin lesion is this?"},
	{"type": "image"},
	],
	},
	]
	prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
	image_file = "ham10000/test1.png"
	raw_image = Image.open(requests.get(image_file, stream=True).raw)
	inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to("cuda", torch.float16)

	output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
	print(processor.decode(output[0][2:], skip_special_tokens=True))
	```

	## Training & Evaluation

	- Training: Fine-tuned using LoRA on HAM10000 with iterative rejection sampling.
	- Evaluation: Demonstrates higher accuracy and robust, interpretable explanations compared to baseline models.

	## Citation

	If you use this model, please cite:

	```bibtex
	@inproceedings{
	shi2025enhancing,
	title={Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data},
	author={Yucheng Shi and Quanzheng Li and Jin Sun and Xiang Li and Ninghao Liu},
	booktitle={The Thirteenth International Conference on Learning Representations},
	year={2025},
	url={https://openreview.net/forum?id=lHbLpwbEyt}
	}
	```

	## Contact

	For any questions, suggestions, or issues, please open an issue on GitHub or contact us at [[email protected]](mailto:[email protected]).

	Github repository: https://github.com/sycny/SelfSynthX