Model Card for Model ID
GiT: Towards Generalist Vision Transformer through Universal Language Interface
This repository includes GiT checkpoints, logs, and the pre-trained files used.
Model Details
Model Description
In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:
- ๐ฎ Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoder and adapter.
- ๐ Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., objecte detection), pixel-level tasks (e.g., semantic segmentation) and vision-language tasks (e.g., image captioning).
- ๐ค Achieving task synergy by unified language interface: Similar to LLM, GiT observes task synergy effect in multi-task training.
- ๐ฅ Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after trained on 27 datasets.
- Developed by: Haiyang Wang ( [email protected] ), Hao Tang ([email protected])
- License: Apache license 2.0
Model Sources
- Repository: https://github.com/Haiyang-W/GiT
- Paper: https://arxiv.org/abs/2403.09394
Uses
Please refer here for more detail about usage.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.