|
--- |
|
license: apache-2.0 |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
[GiT: Towards Generalist Vision Transformer through Universal Language Interface](https://arxiv.org/abs/2403.09394) |
|
|
|
This repository includes GiT checkpoints, logs, and the pre-trained files used. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics: |
|
|
|
- ๐ฎ **Minimalist architecture design similar to LLM**: GiT consists solely of a single transformer, without the inclusion of additional vision encoder and adapter. |
|
- ๐ **Covering all types of visual understanding tasks**: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., objecte detection), pixel-level tasks (e.g., semantic segmentation) and vision-language tasks (e.g., image captioning). |
|
- ๐ค **Achieving task synergy by unified language interface**: Similar to LLM, GiT observes task synergy effect in multi-task training. |
|
- ๐ฅ **Strong performance on zero-shot and few-shot benchmark**: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after trained on 27 datasets. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585493b53c37507639fe3ba/glLj40VWCFaa0BVi4-_9d.png) |
|
|
|
|
|
|
|
- **Developed by:** Haiyang Wang ( [email protected] ), Hao Tang ([email protected]) |
|
- **License:** Apache license 2.0 |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/Haiyang-W/GiT |
|
- **Paper:** https://arxiv.org/abs/2403.09394 |
|
|
|
## Uses |
|
Please refer [here](https://github.com/Haiyang-W/GiT) for more detail about usage. |