Model Card for Model ID

GiT: Towards Generalist Vision Transformer through Universal Language Interface

This repository includes GiT checkpoints, logs, and the pre-trained files used.

Model Details

Model Description

In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:

  • ๐Ÿ˜ฎ Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoder and adapter.
  • ๐Ÿš€ Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., objecte detection), pixel-level tasks (e.g., semantic segmentation) and vision-language tasks (e.g., image captioning).
  • ๐Ÿค— Achieving task synergy by unified language interface: Similar to LLM, GiT observes task synergy effect in multi-task training.
  • ๐Ÿ”ฅ Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after trained on 27 datasets.

image/png

Model Sources

Uses

Please refer here for more detail about usage.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.