--- license: apache-2.0 language: - en --- # Model Card for Model ID [Beyond Language: Multi-layer Transformer is a General Visual Learner](https://arxiv.org/abs/2222.33333) This repository includes ViM checkpoints, logs, and the pre-trained files used. ## Model Details ### Model Description In this project, we introduce ViM (Laqurge Visual Modeling). ViM has the following characteristics: - 😮 **Minimalist architecture design similar to LLM**: ViM consists solely of a single transformer, without the inclusion of additional vision encoder and adapter. - 🚀 **Covering all types of visual understanding tasks**: ViM addresses a spectrum of visual tasks, including object-level tasks (e.g., objecte detection), pixel-level tasks (e.g., semantic segmentation) and vision-language tasks (e.g., image captioning). - 🤗 **Achieving task synergy by unified language interface**: Similar to LLM, ViM observes task synergy effect in multi-task training. - 🔥 **SOTA performance on zero-shot and few-shot benchmark**: ViM scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after trained on 27 datasets. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585493b53c37507639fe3ba/FEhRT9ZscNwG7xIYYIYmh.png) - **Developed by:** Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn ) - **License:** [Apache license 2.0] ### Model Sources [optional] - **Repository:** https://github.com/Haiyang-W/ViM - **Paper [optional]:** https://arxiv.org/abs/2222.33333 ## Uses Please refer [here](https://github.com/Haiyang-W/ViM) for more detail about usage.