metadata

inference: false
pipeline_tag: image-text-to-text
license: apache-2.0
datasets:
  - VIMA/VIMA-Data
tags:
  - llara
  - llava
  - robotics
  - vlm

LLaRA Model Card

This model is released with paper LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li¹, Cristina Mata¹, Jongwoo Park¹, Kumara Kahatapitiya¹, Yoo Sung Jang¹, Jinghuan Shang¹, Kanchana Ranasinghe¹, Ryan Burgert¹, Mu Cai², Yong Jae Lee², and Michael S. Ryoo¹

¹Stony Brook University ²University of Wisconsin-Madison

Model details

Model type: D-RT2-Style is one of the baselines in our LLaRA paper, following the style of RT2. This is an open-source visuomotor policy trained by fine-tuning LLaVA-7b-v1.5 on instruction-following data D-RT2-Style, converted from VIMA-Data. For the conversion code, please refer to convert_vima.ipynb

Model date: llava-1.5-7b-llara-D-RT2-Style-VIMA-80k was trained in June 2024.

Paper or resources for more information: https://github.com/LostXine/LLaRA

Where to send questions or comments about the model: https://github.com/LostXine/LLaRA/issues

Intended use

Primary intended uses: The primary use of LLaRA is research on large multimodal models for robotics.

Primary intended users: The primary intended users of the model are researchers and hobbyists in robotics, computer vision, natural language processing, machine learning, and artificial intelligence.