metadata

license: other
datasets:
  - remyxai/vqasynth_spacellava
tags:
  - remyx

Model Card for SpaceMinitron-4B

SpaceMinitron-4B uses Minitron-4B-Base as the llm backbone along with the fused DINOv2+SigLIP features of prismatic-vlms.

Model Details

Uses a full fine-tune including the spacellava dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM.

Model Description

This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.

Developed by: remyx.ai
Model type: MultiModal Model, Vision Language Model, Prismatic-vlms, Minitron-4B-Base
Finetuned from model: Minitron-4B-Base NVIDIA Open Model License Agreement

Model Sources

Dataset: SpaceLLaVA
Repository: VQASynth
Paper: SpatialVLM

Usage

Try the run_inference.py script to run a quick test:

python run_inference.py --model_location remyxai/SpaceMinitron-4B
                        --image_source "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg"
                        --user_prompt "What is the distance between the man in the red hat and the pallet of boxes?"

Deploy

Under the docker directory, you'll find a dockerized Triton Server for this model. Run the following:

docker build -f Dockerfile -t spacellava-server:latest
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 24G spaceminitron-4B-server:latest
python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" \
                  --prompt "What is the distance between the man in the red hat and the pallet of boxes?"

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@inproceedings{karamcheti2024prismatic,
  title = {Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models},
  author = {Siddharth Karamcheti and Suraj Nair and Ashwin Balakrishna and Percy Liang and Thomas Kollar and Dorsa Sadigh},
  booktitle = {International Conference on Machine Learning (ICML)},
  year = {2024},
}