This model has been trained on a larger version (194 minutes total) of the commabody dataset.
It includes a vqgan encoder/decoder fine tuned from imagenet. It compresses images of size 160x256 to 10x16 tokens.
It also includes a GPT2 model trained to predict the next frame, wheel speeds and actions. It can be used either as a simulator or as a policy. More details in our blog post.
You can run it on a comma body using our example script in body-jim.