microsoft
/

Magma-8B

@@ -102,7 +102,7 @@ Magma is a multimodal agentic AI model that can generate text based on the input
 ### Highlights
 * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
-* **Versatile Capabilities:** Magma as a single model not only posseesses generic image and videos understanding ability, but alse generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
 * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
 * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
@@ -125,15 +125,20 @@ The model is developed by Microsoft and is funded by Microsoft Research. The mod
 <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
-Use the code below to get started with the model.
 ```python
 import torch
 from PIL import Image
 import requests
-from transformers import AutoModelForCausalLM
-from transformers import AutoProcessor
 # Load the model and processor
 model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
@@ -159,7 +164,6 @@ with torch.inference_mode():
 generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
 response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
 print(response)
 ```

 ### Highlights
 * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
+* **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
 * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
 * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
 <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
+To get started with the model, you first need to make sure that `transformers` and `torch` are installed, as well as installing the following dependencies:
+```bash
+pip install torchvision Pillow open_clip_torch
+```
+Then you can run the following code:
 ```python
 import torch
 from PIL import Image
 import requests
+from transformers import AutoModelForCausalLM, AutoProcessor
 # Load the model and processor
 model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
 generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
 response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
 print(response)
 ```