facebook
/

sam-vit-huge

@@ -48,24 +48,37 @@ The SAM model is made up of 3 modules:
 ## Prompted-Mask-Generation
 ```python
-from PIL import Image
 import requests
 from transformers import SamModel, SamProcessor
-model = SamModel.from_pretrained("facebook/sam-vit-huge")
-processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
 img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
 raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
 input_points = [[[450, 600]]] # 2D localization of a window
-```
-```python
-inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to("cuda")
-outputs = model(**inputs)
-masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
 scores = outputs.iou_scores
 ```
 Among other arguments to generate masks, you can pass 2D locations on the approximate position of your object of interest, a bounding box wrapping the object of interest (the format should be x, y coordinate of the top right and bottom left point of the bounding box), a segmentation mask. At this time of writing, passing a text as input is not supported by the official model according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844).
 For more details, refer to this notebook, which shows a walk throught of how to use the model, with a visual example!

 ## Prompted-Mask-Generation
 ```python
+import torch
 import requests
+from PIL import Image
 from transformers import SamModel, SamProcessor
+device = torch.device("cuda" if torch.cuda.is_available() else "CPU")
+# load model and processor
+model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
+processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
+# prepare model imputs
 img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
 raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
 input_points = [[[450, 600]]] # 2D localization of a window
+inputs = processor(raw_image, input_points=input_points, return_tensors="pt")
+inputs = inputs.to(device)
+with torch.no_grad():
+    outputs = model(**inputs)
+# post process model results
+masks = processor.image_processor.post_process_masks(
+    outputs.pred_masks.cpu(),
+    inputs["original_sizes"].cpu(),
+    inputs["reshaped_input_sizes"].cpu()
+)
 scores = outputs.iou_scores
+print(scores)
+# tensor([[[0.9057, 0.9563, 0.9669]]], device='cuda:0')
 ```
 Among other arguments to generate masks, you can pass 2D locations on the approximate position of your object of interest, a bounding box wrapping the object of interest (the format should be x, y coordinate of the top right and bottom left point of the bounding box), a segmentation mask. At this time of writing, passing a text as input is not supported by the official model according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844).
 For more details, refer to this notebook, which shows a walk throught of how to use the model, with a visual example!