@merve on Hugging Face: "Florence-2 is a new vision foundation model capable of a wide variety of tasks…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

merve

posted an update Jun 20

Post

4319

Florence-2 is a new vision foundation model capable of a wide variety of tasks 🤯
Demo 👉🏻 gokaygokay/Florence-2
Collection 👉🏻 microsoft/florence-6669f44df0d87d9c3bfb76de

This model can handle tasks that vary from OCR to semantic segmentation.

The difference from previous models is that the authors have compiled a dataset consisting of 126M images with 5.4B annotations labelled with their own data engine pseudolabelled by smaller specialized models and APIs.

The model has a similar architecture to previous models: an image encoder and a multimodality encoder with a text decoder. The authors have compiled the multitask dataset with prompts for each task.

You can also fine-tune this model on any task of choice. The authors also released different results on downstream tasks and reported their results when un/freezing the vision encoder 🤓📉
They have released fine-tuned models too, you can find them in the collection above 🤗

polles

Jun 20

nice post !

ZeroWw

Jun 21

Interesting, I gave it a photo of a barely readable handwritten piece of old paper, using OCR it made a mess, but when I used "Detailed caption" it made only 2 errors.

lucasjin

Jun 22

It has 126M images training, yet didn't support Chinese or other languages well. A little pity

In this post