Welcome to try DeepSeek-VL2~

#2
by CharlesCXK - opened

Hello, great work! We would also like to invite you to evaluate our DeepSeek-VL2 (the number of activated parameters is 4.5B.) model, which was released in December 2024. DeepSeek-VL2 exhibits strong grounding capabilities and can handle various scenarios, including natural scenes, UI elements, and more.

The input format for our model is <|ref|>xxx<|/ref|>, where <|ref|> and <|/ref|> are special tokens, and xxx represents the object you want to query. The output format is <|ref|>xxx<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>, where xxx is the queried object from the prompt, and [x1, y1, x2, y2] are the coordinates of the detected object. Here, x1 and y1 denote the top-left corner, and x2 and y2 denote the bottom-right corner of the bounding box, with the top-left corner of the image being (0, 0). These coordinates are normalized to the range [0, 999]. For example, if the original width of the image is W, the absolute coordinate of x1 can be calculated as x1 / 999 * W. If multiple objects are detected, there will be more than one list, separated by commas, such as <|det|>[[x1, y1, x2, y2], [m1, n1, m2, n2]]<|/det|>.

Our model is available for download on HuggingFace (https://huggingface.co/deepseek-ai/deepseek-vl2) and can also be accessed via API at https://cloud.siliconflow.cn.
image.png

Hello @CharlesCXK ,

thanks for asking. For sure we will try your model out and add your model to our vision agent lib.

Do you have the training code available? Then we can fine tune your model on UI datasets and report it back to you.

Hi, we do not provide training code. Our model has been trained on some UI data, so it can be tested directly.πŸ˜„

Sign up or log in to comment