Difference between `<image>`, `<img>`, and `<im_start>`

#2
by mbrhd - opened

What is exactly the difference between all these tokens? <image>, <img>, and <im_start>. It seems they are all related to the image according to the model card. However, in the code snippet the <image> is used for the prompt, and for the training the <img> token was used.

OpenGVLab org

The differences are as follows:

  1. <image>: This is used as a placeholder in the prompt. In the code, it will eventually be replaced by a sequence that starts with <img>, followed by several <IMG_CONTEXT> tokens (which act as placeholders for the actual visual tokens produced by a Vision Transformer), and ends with </img>.
  2. <img> and </img>: These tokens mark the start and end of an image, respectively. They encapsulate the visual context (i.e., the <IMG_CONTEXT> tokens) that represents the processed image.
  3. <|im_start|>: This token is part of the ChatML template and is not directly related to image processing. It is used for formatting or structuring the dialogue rather than representing any image data.

In summary, <image> is a higher-level placeholder that gets expanded into a specific image token structure (<img>... </img> with visual tokens), while <|im_start|> is a formatting symbol for the chat interface unrelated to image tokens.

czczup changed discussion status to closed

Sign up or log in to comment