MIL-UT/Asagi-4B · Hugging Face

Model Details

Model Description

This repository provides Asagi-4B, a large-scale Japanese Vision & Language Model (VLM). Asagi-4B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.

A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).

Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.

Model components	Model / Architecture	Parameters
Vision encoder	siglip-so400m-patch14-384	428M
Projector	2-layer MLP	64M
LLM	llm-jp-3-3.7b-instruct	3.7B

Usage

Requirements

transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0

How to use

import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig

transformers.set_seed(42)
model_path = "MIL-UT/Asagi-4B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
    model_path, trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

generation_config = GenerationConfig(
    do_sample=True,
    num_beams=5,
    max_new_tokens=256,
    temperature=0.7,
    repetition_penalty=1.5
)

prompt = ("以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n"
            "### 指示:\n<image>\nこの画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。\n\n### 応答:\n")

# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)

inputs = processor(
    text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
    if v.dtype == torch.float32:
        inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}

generate_ids = model.generate(
    **inputs,
    generation_config=generation_config
)
generated_text = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# do not print the prompt
if "<image>" in prompt:
    prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")

print(f"Generated text: {generated_text}")

# >>> Generated text:  この写真は東京の渋谷で撮影されたものです。夜の渋谷の街並みが写っており、高層ビルが立ち並び、街灯やネオンサインが輝いています。

Example

Prompt	Output
この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。	この写真は東京の渋谷で撮影されたものです。夜の渋谷の街並みが写っており、高層ビルが立ち並び、街灯やネオンサインが輝いています。

More Examples

Click to expand

Prompt	Output
この画像を見て、次の指示に詳細かつ具体的に答えてください。この写真の内容について詳しく教えてください。	この写真は、東京のレインボーブリッジを撮影したものです。橋は白く塗られており、静かな水の上に優雅にアーチを描いています。背景には高層ビルが立ち並ぶ都市景観が広がっており、空は澄み切った青色で、白い雲が点在しています。

Prompt	Output
この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真について評価してください。	この写真は、東京タワーを背景にした秋の風景を撮影したものです。東京タワーは赤と白のストライプ模様が特徴的で、周囲には紅葉した木々が広がっています。空は部分的に曇っており、太陽の光が雲の間から差し込んで美しいコントラストを作り出しています。前景には街灯や建物のシルエットが見え、都会的な雰囲気も感じられます。

Prompt

Output

この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真について評価してください。

この写真は、東京タワーを背景にした秋の風景を撮影したものです。東京タワーは赤と白のストライプ模様が特徴的で、周囲には紅葉した木々が広がっています。空は部分的に曇っており、太陽の光が雲の間から差し込んで美しいコントラストを作り出しています。前景には街灯や建物のシルエットが見え、都会的な雰囲気も感じられます。

Prompt	Output
この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真の神社について、細かいところまで詳しく説明してください。	この写真に写っている神社は、日本の伝統的な建築様式を反映しています。白い壁と鮮やかなオレンジ色の屋根が特徴的で、神聖な雰囲気を醸し出しています。神社の入り口には朱色の鳥居が立っており、これは神道における神聖な空間への入り口を示す象徴的な門です。鳥居の前には小さな社殿があり、参拝者が祈りを捧げる場所として機能しています。社殿の横には縄が張られており、これは神聖な場所であることを示唆しています。神社は現代的な建物の間に位置しており、伝統と現代が共存する日本の文化を象徴しています。

Prompt

Output

この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真の神社について、細かいところまで詳しく説明してください。

この写真に写っている神社は、日本の伝統的な建築様式を反映しています。白い壁と鮮やかなオレンジ色の屋根が特徴的で、神聖な雰囲気を醸し出しています。神社の入り口には朱色の鳥居が立っており、これは神道における神聖な空間への入り口を示す象徴的な門です。鳥居の前には小さな社殿があり、参拝者が祈りを捧げる場所として機能しています。社殿の横には縄が張られており、これは神聖な場所であることを示唆しています。神社は現代的な建物の間に位置しており、伝統と現代が共存する日本の文化を象徴しています。

Prompt	Output
この画像を見て、次の指示に詳細かつ具体的に答えてください。これは日本庭園の中でも、どのような形式に分類される庭園ですか？また、その理由は何ですか？	これは日本庭園の中でも「石庭」と呼ばれる形式の庭園です。石庭は、自然の石を巧みに配置し、静けさと美しさを追求する形式で、日本の伝統的な庭園設計の一つです。

Prompt	Output
この画像を見て、次の質問に詳細に答えてください。この画像の場所を舞台とした小説のあらすじを書いてください。	昔々、ある小さな村に、石造りの階段が村の中心へと続いていました。その階段は、村人たちにとって神聖な場所への入り口であり、村人たちはその階段を登り、神聖な場所で祈りを捧げました。階段の周りには、緑豊かな植物が生い茂り、自然との調和を感じさせる風景が広がっていました。村人たちは、階段を登るたびに、自然の恵みと神聖な場所への敬意を表していました。

Prompt

Output

この画像を見て、次の質問に詳細に答えてください。この画像の場所を舞台とした小説のあらすじを書いてください。

昔々、ある小さな村に、石造りの階段が村の中心へと続いていました。その階段は、村人たちにとって神聖な場所への入り口であり、村人たちはその階段を登り、神聖な場所で祈りを捧げました。階段の周りには、緑豊かな植物が生い茂り、自然との調和を感じさせる風景が広がっていました。村人たちは、階段を登るたびに、自然の恵みと神聖な場所への敬意を表していました。

Training Details

Training Data

Dataset	Preprocess	Stage1	Stage2	Size
ROIS (Ours)	Synthesized	✓	✓	8.4M
Japanese image text pairs	Synthesized	✓	✓	4.4M
Wikipedia	Synthesized	✓	✓	2.5M
Open Images	Translated	✓	✓	680K
DCI	Translated	✓	✓	7K
CommonCatalog CC-BY	Translated	✓	✓	3.5M
LLaVA-Pretrain-JA		✓	✓	550K
STAIR Captions		✓	✓	410K
Flickr-JP		✓	✓	160K
YJ Captions		✓	✓	130K
Japanese Pascal		✓	✓	5K
ArtBench	Synthesized		✓	100K
GQA	Translated		✓	1.9M
VQA v2	Translated		✓	880K
A-OKVQA	Translated		✓	34K
OK-VQA	Translated		✓	18K
Japanese Visual Genome	Translated		✓	1.6M
PangeaInstruct			✓	93K

Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.

Evaluation

We evaluated our model using Heron-Bench, JA-VLM-Bench-in-the-Wild, and JA-VG-VQA-500. We used eval-mm library for this evaluation.

Here, models with "†" are not trained with GPT-generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT-generated data.

Model	LM Size	Heron-Bench (LLM (%))	JA-VLM-Bench-In-the-Wild (ROUGE-L)	JA-VLM-Bench-In-the-Wild (LLM (/5.0))	JA-VG-VQA-500 (ROUGE-L)	JA-VG-VQA-500 (LLM (/5.0))
Japanese InstructBLIP Alpha†	7B	14.0	20.8	2.42	-	-
Japanese Stable VLM†	7B	24.2	23.3	2.47	-	-
LLaVA-CALM2-SigLIP†	7B	43.3	47.2	3.15	17.4	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	41.4	2.92	23.5	2.96
VILA-jp	13B	57.2	52.3	3.69	16.2	3.62
Asagi-2B†	1.8B	44.7	48.8	3.26	53.7	3.69
Asagi-4B†	3.7B	49.3	49.6	3.38	55.6	3.78
Asagi-8B†	7.2B	54.7	49.4	3.45	56.43	3.84
Asagi-14B†	13B	55.8	50.8	3.44	56.8	3.84
GPT-4o	-	87.6	37.6	3.85	12.1	3.58

Risks and Limitations

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

Model Card Authors

Kohei Uehara

MIL-UT
/

Asagi-4B

You need to agree to share your contact information to access this model

Model Details

Model Description

Usage

Requirements

How to use

Example

More Examples

Training Details

Training Data

Evaluation

Risks and Limitations

Model Card Authors

Model tree for MIL-UT/Asagi-4B

Collection including MIL-UT/Asagi-4B

Asagi-VLM