Image-Text-to-Text
Transformers
Safetensors
vqa
vlm
Inference Endpoints
mehmetkeremturkcan commited on
Commit
894cab3
·
verified ·
1 Parent(s): 51f5831

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -3
README.md CHANGED
@@ -1,3 +1,49 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - HuggingFaceM4/the_cauldron
5
+ - AnyModal/flickr30k
6
+ - openbmb/RLAIF-V-Dataset
7
+ base_model:
8
+ - HuggingFaceTB/SmolLM2-135M-Instruct
9
+ - facebook/dino-vitb16
10
+ library_name: transformers
11
+ pipeline_tag: image-text-to-text
12
+ tags:
13
+ - vqa
14
+ - vlm
15
+ ---
16
+
17
+ <p align="center">
18
+ <img src="https://github.com/mkturkcan/femtovlm/blob/main/assets/logo.png?raw=true" width="180" />
19
+ </p>
20
+ <h1 align="center">
21
+ <p>mehmetkeremturkcan/FemtoVLM-DINO</p>
22
+ </h1>
23
+ <h3 align="center">
24
+ <p>FemtoVLM: Tiniest Vision Language Models</p>
25
+ </h3>
26
+
27
+ FemtoVLM is the smallest visual question answering/captioning model in the world. It accepts image and text inputs to produce text outputs. It's designed for efficiency. FemtoVLM can answer questions about images and describe visual content. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance.
28
+
29
+ FemtoVLM comes in three sizes: 116M (femto), 143M (tiny), 160M (base), 225M (dino). All models are trained for image captioning and question answering in real-world contexts. FemtoVLM cannot perform optical character recognition (OCR), multi-turn question-answering, or scientific question answering.
30
+ ## Setup
31
+ ```bash
32
+ pip install git+https://github.com/facebookresearch/schedule_free.git
33
+ pip install peft
34
+ git clone https://github.com/mkturkcan/seers.git
35
+ cd seers/seers/
36
+ git clone https://huggingface.co/mehmetkeremturkcan/FemtoVLM-DINO
37
+ ```
38
+ ## Test
39
+ Run, in the seers/seers folder,
40
+ ```bash
41
+ python femtovlm_inference.py
42
+ ```
43
+
44
+ ## Train
45
+
46
+ [seers](https://github.com/mkturkcan/seers) training code is public! Run
47
+ ```bash
48
+ python femtovlm_train.py
49
+ ```