0nejiawei commited on
Commit
175d088
·
1 Parent(s): cfa34aa
Files changed (1) hide show
  1. README.md +51 -0
README.md CHANGED
@@ -1,3 +1,54 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - video LLM
5
  ---
6
+
7
+
8
+ # Tarsier Model Card
9
+ ## Model details
10
+ **Model type:**
11
+ Tarsier-34b is an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (SOTA results on 6 open benchmarks).
12
+
13
+ **Model date:**
14
+ Tarsier-34b was trained in June 2024.
15
+
16
+ **Paper or resources for more information:**
17
+ - github repo: https://github.com/bytedance/tarsier
18
+ - paper link: https://arxiv.org/abs/2407.00634
19
+
20
+ ## License
21
+ NousResearch/Nous-Hermes-2-Yi-34B license.
22
+
23
+ **Where to send questions or comments about the model:**
24
+ https://github.com/bytedance/tarsier/issues
25
+
26
+ ## Intended use
27
+ **Primary intended uses:**
28
+ The primary use of Tarsier is research on large multimodal models, especially video description.
29
+
30
+ **Primary intended users:**
31
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
32
+
33
+ ## Training dataset
34
+ Tarsier tasks a two-stage training strategy.
35
+ 1. Stage-1: Multi-task Pre-training
36
+
37
+ In stage-1, we trained our model across:
38
+ - 10M diverse public datasets, such as video captioning, video question answering, action recognition, multi-image understanding, and text generation.
39
+ - 3.5M in-house data, including 2.4M high-quality video caption data similar to WebVid and 1.1M videos with object-tracking (processed on videos from Webvid and HD-VILA by object tracking tool: [DEVA](https://github.com/hkchengrex/Tracking-Anything-with-DEVA))
40
+ 2. Stage-2: Multi-grained Instruction Tuning
41
+
42
+ In stage-2, we use 500K of in-house instruction tuning data, including:
43
+ - Movie clips featuring multiple shots, subjects, or events, and had annotators provide descriptions varying in length and detail, from brief motion summaries to comprehensive narratives of visual details.
44
+ - A dataset rich in camera motions, including zooming, translating, panning, and rotating.
45
+ - Video-aware creative writing, such as poems, dialogues, speeches.
46
+
47
+ ## Evaluation dataset
48
+ - A challenging video desription dataset: [DREAM-1K](https://huggingface.co/datasets/omni-research/DREAM-1K)
49
+ - Multi-choice VQA: [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [NeXT-QA](https://github.com/doc-doc/NExT-QA) and [Egoschema](https://drive.google.com/drive/folders/1SS0VVz8rML1e5gWq7D7VtP1oxE2UtmhQ)
50
+ - Open-ended VQA: [MSVD-QA](https://opendatalab.com/OpenDataLab/MSVD), [MSR-VTT-QA](https://opendatalab.com/OpenDataLab/MSR-VTT), [ActivityNet-QA](https://github.com/MILVLG/activitynet-qa) and [TGIF-QA](https://opendatalab.com/OpenDataLab/TGIF-QA)
51
+ - Video Caption: [MSVD-Caption](https://opendatalab.com/OpenDataLab/MSVD), [MSRVTT-Caption](https://opendatalab.com/OpenDataLab/MSR-VTT), [VATEX](https://eric-xw.github.io/vatex-website/about.html)
52
+
53
+ ## How to Use
54
+ see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage