Update README.md
Browse files
README.md
CHANGED
@@ -37,7 +37,7 @@ We release the checkpoints under the Apache 2.0.
|
|
37 |
](https://huggingface.co/papers/2306.16527)
|
38 |
- Idefics2 paper: [What matters when building vision-language models?
|
39 |
](https://huggingface.co/papers/2405.02246)
|
40 |
-
- Idefics3 paper:
|
41 |
|
42 |
# Uses
|
43 |
|
@@ -65,7 +65,7 @@ Idefics3 demonstrates a great improvement over Idefics2, especially in document
|
|
65 |
- We use 169 visual tokens to encode a image of size 364x364. Each image is divided into several sub images of sizes at most 364x364, which are then encoded separately.
|
66 |
- For the fine-tuning datasets, we have extended [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and added several datasets, including [Docmatix](HuggingFaceM4/Docmatix). We will push soon these datasets to the same repo of The Cauldron (TODO).
|
67 |
|
68 |
-
More details about the training of the model
|
69 |
|
70 |
|
71 |
# How to Get Started
|
|
|
37 |
](https://huggingface.co/papers/2306.16527)
|
38 |
- Idefics2 paper: [What matters when building vision-language models?
|
39 |
](https://huggingface.co/papers/2405.02246)
|
40 |
+
- Idefics3 paper: [Building and better understanding vision-language models: insights and future directions](https://huggingface.co/papers/2408.12637)
|
41 |
|
42 |
# Uses
|
43 |
|
|
|
65 |
- We use 169 visual tokens to encode a image of size 364x364. Each image is divided into several sub images of sizes at most 364x364, which are then encoded separately.
|
66 |
- For the fine-tuning datasets, we have extended [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and added several datasets, including [Docmatix](HuggingFaceM4/Docmatix). We will push soon these datasets to the same repo of The Cauldron (TODO).
|
67 |
|
68 |
+
More details about the training of the model is available in our [technical report](https://huggingface.co/papers/2408.12637).
|
69 |
|
70 |
|
71 |
# How to Get Started
|