Speech Seq2Seq Experiments

non-profit

AI & ML interests

None defined yet.

Recent Activity

speech-seq2seq's activity

HugoLaurencon 
posted an update 8 months ago
view post
Post
2897
We release Idefics2-chatty, the chatbot-optimized version of Idefics2: HuggingFaceM4/idefics2-8b-chatty

Idefics2-chatty is better at following instructions and following Chain-of-Thoughts reasoning.

Moreover, we also release a paper, containing a lot of findings on how to build an efficient and performant Vision-Language Model: What matters when building vision-language models? (2405.02246)

How are you going to use the model, or what data are you going to fine-tune it on?
·
HugoLaurencon 
posted an update 8 months ago
view post
Post
2513
Idefics2 is trained mostly on OBELICS, our open interleaved image-text document dataset.

Training on interleaved data is crucial to reaching high performance on VQA tasks, taking an arbitrary number of images as input, and doing in-context learning.

Dataset: HuggingFaceM4/OBELICS
Nomic visualization: https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f
Link to OBELICS thread: https://twitter.com/HugoLaurencon/status/1694005892839006301
HugoLaurencon 
posted an update 8 months ago
view post
Post
2835
The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.

HuggingFaceM4/the_cauldron
HugoLaurencon 
posted an update 8 months ago
view post
Post
3085
We release Idefics2-8B, a foundation vision language model with SOTA results for its size on many benchmarks.

For Idefics2, we adopted a simple architecture:
-Images are fed to a vision encoder, then to a modality projection to match the input dimension of the LLM, and finally to a perceiver resampler for efficient pooling.
-Interleaved image-text data are then passed to the LLM.

During the pre-training:
-The modality projection and perceiver resampler weights are newly initialized.
-We start with pre-trained models for the vision encoder and the LLM, and continue the training with LoRA.
-In total, we see 1.5T images!

We pre-train on 3 types of data, all publicly available:
-Interleaved image-text documents: our dataset OBELICS HuggingFaceM4/OBELICS
-Image caption pairs: only synthetic captions!
-PDF documents: IDL and PDFA

We kept the aspect ratio of the images with the Patch n' Pack strategy, with a resolution of up to 980x980.
At inference, it's also more efficient for lower-resolution images.

For the SFT, we build The Cauldron, a collection of 50 high-quality datasets in the user/assistant format.
It is a ready-to-use dataset for the fine-tuning of any VLM.
HuggingFaceM4/the_cauldron

Most current models, like LLaVA-NeXT, encode images with an excessive number of tokens, like 2880.
Instead, we put a focus on being efficient at inference by training on a mix of images encoded with 64 tokens, and 320 tokens.
The result is that we perform favorably compared to the best models in our size class, while being efficient at inference.
HugoLaurencon 
posted an update 9 months ago
view post
Post
With the new WebSight dataset, converting the screenshot of a web page to its corresponding HTML code is just one fine-tuning step away

We release a new version of our synthetic dataset:
-Real images within web pages 🖼️
-Tailwind CSS 🎨
-2M examples 📈

Our initial release, v0.1, featured web designs in HTML + CSS, using simple colored rectangles as image placeholders.
It was a good start to help models grasp the basics of web page structure and coding associations.
Yet, it was missing the look of a real website.

Improving visual appeal, we've now embedded actual images in our web designs, ensuring they match the site's content for a more authentic look.

Switching to Tailwind CSS offers a more compact representation of the code.

We've also expanded our dataset to 2 million examples!

After fine-tuning our forthcoming foundation vision-language model on this dataset, we've observed some encouraging capabilities, such as converting sketches directly into functional HTML code.

We're excited to hear your thoughts and suggestions for future versions. What would you like to see next? Feel free to open a discussion on the hub!

Dataset: HuggingFaceM4/WebSight
Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
Blog post: https://huggingface.co/blog/websight
Google Colab: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Work done with @VictorSanh @Leyo
·
sanchit-gandhi 
posted an update 10 months ago
view post
Post
Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:
<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?
·