The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning π§ββοΈπ¦ΈββοΈπ¦Ήπ§ββοΈ
Multimodal π¬ - We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG π - UI-TARS are new models by ByteDance to unlock agentic GUI control π€― in 2B, 7B and 72B - Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B - MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context - Dataset: Yale released a new benchmark called MMVU - Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs π - DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! π€― - Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B - NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio π£οΈ - Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B - TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation β―οΈ - Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux - tencent released Hunyuan3D-2, new 3D asset generation from images
smolagents can see π₯ we just shipped vision support to smolagents π€ agentic computers FTW
you can now: π» let the agent get images dynamically (e.g. agentic web browser) π pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! π€― you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) π€
New look for AI powered paper reviews from the list by Hugging Face Daily Papers ( managed by the @akhaliq )
Bookmark the webpage along, check comprehensive reviews by Google DeepMind Gemini 1.5, and listen to audio podcast made by the same tech used in NotebookLM. Link: https://deep-diver.github.io/ai-paper-reviewer/
This is not an official service by Hugging Face. It is just a service developed by an individual developer using his own money :)
I just released Sentence Transformers v3.4.0, featuring a memory leak fix, compatibility between the powerful Cached... losses and the Matryoshka loss modifier, and a bunch of fixes & small features.
πͺ Matryoshka & Cached loss compatibility It is now possible to combine the powerful Cached... losses (which use in-batch negatives & a caching mechanism to allow for endless batch size & negatives) with the Matryoshka loss modifier which modifies a base loss such that it is trained not only on the maximum dimensionality (e.g. 1024 dimensions), but also on many lower dimensions (e.g. 768, 512, 256, 128, 64, 32). After training, these models' embeddings can be truncated for faster retrieval, etc.
ποΈ Resolve memory leak when Model and Trainer are reinitialized Due to a circular dependency between Trainer -> Model -> ModelCardData -> Trainer, deleting both the trainer & model still didn't free up the memory. This led to a memory leak in scripts where you repeatedly do so.
β New Features Many new small features, e.g. multi-GPU support for 'mine_hard_negatives', a 'margin' parameter to TripletEvaluator, and Matthews Correlation Coefficient in the BinaryClassificationEvaluator.
π Bug Fixes Also a bunch of fixes, for example that subsequent batches were not sorted when using the "no_duplicates" batch sampler. See the release notes for more details.
Simple summarization of Evolving Deeper LLM Thinking (Google DeepMind)
The process starts by posing a question. 1) The LLM generates initial responses. 2) These generated responses are evaluated according to specific criteria (program-based checker). 3) The LLM critiques the evaluated results. 4) The LLM refines the responses based on the evaluation, critique, and original responses.
The refined response is then fed back into step 2). If it meets the criteria, the process ends. Otherwise, the algorithm generates more responses based on the refined ones (with some being discarded, some remaining, and some responses potentially being merged).
Through this process, it demonstrated excellent performance in complex scheduling problems (travel planning, meeting scheduling, etc.). It's a viable method for finding highly effective solutions in specific scenarios.
However, there are two major drawbacks: π€ An excessive number of API calls are required. (While the cost might not be very high, it leads to significant latency.) π€ The evaluator is program-based. (This limits its use as a general method. It could potentially be modified/implemented using LLM as Judge, but that would introduce additional API costs for evaluation.)
Simple Summarization on DeepSeek-R1 from DeepSeek AI
The RL stage is very important. β³ However, it is difficult to create a truly helpful AI for people solely through RL. β³ So, we applied a learning pipeline consisting of four stages: providing a good starting point, reasoning RL, SFT, and safety RL, and achieved performance comparable to o1. β³ Simply fine-tuning other open models with the data generated by R1-Zero (distillation) resulted in performance comparable to o1-mini.
Of course, this is just a brief overview and may not be of much help. All models are accessible on Hugging Face, and the paper can be read through the GitHub repository.
π Multimodal - MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB (vision, speech and text!) - VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448 - ByteDance released larger SA2VA that comes in 26B parameters - Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
π¬ LLMs - MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens π€― - Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B - kyutai released Helium-1-Preview-2B is a new small multilingual LM - Wayfarer-12B is a new LLM able to write D&D π§π»ββοΈ - ReaderLM-v2 is a new HTML parsing model by Jina AI - Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder - Unsloth released Phi-4, faster and memory efficient Llama 3.3
πΌοΈ Vision - MatchAnything is a new foundation model for matching - FitDit is a high-fidelity VTON model based on DiT architecture
π£οΈ Audio - OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
π Retrieval - lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages - cde-small-v2 is a new sota small retrieval model by @jxm
ποΈ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.
We apply our recipe to train 2 Static Embedding models that we release today! We release: 2οΈβ£ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0 π§ my modern training strategy: ideation -> dataset choice -> implementation -> evaluation π my training scripts, using the Sentence Transformers library π my Weights & Biases reports with losses & metrics π my list of 30 training and 13 evaluation datasets
The 2 Static Embedding models have the following properties: ποΈ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5' 0οΈβ£ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed! π No maximum sequence length! Embed texts at any length (note: longer texts may embed worse) π Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more. πͺ Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)
Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings
The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.
Published a new blogpost π In this blogpost I have gone through the transformers' architecture emphasizing how shapes propagate throughout each layer. π https://huggingface.co/blog/not-lain/tensor-dims some interesting takeaways :
Multimodal πΌοΈ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook β 22k hours worth of samples from instruction videos π€― > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
Embeddings π > @MoritzLaurer released zero-shot version of ModernBERT large π > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation β―οΈ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts π₯ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos β―οΈ
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM π¬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡οΈ
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!
Details: π€ Based on ModernBERT-base with 149M parameters. π Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB! ποΈ Immediate FA2 and unpacking support for super efficient inference. πͺ Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256. β‘οΈ Maximum sequence length of 8192 tokens! 2οΈβ£ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets. β Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc. ποΈ Apache 2.0 licensed: fully commercially permissible