A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ ย An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller - โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
โก Mixture of Experts (MoE) architecture: 389 B parameters in total, but only 52B are activated for any input
๐งช Trained on 7T tokens, including 1.5T tokens of synthetic data
๐๏ธ Architecture : Novel "recycle routing" prevents token dropping when experts are overrloaded
๐ Great benchmark results: Surpasses Llama-3-405B-Instruct in most benchmarks although it has 8x fewer active parameters โฃ Impressive perf on MATH: 77.4
๐ย Large context length: up to 256K tokens
๐ License: โฃ Commercial use allowed, except if your products have >100M monthly active users โฃ No access in the EU
With privacy concerns rising, we sometimes need our models to "forget" specific information - like a person's data - while keeping everything else intact. Researchers just released CLEAR, the first benchmark to test how well this works with both text and images.
โย Bad news: Current methods either fail to truly forget or end up forgetting way too much. It's like trying to remove a single ingredient from a baked cake!
โจย But there's hope: Adding simple mathematical constraints (L1 regularization) during the forgetting process significantly improves results.
๐ฏย Key insights:
โ ย The benchmark tests forgetting on 200 fictional personas โฃ 3,770 visual Q&A pairs โฃ 4,000 textual Q&A pairs โฃ Additional real-world tests
๐ย Most current forgetting methods don't work well with both text and images โฃ They either remember what they should forget โฃ Or they forget too much unrelated information
โจย Simple mathematical constraints work surprisingly well โฃ L1 regularization prevents excessive forgetting โฃ Works especially well with the LLMU method
> Oasis: First Real-Time Video Game Without a Game Engine! ๐ฎ
DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.
โก๏ธ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.
โ๏ธ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.
Key insights: โก๏ธ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models โฃ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100
๐ฎ Features real game mechanics โฃ Movement, jumping, item management โฃ Physics and lighting โฃ Procedurally generated worlds
โ ๏ธ Current limitations โฃ Blurry graphics at a distance โฃ Objects sometimes change appearance โฃ Memory issues in long sessions
I'm very proud to have supported @CGIAR and @Digigreen in making http://Farmer.chat, an app that supports 20k smallholder farmers on a daily basis ๐พ
There are ~500 million smallholder farmers globally, playing a critical role in global food security. Having access to accurate information is essential for them.
๐ฌ An โagricultural extension serviceโ offers technical advice on agriculture, and also supplies farmers with the necessary inputs and services to support their agricultural production.
But agriculture extension agents are not in large enough numbers to cope with all the requests, especially in countries like Kenya, India, Ethiopia, and Nigeria.
๐ So the team set out to build an app called http://Farmer.Chat, to provide an agricultural extension service, by building on the immense knowledge accumulated by CGIAR.
โจ The app is technically impressive: behind the Whatsapp-type UX, an agent interprets the user's intent, and identifies which tool to call to best answer their request: weather API, RAG on a CGIAR-provided knowledge base, market data, etc. The RAG on the knowledge base is in itself a work of art.
๐ฏ A key part of building such a complex system is to be able to evaluate it properly. During our bi-weekly sessions with the team, I could support them in implementing the method called "LLM-as-a-judge" to tackle this problem.
It worked really well : thanks to the amazing work of the team, the app now successfully answered over 300 thousand requests, in 6 different languages, and it keeps growing!
โก๏ธ @Vinsingh, @rajgreen and I just wrote a blog post to describe how the app works, especially the LLM-as-a-judge system!
๐๐ Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages !
How did they manage to beat top contenders while also adding 23 languages?
๐ ๐ง๐ฟ๐ฎ๐ถ๐ป ๐ผ๐ป ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ: โข Synthetic data has been said to cause model-collapse after too much training โข Cohere has introduced "data arbitrage" to prevent this by strategically sampling from a pool of several teacher models instead of one single teacher โข First train a model pool for each different groups of languages, and employ an internal Reward Model named "Arbiter" to evaluate and select the optimal generation. Then only the best generation is kept as the final completion for each prompt โก๏ธ This process is particularly effective for multilingual setting, where no single teacher model performs in all languages : here "Multilingual Arbitrage" singlehandedly improves win rates of the 8B model vs Gemma-2-9B by 10 points!
๐งฉ ๐จ๐๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น ๐บ๐ฒ๐ฟ๐ด๐ถ๐ป๐ด: Rather than struggling to find the right mix of data in training a single model for multilingual use, just train language specific models then merge them! โข Maximize diversity between merged checkpoints by training each on different language families. โข Experimented fancy techniques (SLERP, TIES, DARE-TIES) but found out weighted averaging to be the most consistent! โก๏ธ Merging had 3x more gains at high 35B scale vs the 8B scale - consistent with literature findings that merging is more effective at scale
โก๏ธ ๐๐ฟ๐ฒ๐ฎ๐ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ: Automatic evaluations on Arena-Hard-Auto dataset: โก๏ธ Aya Expanse 8B beats models from its weight class such as Gemma 2 9B, Llama 3.1 8B, and the recent Ministral 8B, with win rates ranging from 60.4% to 70.6% โก๏ธ Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B (2x its size) โข โ ๏ธ But this performance eval comes from only one benchmark! Let's wait for Open LLM leaderboard evals;
Letโs say youโre doing RAG, and in an effort to improve performance, you try to rerank a few possible source snippets by their relevancy to a query.
How can you score similarity between your query and any source document? ๐ค ๐ โ๏ธ ๐
This means that you encode each token from both the query and the doc as separate vectors, then average the tokens of each separately to get in total 2 vectors, then you compute similarity via cosine or something. โก๏ธ Notable examples: Check the top of the MTEB leaderboard!
These encode each token from both query and doc as separate vectors as before, but compare all together without previously averaging them and losing information.
This is more accurate than no-interaction but also slower because you have to compare n*m vectors instead of 2. At least you can store documents in memory. And ColBERT has some optimisations like pooling to be faster.
By far the coolest release of the day! > The Open LLM Leaderboard, most comprehensive suite for comparing Open LLMs on many benchmarks, just released a comparator tool that lets you dig into the detail of differences between any models.
Here's me checking how the new Llama-3.1-Nemotron-70B that we've heard so much compares to the original Llama-3.1-70B. ๐ค๐
Thought that self-attention could not be improved anymore?
Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐ง Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns
๐ฅ DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens
๐ Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively
๐ Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations ๐คฏ
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash โก๏ธ
New player entered the game! Rhymes AI has just been announced, and unveiled Aria โ a multimodal powerhouse that's punching above its weight.
Key insights:
๐ง Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.
๐ Multimodal: text/image/video โ text.
๐ Novel training approach: โmultimodal-nativeโ where multimodal training starts directly during pre-training, not just tacked on later
๐ Long 64K token context window
๐ Apache 2.0 license, with weights, code, and demos all open
โก๏ธ On the benchmark side, Aria leaves some big names in the dust.
- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista. - It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.
But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called โBeagoโ. Itโs handling even recent events with great accuracy!
And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.
Microsoft researchers dropped a groundbreaking technique that could slash the energy use in transformer computations : their novel "linear-complexity multiplication" (L-Mul) algorithm approximates floating-point multiplication using energy-efficient integer addition instead of costly multiplications.
๐ก Quick reminder on how floats are coded on 8 bits (FP8): In the e4m3 FP8 standard, you encode a number as: Sign (1 bit) | Exponent (4 bits) | Mantissa (3 bits) Example: 0 (positive) | 1000 (8) | 101 (1/2 + 1/8 = 0.625) Calculation: you add one to the mantissa, and multiply it by 2 power (the exponent - a bias term which is 7 for e4m3):
โก๏ธย You get (1 + 0.625) ร 2^(8-7) = 3.25
Now back to the paper. ๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
โก๏ธ Multiplication is extremely energy-intensive compared to addition. For 32-bit operations, multiplication (3.7 pJ) uses 37x more energy than addition (0.1 pJ)!
๐งฎ Traditional floating-point multiplication go like (noting xm the mantissa and xe the exponent): Mul(x,y) = (1 + xm) ยท 2^xe ยท (1 + ym) ยท 2^ye = (1 + xm + ym + xm ยท ym) ยท 2^(xe+ye)
๐ก L-Mul cleverly approximates this as: L-Mul(x,y) = (1 + xm + ym + 2^-l(m)) ยท 2^(xe+ye), eliminating the costly xm ยท ym term
๐ง l(m) term is adaptively set based on mantissa size for optimal accuracy
๐ Benchmarks on the Llama-3.1-8B-Instruct model show L-Mul preserves precision across various NLP tasks, with performance nearly identical to full BFloat16 precision
๐ฌ Authors claim: "We can achieve the same model inference performance while reducing the energy cost of attention computations by 80%."
This breakthrough is still theoretical and would need implementation on dedicated hardware to confirm real-world gains, but itโs a really exciting path for more sustainable AI! ๐ฑ
Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.
They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes: โถ Removed dependencies on previous hidden states in the gates โท Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients โธ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)
โก๏ธ As a result, you can use a โparallel scanโ algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences
๐ฅ The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.
And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! ๐
๐ค Why does this matter?
By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!
๐ฌย Franรงois Chollet wrote in a tweet about this paper:
โThe fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)โ
โCurve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.โ
Itโs the Bitter lesson by Rich Sutton striking again: donโt need fancy thinking architectures, just scale up your model and data!
#phdone - I defended my PhD yesterday! A key lesson: it is amazing how open science and open source can empower beginners with limited resources:
I first learned about instruction-based classifiers like BERT-NLI 3-4 years ago, through the @HuggingFace ZeroShotClassificationPipeline. Digging deeper into this, it was surprisingly easy to find new datasets, newer base models, and reusable fine-tuning scripts on the HF Hub to create my own zeroshot models - although I didn't know much about fine-tuning at the time.
Thanks to the community effect of the Hub, my models were downloaded hundreds of thousands of times after a few months. Seeing my research being useful for people motivated me to improve and upload newer models. Leaving my contact details in the model cards led to academic cooperation and consulting contracts (and eventually my job at HF).
That's the power of open science & open source: learning, sharing, improving, collaborating.
I mean every word in my thesis acknowledgments (screenshot). I'm very grateful to my supervisors @vanatteveldt@CasAndreu@KasperWelbers for their guidance; to @profAndreaRenda and @CEPS_thinktank for enabling me to work part-time during the first year; to @huggingface for creating awesome tools and an awesome platform; and to many others who are not active on social media.
Links to the full thesis and the collection of my most recent models are below.
PS: If someone happens to speak Latin, let me know if my diploma contains some hidden Illuminati code or something :D
๐จ๐ณโต๏ธ ๅบๆตท: Chinese AI is expanding globally
Fact: Chinese LLMs are heavily underrated, for instance recently the excellent Deepseek-v2.5 or Qwen models.
Luckily for us, @AdinaY just wrote an excellent blog post explaining the Chinese AI ecosystem!
My key takeaways:
Since Google, OpenAI and Anthropic models are not available in China, local companies are fighting for the market. A really good market - AI has much higher penetration there than in the rest of the world, both with companies and individual users!
๐ฐ But since Deepseek heavily cut prices in May 24, this spiraled into a price war that created a cut-throat environment with unsustainably low prices.
๐ On top of this, the local regulation is stringent: models must undergo licensing from a local censor (the Cyberspace Administration of China), that for instance requires models to refuse answering certain questions on the CCP. Although this is certainly simpler to implement than certain condition of the European AI Act.
๐ธ If this wasn't enough, VC investment in AI is drying out: By mid-2024, Chinese AI startups raised approximately $4.4 billion, vs $55B for US startups just in Q2 24.
๐ฑ To get profitability companies have shifted from foundational models to model + application, for instance PopAI from [01.AI](http://01.ai/) with millions of users and high profitability.
โ๏ธ They also try to drill down specific industries: but these niches are also getting crowded.
โก๏ธ Since their home market is becoming both too crowded and unhospitable, Chinese companies are now going for international market, "Sailing abroad" following the expression consacred for Zheng He's legendary journey in 1500.
There, they'll have to adapt to different infrastructures and regulations, but they have bright prospects for growth!
This is the most important research in months: weโre now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.
๐ช๐ต๐ฎ๐'๐ ๐๐ต๐ฒ ๐ฏ๐ถ๐ด ๐ฑ๐ฒ๐ฎ๐น? ๐ Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token. And itโs only 8B, but really strong: ๐ผ๏ธ For image generation, it's matching the best specialized models out there, like SDXL. ๐๏ธ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this. ๐ฌ It's the first to nail video generation without using complicated diffusion techniques.
๐๐ผ๐ ๐ฑ๐ผ๐ฒ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ? ๐งฉ Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens. ๐ Then, it treats everything - text, images, and videos - as one long series of tokens to predict. ๐ฎ During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.
๐๐ฎ๐๐ฒ๐ฎ๐๐ ๐ผ๐ป ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐: ๐ In image generation, Emu3 beats SDXL, but itโs also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev. ๐ In vision, authors also donโt show a comparison against all the current SOTA models like Qwen-VL or Pixtral.
This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!
RAG systems are supposed to make your LLM's answer more trustworthy, by inserting in the prompt some supporting documents from a knowledge base : we say that we're "adding some context".
๐ But if you don't know which part of the answer has been generated based on which input tokens, it's hard to tell wether it was effectively grounded in the context knowledge or not!
๐ค I've been working on the question: is it possible to add notes to the answer linking to which part of the context they're generated from?
And I've found a great solution: a great technique called Layer-wise Relevance Propagation (LRP), showcased in a paper at ICML `24 by Reduan Achtibat et al allows, allows to precisely score how important each input token was in generating your output! They've made it into a library called LXT.
๐ For each generated output token, LXT gives you attribution scores for each input token.
โ๏ธ So I've worked a bit more on aggregating these scores into meaningful spans between successive input and output tokens, and I finally obtained my desired result: RAG with source highlighting!
Caveats: - It slows down generation (for now quite a lot, could hopefully be reduced a lot) - For now it supports only specific models: Llama models and Mixtral
If there's enough interest in this solution, I can improve it further and spin it off into a specific library for RAG! ๐
Transformers v4.45.0 released: includes a lightning-fast method to build tools! โก๏ธ
During user research with colleagues @MoritzLaurer and @Jofthomas , we discovered that the class definition currently in used to define a Tool in transformers.agents is a bit tedious to use, because it goes in great detail.
โก๏ธ So Iโve made an easier way to build tools: just make a function with type hints + a docstring, and add a @tool decorator in front.
Hurricane Katrina killed hundreds of people as it made landfall on New Orleans in 2005 - many of these deaths could have been avoided if alerts had been given one day earlier. Accurate weather forecasts are really life-saving.
๐ฅย Now, NASA and IBM just dropped a game-changing new model: the first ever foundation model for weather! This means, it's the first time we have a generalist model not restricted to one task, but able to predict 160 weather variables!
Prithvi WxC (Prithvi, โเคชเฅเคฅเฅเคตเฅโ, is the Sanskrit name for Earth) - is a 2.3 billion parameter model, with an architecture close to previous vision transformers like Hiera.
๐กย But it comes with some important tweaks: under the hood, Prithvi WxC uses a clever transformer-based architecture with 25 encoder and 5 decoder blocks. It alternates between "local" and "global" attention to capture both regional and global weather patterns.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐: ๐ฎ Nails short-term forecasts - Prithvi WxC crushed it on 6-12 hour predictions, even outperforming some traditional numerical weather models ๐ Tracks hurricanes like a champ - For Hurricane Ida, it predicted the landfall location within 5 km (vs 20+ km errors from other AI models), which is a huge progress! ๐ 6x downscaling power - Can zoom in on weather data to 6x higher resolution with 4x lower error than basic methods ๐ Models elusive gravity waves - Accurately simulates these crucial but hard-to-capture atmospheric oscillations
As climate change intensifies, tools like Prithvi WxC will become more and more crucial to avoid disasters!
๐ง Stanford paper might be the key to OpenAI o1โs performance: Whatโs so effective about Chain of Thought? โ it unlocks radically different sequential tasks!
๐ญย Reminder: A Chain of Thought (CoT) means that you instruct the model to โthink step by stepโ. Often itโs literally just putting in the prompt โletโs think step by step.โ
๐คย This method has been shown to be unreasonably effective to increase perf on benchmarks. However why it works so well remains unclear.
Here's the scoop: Transformers are amazing at parallel processing, but they've always struggled with tasks that require sequential reasoning.
โ๏ธ For instance if you ask them the result of 3^2^2^2^โฆ, with 20 iterations, theyโll nearly always fail.
๐กย Indeed, researchers prove mathematically, by assimilating transformers networks to logical circuits, that effectively they cannot solve sequential tasks that require more than a certain threshold of sequences.
But CoT enables sequential reasoning:
- ๐งฑ Each step in the CoT corresponds to simulating one operation in a complex circuit. - ๐ This allows the transformer to "reset" the depth of intermediate outputs, overcoming previous limitations. - ๐ Thus, with CoT, constant-depth transformers can now solve ANY problem computable by polynomial-size circuits! (That's a huge class of problems in computer science.) - ๐ Transformers can now handle tricky tasks like iterated squares (computing 3^2^2^2^2) composed permutations and evaluating circuits - stuff that requires serial computation. - ๐ย The improvement is especially dramatic for transformers with a limited depth. Empirical tests on four arithmetic problems showed massive accuracy gains with CoT on inherently serial tasks.
Main takeaway: Chain-of-thought isn't just a neat trick - it fundamentally expands what transformer models can do!