Will Brooks

TornButter

AI & ML interests

None yet

Recent Activity

liked a Space about 4 hours ago
tencent/Hunyuan3D-2
liked a model 2 days ago
deepseek-ai/Janus-Pro-7B
View all activity

Organizations

None yet

TornButter's activity

reacted to MoritzLaurer's post with šŸ”„ 23 days ago
view post
Post
1706
The TRL v0.13 release is šŸ”„! My highlight are the new process reward trainer to train models similar to o1 and tool call support:

šŸ§  Process reward trainer: Enables training of Process-supervised Reward Models (PRMs), which reward the quality of intermediate steps, promoting structured reasoning. Perfect for tasks like stepwise reasoning.

šŸ”€ Model merging: A new callback leverages mergekit to merge models during training, improving performance by blending reference and policy models - optionally pushing merged models to the Hugging Face Hub.

šŸ› ļø Tool call support: TRL preprocessing now supports tool integration, laying the groundwork for agent fine-tuning with examples like dynamic temperature fetching in prompts.

āš–ļø Mixture of judges: The new AllTrueJudge combines decisions from multiple binary judges for more nuanced evaluation.

Read the release notes and other resources here šŸ‘‡
Release: https://github.com/huggingface/trl/releases/tag/v0.13.0
Mergekit: https://github.com/arcee-ai/mergekit
Mixture of judges paper: The Perfect Blend: Redefining RLHF with Mixture of Judges (2409.20370)
reacted to singhsidhukuldeep's post with šŸ”„ about 1 month ago
view post
Post
2191
Exciting News in AI: JinaAI Releases JINA-CLIP-v2!

The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:

šŸš€ Technical Highlights:
- Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder
- Supports 89 languages with 8,192 token context length
- Processes images up to 512Ɨ512 pixels with 14Ɨ14 patch size
- Implements FlashAttention2 for text and xFormers for vision processing
- Uses Matryoshka Representation Learning for efficient vector storage

āš”ļø Under The Hood:
- Multi-stage training process with progressive resolution scaling (224ā†’384ā†’512)
- Contrastive learning using InfoNCE loss in both directions
- Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs
- Incorporates specialized datasets for document understanding, scientific graphs, and infographics
- Uses hard negative mining with 7 negatives per positive sample

šŸ“Š Performance:
- Outperforms previous models on visual document retrieval (52.65% nDCG@5)
- Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark
- Strong multilingual performance across 30 languages
- Maintains performance even with 75% dimension reduction (256D vs 1024D)

šŸŽÆ Key Innovation:
The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!

Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!