Daniel Bram
keerekeerweere
·
AI & ML interests
None yet
Recent Activity
reacted
to
singhsidhukuldeep's
post
with 👀
19 days ago
Exciting breakthrough in multimodal search technology! @nvidia researchers have developed MM-Embed, a groundbreaking universal multimodal retrieval system that's changing how we think about search.
Key innovations:
• First-ever universal multimodal retriever that excels at both text and image searches across diverse tasks
• Leverages advanced multimodal LLMs to understand complex queries combining text and images
• Implements novel modality-aware hard negative mining to overcome modality bias issues
• Achieves state-of-the-art performance on M-BEIR benchmark while maintaining superior text retrieval capabilities
Under the hood:
The system uses a sophisticated bi-encoder architecture with LLaVa-Next (based on Mistral 7B) as its backbone. It employs a unique two-stage training approach: first with random negatives, then with carefully mined hard negatives to improve cross-modal understanding.
The real magic happens in the modality-aware negative mining, where the system learns to distinguish between incorrect modality matches and unsatisfactory information matches, ensuring retrieved results match both content and format requirements.
What sets it apart is its ability to handle diverse search scenarios - from simple text queries to complex combinations of images and text, all while maintaining high accuracy across different domains
reacted
to
singhsidhukuldeep's
post
with 🤗
19 days ago
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding.
The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with:
- Process 40,000+ pages across 3,000+ documents
- Answer questions requiring information from multiple pages
- Understand visual elements like charts, tables, and figures
- Support both closed-domain (single document) and open-domain (multiple documents) queries
Under the hood, M3DocRAG operates through three sophisticated stages:
>> Document Embedding:
- Converts PDF pages to RGB images
- Uses ColPali to project both text queries and page images into a shared embedding space
- Creates dense visual embeddings for each page while maintaining visual information integrity
>> Page Retrieval:
- Employs MaxSim scoring to compute relevance between queries and pages
- Implements inverted file indexing (IVFFlat) for efficient search
- Reduces retrieval latency from 20s to under 2s when searching 40K+ pages
- Supports approximate nearest neighbor search via Faiss
>> Question Answering:
- Leverages Qwen2-VL 7B as the multi-modal language model
- Processes retrieved pages through a visual encoder
- Generates answers considering both textual and visual context
The results are impressive:
- State-of-the-art performance on MP-DocVQA benchmark
- Superior handling of non-text evidence compared to text-only systems
- Significantly better performance on multi-hop reasoning tasks
This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
Organizations
None yet
models
None public yet
datasets
None public yet