Kuldeep Singh Sidhu

singhsidhukuldeep

https://singhsidhukuldeep.github.io

AI & ML interests

😃 TOP 3 on HuggingFace for posts 🤗 Seeking contributors for a completely open-source 🚀 Data Science platform! singhsidhukuldeep.github.io

Recent Activity

posted an update about 1 hour ago

I just came across a groundbreaking paper titled "Hypencoder: Hypernetworks for Information Retrieval" by researchers from the University of Massachusetts Amherst that introduces a fundamentally new paradigm for search technology. Most current retrieval models rely on simple inner product calculations between query and document vectors, which severely limits their expressiveness. The authors prove theoretically that inner product similarity functions fundamentally constrain what types of relevance relationships can be captured. Hypencoder takes a radically different approach: instead of encoding a query as a vector, it generates a small neural network (called a "q-net") that acts as a learned relevance function. This neural network takes document representations as input and produces relevance scores. Under the hood, Hypencoder uses: - Attention-based hypernetwork layers (hyperhead layers) that transform contextualized query embeddings into weights and biases for the q-net - A document encoder that produces vector representations similar to existing models - A graph-based greedy search algorithm for efficient retrieval that can search 8.8M documents in under 60ms The results are impressive - Hypencoder significantly outperforms strong dense retrieval models on standard benchmarks like MS MARCO and TREC Deep Learning Track. The performance gap widens even further on complex retrieval tasks like tip-of-the-tongue queries and instruction-following retrieval. What makes this approach particularly powerful is that neural networks are universal approximators, allowing Hypencoder to express far more complex relevance relationships than inner product similarity functions. The framework is also flexible enough to replicate any existing neural retrieval method while adding the ability to learn query-dependent weights.

posted an update 16 days ago

Fascinating deep dive into Swiggy's Hermes - their in-house Text-to-SQL solution that's revolutionizing data accessibility! Hermes enables natural language querying within Slack, generating and executing SQL queries with an impressive <2 minute turnaround time. The system architecture is particularly intriguing: Technical Implementation: - Built on GPT-4 with a Knowledge Base + RAG approach for Swiggy-specific context - AWS Lambda middleware handles communication between Slack UI and the Gen AI model - Databricks jobs orchestrate query generation and execution Under the Hood: The pipeline employs a sophisticated multi-stage approach: 1. Metrics retrieval using embedding-based vector lookup 2. Table/column identification through metadata descriptions 3. Few-shot SQL retrieval with vector-based search 4. Structured prompt creation with data snapshots 5. Query validation with automated error correction Architecture Highlights: - Compartmentalized by business units (charters) for better context management - Snowflake integration with seamless authentication - Automated metadata onboarding with QA validation - Real-time feedback collection via Slack What's particularly impressive is how they've solved the data context challenge through charter-specific implementations, significantly improving query accuracy for well-defined metadata sets. Kudos to the Swiggy team for democratizing data access across their organization. This is a brilliant example of practical AI implementation solving real business challenges.

posted an update 19 days ago

Exciting breakthrough in neural search technology! Researchers from ETH Zurich, UC Berkeley, and Stanford University have introduced WARP - a groundbreaking retrieval engine that achieves remarkable performance improvements in multi-vector search. WARP brings three major innovations to the table: - A novel WARP SELECT algorithm for dynamic similarity estimation - Implicit decompression during retrieval operations - An optimized two-stage reduction process for efficient scoring The results are stunning - WARP delivers a 41x reduction in query latency compared to existing XTR implementations, bringing response times down from 6+ seconds to just 171 milliseconds on single-threaded execution. It also achieves a 3x speedup over the current state-of-the-art ColBERTv2 PLAID engine while maintaining retrieval quality. Under the hood, WARP uses highly-optimized C++ kernels and specialized inference runtimes. It employs an innovative compression strategy using k-means clustering and quantized residual vectors, reducing index sizes by 2-4x compared to baseline implementations. The engine shows excellent scalability, with latency scaling with the square root of dataset size and effective parallelization across multiple CPU threads - achieving 3.1x speedup with 16 threads. This work represents a significant step forward in making neural search more practical for production environments. The researchers have made the implementation publicly available for the community.

View all activity

Organizations

singhsidhukuldeep's activity

upvoted an article 7 months ago

Article

Making LLMs lighter with AutoGPTQ and transformers

Aug 23, 2023

• 43

upvoted an article 9 months ago

Article

LLM Comparison/Test: Llama 3 Instruct 70B + 8B HF/GGUF/EXL2 (20 versions tested and compared!)

•

Apr 24, 2024

• 61

upvoted an article 10 months ago

Article

Train custom AI models with the trainer API and adapt them to 🤗

•

Jun 29, 2024

• 33