Daniel Leong's picture
2 3

Daniel Leong

daniel-ltw
Β·

AI & ML interests

None yet

Recent Activity

reacted to singhsidhukuldeep's post with πŸ”₯ 5 days ago
Exciting New Tool for Knowledge Graph Extraction from Plain Text! I just came across a groundbreaking new tool called KGGen that's solving a major challenge in the AI world - the scarcity of high-quality knowledge graph data. KGGen is an open-source Python package that leverages language models to extract knowledge graphs (KGs) from plain text. What makes it special is its innovative approach to clustering related entities, which significantly reduces sparsity in the extracted KGs. The technical approach is fascinating: 1. KGGen uses a multi-stage process involving an LLM (GPT-4o in their implementation) to extract entities and relations from source text 2. It aggregates graphs across sources to reduce redundancy 3. Most importantly, it applies iterative LM-based clustering to refine the raw graph The clustering stage is particularly innovative - it identifies which nodes and edges refer to the same underlying entities or concepts. This normalizes variations in tense, plurality, stemming, and capitalization (e.g., "labors" clustered with "labor"). The researchers from Stanford and University of Toronto also introduced MINE (Measure of Information in Nodes and Edges), the first benchmark for evaluating KG extractors. When tested against existing methods like OpenIE and GraphRAG, KGGen outperformed them by up to 18%. For anyone working with knowledge graphs, RAG systems, or KG embeddings, this tool addresses the fundamental challenge of data scarcity that's been holding back progress in graph-based foundation models. The package is available via pip install kg-gen, making it accessible to everyone. This could be a game-changer for knowledge graph applications!
reacted to singhsidhukuldeep's post with πŸ‘ 5 days ago
Exciting New Tool for Knowledge Graph Extraction from Plain Text! I just came across a groundbreaking new tool called KGGen that's solving a major challenge in the AI world - the scarcity of high-quality knowledge graph data. KGGen is an open-source Python package that leverages language models to extract knowledge graphs (KGs) from plain text. What makes it special is its innovative approach to clustering related entities, which significantly reduces sparsity in the extracted KGs. The technical approach is fascinating: 1. KGGen uses a multi-stage process involving an LLM (GPT-4o in their implementation) to extract entities and relations from source text 2. It aggregates graphs across sources to reduce redundancy 3. Most importantly, it applies iterative LM-based clustering to refine the raw graph The clustering stage is particularly innovative - it identifies which nodes and edges refer to the same underlying entities or concepts. This normalizes variations in tense, plurality, stemming, and capitalization (e.g., "labors" clustered with "labor"). The researchers from Stanford and University of Toronto also introduced MINE (Measure of Information in Nodes and Edges), the first benchmark for evaluating KG extractors. When tested against existing methods like OpenIE and GraphRAG, KGGen outperformed them by up to 18%. For anyone working with knowledge graphs, RAG systems, or KG embeddings, this tool addresses the fundamental challenge of data scarcity that's been holding back progress in graph-based foundation models. The package is available via pip install kg-gen, making it accessible to everyone. This could be a game-changer for knowledge graph applications!
View all activity

Organizations

None yet

daniel-ltw's activity

reacted to davidberenstein1957's post with πŸ”₯ 4 days ago
view post
Post
4077
πŸ₯Š Epic Agent Framework Showdown! Available today!

πŸ”΅ In the blue corner, the versatile challenger with a proven track record of knowledge retrieval: LlamaIndex!

πŸ›‘ In the red corner, the defender, weighing in with lightweight efficiency: Hugging Face smolagents!

πŸ”— URL: https://huggingface.co/agents-course

We just published the LlamaIndex unit for the agents course, and it is set to offer a great contrast between the smolagents unit by looking at

- What makes llama-index stand-out
- How the LlamaHub is used for integrations
- Creating QueryEngine components
- Using agents and tools
- Agentic and multi-agent workflows

The team has been working flat-out on this for a few weeks. Supported by Logan Markewich and Laurie Voss over at LlamaIndex.

Who won? You decide!
reacted to singhsidhukuldeep's post with πŸ”₯πŸ‘ 5 days ago
view post
Post
6688
Exciting New Tool for Knowledge Graph Extraction from Plain Text!

I just came across a groundbreaking new tool called KGGen that's solving a major challenge in the AI world - the scarcity of high-quality knowledge graph data.

KGGen is an open-source Python package that leverages language models to extract knowledge graphs (KGs) from plain text. What makes it special is its innovative approach to clustering related entities, which significantly reduces sparsity in the extracted KGs.

The technical approach is fascinating:

1. KGGen uses a multi-stage process involving an LLM (GPT-4o in their implementation) to extract entities and relations from source text
2. It aggregates graphs across sources to reduce redundancy
3. Most importantly, it applies iterative LM-based clustering to refine the raw graph

The clustering stage is particularly innovative - it identifies which nodes and edges refer to the same underlying entities or concepts. This normalizes variations in tense, plurality, stemming, and capitalization (e.g., "labors" clustered with "labor").

The researchers from Stanford and University of Toronto also introduced MINE (Measure of Information in Nodes and Edges), the first benchmark for evaluating KG extractors. When tested against existing methods like OpenIE and GraphRAG, KGGen outperformed them by up to 18%.

For anyone working with knowledge graphs, RAG systems, or KG embeddings, this tool addresses the fundamental challenge of data scarcity that's been holding back progress in graph-based foundation models.

The package is available via pip install kg-gen, making it accessible to everyone. This could be a game-changer for knowledge graph applications!
reacted to burtenshaw's post with πŸš€ 18 days ago
view post
Post
7198
AGENTS + FINETUNING! This week Hugging Face learn has a whole pathway on finetuning for agentic applications. You can follow these two courses to get knowledge on levelling up your agent game beyond prompts:

1️⃣ New Supervised Fine-tuning unit in the NLP Course https://huggingface.co/learn/nlp-course/en/chapter11/1
2️⃣New Finetuning for agents bonus module in the Agents Course https://huggingface.co/learn/agents-course/bonus-unit1/introduction

Fine-tuning will squeeze everything out of your model for how you’re using it, more than any prompt.
  • 2 replies
Β·
reacted to fdaudens's post with β€οΈπŸ‘ 19 days ago
reacted to Xenova's post with πŸ”₯ 28 days ago
view post
Post
8878
We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. ⚑️

Generate 10 seconds of speech in ~1 second for $0.

What will you build? πŸ”₯
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
βœ‚οΈ Implement sentence splitting, allowing for streamed responses
🌍 Multilingual support (only phonemization left)

Who wants to help?
Β·
reacted to Kseniase's post with πŸ”₯ 28 days ago
view post
Post
7772
8 New Types of RAG

RAG techniques continuously evolve to enhance LLM response accuracy by retrieving relevant external data during generation. To keep up with current AI trends, new RAG types incorporate deep step-by-step reasoning, tree search, citations, multimodality and other effective techniques.

Here's a list of 8 latest RAG advancements:

1. DeepRAG -> DeepRAG: Thinking to Retrieval Step by Step for Large Language Models (2502.01142)
Models retrieval-augmented reasoning as a Markov Decision Process, enabling strategic retrieval. It dynamically decides when to retrieve external knowledge and when rely on parametric reasoning.

2. RealRAG -> RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning (2502.00848)
EnhancesΒ  novel object generation by retrieving real-world images and using self-reflective contrastive learning to fill knowledge gap, improve realism and reduce distortions.

3. Chain-of-Retrieval Augmented Generation (CoRAG) -> Chain-of-Retrieval Augmented Generation (2501.14342)
Retrieves information step-by-step and adjusts it, also deciding how much compute power to use at test time. If needed it reformulates queries.

4. VideoRAG -> VideoRAG: Retrieval-Augmented Generation over Video Corpus (2501.05874)
Enables unlimited-length video processing, using dual-channel architecture that integrates graph-based textual grounding and multi-modal context encoding.

5. CFT-RAG ->Β  CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter (2501.15098)
A tree-RAG acceleration method uses an improved Cuckoo Filter to optimize entity localization, enabling faster retrieval.

6. Contextualized Graph RAG (CG-RAG) -> CG-RAG: Research Question Answering by Citation Graph Retrieval-Augmented LLMs (2501.15067)
Uses Lexical-Semantic Graph Retrieval (LeSeGR) to integrate sparse and dense signals within graph structure and capture citation relationships

7. GFM-RAG -> GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation (2502.01113)
A graph foundation model that uses a graph neural network to refine query-knowledge connections

8. URAG -> URAG: Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots -- A Case Study at HCMUT (2501.16276)
A hybrid system combining rule-based and RAG methods to improve lightweight LLMs for educational chatbots
  • 1 reply
Β·
reacted to singhsidhukuldeep's post with πŸ”₯ about 1 month ago
view post
Post
2416
Excited to share groundbreaking research in Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG)!

Researchers from the University of Science and Technology of China have developed FRAG - a novel flexible modular framework that revolutionizes how Large Language Models (LLMs) reason with knowledge graphs.

What makes FRAG special? It intelligently adapts retrieval strategies based on query complexity without requiring expensive KG fine-tuning. The framework uses a reasoning-aware module to classify queries as simple or complex, then applies tailored retrieval pipelines.

Under the hood:
- For simple queries: Uses breadth-first search and ranking to efficiently find relevant paths
- For complex queries: Employs shortest path algorithms to minimize computational overhead
- Features a preprocessing-retrieval-postprocessing pipeline with flexible components
- Leverages traditional algorithms like PersonalizedPageRank for subgraph extraction
- Implements edge and path ranking models for precise information filtering

The results are impressive - FRAG achieves state-of-the-art performance while maintaining high efficiency and low resource consumption. On benchmark datasets like WebQSP and CWQ, it outperforms existing approaches by significant margins.

Most importantly, FRAG maintains flexibility and modularity while improving retrieval quality - no expensive LLM fine-tuning required! This makes it highly practical for real-world applications.

This work represents a major step forward in making LLMs more reliable and capable of complex reasoning tasks. Looking forward to seeing how this technology evolves!
  • 2 replies
Β·
reacted to m-ric's post with πŸ”₯ about 1 month ago
view post
Post
3120
Now you can launch a code agent directly from your terminal!
✨ πšœπš–πš˜πš•πšŠπšπšŽπš—πš "πšˆπš˜πšžπš› πšπšŠπšœπš”" directly launches a CodeAgent
▢️ This also works with web agents (replace πšœπš–πš˜πš•πšŠπšπšŽπš—πš with πš πšŽπš‹πšŠπšπšŽπš—πš) thanks to @merve !

πŸ’Ύ Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with πšŠπšπšŽπš—πš.πš›πšŽπš™πš•πšŠπš’(), thank you @clefourrier !

Check the release notes here πŸ‘‰ https://github.com/huggingface/smolagents/releases/tag/v1.7.0
replied to etemiz's post about 1 month ago
view reply

I guess when you say beneficial to humans, that could also be subjective?

Like we can now say this vitamin or medication has benefits that over weigh the negatives, but this could also be due to full studies might not have been done to prove the other negatives that could come with it. We are just weighing heavily on what we know base on what we have seen or heard.

Also, with the above, as the human genome is different, benefits to one might not be the same to others.

I do reckon, in such situation, an AI not taking a side to be a better approach and prompt the humans to do their own research. I'm also pretty sure this medication example do also applies to other paradigms/areas.

replied to etemiz's post about 1 month ago
view reply

Define human alignment. Is human alignment, what the majority says? Is majority always correct?

These are critical questions that need to also be considered.

reacted to sagar007's post with πŸ”₯ about 1 month ago
view post
Post
3552
πŸš€ Just built a Perplexity-inspired AI search assistant using Gradio, DeepSeek, and DuckDuckGo!
Ask it anything, and it’ll:

Scour the web for answers πŸ“š

Cite sources like a pro πŸ”—

Even talk back with TTS (thanks, Kokoro!) πŸŽ™οΈ

Ask it anything, and it’ll:

Scour the web for answers πŸ“š


Check it out β†’ sagar007/DeepSeekR1_Search
reacted to singhsidhukuldeep's post with πŸ‘ about 1 month ago
view post
Post
600
Exciting breakthrough in Text Embeddings: Introducing LENS (Lexicon-based EmbeddiNgS)!

A team of researchers from University of Amsterdam, University of Technology Sydney, and Tencent have developed a groundbreaking approach that outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB).

>> Key Technical Innovations:
- LENS consolidates vocabulary space through token embedding clustering, addressing the inherent redundancy in LLM tokenizers
- Implements bidirectional attention and innovative pooling strategies to unlock the full potential of LLMs
- Each dimension corresponds to token clusters instead of individual tokens, creating more coherent and compact embeddings
- Achieves competitive performance with just 4,000-8,000 dimensional embeddings, matching the size of dense counterparts

>> Under the Hood:
The framework applies KMeans clustering to token embeddings from the language modeling head, replacing original embeddings with cluster centroids. This reduces dimensionality while preserving semantic relationships.

>> Results:
- Outperforms dense embeddings on MTEB benchmark
- Achieves state-of-the-art performance when combined with dense embeddings on BEIR retrieval tasks
- Demonstrates superior performance across clustering, classification, and retrieval tasks

This work opens new possibilities for more efficient and interpretable text embeddings. The code will be available soon.
  • 1 reply
Β·
reacted to Jaward's post with πŸ‘ about 2 months ago
reacted to julien-c's post with πŸ”₯ 3 months ago
view post
Post
10260
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
reacted to m-ric's post with πŸ”₯ 3 months ago
view post
Post
2263
Last week was crazy in OS AI, with important models and datasets releases every day.

Here are the most important ones I've pinned:

🌎 Cohere relased GLobal-MMLU, a multilingual version of MMLU, to evaluate AI models' world knowledge in many languages!

πŸ¦™ Meta released Llama-3.3-70B-Instruct, a 70B model that's on par with Llama-3.1-405B-Instruct, GPT-4o and Claude. Probably my new go-to for agentic workflows.

πŸ”‰ FishAudio released fish-speech-1.5, multilingual text to speech model

🎨 Microsoft Research released TRELLIS, an extremely impressive image-to-3D model, which you can try here: JeffreyXiang/TRELLIS

πŸ“š Yesterday, Hugging Face release FineWeb 2, a new version that extends the previous FineWeb to over 1000 languages, including extended coverage in Russina, Mandarin, German, Japanese, Spanish, French, so a huge, high-quality dataset of > 3 trillion words! HuggingFaceFW/fineweb-2

Now let's go build to make this week as productive as last one!
reacted to AdinaY's post with πŸ”₯ 3 months ago
view post
Post
1654
🌊 The wave of reasoning models from the Chinese community has arrived!

πŸš€ Marco-o1 by AIDC, Alibaba
πŸ‘‰ AIDC-AI/Marco-o1

✨ QwQ by Qwen, Alibaba
πŸ‘‰ Qwen/qwq-674762b79b75eac01735070a

🌟 Skywork-o1 by Kunlun Tech
πŸ‘‰ Skywork/skywork-o1-open-67453df58e12f6c3934738d0

πŸ”₯ Xkev/Llama-3.2V-11B-cot by PKU Yuan group
πŸ‘‰ Xkev/Llama-3.2V-11B-cot

πŸ’‘ DeepSeek-R1-Lite-Preview by DeepSeek AI
πŸ‘‰ https://chat.deepseek.com/

πŸ” InternThinker Preview by Shanghai AI Lab
πŸ‘‰ https://sso.openxlab.org.cn/login?redirect=https://internlm-chat.intern-ai.org.cn/&clientId=ebmrvod6yo0nlzaek1yp

πŸ“˜ k0-math by Moonshot AI
πŸš€ https://kimi.moonshot.cn/ ( coming soon! )

Who's next? πŸ‘€
zh-ai-community/reasoning-models-67409fb3aa1ed78f10087cd7
reacted to luigi12345's post with πŸ‘ 3 months ago
view post
Post
3741
MinimalScrap
Only Free Dependencies. Save it.It is quite useful uh.


!pip install googlesearch-python requests
from googlesearch import search
import requests
query = "Glaucoma"
for url in search(f"{query} site:nih.gov filetype:pdf", 20):
    if url.endswith(".pdf"):
        with open(url.split("/")[-1], "wb") as f: f.write(requests.get(url).content)
        print("βœ…" + url.split("/")[-1])
print("Done!")