22 3 4

Morgan Funtowicz

mfuntowicz

https://github.com/mfuntowicz

AI & ML interests

Model inference low-level optimization, hardware affinity and large-scale distributed training.

Recent Activity

upvoted an article about 1 month ago

Yay! Organizations can now publish blog Articles

upvoted an article about 1 month ago

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

published an article about 1 month ago

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

View all activity

Organizations

mfuntowicz's activity

upvoted 2 articles about 1 month ago

Article

Yay! Organizations can now publish blog Articles

and 3 others •

Jan 20

• 34

Article

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Jan 16

• 68

published an article about 1 month ago

Article

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Jan 16

• 68

updated a model 4 months ago

mfuntowicz/SmolLM2-360M-Instruct-Q4_K_M-GGUF

Updated Nov 5, 2024 • 12

published an article 5 months ago

Article

Introducing the AMD 5th Gen EPYC™ CPU

Oct 10, 2024

• 6

New activity in optimum-nvidia/README 5 months ago

TensorRT-LLM Meetup San Francisco September 18th

#1 opened 5 months ago by

mfuntowicz

reacted to alex-abb's post with 👍🔥 8 months ago

Post

4838

Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer

4 replies

reacted to IlyasMoutawwakil's post with 🚀🧠 8 months ago

Post

4050

Last week, Intel's new Xeon CPUs, Sapphire Rapids (SPR), landed on Inference Endpoints and I think they got the potential to reduce the cost of your RAG pipelines 💸

Why ? Because they come with Intel® AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU ⚡

I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e

Here's how it works:

- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.

- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.

- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.

Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker