larsthepenguin's picture
Upload folder using huggingface_hub
272de00 verified

A newer version of the Gradio SDK is available: 5.5.0

Upgrade
metadata
title: trt-llm-rag-windows-main
app_file: app.py
sdk: gradio
sdk_version: 4.14.0

๐Ÿš€ RAG on Windows using TensorRT-LLM and LlamaIndex ๐Ÿฆ™

Chat with RTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own contentโ€”docs, notes, videos, or other data. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. And because it all runs locally on your Windows RTX PC or workstation, youโ€™ll get fast and secure results. Chat with RTX supports various file formats, including text, pdf, doc/docx, and xml. Simply point the application at the folder containing your files and it'll load them into the library in a matter of seconds. Additionally, you can provide the url of a YouTube playlist and the app will load the transcriptions of the videos in the playlist, enabling you to query the content they cover

The pipeline incorporates the LLaMa2-13B model (or the Mistral-7B), TensorRT-LLM, and the FAISS vector search library. For demonstration, the dataset consists of recent articles sourced from NVIDIA Gefore News.

What is RAG? ๐Ÿ”

Retrieval-augmented generation (RAG) for large language models (LLMs) seeks to enhance prediction accuracy by leveraging an external datastore during inference. This approach constructs a comprehensive prompt enriched with context, historical data, and recent or relevant knowledge.

Getting Started

Hardware requirement

  • Chat with RTX is currently built for RTX 3xxx and RTX 4xxx series GPUs that have at least 8GB of GPU memory.
  • At least 100 GB of available hard disk space
  • Windows 10/11
  • Latest NVIDIA GPU drivers

Setup Steps

Ensure you have the pre-requisites in place:
  1. Install TensorRT-LLM 0.7v for Windows using the instructions here

Command:

pip install tensorrt_llm==0.7 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121

Prerequisites

  1. Install requirement.txt
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cu121

pip install nvidia-cudnn-cu11==8.9.4.25 --no-cache-dir

pip uninstall -y nvidia-cudnn-cu11
  1. In this project, the LLaMa2-13B AWQ 4bit and Mistral-7B int4 quantized model is used for inference. Before using it, you'll need to compile a TensorRT Engine specific to your GPU for both the models. Below are the steps to build the engine:
  • Download tokenizer: Ensure you have access to the Llama 2 and Mistral repository on Huggingface.Downlaod config.json, tokenizer.json, tokenizer.model, tokenizer_config.json for both the models. Place the tokenizer files in dir

  • Get Quantized weights: Downlaod the LLaMa-2 13B AWQ 4bit and Mistral-7B int4 quantized model weights form NGC:

    Llama2-13b int4, Mistral-7B int4

  • Get TensorRT-LLM exmaple repo: Download TensorRT-LLM 0.7v repo to build the engine

  • Build TensorRT engine: Commands to build the engines

Llama2-13B int4:

python TensorRT-LLM-0.7.0\examples\llama\build.py --model_dir <model_tokenizer_dir_path> --quant_ckpt_path <quantized_weights_file_path> --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir <engine_output_dir> --world_size 1 --tp_size 1 --parallel_build --max_input_len 3900 --max_batch_size 1 --max_output_len 1024

Mistral 7B int4:

python.exe TensorRT-LLM-0.7.0\examples\llama\build.py --model_dir <model_tokenizer_dir_path>  --quant_ckpt_path <quantized_weights_file_path> --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir <engine_output_dir> --world_size 1 --tp_size 1 --parallel_build --max_input_len 7168 --max_batch_size 1 --max_output_len 1024
  • Run app
python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --data_dir <Data folder>
  • Run app Update the config/config.json with below details for both the models
Name Details
--model_path Trt engine direcotry path
--engine Trt engine file name
--tokenizer_path Huggingface tokenizer direcotry
--trt_engine_path Directory of TensorRT engine
--installed <> Ture/False if model is installed or not

Command:

python app.py

Adding your own data

  • This app loads data from the dataset / directory into the vector store. To add support for your own data, replace the files in the dataset / directory with your own data. By default, the script uses llamaindex's SimpleDirectoryLoader which supports text files such as .txt, PDF, and so on.

This project requires additional third-party open source software projects as specified in the documentation. Review the license terms of these open source projects before use.