--- title: trt-llm-rag-windows-main app_file: app.py sdk: gradio sdk_version: 4.14.0 --- # 🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙

Chat with RTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own content—docs, notes, videos, or other data. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. And because it all runs locally on your Windows RTX PC or workstation, you’ll get fast and secure results. Chat with RTX supports various file formats, including text, pdf, doc/docx, and xml. Simply point the application at the folder containing your files and it'll load them into the library in a matter of seconds. Additionally, you can provide the url of a YouTube playlist and the app will load the transcriptions of the videos in the playlist, enabling you to query the content they cover The pipeline incorporates the LLaMa2-13B model (or the Mistral-7B), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/), and the [FAISS](https://github.com/facebookresearch/faiss) vector search library. For demonstration, the dataset consists of recent articles sourced from [NVIDIA Gefore News](https://www.nvidia.com/en-us/geforce/news/). ### What is RAG? 🔍 Retrieval-augmented generation (RAG) for large language models (LLMs) seeks to enhance prediction accuracy by leveraging an external datastore during inference. This approach constructs a comprehensive prompt enriched with context, historical data, and recent or relevant knowledge. ## Getting Started ### Hardware requirement - Chat with RTX is currently built for RTX 3xxx and RTX 4xxx series GPUs that have at least 8GB of GPU memory. - At least 100 GB of available hard disk space - Windows 10/11 - Latest NVIDIA GPU drivers

Setup Steps

Ensure you have the pre-requisites in place: 1. Install [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/) 0.7v for Windows using the instructions [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/windows) Command: ``` pip install tensorrt_llm==0.7 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121 ``` Prerequisites - [Python 3.10](https://www.python.org/downloads/windows/) - [CUDA 12.2 Toolkit](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Windows&target_arch=x86_64) - [Microsoft MPI](https://www.microsoft.com/en-us/download/details.aspx?id=57467) - [cuDNN](https://developer.nvidia.com/cudnn) 2. Install requirement.txt ``` pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cu121 pip install nvidia-cudnn-cu11==8.9.4.25 --no-cache-dir pip uninstall -y nvidia-cudnn-cu11 ``` 3. In this project, the LLaMa2-13B AWQ 4bit and Mistral-7B int4 quantized model is used for inference. Before using it, you'll need to compile a TensorRT Engine specific to your GPU for both the models. Below are the steps to build the engine: - **Download tokenizer:** Ensure you have access to the [Llama 2](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) repository on Huggingface.Downlaod config.json, tokenizer.json, tokenizer.model, tokenizer_config.json for both the models. Place the tokenizer files in dir - **Get Quantized weights:** Downlaod the LLaMa-2 13B AWQ 4bit and Mistral-7B int4 quantized model weights form NGC: [Llama2-13b int4](https://catalog.ngc.nvidia.com/orgs/nvidia/models/llama2-13b/files?version=1.3), [Mistral-7B int4](https://catalog.ngc.nvidia.com/orgs/nvidia/models/mistral-7b-int4-chat) - **Get TensorRT-LLM exmaple repo**: Download [TensorRT-LLM 0.7v](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.7.0) repo to build the engine - **Build TensorRT engine:** Commands to build the engines Llama2-13B int4: ``` python TensorRT-LLM-0.7.0\examples\llama\build.py --model_dir --quant_ckpt_path --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir --world_size 1 --tp_size 1 --parallel_build --max_input_len 3900 --max_batch_size 1 --max_output_len 1024 ``` Mistral 7B int4: ``` python.exe TensorRT-LLM-0.7.0\examples\llama\build.py --model_dir --quant_ckpt_path --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir --world_size 1 --tp_size 1 --parallel_build --max_input_len 7168 --max_batch_size 1 --max_output_len 1024 ``` - **Run app** ``` python app.py --trt_engine_path --trt_engine_name .engine --tokenizer_dir_path --data_dir ``` - **Run app** Update the **config/config.json** with below details for both the models | Name | Details | | ------ | ------ | | --model_path | Trt engine direcotry path | | --engine | Trt engine file name | | --tokenizer_path | Huggingface tokenizer direcotry | | --trt_engine_path | Directory of TensorRT engine | | --installed <> | Ture/False if model is installed or not | **Command:** ``` python app.py ``` ## Adding your own data - This app loads data from the dataset / directory into the vector store. To add support for your own data, replace the files in the dataset / directory with your own data. By default, the script uses llamaindex's SimpleDirectoryLoader which supports text files such as .txt, PDF, and so on. This project requires additional third-party open source software projects as specified in the documentation. Review the license terms of these open source projects before use.