### RAG Demo: AI-Powered Document Search with Generative Response This project showcases a Retrieval-Augmented Generation (RAG) implementation using SentenceTransformer for semantic search and GPT-2 (or a similar generative model) for response generation. The system combines the power of semantic search with AI-driven text generation, providing relevant answers based on a collection of text documents. ## Project Overview The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual responses using Generative AI. It supports secure document search and offers additional protection against malicious queries using semantic analysis. The project is built with the following goals: # Semantic Search: Retrieve the most relevant documents based on user queries using embeddings. # Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model. # Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them. ### Features # Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database. # Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval. # Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context. # Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection). Technologies Used SentenceTransformer: For generating semantic embeddings of text documents. Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2). SQLite: A lightweight database for storing embeddings and document content. Scikit-learn: Used for calculating cosine similarity. NumPy: Efficient numerical operations. Installation Clone the Repository: bash ``` git clone https://github.com/yourusername/chagu-rag-demo.git cd chagu-rag-demo ``` Create a Virtual Environment: bash ``` python3 -m venv .venv source .venv/bin/activate ``` Install Dependencies: bash ``` pip install -r requirements.txt ``` Authenticate with Hugging Face (if needed): bash ``` huggingface-cli login ``` Setup and Dataset Download and Prepare the Dataset: You can use the IMDB Movie Reviews dataset or any other text files. Place your .txt files in the documents/ directory or specify a custom path. Ingest Files: The script will process all .txt files in the specified directory and store embeddings in a local SQLite database. bash ``` python embededGeneratorRAG.py ``` Usage Ingest Documents Ingest .txt files from the documents/ directory: python ``` embedding_generator = EmbeddingGenerator() embedding_generator.ingest_files("documents") ``` Perform a Search Query Run a semantic search query and generate a response: python ``` query = "How can I secure my database against SQL injection?" response = embedding_generator.find_most_similar_and_generate(query) print("Generated Response:") print(response) ``` Example Output sql ``` Generated Response: To prevent SQL injection, you should use prepared statements and parameterized queries. Avoid constructing SQL queries directly using user input. ``` File Structure bash ``` chagu-rag-demo/ ├── embeddings.db # SQLite database for storing embeddings ├── documents/ # Directory containing .txt files for ingestion ├── rag_chagu_demo.py # Main script with RAG implementation ├── embededGeneratorRAG.py # Core Embedding Generator class ├── requirements.txt # Python dependencies ├── README.md # Project documentation Configuration ``` You can update the following configurations in the EmbeddingGenerator class: Model Names: Change model_name or gen_model to use different embedding or generative models. Database Path: Specify a custom path for the SQLite database. python ``` embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db") ``` ### Potential Improvements FAISS Integration for Scalability: Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search. Enhanced Security: Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs. Deployment on Hugging Face Spaces: Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces. Known Issues Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries. Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier. ## Contributing Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project. ## Fork the repository. Create a new feature branch. Submit your changes via a pull request. License This project is licensed under the MIT License - see the LICENSE file for details. ## Acknowledgments Hugging Face for the amazing models and NLP tools. Scikit-learn for efficient similarity computation. SQLite for providing a lightweight database solution.