Spaces:
Running
Running
title: Chagu Demo | |
emoji: π | |
colorFrom: pink | |
colorTo: purple | |
sdk: streamlit | |
sdk_version: 1.40.1 | |
app_file: app.py | |
pinned: false | |
license: mit | |
short_description: 'this is demo for chain guard protocol, assistant, RAG ' | |
# **AI-Powered Document Search with Malicious Query Detection** | |
This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model. | |
## **Features** | |
- **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches. | |
- **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries. | |
- **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files. | |
- **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable. | |
## **Technologies Used** | |
- **Python 3.8+** | |
- **Transformers**: For NLP-based malicious query detection. | |
- **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis. | |
- **Pathlib**: For robust file and path handling. | |
## **Project Structure** | |
βββ rag_chagu_demo.py # Main script containing the DocumentSearcher class | |
βββ README.md # This file | |
βββ data-sets/ - this part shifted to $HOME | |
β βββ aclImdb/ | |
β β βββ train/ | |
β β β βββ pos/ # Positive movie reviews | |
β β β βββ neg/ # Negative movie reviews | |
β βββ txt-files/ # Additional .txt files for document search | |
## **Installation** | |
Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies: | |
```bash | |
pip install transformers | |
``` | |
Dataset Setup | |
Place the IMDB dataset in the following structure: | |
bash | |
Copy code | |
$HOME/data-sets/aclImdb/train/pos/ | |
$HOME/data-sets/aclImdb/train/neg/ | |
Optionally, place additional .txt files under: | |
bash | |
Copy code | |
$HOME/data-sets/txt-files/ | |
Usage | |
Run the script with the following command: | |
bash | |
``` | |
python rag_chagu_demo.py | |
``` | |
Example Output | |
``` | |
Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos | |
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg | |
Loaded 5000 movie reviews from IMDB dataset. | |
Normal Query Results: | |
Document: This movie had great acting and a compelling storyline. The characters were well-developed... | |
Malicious Query Detected - Confidence: 0.95 | |
Malicious Query Results: | |
Document: ANOMALY: Query blocked due to detected malicious intent. | |
``` | |
## How It Works | |
The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents. | |
The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis. | |
If a query is flagged as malicious, it is blocked and an anomaly message is returned. | |
For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches. | |
AI Model Used | |
The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis. | |
## Why Use AI for Malicious Query Detection? | |
Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries. | |
#### Improvements and Future Work | |
Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results. | |
Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process. | |
Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis. | |
Contributing | |
Feel free to fork this repository and submit pull requests. Contributions are welcome! | |
#### License | |
This project is licensed under the MIT License - see the LICENSE file for details. | |
#### Contact | |
For any questions or issues, please contact the project maintainer: | |
Name: Talex Maxim | |
Email: [email protected] | |
GitHub: taimax13 |