File size: 4,588 Bytes
2b3f8fe
 
 
 
 
 
 
 
 
 
 
 
 
aeb8626
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
title: Chagu Demo
emoji: πŸ“Š
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.40.1
app_file: app.py
pinned: false
license: mit
short_description: 'this is demo for chain guard protocol, assistant, RAG '
---

# **AI-Powered Document Search with Malicious Query Detection**

This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.

## **Features**
- **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches.
- **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
- **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files.
- **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable.

## **Technologies Used**
- **Python 3.8+**
- **Transformers**: For NLP-based malicious query detection.
- **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
- **Pathlib**: For robust file and path handling.

## **Project Structure**
β”œβ”€β”€ rag_chagu_demo.py # Main script containing the DocumentSearcher class 
β”œβ”€β”€ README.md # This file 
β”œβ”€β”€ data-sets/  - this part shifted to $HOME
β”‚ β”œβ”€β”€ aclImdb/ 
β”‚ β”‚ β”œβ”€β”€ train/ 
β”‚ β”‚ β”‚ β”œβ”€β”€ pos/ # Positive movie reviews 
β”‚ β”‚ β”‚ └── neg/ # Negative movie reviews 
β”‚ └── txt-files/ # Additional .txt files for document search


## **Installation**
Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:

```bash
pip install transformers
```
Dataset Setup
Place the IMDB dataset in the following structure:

bash
Copy code
$HOME/data-sets/aclImdb/train/pos/
$HOME/data-sets/aclImdb/train/neg/
Optionally, place additional .txt files under:

bash
Copy code
$HOME/data-sets/txt-files/
Usage
Run the script with the following command:

bash
```
python rag_chagu_demo.py
```
Example Output
```

Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
Loaded 5000 movie reviews from IMDB dataset.

Normal Query Results:
Document: This movie had great acting and a compelling storyline. The characters were well-developed...

Malicious Query Detected - Confidence: 0.95
Malicious Query Results:

Document: ANOMALY: Query blocked due to detected malicious intent.

```
## How It Works
The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
If a query is flagged as malicious, it is blocked and an anomaly message is returned.
For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
AI Model Used
The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.

## Why Use AI for Malicious Query Detection?
Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.

#### Improvements and Future Work
Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
Contributing
Feel free to fork this repository and submit pull requests. Contributions are welcome!

#### License
This project is licensed under the MIT License - see the LICENSE file for details.

#### Contact
For any questions or issues, please contact the project maintainer:

Name: Talex Maxim
Email: [email protected]
GitHub: taimax13