Spaces:
Sleeping
Sleeping
Updated application
Browse files- BuildingAChainlitApp.md +0 -312
- app.py +47 -63
- classes/app_state.py +4 -74
- classes/model_run_state.py +50 -0
- classes/ragas_state.py +17 -0
- images/docchain_img.png +0 -0
- old_app.py +0 -145
- utilities/constants.py +4 -0
- utilities/doc_utilities.py +54 -0
- utilities/get_documents.py +0 -33
- utilities/pipeline.py +0 -27
- utilities/rag_utilities.py +0 -125
- utilities/templates.py +27 -0
- utilities/text_utils.py +0 -103
- utilities/vector_database.py +0 -105
- utilities/vector_utilities.py +203 -0
- utilities_2/__init__.py +0 -0
- utilities_2/openai_utils/__init__.py +0 -0
- utilities_2/openai_utils/chatmodel.py +0 -45
- utilities_2/openai_utils/embedding.py +0 -60
- utilities_2/openai_utils/prompts.py +0 -78
- utilities_2/text_utils.py +0 -75
- utilities_2/vectordatabase.py +0 -82
BuildingAChainlitApp.md
DELETED
@@ -1,312 +0,0 @@
|
|
1 |
-
# Building a Chainlit App
|
2 |
-
|
3 |
-
What if we want to take our Week 1 Day 2 assignment - [Pythonic RAG](https://github.com/AI-Maker-Space/AIE4/tree/main/Week%201/Day%202) - and bring it out of the notebook?
|
4 |
-
|
5 |
-
Well - we'll cover exactly that here!
|
6 |
-
|
7 |
-
## Anatomy of a Chainlit Application
|
8 |
-
|
9 |
-
[Chainlit](https://docs.chainlit.io/get-started/overview) is a Python package similar to Streamlit that lets users write a backend and a front end in a single (or multiple) Python file(s). It is mainly used for prototyping LLM-based Chat Style Applications - though it is used in production in some settings with 1,000,000s of MAUs (Monthly Active Users).
|
10 |
-
|
11 |
-
The primary method of customizing and interacting with the Chainlit UI is through a few critical [decorators](https://blog.hubspot.com/website/decorators-in-python).
|
12 |
-
|
13 |
-
> NOTE: Simply put, the decorators (in Chainlit) are just ways we can "plug-in" to the functionality in Chainlit.
|
14 |
-
|
15 |
-
We'll be concerning ourselves with three main scopes:
|
16 |
-
|
17 |
-
1. On application start - when we start the Chainlit application with a command like `chainlit run app.py`
|
18 |
-
2. On chat start - when a chat session starts (a user opens the web browser to the address hosting the application)
|
19 |
-
3. On message - when the users sends a message through the input text box in the Chainlit UI
|
20 |
-
|
21 |
-
Let's dig into each scope and see what we're doing!
|
22 |
-
|
23 |
-
## On Application Start:
|
24 |
-
|
25 |
-
The first thing you'll notice is that we have the traditional "wall of imports" this is to ensure we have everything we need to run our application.
|
26 |
-
|
27 |
-
```python
|
28 |
-
import os
|
29 |
-
from typing import List
|
30 |
-
from chainlit.types import AskFileResponse
|
31 |
-
from utilities_2.text_utils import CharacterTextSplitter, TextFileLoader
|
32 |
-
from utilities_2.openai_utils.prompts import (
|
33 |
-
UserRolePrompt,
|
34 |
-
SystemRolePrompt,
|
35 |
-
AssistantRolePrompt,
|
36 |
-
)
|
37 |
-
from utilities_2.openai_utils.embedding import EmbeddingModel
|
38 |
-
from utilities_2.vectordatabase import VectorDatabase
|
39 |
-
from utilities_2.openai_utils.chatmodel import ChatOpenAI
|
40 |
-
import chainlit as cl
|
41 |
-
```
|
42 |
-
|
43 |
-
Next up, we have some prompt templates. As all sessions will use the same prompt templates without modification, and we don't need these templates to be specific per template - we can set them up here - at the application scope.
|
44 |
-
|
45 |
-
```python
|
46 |
-
system_template = """\
|
47 |
-
Use the following context to answer a users question. If you cannot find the answer in the context, say you don't know the answer."""
|
48 |
-
system_role_prompt = SystemRolePrompt(system_template)
|
49 |
-
|
50 |
-
user_prompt_template = """\
|
51 |
-
Context:
|
52 |
-
{context}
|
53 |
-
|
54 |
-
Question:
|
55 |
-
{question}
|
56 |
-
"""
|
57 |
-
user_role_prompt = UserRolePrompt(user_prompt_template)
|
58 |
-
```
|
59 |
-
|
60 |
-
> NOTE: You'll notice that these are the exact same prompt templates we used from the Pythonic RAG Notebook in Week 1 Day 2!
|
61 |
-
|
62 |
-
Following that - we can create the Python Class definition for our RAG pipeline - or *chain*, as we'll refer to it in the rest of this walkthrough.
|
63 |
-
|
64 |
-
Let's look at the definition first:
|
65 |
-
|
66 |
-
```python
|
67 |
-
class RetrievalAugmentedQAPipeline:
|
68 |
-
def __init__(self, llm: ChatOpenAI(), vector_db_retriever: VectorDatabase) -> None:
|
69 |
-
self.llm = llm
|
70 |
-
self.vector_db_retriever = vector_db_retriever
|
71 |
-
|
72 |
-
async def arun_pipeline(self, user_query: str):
|
73 |
-
### RETRIEVAL
|
74 |
-
context_list = self.vector_db_retriever.search_by_text(user_query, k=4)
|
75 |
-
|
76 |
-
context_prompt = ""
|
77 |
-
for context in context_list:
|
78 |
-
context_prompt += context[0] + "\n"
|
79 |
-
|
80 |
-
### AUGMENTED
|
81 |
-
formatted_system_prompt = system_role_prompt.create_message()
|
82 |
-
|
83 |
-
formatted_user_prompt = user_role_prompt.create_message(question=user_query, context=context_prompt)
|
84 |
-
|
85 |
-
|
86 |
-
### GENERATION
|
87 |
-
async def generate_response():
|
88 |
-
async for chunk in self.llm.astream([formatted_system_prompt, formatted_user_prompt]):
|
89 |
-
yield chunk
|
90 |
-
|
91 |
-
return {"response": generate_response(), "context": context_list}
|
92 |
-
```
|
93 |
-
|
94 |
-
Notice a few things:
|
95 |
-
|
96 |
-
1. We have modified this `RetrievalAugmentedQAPipeline` from the initial notebook to support streaming.
|
97 |
-
2. In essence, our pipeline is *chaining* a few events together:
|
98 |
-
1. We take our user query, and chain it into our Vector Database to collect related chunks
|
99 |
-
2. We take those contexts and our user's questions and chain them into the prompt templates
|
100 |
-
3. We take that prompt template and chain it into our LLM call
|
101 |
-
4. We chain the response of the LLM call to the user
|
102 |
-
3. We are using a lot of `async` again!
|
103 |
-
|
104 |
-
Now, we're going to create a helper function for processing uploaded text files.
|
105 |
-
|
106 |
-
First, we'll instantiate a shared `CharacterTextSplitter`.
|
107 |
-
|
108 |
-
```python
|
109 |
-
text_splitter = CharacterTextSplitter()
|
110 |
-
```
|
111 |
-
|
112 |
-
Now we can define our helper.
|
113 |
-
|
114 |
-
```python
|
115 |
-
def process_text_file(file: AskFileResponse):
|
116 |
-
import tempfile
|
117 |
-
|
118 |
-
with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".txt") as temp_file:
|
119 |
-
temp_file_path = temp_file.name
|
120 |
-
|
121 |
-
with open(temp_file_path, "wb") as f:
|
122 |
-
f.write(file.content)
|
123 |
-
|
124 |
-
text_loader = TextFileLoader(temp_file_path)
|
125 |
-
documents = text_loader.load_documents()
|
126 |
-
texts = text_splitter.split_text(documents)
|
127 |
-
return texts
|
128 |
-
```
|
129 |
-
|
130 |
-
Simply put, this downloads the file as a temp file, we load it in with `TextFileLoader` and then split it with our `TextSplitter`, and returns that list of strings!
|
131 |
-
|
132 |
-
<div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
|
133 |
-
QUESTION #1:
|
134 |
-
|
135 |
-
Why do we want to support streaming? What about streaming is important, or useful?
|
136 |
-
|
137 |
-
### ANSWER #1:
|
138 |
-
|
139 |
-
Streaming is the continuous transmission of the data from the model to the UI. Instead of waiting and batching up the response into a single
|
140 |
-
large message, the response is sent in pieces (streams) as it is created.
|
141 |
-
|
142 |
-
The advantages of streaming:
|
143 |
-
- quicker initial response - the user sees the first part of the answer sooner
|
144 |
-
- it is easier to identify the results are incorrect and terminate the request
|
145 |
-
- it is a more natural mode of communication for humans
|
146 |
-
- better handling of large data, not requiring complex caching
|
147 |
-
- essential for real time processing
|
148 |
-
- humans can only read so fast so its an advantage to get some of the data earlier
|
149 |
-
|
150 |
-
</div>
|
151 |
-
|
152 |
-
## On Chat Start:
|
153 |
-
|
154 |
-
The next scope is where "the magic happens". On Chat Start is when a user begins a chat session. This will happen whenever a user opens a new chat window, or refreshes an existing chat window.
|
155 |
-
|
156 |
-
You'll see that our code is set-up to immediately show the user a chat box requesting them to upload a file.
|
157 |
-
|
158 |
-
```python
|
159 |
-
while files == None:
|
160 |
-
files = await cl.AskFileMessage(
|
161 |
-
content="Please upload a Text File file to begin!",
|
162 |
-
accept=["text/plain"],
|
163 |
-
max_size_mb=2,
|
164 |
-
timeout=180,
|
165 |
-
).send()
|
166 |
-
```
|
167 |
-
|
168 |
-
Once we've obtained the text file - we'll use our processing helper function to process our text!
|
169 |
-
|
170 |
-
After we have processed our text file - we'll need to create a `VectorDatabase` and populate it with our processed chunks and their related embeddings!
|
171 |
-
|
172 |
-
```python
|
173 |
-
vector_db = VectorDatabase()
|
174 |
-
vector_db = await vector_db.abuild_from_list(texts)
|
175 |
-
```
|
176 |
-
|
177 |
-
Once we have that piece completed - we can create the chain we'll be using to respond to user queries!
|
178 |
-
|
179 |
-
```python
|
180 |
-
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
|
181 |
-
vector_db_retriever=vector_db,
|
182 |
-
llm=chat_openai
|
183 |
-
)
|
184 |
-
```
|
185 |
-
|
186 |
-
Now, we'll save that into our user session!
|
187 |
-
|
188 |
-
> NOTE: Chainlit has some great documentation about [User Session](https://docs.chainlit.io/concepts/user-session).
|
189 |
-
|
190 |
-
<div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
|
191 |
-
|
192 |
-
### QUESTION #2:
|
193 |
-
|
194 |
-
Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
|
195 |
-
|
196 |
-
### ANSWER #2:
|
197 |
-
The application hopefully will be run by many people, at the same time. If the data was stored in a global variable
|
198 |
-
this would be accessed by everyone using the application. So everytime someone started a new session, the information
|
199 |
-
would be overwritten, meaning everyone would basically get the same results. Unless only one person used the system
|
200 |
-
at a time.
|
201 |
-
|
202 |
-
So the goal is to keep each users session information separate from all the other users. The ChainLit User session
|
203 |
-
provides the capability of storing each users data separately.
|
204 |
-
</div>
|
205 |
-
|
206 |
-
## On Message
|
207 |
-
|
208 |
-
First, we load our chain from the user session:
|
209 |
-
|
210 |
-
```python
|
211 |
-
chain = cl.user_session.get("chain")
|
212 |
-
```
|
213 |
-
|
214 |
-
Then, we run the chain on the content of the message - and stream it to the front end - that's it!
|
215 |
-
|
216 |
-
```python
|
217 |
-
msg = cl.Message(content="")
|
218 |
-
result = await chain.arun_pipeline(message.content)
|
219 |
-
|
220 |
-
async for stream_resp in result["response"]:
|
221 |
-
await msg.stream_token(stream_resp)
|
222 |
-
```
|
223 |
-
|
224 |
-
## 🎉
|
225 |
-
|
226 |
-
With that - you've created a Chainlit application that moves our Pythonic RAG notebook to a Chainlit application!
|
227 |
-
|
228 |
-
## 🚧 CHALLENGE MODE 🚧
|
229 |
-
|
230 |
-
For an extra challenge - modify the behaviour of your applciation by integrating changes you made to your Pythonic RAG notebook (using new retrieval methods, etc.)
|
231 |
-
|
232 |
-
If you're still looking for a challenge, or didn't make any modifications to your Pythonic RAG notebook:
|
233 |
-
|
234 |
-
1) Allow users to upload PDFs (this will require you to build a PDF parser as well)
|
235 |
-
2) Modify the VectorStore to leverage [Qdrant](https://python-client.qdrant.tech/)
|
236 |
-
|
237 |
-
> NOTE: The motivation for these challenges is simple - the beginning of the course is extremely information dense, and people come from all kinds of different technical backgrounds. In order to ensure that all learners are able to engage with the content confidently and comfortably, we want to focus on the basic units of technical competency required. This leads to a situation where some learners, who came in with more robust technical skills, find the introductory material to be too simple - and these open-ended challenges help us do this!
|
238 |
-
|
239 |
-
## Support pdf documents
|
240 |
-
|
241 |
-
Code was modified to support pdf documents in the following areas:
|
242 |
-
|
243 |
-
1) Change to the request for documents in on_chat_start:
|
244 |
-
|
245 |
-
- changed the message to ask for .txt or .pdf file
|
246 |
-
- changed the acceptable file formats so that the pdf documents are included in the select pop up
|
247 |
-
|
248 |
-
```python
|
249 |
-
while not files:
|
250 |
-
files = await cl.AskFileMessage(
|
251 |
-
content="Please upload a .txt or .pdf file to begin processing!",
|
252 |
-
accept=["text/plain", "application/pdf"],
|
253 |
-
max_size_mb=2,
|
254 |
-
timeout=180,
|
255 |
-
).send()
|
256 |
-
```
|
257 |
-
|
258 |
-
2) change process_text_file() function to handle .pdf files
|
259 |
-
|
260 |
-
- refactor the code to do all file handling in utilities.text_utils
|
261 |
-
- app calls process_file, optionally passing in the text splitter function
|
262 |
-
- default text splitter function is CharacterTextSplitter
|
263 |
-
```python
|
264 |
-
texts = process_file(file)
|
265 |
-
```
|
266 |
-
- load_file() function does the following
|
267 |
-
- read the uploaded document into a temporary file
|
268 |
-
- identify the file extension
|
269 |
-
- process a .txt file as before resulting in the texts list
|
270 |
-
- if the file is .pdf use the PyMuPDF library to read each page and extract the text and add it to texts list
|
271 |
-
- use the passed in text splitter function to split the documents
|
272 |
-
|
273 |
-
```python
|
274 |
-
def load_file(self, file, text_splitter=CharacterTextSplitter()):
|
275 |
-
file_extension = os.path.splitext(file.name)[1].lower()
|
276 |
-
with tempfile.NamedTemporaryFile(mode="wb", delete=False, suffix=file_extension) as temp_file:
|
277 |
-
self.temp_file_path = temp_file.name
|
278 |
-
temp_file.write(file.content)
|
279 |
-
|
280 |
-
if os.path.isfile(self.temp_file_path):
|
281 |
-
if self.temp_file_path.endswith(".txt"):
|
282 |
-
self.load_text_file()
|
283 |
-
elif self.temp_file_path.endswith(".pdf"):
|
284 |
-
self.load_pdf_file()
|
285 |
-
else:
|
286 |
-
raise ValueError(
|
287 |
-
f"Unsupported file type: {self.temp_file_path}"
|
288 |
-
)
|
289 |
-
return text_splitter.split_text(self.documents)
|
290 |
-
else:
|
291 |
-
raise ValueError(
|
292 |
-
"Not a file"
|
293 |
-
)
|
294 |
-
|
295 |
-
def load_text_file(self):
|
296 |
-
with open(self.temp_file_path, "r", encoding=self.encoding) as f:
|
297 |
-
self.documents.append(f.read())
|
298 |
-
|
299 |
-
def load_pdf_file(self):
|
300 |
-
|
301 |
-
pdf_document = fitz.open(self.temp_file_path)
|
302 |
-
for page_num in range(len(pdf_document)):
|
303 |
-
page = pdf_document.load_page(page_num)
|
304 |
-
text = page.get_text()
|
305 |
-
self.documents.append(text)
|
306 |
-
```
|
307 |
-
|
308 |
-
3) Test the handling of .pdf and .txt files
|
309 |
-
|
310 |
-
Several different .pdf and .txt files were successfully uploaded and processed by the app
|
311 |
-
|
312 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
@@ -1,14 +1,16 @@
|
|
|
|
1 |
import os
|
|
|
|
|
2 |
from dotenv import load_dotenv
|
3 |
-
|
4 |
from langchain_openai import ChatOpenAI
|
5 |
-
from
|
6 |
-
from
|
7 |
-
from langchain_core.prompts import ChatPromptTemplate
|
8 |
from operator import itemgetter
|
9 |
-
from
|
10 |
-
from
|
11 |
-
from
|
12 |
|
13 |
document_urls = [
|
14 |
"https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
|
@@ -21,56 +23,39 @@ load_dotenv()
|
|
21 |
# Get the OpenAI API key from environment variables
|
22 |
openai_api_key = os.getenv("OPENAI_API_KEY")
|
23 |
|
24 |
-
# Setup our state
|
25 |
-
|
26 |
-
|
|
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
state.set_chunk_overlap(100)
|
34 |
|
35 |
-
|
36 |
-
|
37 |
-
state.set_main_llm(llm)
|
38 |
|
39 |
-
|
|
|
|
|
40 |
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
The two documents are 'Blueprint for an AI Bill of Rights' and 'Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile'
|
48 |
-
"""
|
49 |
-
human_template = """
|
50 |
-
===
|
51 |
-
question:
|
52 |
-
{question}
|
53 |
-
|
54 |
-
===
|
55 |
-
context:
|
56 |
-
{context}
|
57 |
-
===
|
58 |
-
"""
|
59 |
-
chat_prompt = ChatPromptTemplate.from_messages([
|
60 |
-
("system", system_template),
|
61 |
-
("human", human_template)
|
62 |
-
])
|
63 |
-
# create the chain
|
64 |
-
openai_chat_model = ChatOpenAI(model="gpt-4o")
|
65 |
|
|
|
66 |
|
|
|
67 |
|
68 |
retrieval_augmented_qa_chain = (
|
69 |
-
{"context": itemgetter("question") |
|
70 |
| RunnablePassthrough.assign(context=itemgetter("context"))
|
71 |
-
|
72 |
-
|
73 |
-
| {"response": chat_prompt | openai_chat_model, "context": itemgetter("context")}
|
74 |
)
|
75 |
|
76 |
opening_content = """
|
@@ -116,19 +101,18 @@ async def main(message):
|
|
116 |
|
117 |
await cl.Message(content=context_msg).send()
|
118 |
|
119 |
-
for doc in context_documents:
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
).send()
|
|
|
1 |
+
import chainlit as cl
|
2 |
import os
|
3 |
+
from classes.app_state import AppState
|
4 |
+
from classes.model_run_state import ModelRunState
|
5 |
from dotenv import load_dotenv
|
6 |
+
from langchain.schema.runnable import RunnablePassthrough
|
7 |
from langchain_openai import ChatOpenAI
|
8 |
+
from langchain_openai.embeddings import OpenAIEmbeddings
|
9 |
+
from langchain.embeddings import HuggingFaceEmbeddings
|
|
|
10 |
from operator import itemgetter
|
11 |
+
from utilities.doc_utilities import get_documents
|
12 |
+
from utilities.templates import get_qa_prompt
|
13 |
+
from utilities.vector_utilities import create_vector_store
|
14 |
|
15 |
document_urls = [
|
16 |
"https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
|
|
|
23 |
# Get the OpenAI API key from environment variables
|
24 |
openai_api_key = os.getenv("OPENAI_API_KEY")
|
25 |
|
26 |
+
# Setup our state and read the documents
|
27 |
+
app_state = AppState()
|
28 |
+
app_state.set_debug(False)
|
29 |
+
app_state.set_document_urls(document_urls)
|
30 |
|
31 |
+
get_documents(app_state)
|
32 |
|
33 |
+
# set up this model run
|
34 |
+
chainlit_state = ModelRunState()
|
35 |
+
chainlit_state.name = "Chainlit"
|
|
|
36 |
|
37 |
+
chainlit_state.qa_model_name = "gpt-4o-mini"
|
38 |
+
chainlit_state.qa_model = ChatOpenAI(model=chainlit_state.qa_model_name, openai_api_key=openai_api_key)
|
|
|
39 |
|
40 |
+
hf_username = "rchrdgwr"
|
41 |
+
hf_repo_name = "finetuned-arctic-model-2"
|
42 |
+
finetuned_model_name = f"{hf_username}/{hf_repo_name}"
|
43 |
|
44 |
+
chainlit_state.embedding_model_name = finetuned_model_name
|
45 |
+
chainlit_state.embedding_model = HuggingFaceEmbeddings(model_name=chainlit_state.embedding_model_name)
|
46 |
+
|
47 |
+
chainlit_state.chunk_size = 1000
|
48 |
+
chainlit_state.chunk_overlap = 100
|
49 |
+
create_vector_store(app_state, chainlit_state )
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
+
chat_prompt = get_qa_prompt()
|
52 |
|
53 |
+
# create the chain
|
54 |
|
55 |
retrieval_augmented_qa_chain = (
|
56 |
+
{"context": itemgetter("question") | chainlit_state.retriever, "question": itemgetter("question")}
|
57 |
| RunnablePassthrough.assign(context=itemgetter("context"))
|
58 |
+
| {"response": chat_prompt | chainlit_state.qa_model, "context": itemgetter("context")}
|
|
|
|
|
59 |
)
|
60 |
|
61 |
opening_content = """
|
|
|
101 |
|
102 |
await cl.Message(content=context_msg).send()
|
103 |
|
104 |
+
# for doc in context_documents:
|
105 |
+
# document_title = doc.metadata.get("source", "Unknown Document")
|
106 |
+
# chunk_number = doc.metadata.get("chunk_number", "Unknown Chunk")
|
107 |
+
|
108 |
+
# document_context = doc.page_content.strip()
|
109 |
+
# truncated_context = document_context[:MAX_PREVIEW_LENGTH] + ("..." if len(document_context) > MAX_PREVIEW_LENGTH else "")
|
110 |
+
# print("----------------------------------------")
|
111 |
+
# print(truncated_context)
|
112 |
+
|
113 |
+
# await cl.Message(
|
114 |
+
# content=f"**{document_title} ( Chunk: {chunk_number})**",
|
115 |
+
# elements=[
|
116 |
+
# cl.Text(content=truncated_context, display="inline")
|
117 |
+
# ]
|
118 |
+
# ).send()
|
|
classes/app_state.py
CHANGED
@@ -1,86 +1,16 @@
|
|
|
|
1 |
class AppState:
|
2 |
def __init__(self):
|
3 |
self.debug = False
|
4 |
-
self.llm_model = "gpt-3.5-turbo"
|
5 |
-
self.embedding_model = "text-embedding-3-small"
|
6 |
-
self.chunk_size = 1000
|
7 |
-
self.chunk_overlap = 100
|
8 |
self.document_urls = []
|
9 |
self.download_folder = "data/"
|
10 |
-
self.loaded_documents = []
|
11 |
-
self.single_text_documents = []
|
12 |
-
self.metadata = []
|
13 |
-
self.titles = []
|
14 |
self.documents = []
|
15 |
-
self.combined_document_objects = []
|
16 |
-
self.main_llm = None
|
17 |
-
self.retriever = None
|
18 |
|
19 |
-
|
20 |
-
|
21 |
-
self.user_input = None
|
22 |
-
self.retrieved_documents = []
|
23 |
-
self.chat_history = []
|
24 |
-
self.current_question = None
|
25 |
-
|
26 |
def set_document_urls(self, document_urls):
|
27 |
self.document_urls = document_urls
|
28 |
-
|
29 |
-
def set_llm_model(self, llm_model):
|
30 |
-
self.llm_model = llm_model
|
31 |
-
|
32 |
-
def set_embedding_model(self, embedding_model):
|
33 |
-
self.embedding_model = embedding_model
|
34 |
-
|
35 |
-
def set_chunk_size(self, chunk_size):
|
36 |
-
self.chunk_size = chunk_size
|
37 |
-
|
38 |
-
def set_chunk_overlap(self, chunk_overlap):
|
39 |
-
self.chunk_overlap = chunk_overlap
|
40 |
-
|
41 |
-
def set_system_template(self, system_template):
|
42 |
-
self.system_template = system_template
|
43 |
-
|
44 |
-
def add_loaded_document(self, loaded_document):
|
45 |
-
self.loaded_documents.append(loaded_document)
|
46 |
-
|
47 |
-
def add_single_text_documents(self, single_text_document):
|
48 |
-
self.single_text_documents.append(single_text_document)
|
49 |
-
def add_metadata(self, metadata):
|
50 |
-
self.metadata = metadata
|
51 |
-
|
52 |
-
def add_title(self, title):
|
53 |
-
self.titles.append(title)
|
54 |
def add_document(self, document):
|
55 |
self.documents.append(document)
|
56 |
-
def add_combined_document_objects(self, combined_document_objects):
|
57 |
-
self.combined_document_objects = combined_document_objects
|
58 |
-
def set_retriever(self, retriever):
|
59 |
-
self.retriever = retriever
|
60 |
-
def set_main_llm(self, main_llm):
|
61 |
-
self.main_llm = main_llm
|
62 |
def set_debug(self, debug):
|
63 |
-
self.debug = debug
|
64 |
-
#
|
65 |
-
# Method to update the user input
|
66 |
-
def set_user_input(self, input_text):
|
67 |
-
self.user_input = input_text
|
68 |
-
|
69 |
-
# Method to add a retrieved document
|
70 |
-
# def add_document(self, document):
|
71 |
-
# print("adding document")
|
72 |
-
# print(self)
|
73 |
-
# self.retrieved_documents.append(document)
|
74 |
-
|
75 |
-
# Method to update chat history
|
76 |
-
def update_chat_history(self, message):
|
77 |
-
self.chat_history.append(message)
|
78 |
-
|
79 |
-
# Method to get the current state
|
80 |
-
def get_state(self):
|
81 |
-
return {
|
82 |
-
"user_input": self.user_input,
|
83 |
-
"retrieved_documents": self.retrieved_documents,
|
84 |
-
"chat_history": self.chat_history,
|
85 |
-
"current_question": self.current_question
|
86 |
-
}
|
|
|
1 |
+
import pprint
|
2 |
class AppState:
|
3 |
def __init__(self):
|
4 |
self.debug = False
|
|
|
|
|
|
|
|
|
5 |
self.document_urls = []
|
6 |
self.download_folder = "data/"
|
|
|
|
|
|
|
|
|
7 |
self.documents = []
|
|
|
|
|
|
|
8 |
|
9 |
+
def display(self):
|
10 |
+
pprint.pprint(self.__dict__)
|
|
|
|
|
|
|
|
|
|
|
11 |
def set_document_urls(self, document_urls):
|
12 |
self.document_urls = document_urls
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
def add_document(self, document):
|
14 |
self.documents.append(document)
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
def set_debug(self, debug):
|
16 |
+
self.debug = debug
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
classes/model_run_state.py
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import pprint
|
2 |
+
|
3 |
+
from utilities.constants import (
|
4 |
+
CHUNKING_STRATEGY_RECURSIVE,
|
5 |
+
CHUNKING_STRATEGY_TABLE_AWARE,
|
6 |
+
CHUNKING_STRATEGY_SECTION_BASED
|
7 |
+
)
|
8 |
+
|
9 |
+
class ModelRunState:
|
10 |
+
def __init__(self):
|
11 |
+
self.name = ""
|
12 |
+
|
13 |
+
self.qa_model_name = "gpt-4o"
|
14 |
+
self.qa_model = None
|
15 |
+
|
16 |
+
self.embedding_model_name = "text-embedding-3-small"
|
17 |
+
self.embedding_model = None
|
18 |
+
|
19 |
+
self.chunking_strategy = CHUNKING_STRATEGY_RECURSIVE
|
20 |
+
self.chunk_size = 1000
|
21 |
+
self.chunk_overlap = 100
|
22 |
+
|
23 |
+
self.response_dataset = []
|
24 |
+
|
25 |
+
self.combined_document_objects = []
|
26 |
+
self.retriever = None
|
27 |
+
|
28 |
+
self.ragas_results = None
|
29 |
+
self.system_template = "You are a helpful assistant"
|
30 |
+
|
31 |
+
def display(self):
|
32 |
+
pprint.pprint(self.__dict__)
|
33 |
+
|
34 |
+
def parameters(self):
|
35 |
+
print(f"Base model: {self.qa_model_name}")
|
36 |
+
print(f"Embedding model: {self.embedding_model_name}")
|
37 |
+
print(f"Chunk size: {self.chunk_size}")
|
38 |
+
print(f"Chunk overlap: {self.chunk_overlap}")
|
39 |
+
|
40 |
+
def results_summary(self):
|
41 |
+
print(self.ragas_results)
|
42 |
+
|
43 |
+
def results(self):
|
44 |
+
results_df = self.ragas_results.to_pandas()
|
45 |
+
results_df
|
46 |
+
|
47 |
+
@classmethod
|
48 |
+
def compare_ragas_results(cls, model_run_1, model_run_2):
|
49 |
+
if not isinstance(model_run_1, cls) or not isinstance(model_run_2, cls):
|
50 |
+
raise ValueError("Both instances must be of the same class")
|
classes/ragas_state.py
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import pprint
|
2 |
+
from ragas.testset.evolutions import simple, reasoning, multi_context
|
3 |
+
class RagasState:
|
4 |
+
def __init__(self):
|
5 |
+
self.chunk_size = 600
|
6 |
+
self.chunk_overlap = 50
|
7 |
+
self.chunks = []
|
8 |
+
self.generator_llm = "gpt-4"
|
9 |
+
self.critic_llm = "gpt-4o-mini"
|
10 |
+
self.distributions = {
|
11 |
+
simple: 0.5,
|
12 |
+
multi_context: 0.4,
|
13 |
+
reasoning: 0.1
|
14 |
+
}
|
15 |
+
self.num_questions = 3
|
16 |
+
self.testset_df = None
|
17 |
+
|
images/docchain_img.png
DELETED
Binary file (100 kB)
|
|
old_app.py
DELETED
@@ -1,145 +0,0 @@
|
|
1 |
-
import os
|
2 |
-
from chainlit.types import AskFileResponse
|
3 |
-
|
4 |
-
from utilities_2.openai_utils.prompts import (
|
5 |
-
UserRolePrompt,
|
6 |
-
SystemRolePrompt,
|
7 |
-
AssistantRolePrompt,
|
8 |
-
)
|
9 |
-
from utilities_2.openai_utils.embedding import EmbeddingModel
|
10 |
-
from utilities_2.vectordatabase import VectorDatabase
|
11 |
-
from utilities_2.openai_utils.chatmodel import ChatOpenAI
|
12 |
-
import chainlit as cl
|
13 |
-
from utilities.text_utils import FileLoader
|
14 |
-
from utilities.pipeline import RetrievalAugmentedQAPipeline
|
15 |
-
# from utilities.vector_database import QdrantDatabase
|
16 |
-
|
17 |
-
|
18 |
-
def process_file(file, use_rct):
|
19 |
-
fileLoader = FileLoader()
|
20 |
-
return fileLoader.load_file(file, use_rct)
|
21 |
-
|
22 |
-
system_template = """\
|
23 |
-
Use the following context to answer a users question.
|
24 |
-
If you cannot find the answer in the context, say you don't know the answer.
|
25 |
-
The context contains the text from a document. Refer to it as the document not the context.
|
26 |
-
"""
|
27 |
-
system_role_prompt = SystemRolePrompt(system_template)
|
28 |
-
|
29 |
-
user_prompt_template = """\
|
30 |
-
Context:
|
31 |
-
{context}
|
32 |
-
|
33 |
-
Question:
|
34 |
-
{question}
|
35 |
-
"""
|
36 |
-
user_role_prompt = UserRolePrompt(user_prompt_template)
|
37 |
-
|
38 |
-
@cl.on_chat_start
|
39 |
-
async def on_chat_start():
|
40 |
-
# get user inputs
|
41 |
-
res = await cl.AskActionMessage(
|
42 |
-
content="Do you want to use Qdrant?",
|
43 |
-
actions=[
|
44 |
-
cl.Action(name="yes", value="yes", label="✅ Yes"),
|
45 |
-
cl.Action(name="no", value="no", label="❌ No"),
|
46 |
-
],
|
47 |
-
).send()
|
48 |
-
use_qdrant = False
|
49 |
-
use_qdrant_type = "Local"
|
50 |
-
if res and res.get("value") == "yes":
|
51 |
-
use_qdrant = True
|
52 |
-
local_res = await cl.AskActionMessage(
|
53 |
-
content="Do you want to use local or cloud?",
|
54 |
-
actions=[
|
55 |
-
cl.Action(name="Local", value="Local", label="✅ Local"),
|
56 |
-
cl.Action(name="Cloud", value="Cloud", label="❌ Cloud"),
|
57 |
-
],
|
58 |
-
).send()
|
59 |
-
if local_res and local_res.get("value") == "Cloud":
|
60 |
-
use_qdrant_type = "Cloud"
|
61 |
-
use_rct = False
|
62 |
-
res = await cl.AskActionMessage(
|
63 |
-
content="Do you want to use RecursiveCharacterTextSplitter?",
|
64 |
-
actions=[
|
65 |
-
cl.Action(name="yes", value="yes", label="✅ Yes"),
|
66 |
-
cl.Action(name="no", value="no", label="❌ No"),
|
67 |
-
],
|
68 |
-
).send()
|
69 |
-
if res and res.get("value") == "yes":
|
70 |
-
use_rct = True
|
71 |
-
|
72 |
-
files = None
|
73 |
-
# Wait for the user to upload a file
|
74 |
-
while not files:
|
75 |
-
files = await cl.AskFileMessage(
|
76 |
-
content="Please upload a .txt or .pdf file to begin processing!",
|
77 |
-
accept=["text/plain", "application/pdf"],
|
78 |
-
max_size_mb=2,
|
79 |
-
timeout=180,
|
80 |
-
).send()
|
81 |
-
|
82 |
-
file = files[0]
|
83 |
-
|
84 |
-
msg = cl.Message(
|
85 |
-
content=f"Processing `{file.name}`...", disable_human_feedback=True
|
86 |
-
)
|
87 |
-
await msg.send()
|
88 |
-
|
89 |
-
texts = process_file(file, use_rct)
|
90 |
-
|
91 |
-
msg = cl.Message(
|
92 |
-
content=f"Resulted in {len(texts)} chunks", disable_human_feedback=True
|
93 |
-
)
|
94 |
-
await msg.send()
|
95 |
-
|
96 |
-
# decide if to use the dict vector store of the Qdrant vector store
|
97 |
-
|
98 |
-
# Create a dict vector store
|
99 |
-
if use_qdrant == False:
|
100 |
-
vector_db = VectorDatabase()
|
101 |
-
vector_db = await vector_db.abuild_from_list(texts)
|
102 |
-
else:
|
103 |
-
embedding_model = EmbeddingModel(embeddings_model_name= "text-embedding-3-small", dimensions=1000)
|
104 |
-
if use_qdrant_type == "Local":
|
105 |
-
from utilities.vector_database import QdrantDatabase
|
106 |
-
vector_db = QdrantDatabase(
|
107 |
-
embedding_model=embedding_model
|
108 |
-
)
|
109 |
-
|
110 |
-
vector_db = await vector_db.abuild_from_list(texts)
|
111 |
-
|
112 |
-
msg = cl.Message(
|
113 |
-
content=f"The Vector store has been created", disable_human_feedback=True
|
114 |
-
)
|
115 |
-
await msg.send()
|
116 |
-
|
117 |
-
chat_openai = ChatOpenAI()
|
118 |
-
|
119 |
-
# Create a chain
|
120 |
-
retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
|
121 |
-
vector_db_retriever=vector_db,
|
122 |
-
llm=chat_openai,
|
123 |
-
system_role_prompt=system_role_prompt,
|
124 |
-
user_role_prompt=user_role_prompt
|
125 |
-
)
|
126 |
-
|
127 |
-
# Let the user know that the system is ready
|
128 |
-
msg.content = f"Processing `{file.name}` is complete."
|
129 |
-
await msg.update()
|
130 |
-
msg.content = f"You can now ask questions about `{file.name}`."
|
131 |
-
await msg.update()
|
132 |
-
cl.user_session.set("chain", retrieval_augmented_qa_pipeline)
|
133 |
-
|
134 |
-
|
135 |
-
@cl.on_message
|
136 |
-
async def main(message):
|
137 |
-
chain = cl.user_session.get("chain")
|
138 |
-
|
139 |
-
msg = cl.Message(content="")
|
140 |
-
result = await chain.arun_pipeline(message.content)
|
141 |
-
|
142 |
-
async for stream_resp in result["response"]:
|
143 |
-
await msg.stream_token(stream_resp)
|
144 |
-
|
145 |
-
await msg.send()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities/constants.py
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
CHUNKING_STRATEGY_RECURSIVE = "recursive"
|
2 |
+
CHUNKING_STRATEGY_TABLE_AWARE = "table_aware"
|
3 |
+
CHUNKING_STRATEGY_SECTION_BASED = "section_based"
|
4 |
+
CHUNKING_STRATEGY_SEMANTIC = "semantic_based"
|
utilities/doc_utilities.py
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from langchain_community.document_loaders import PyMuPDFLoader
|
2 |
+
import fitz
|
3 |
+
import os
|
4 |
+
import requests
|
5 |
+
|
6 |
+
from utilities.debugger import dprint
|
7 |
+
import uuid
|
8 |
+
|
9 |
+
|
10 |
+
|
11 |
+
def download_document(app_state, url, file_name, download_folder):
|
12 |
+
file_path = os.path.join(download_folder, file_name)
|
13 |
+
if not os.path.exists(download_folder):
|
14 |
+
os.makedirs(download_folder)
|
15 |
+
|
16 |
+
if not os.path.exists(file_path):
|
17 |
+
print(f"Downloading {file_name} from {url}...")
|
18 |
+
response = requests.get(url)
|
19 |
+
if response.status_code == 200:
|
20 |
+
with open(file_path, 'wb') as f:
|
21 |
+
f.write(response.content)
|
22 |
+
else:
|
23 |
+
dprint(app_state, f"Failed to download document from {url}. Status code: {response.status_code}")
|
24 |
+
else:
|
25 |
+
dprint(app_state, f"{file_name} already exists locally.")
|
26 |
+
return file_path
|
27 |
+
|
28 |
+
def get_documents(app_state):
|
29 |
+
for url in app_state.document_urls:
|
30 |
+
dprint(app_state, f"Downloading and loading document from {url}...")
|
31 |
+
file_name = url.split("/")[-1]
|
32 |
+
file_path = download_document(app_state, url, file_name, app_state.download_folder)
|
33 |
+
loader = PyMuPDFLoader(file_path)
|
34 |
+
loaded_document = loader.load()
|
35 |
+
single_text_document = "\n".join([doc.page_content for doc in loaded_document])
|
36 |
+
dprint(app_state, f"Number of pages: {len(loaded_document)}")
|
37 |
+
# lets get titles and metadata
|
38 |
+
pdf = fitz.open(file_path)
|
39 |
+
metadata = pdf.metadata
|
40 |
+
title = metadata.get('title', 'Document 1')
|
41 |
+
|
42 |
+
document = {
|
43 |
+
"url": url,
|
44 |
+
"title": title,
|
45 |
+
"metadata": metadata,
|
46 |
+
"loaded_document": loaded_document,
|
47 |
+
"single_text_document": single_text_document,
|
48 |
+
"document_id": str(uuid.uuid4())
|
49 |
+
}
|
50 |
+
app_state.add_document(document)
|
51 |
+
dprint(app_state, f"Title of Document: {title}")
|
52 |
+
dprint(app_state, f"Full metadata for Document 1: {metadata}")
|
53 |
+
pdf.close()
|
54 |
+
print(f"Total documents: {len(app_state.documents)}")
|
utilities/get_documents.py
DELETED
@@ -1,33 +0,0 @@
|
|
1 |
-
import requests
|
2 |
-
import os
|
3 |
-
from langchain.document_loaders import PyMuPDFLoader
|
4 |
-
|
5 |
-
# Define the URLs for the documents
|
6 |
-
url_1 = "https://example.com/Blueprint-for-an-AI-Bill-of-Rights.pdf"
|
7 |
-
url_2 = "https://example.com/NIST.AI.600-1.pdf"
|
8 |
-
|
9 |
-
# Define local file paths for storing the downloaded PDFs
|
10 |
-
file_path_1 = "data/Blueprint-for-an-AI-Bill-of-Rights.pdf"
|
11 |
-
file_path_2 = "data/NIST.AI.600-1.pdf"
|
12 |
-
|
13 |
-
# Function to download a file from a URL
|
14 |
-
def download_pdf(url, file_path):
|
15 |
-
# Check if the file already exists to avoid re-downloading
|
16 |
-
if not os.path.exists(file_path):
|
17 |
-
print(f"Downloading {file_path} from {url}...")
|
18 |
-
response = requests.get(url)
|
19 |
-
with open(file_path, 'wb') as f:
|
20 |
-
f.write(response.content)
|
21 |
-
else:
|
22 |
-
print(f"{file_path} already exists, skipping download.")
|
23 |
-
|
24 |
-
# Download the PDFs from the URLs
|
25 |
-
download_pdf(url_1, file_path_1)
|
26 |
-
download_pdf(url_2, file_path_2)
|
27 |
-
|
28 |
-
# Load the PDFs using PyMuPDFLoader
|
29 |
-
loader_1 = PyMuPDFLoader(file_path_1)
|
30 |
-
documents_1 = loader_1.load()
|
31 |
-
|
32 |
-
loader_2 = PyMuPDFLoader(file_path_2)
|
33 |
-
documents_2 = loader_2.load()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities/pipeline.py
DELETED
@@ -1,27 +0,0 @@
|
|
1 |
-
from utilities_2.vectordatabase import VectorDatabase
|
2 |
-
|
3 |
-
class RetrievalAugmentedQAPipeline:
|
4 |
-
def __init__(self, llm, vector_db_retriever: VectorDatabase,
|
5 |
-
system_role_prompt, user_role_prompt
|
6 |
-
) -> None:
|
7 |
-
self.llm = llm
|
8 |
-
self.vector_db_retriever = vector_db_retriever
|
9 |
-
self.system_role_prompt = system_role_prompt
|
10 |
-
self.user_role_prompt = user_role_prompt
|
11 |
-
|
12 |
-
async def arun_pipeline(self, user_query: str):
|
13 |
-
context_list = self.vector_db_retriever.search_by_text(user_query, k=4)
|
14 |
-
|
15 |
-
context_prompt = ""
|
16 |
-
for context in context_list:
|
17 |
-
context_prompt += context[0] + "\n"
|
18 |
-
|
19 |
-
formatted_system_prompt = self.system_role_prompt.create_message()
|
20 |
-
|
21 |
-
formatted_user_prompt = self.user_role_prompt.create_message(question=user_query, context=context_prompt)
|
22 |
-
|
23 |
-
async def generate_response():
|
24 |
-
async for chunk in self.llm.astream([formatted_system_prompt, formatted_user_prompt]):
|
25 |
-
yield chunk
|
26 |
-
|
27 |
-
return {"response": generate_response(), "context": context_list}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities/rag_utilities.py
DELETED
@@ -1,125 +0,0 @@
|
|
1 |
-
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
2 |
-
from langchain.docstore.document import Document
|
3 |
-
from langchain_community.document_loaders import PyMuPDFLoader
|
4 |
-
from langchain_community.vectorstores import Qdrant
|
5 |
-
from langchain_openai.embeddings import OpenAIEmbeddings
|
6 |
-
import fitz
|
7 |
-
import io
|
8 |
-
import os
|
9 |
-
import requests
|
10 |
-
import tiktoken
|
11 |
-
from utilities.debugger import dprint
|
12 |
-
import uuid
|
13 |
-
|
14 |
-
def tiktoken_len(text):
|
15 |
-
tokens = tiktoken.encoding_for_model("gpt-4o").encode(
|
16 |
-
text,
|
17 |
-
)
|
18 |
-
return len(tokens)
|
19 |
-
|
20 |
-
def download_document(state, url, file_name, download_folder):
|
21 |
-
file_path = os.path.join(download_folder, file_name)
|
22 |
-
if not os.path.exists(download_folder):
|
23 |
-
os.makedirs(download_folder)
|
24 |
-
|
25 |
-
if not os.path.exists(file_path):
|
26 |
-
print(f"Downloading {file_name} from {url}...")
|
27 |
-
response = requests.get(url)
|
28 |
-
if response.status_code == 200:
|
29 |
-
with open(file_path, 'wb') as f:
|
30 |
-
f.write(response.content)
|
31 |
-
else:
|
32 |
-
dprint(state, f"Failed to download document from {url}. Status code: {response.status_code}")
|
33 |
-
else:
|
34 |
-
dprint(state, f"{file_name} already exists locally.")
|
35 |
-
return file_path
|
36 |
-
|
37 |
-
def get_documents(state):
|
38 |
-
for url in state.document_urls:
|
39 |
-
dprint(state, f"Downloading and loading document from {url}...")
|
40 |
-
file_name = url.split("/")[-1]
|
41 |
-
file_path = download_document(state, url, file_name, state.download_folder)
|
42 |
-
loader = PyMuPDFLoader(file_path)
|
43 |
-
loaded_document = loader.load()
|
44 |
-
single_text_document = "\n".join([doc.page_content for doc in loaded_document])
|
45 |
-
#state.add_loaded_document(loaded_document) # Append the loaded documents to the list
|
46 |
-
#state.add_single_text_document(single_text_document)
|
47 |
-
dprint(state, f"Number of pages: {len(loaded_document)}")
|
48 |
-
# lets get titles and metadata
|
49 |
-
pdf = fitz.open(file_path)
|
50 |
-
metadata = pdf.metadata
|
51 |
-
title = metadata.get('title', 'Document 1')
|
52 |
-
#state.add_metadata(metadata)
|
53 |
-
#state.add_title(title)
|
54 |
-
document = {
|
55 |
-
"url": url,
|
56 |
-
"title": title,
|
57 |
-
"metadata": metadata,
|
58 |
-
"single_text_document": single_text_document,
|
59 |
-
"document_id": str(uuid.uuid4())
|
60 |
-
}
|
61 |
-
state.add_document(document)
|
62 |
-
dprint(state, f"Title of Document: {title}")
|
63 |
-
dprint(state, f"Full metadata for Document 1: {metadata}")
|
64 |
-
pdf.close()
|
65 |
-
dprint(state, f"documents: {state.documents}")
|
66 |
-
|
67 |
-
def create_chunked_documents(state):
|
68 |
-
get_documents(state)
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
text_splitter = RecursiveCharacterTextSplitter(
|
73 |
-
chunk_size=state.chunk_size,
|
74 |
-
chunk_overlap=state.chunk_overlap,
|
75 |
-
length_function = tiktoken_len,
|
76 |
-
)
|
77 |
-
combined_document_objects = []
|
78 |
-
dprint(state, "Chunking documents and creating document objects")
|
79 |
-
for document in state.documents:
|
80 |
-
dprint(state, f"processing documend: {document['title']}")
|
81 |
-
text = document["single_text_document"]
|
82 |
-
dprint(state, text)
|
83 |
-
title = document["title"]
|
84 |
-
document_id = document["document_id"]
|
85 |
-
chunks_document = text_splitter.split_text(text)
|
86 |
-
dprint(state, len(chunks_document))
|
87 |
-
|
88 |
-
for chunk_number, chunk in enumerate(chunks_document, start=1):
|
89 |
-
document_objects = Document(
|
90 |
-
page_content=chunk,
|
91 |
-
metadata={
|
92 |
-
"source": title,
|
93 |
-
"document_id": document.get("document_id", "default_id"),
|
94 |
-
"chunk_number": chunk_number # Add unique chunk number
|
95 |
-
}
|
96 |
-
)
|
97 |
-
combined_document_objects.append(document_objects)
|
98 |
-
state.add_combined_document_objects(combined_document_objects)
|
99 |
-
|
100 |
-
|
101 |
-
def create_vector_store(state, **kwargs):
|
102 |
-
for key, value in kwargs.items():
|
103 |
-
if hasattr(state, key):
|
104 |
-
setattr(state, key, value)
|
105 |
-
else:
|
106 |
-
print(f"Warning: {key} is not an attribute of the state object")
|
107 |
-
|
108 |
-
# Rest of your create_vector_store logic
|
109 |
-
print(f"Chunk size after update: {state.chunk_size}")
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
create_chunked_documents(state)
|
115 |
-
embedding_model = OpenAIEmbeddings(model=state.embedding_model)
|
116 |
-
|
117 |
-
qdrant_vectorstore = Qdrant.from_documents(
|
118 |
-
documents=state.combined_document_objects,
|
119 |
-
embedding=embedding_model,
|
120 |
-
location=":memory:"
|
121 |
-
)
|
122 |
-
qdrant_retriever = qdrant_vectorstore.as_retriever()
|
123 |
-
state.set_retriever(qdrant_retriever)
|
124 |
-
print("Vector store created")
|
125 |
-
return qdrant_retriever
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities/templates.py
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
from langchain_core.prompts import ChatPromptTemplate
|
3 |
+
def get_qa_prompt():
|
4 |
+
|
5 |
+
system_template = """
|
6 |
+
You are an expert at explaining technical documents to people.
|
7 |
+
You are provided context below to answer the question.
|
8 |
+
Only use the information provided below.
|
9 |
+
If they do not ask a question, have a conversation with them and ask them if they have any questions
|
10 |
+
If you cannot answer the question with the content below say 'I don't have enough information, sorry'
|
11 |
+
The two documents are 'Blueprint for an AI Bill of Rights' and 'Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile'
|
12 |
+
"""
|
13 |
+
human_template = """
|
14 |
+
===
|
15 |
+
question:
|
16 |
+
{question}
|
17 |
+
|
18 |
+
===
|
19 |
+
context:
|
20 |
+
{context}
|
21 |
+
===
|
22 |
+
"""
|
23 |
+
chat_prompt = ChatPromptTemplate.from_messages([
|
24 |
+
("system", system_template),
|
25 |
+
("human", human_template)
|
26 |
+
])
|
27 |
+
return chat_prompt
|
utilities/text_utils.py
DELETED
@@ -1,103 +0,0 @@
|
|
1 |
-
import os
|
2 |
-
from typing import List
|
3 |
-
import fitz # pymupdf
|
4 |
-
import tempfile
|
5 |
-
from utilities_2.text_utils import CharacterTextSplitter
|
6 |
-
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
7 |
-
|
8 |
-
# load the file
|
9 |
-
# handle .txt and .pdf
|
10 |
-
|
11 |
-
class FileLoader:
|
12 |
-
|
13 |
-
def __init__(self, encoding: str = "utf-8"):
|
14 |
-
self.documents = []
|
15 |
-
self.encoding = encoding
|
16 |
-
self.temp_file_path = ""
|
17 |
-
|
18 |
-
|
19 |
-
def load_file(self, file, use_rct):
|
20 |
-
if use_rct:
|
21 |
-
text_splitter=MyRecursiveCharacterTextSplitter()
|
22 |
-
else:
|
23 |
-
text_splitter=CharacterTextSplitter()
|
24 |
-
file_extension = os.path.splitext(file.name)[1].lower()
|
25 |
-
|
26 |
-
with tempfile.NamedTemporaryFile(mode="wb", delete=False, suffix=file_extension) as temp_file:
|
27 |
-
self.temp_file_path = temp_file.name
|
28 |
-
temp_file.write(file.content)
|
29 |
-
|
30 |
-
if os.path.isfile(self.temp_file_path):
|
31 |
-
if self.temp_file_path.endswith(".txt"):
|
32 |
-
self.load_text_file()
|
33 |
-
elif self.temp_file_path.endswith(".pdf"):
|
34 |
-
self.load_pdf_file()
|
35 |
-
else:
|
36 |
-
raise ValueError(
|
37 |
-
f"Unsupported file type: {self.temp_file_path}"
|
38 |
-
)
|
39 |
-
return text_splitter.split_text(self.documents)
|
40 |
-
else:
|
41 |
-
raise ValueError(
|
42 |
-
"Not a file"
|
43 |
-
)
|
44 |
-
|
45 |
-
def load_text_file(self):
|
46 |
-
with open(self.temp_file_path, "r", encoding=self.encoding) as f:
|
47 |
-
self.documents.append(f.read())
|
48 |
-
|
49 |
-
def load_pdf_file(self):
|
50 |
-
# pymupdf
|
51 |
-
pdf_document = fitz.open(self.temp_file_path)
|
52 |
-
for page_num in range(len(pdf_document)):
|
53 |
-
page = pdf_document.load_page(page_num)
|
54 |
-
text = page.get_text()
|
55 |
-
self.documents.append(text)
|
56 |
-
|
57 |
-
class CharacterTextSplitter:
|
58 |
-
def __init__(
|
59 |
-
self,
|
60 |
-
chunk_size: int = 1000,
|
61 |
-
chunk_overlap: int = 200,
|
62 |
-
):
|
63 |
-
assert (
|
64 |
-
chunk_size > chunk_overlap
|
65 |
-
), "Chunk size must be greater than chunk overlap"
|
66 |
-
|
67 |
-
self.chunk_size = chunk_size
|
68 |
-
self.chunk_overlap = chunk_overlap
|
69 |
-
|
70 |
-
def split(self, text: str) -> List[str]:
|
71 |
-
chunks = []
|
72 |
-
for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
|
73 |
-
chunks.append(text[i : i + self.chunk_size])
|
74 |
-
return chunks
|
75 |
-
|
76 |
-
def split_text(self, texts: List[str]) -> List[str]:
|
77 |
-
chunks = []
|
78 |
-
for text in texts:
|
79 |
-
chunks.extend(self.split(text))
|
80 |
-
return chunks
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
class MyRecursiveCharacterTextSplitter:
|
85 |
-
# uses langChain.RecursiveCharacterTextSplitter
|
86 |
-
def __init__(
|
87 |
-
self
|
88 |
-
):
|
89 |
-
self.RCTS = RecursiveCharacterTextSplitter(
|
90 |
-
chunk_size=1000,
|
91 |
-
chunk_overlap=20,
|
92 |
-
length_function=len,
|
93 |
-
separators=["\n\n", "\n", " ", ""]
|
94 |
-
)
|
95 |
-
|
96 |
-
def split_text(self, texts: List[str]) -> List[str]:
|
97 |
-
all_chunks = []
|
98 |
-
for doc in texts:
|
99 |
-
chunks = self.RCTS.split_text(doc)
|
100 |
-
all_chunks.extend(chunks)
|
101 |
-
return all_chunks
|
102 |
-
|
103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities/vector_database.py
DELETED
@@ -1,105 +0,0 @@
|
|
1 |
-
import numpy as np
|
2 |
-
from collections import defaultdict
|
3 |
-
from typing import List, Tuple, Callable
|
4 |
-
from utilities_2.openai_utils.embedding import EmbeddingModel
|
5 |
-
import hashlib
|
6 |
-
from qdrant_client import QdrantClient
|
7 |
-
from qdrant_client.http.models import PointStruct
|
8 |
-
from qdrant_client.models import VectorParams
|
9 |
-
import uuid
|
10 |
-
|
11 |
-
def cosine_similarity(vector_a: np.array, vector_b: np.array) -> float:
|
12 |
-
"""Computes the cosine similarity between two vectors."""
|
13 |
-
dot_product = np.dot(vector_a, vector_b)
|
14 |
-
norm_a = np.linalg.norm(vector_a)
|
15 |
-
norm_b = np.linalg.norm(vector_b)
|
16 |
-
return dot_product / (norm_a * norm_b)
|
17 |
-
|
18 |
-
class QdrantDatabase:
|
19 |
-
def __init__(self, embedding_model=None):
|
20 |
-
self.qdrant_client = QdrantClient(location=":memory:")
|
21 |
-
self.collection_name = "my_collection"
|
22 |
-
self.embedding_model = embedding_model or EmbeddingModel(embeddings_model_name= "text-embedding-3-small", dimensions=1000)
|
23 |
-
vector_params = VectorParams(
|
24 |
-
size=self.embedding_model.dimensions, # vector size
|
25 |
-
distance="Cosine"
|
26 |
-
) # distance metric
|
27 |
-
self.qdrant_client.create_collection(
|
28 |
-
collection_name=self.collection_name,
|
29 |
-
vectors_config={"text": vector_params},
|
30 |
-
)
|
31 |
-
self.vectors = defaultdict(np.array) # Still keeps a local copy if needed
|
32 |
-
|
33 |
-
def string_to_int_id(self, s: str) -> int:
|
34 |
-
return int(hashlib.sha256(s.encode('utf-8')).hexdigest(), 16) % (10**8)
|
35 |
-
def get_test_vector(self):
|
36 |
-
retrieved_vector = self.qdrant_client.retrieve(
|
37 |
-
collection_name="my_collection",
|
38 |
-
ids=[self.string_to_int_id("test_key")]
|
39 |
-
)
|
40 |
-
return retrieved_vector
|
41 |
-
def insert(self, key: str, vector: np.array) -> None:
|
42 |
-
point_id = str(uuid.uuid4())
|
43 |
-
payload = {"text": key}
|
44 |
-
|
45 |
-
point = PointStruct(
|
46 |
-
id=point_id,
|
47 |
-
vector={"default": vector.tolist()},
|
48 |
-
payload=payload
|
49 |
-
)
|
50 |
-
print(f"Inserting vector for key: {key}, ID: {point_id}")
|
51 |
-
# Insert the vector into Qdrant with the associated document
|
52 |
-
self.qdrant_client.upsert(
|
53 |
-
collection_name=self.collection_name,
|
54 |
-
points=[point] # Qdrant expects a list of PointStruct
|
55 |
-
)
|
56 |
-
|
57 |
-
|
58 |
-
def search(
|
59 |
-
self,
|
60 |
-
query_vector: np.array,
|
61 |
-
k: int=5,
|
62 |
-
distance_measure: Callable = cosine_similarity,
|
63 |
-
) -> List[Tuple[str, float]]:
|
64 |
-
# Perform search in Qdrant
|
65 |
-
if isinstance(query_vector, np.ndarray):
|
66 |
-
query_vector = query_vector.tolist()
|
67 |
-
print(type(query_vector))
|
68 |
-
search_results = self.qdrant_client.search(
|
69 |
-
collection_name=self.collection_name,
|
70 |
-
query_vector=query_vector, # Pass the vector as a list
|
71 |
-
limit=k
|
72 |
-
)
|
73 |
-
return [(result.payload['text'], result.score) for result in search_results]
|
74 |
-
|
75 |
-
def search_by_text(
|
76 |
-
self,
|
77 |
-
query_text: str,
|
78 |
-
k: int,
|
79 |
-
distance_measure: Callable = cosine_similarity,
|
80 |
-
return_as_text: bool = False,
|
81 |
-
) -> List[Tuple[str, float]]:
|
82 |
-
|
83 |
-
query_vector = self.embedding_model.get_embedding(query_text)
|
84 |
-
results = self.search(query_vector, k, distance_measure)
|
85 |
-
return [result[0] for result in results] if return_as_text else results
|
86 |
-
|
87 |
-
async def abuild_from_list(self, list_of_text: List[str]) -> "QdrantDatabase":
|
88 |
-
from qdrant_client.http import models
|
89 |
-
embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
|
90 |
-
points = [
|
91 |
-
models.PointStruct(
|
92 |
-
id=str(uuid.uuid4()),
|
93 |
-
vector={"text": embedding}, # Should be a named vector as per vector_config
|
94 |
-
payload={
|
95 |
-
"text": text
|
96 |
-
}
|
97 |
-
)
|
98 |
-
for text, embedding in zip(list_of_text, embeddings)
|
99 |
-
]
|
100 |
-
self.qdrant_client.upsert(
|
101 |
-
collection_name=self.collection_name,
|
102 |
-
points=points
|
103 |
-
)
|
104 |
-
return self
|
105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities/vector_utilities.py
ADDED
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from utilities.constants import (
|
2 |
+
CHUNKING_STRATEGY_TABLE_AWARE,
|
3 |
+
CHUNKING_STRATEGY_SECTION_BASED,
|
4 |
+
CHUNKING_STRATEGY_SEMANTIC
|
5 |
+
)
|
6 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
7 |
+
from langchain.docstore.document import Document
|
8 |
+
from langchain_community.vectorstores import Qdrant
|
9 |
+
from langchain_openai.embeddings import OpenAIEmbeddings
|
10 |
+
import numpy as np
|
11 |
+
import pdfplumber
|
12 |
+
import re
|
13 |
+
from sentence_transformers import SentenceTransformer
|
14 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
15 |
+
import tiktoken
|
16 |
+
from utilities.debugger import dprint
|
17 |
+
|
18 |
+
def create_vector_store(app_state, model_run_state, **kwargs):
|
19 |
+
for key, value in kwargs.items():
|
20 |
+
if hasattr(model_run_state, key):
|
21 |
+
setattr(model_run_state, key, value)
|
22 |
+
else:
|
23 |
+
print(f"Warning: {key} is not an attribute of the state object")
|
24 |
+
|
25 |
+
# Rest of your create_vector_store logic
|
26 |
+
dprint(app_state, f"Chunk size after update: {model_run_state.chunk_size}")
|
27 |
+
create_chunked_documents(app_state, model_run_state)
|
28 |
+
|
29 |
+
qdrant_vectorstore = Qdrant.from_documents(
|
30 |
+
documents=model_run_state.combined_document_objects,
|
31 |
+
embedding=model_run_state.embedding_model,
|
32 |
+
location=":memory:"
|
33 |
+
)
|
34 |
+
qdrant_retriever = qdrant_vectorstore.as_retriever()
|
35 |
+
model_run_state.retriever = qdrant_retriever
|
36 |
+
print("Vector store created")
|
37 |
+
|
38 |
+
def tiktoken_len(text):
|
39 |
+
tokens = tiktoken.encoding_for_model("gpt-4o").encode(
|
40 |
+
text,
|
41 |
+
)
|
42 |
+
return len(tokens)
|
43 |
+
|
44 |
+
def create_chunked_documents(app_state, model_run_state):
|
45 |
+
dprint(app_state, model_run_state.chunking_strategy)
|
46 |
+
if model_run_state.chunking_strategy == CHUNKING_STRATEGY_TABLE_AWARE:
|
47 |
+
combined_document_objects = chunk_with_table_aware(app_state, model_run_state)
|
48 |
+
elif model_run_state.chunking_strategy == CHUNKING_STRATEGY_SECTION_BASED:
|
49 |
+
combined_document_objects = chunk_with_section_based(app_state, model_run_state)
|
50 |
+
elif model_run_state.chunking_strategy == CHUNKING_STRATEGY_SEMANTIC:
|
51 |
+
combined_document_objects = chunk_with_semantic_splitter(app_state, model_run_state)
|
52 |
+
else:
|
53 |
+
combined_document_objects = chunk_with_recursive_splitter(app_state, model_run_state)
|
54 |
+
model_run_state.combined_document_objects = combined_document_objects
|
55 |
+
dprint(app_state, "Chunking completed successfully")
|
56 |
+
|
57 |
+
|
58 |
+
def chunk_with_recursive_splitter(app_state, model_run_state):
|
59 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
60 |
+
chunk_size=model_run_state.chunk_size,
|
61 |
+
chunk_overlap=model_run_state.chunk_overlap,
|
62 |
+
length_function = tiktoken_len,
|
63 |
+
)
|
64 |
+
combined_document_objects = []
|
65 |
+
dprint(app_state, "Chunking documents and creating document objects")
|
66 |
+
for document in app_state.documents:
|
67 |
+
dprint(app_state, f"processing documend: {document['title']}")
|
68 |
+
text = document["single_text_document"]
|
69 |
+
dprint(app_state, text)
|
70 |
+
title = document["title"]
|
71 |
+
# document_id = document["document_id"]
|
72 |
+
chunks_document = text_splitter.split_text(text)
|
73 |
+
dprint(app_state, len(chunks_document))
|
74 |
+
|
75 |
+
for chunk_number, chunk in enumerate(chunks_document, start=1):
|
76 |
+
document_objects = Document(
|
77 |
+
page_content=chunk,
|
78 |
+
metadata={
|
79 |
+
"source": title,
|
80 |
+
"document_id": document.get("document_id", "default_id"),
|
81 |
+
"chunk_number": chunk_number # Add unique chunk number
|
82 |
+
}
|
83 |
+
)
|
84 |
+
combined_document_objects.append(document_objects)
|
85 |
+
return combined_document_objects
|
86 |
+
|
87 |
+
def chunk_with_table_aware(app_state, model_run_state):
|
88 |
+
combined_document_objects = []
|
89 |
+
dprint(app_state, "Using Table-Aware Chunking for documents.")
|
90 |
+
|
91 |
+
for document in app_state.documents:
|
92 |
+
title = document["title"]
|
93 |
+
text = document["single_text_document"]
|
94 |
+
|
95 |
+
# Check if document is a PDF and contains tables
|
96 |
+
if document.get("is_pdf", False):
|
97 |
+
with pdfplumber.open(document["file_path"]) as pdf:
|
98 |
+
for page in pdf.pages:
|
99 |
+
tables = page.extract_tables()
|
100 |
+
for table in tables:
|
101 |
+
table_content = "\n".join([str(row) for row in table])
|
102 |
+
document_objects = Document(
|
103 |
+
page_content=table_content,
|
104 |
+
metadata={
|
105 |
+
"source": title,
|
106 |
+
"document_id": document.get("document_id", "default_id"),
|
107 |
+
"chunk_number": "table"
|
108 |
+
}
|
109 |
+
)
|
110 |
+
combined_document_objects.append(document_objects)
|
111 |
+
|
112 |
+
# Chunk the rest of the text
|
113 |
+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=model_run_state.chunk_size, chunk_overlap=model_run_state.chunk_overlap)
|
114 |
+
chunks_document = text_splitter.split_text(text)
|
115 |
+
|
116 |
+
for chunk_number, chunk in enumerate(chunks_document, start=1):
|
117 |
+
document_objects = Document(
|
118 |
+
page_content=chunk,
|
119 |
+
metadata={
|
120 |
+
"source": title,
|
121 |
+
"document_id": document.get("document_id", "default_id"),
|
122 |
+
"chunk_number": chunk_number
|
123 |
+
}
|
124 |
+
)
|
125 |
+
combined_document_objects.append(document_objects)
|
126 |
+
|
127 |
+
return combined_document_objects
|
128 |
+
|
129 |
+
|
130 |
+
def chunk_with_section_based(app_state, model_run_state):
|
131 |
+
combined_document_objects = []
|
132 |
+
dprint(app_state, "Using Section-Based Chunking for documents.")
|
133 |
+
|
134 |
+
for document in app_state.documents:
|
135 |
+
text = document["single_text_document"]
|
136 |
+
title = document["title"]
|
137 |
+
|
138 |
+
# Split the text by headings
|
139 |
+
sections = re.split(r"\n[A-Z].+?\n", text)
|
140 |
+
|
141 |
+
# Chunk each section
|
142 |
+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=model_run_state.chunk_size, chunk_overlap=model_run_state.chunk_overlap)
|
143 |
+
for section_number, section in enumerate(sections, start=1):
|
144 |
+
chunks_document = text_splitter.split_text(section)
|
145 |
+
for chunk_number, chunk in enumerate(chunks_document, start=1):
|
146 |
+
document_objects = Document(
|
147 |
+
page_content=chunk,
|
148 |
+
metadata={
|
149 |
+
"source": title,
|
150 |
+
"document_id": document.get("document_id", "default_id"),
|
151 |
+
"section_number": section_number,
|
152 |
+
"chunk_number": chunk_number
|
153 |
+
}
|
154 |
+
)
|
155 |
+
combined_document_objects.append(document_objects)
|
156 |
+
|
157 |
+
return combined_document_objects
|
158 |
+
|
159 |
+
|
160 |
+
def chunk_with_semantic_splitter(app_state, model_run_state):
|
161 |
+
# Load pre-trained model for embeddings
|
162 |
+
model = SentenceTransformer('all-MiniLM-L6-v2')
|
163 |
+
|
164 |
+
combined_document_objects = []
|
165 |
+
dprint(app_state, "Using Semantic-Based Chunking for documents.")
|
166 |
+
|
167 |
+
for document in app_state.documents:
|
168 |
+
text = document["single_text_document"]
|
169 |
+
title = document["title"]
|
170 |
+
|
171 |
+
# Split text into sentences or paragraphs
|
172 |
+
sentences = text.split(". ") # Simple split by sentence (you can refine this)
|
173 |
+
sentence_embeddings = model.encode(sentences)
|
174 |
+
|
175 |
+
# Group sentences into chunks based on semantic similarity
|
176 |
+
chunks = []
|
177 |
+
current_chunk = []
|
178 |
+
for i in range(len(sentences) - 1):
|
179 |
+
current_chunk.append(sentences[i])
|
180 |
+
# Calculate similarity between consecutive sentences
|
181 |
+
sim = cosine_similarity([sentence_embeddings[i]], [sentence_embeddings[i + 1]])[0][0]
|
182 |
+
if sim < 0.7 or len(current_chunk) >= model_run_state.chunk_size:
|
183 |
+
# If similarity is below threshold or chunk size is reached, start a new chunk
|
184 |
+
chunks.append(" ".join(current_chunk))
|
185 |
+
current_chunk = []
|
186 |
+
|
187 |
+
# Add the final chunk
|
188 |
+
if current_chunk:
|
189 |
+
chunks.append(" ".join(current_chunk))
|
190 |
+
|
191 |
+
# Create document objects for the chunks
|
192 |
+
for chunk_number, chunk in enumerate(chunks, start=1):
|
193 |
+
document_objects = Document(
|
194 |
+
page_content=chunk,
|
195 |
+
metadata={
|
196 |
+
"source": title,
|
197 |
+
"document_id": document.get("document_id", "default_id"),
|
198 |
+
"chunk_number": chunk_number
|
199 |
+
}
|
200 |
+
)
|
201 |
+
combined_document_objects.append(document_objects)
|
202 |
+
|
203 |
+
return combined_document_objects
|
utilities_2/__init__.py
DELETED
File without changes
|
utilities_2/openai_utils/__init__.py
DELETED
File without changes
|
utilities_2/openai_utils/chatmodel.py
DELETED
@@ -1,45 +0,0 @@
|
|
1 |
-
from openai import OpenAI, AsyncOpenAI
|
2 |
-
from dotenv import load_dotenv
|
3 |
-
import os
|
4 |
-
|
5 |
-
load_dotenv()
|
6 |
-
|
7 |
-
|
8 |
-
class ChatOpenAI:
|
9 |
-
def __init__(self, model_name: str = "gpt-4o-mini"):
|
10 |
-
self.model_name = model_name
|
11 |
-
self.openai_api_key = os.getenv("OPENAI_API_KEY")
|
12 |
-
if self.openai_api_key is None:
|
13 |
-
raise ValueError("OPENAI_API_KEY is not set")
|
14 |
-
|
15 |
-
def run(self, messages, text_only: bool = True, **kwargs):
|
16 |
-
if not isinstance(messages, list):
|
17 |
-
raise ValueError("messages must be a list")
|
18 |
-
|
19 |
-
client = OpenAI()
|
20 |
-
response = client.chat.completions.create(
|
21 |
-
model=self.model_name, messages=messages, **kwargs
|
22 |
-
)
|
23 |
-
|
24 |
-
if text_only:
|
25 |
-
return response.choices[0].message.content
|
26 |
-
|
27 |
-
return response
|
28 |
-
|
29 |
-
async def astream(self, messages, **kwargs):
|
30 |
-
if not isinstance(messages, list):
|
31 |
-
raise ValueError("messages must be a list")
|
32 |
-
|
33 |
-
client = AsyncOpenAI()
|
34 |
-
|
35 |
-
stream = await client.chat.completions.create(
|
36 |
-
model=self.model_name,
|
37 |
-
messages=messages,
|
38 |
-
stream=True,
|
39 |
-
**kwargs
|
40 |
-
)
|
41 |
-
|
42 |
-
async for chunk in stream:
|
43 |
-
content = chunk.choices[0].delta.content
|
44 |
-
if content is not None:
|
45 |
-
yield content
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities_2/openai_utils/embedding.py
DELETED
@@ -1,60 +0,0 @@
|
|
1 |
-
from dotenv import load_dotenv
|
2 |
-
from openai import AsyncOpenAI, OpenAI
|
3 |
-
import openai
|
4 |
-
from typing import List
|
5 |
-
import os
|
6 |
-
import asyncio
|
7 |
-
|
8 |
-
|
9 |
-
class EmbeddingModel:
|
10 |
-
def __init__(self, embeddings_model_name: str = "text-embedding-3-small", dimensions: int = None):
|
11 |
-
load_dotenv()
|
12 |
-
self.openai_api_key = os.getenv("OPENAI_API_KEY")
|
13 |
-
self.async_client = AsyncOpenAI()
|
14 |
-
self.client = OpenAI()
|
15 |
-
self.dimensions = dimensions
|
16 |
-
|
17 |
-
if self.openai_api_key is None:
|
18 |
-
raise ValueError(
|
19 |
-
"OPENAI_API_KEY environment variable is not set. Please set it to your OpenAI API key."
|
20 |
-
)
|
21 |
-
openai.api_key = self.openai_api_key
|
22 |
-
self.embeddings_model_name = embeddings_model_name
|
23 |
-
|
24 |
-
async def async_get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
|
25 |
-
embedding_response = await self.async_client.embeddings.create(
|
26 |
-
input=list_of_text, model=self.embeddings_model_name, dimensions=self.dimensions
|
27 |
-
)
|
28 |
-
|
29 |
-
return [embeddings.embedding for embeddings in embedding_response.data]
|
30 |
-
|
31 |
-
async def async_get_embedding(self, text: str) -> List[float]:
|
32 |
-
embedding = await self.async_client.embeddings.create(
|
33 |
-
input=text, model=self.embeddings_model_name, dimensions=self.dimensions
|
34 |
-
)
|
35 |
-
|
36 |
-
return embedding.data[0].embedding
|
37 |
-
|
38 |
-
def get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
|
39 |
-
embedding_response = self.client.embeddings.create(
|
40 |
-
input=list_of_text, model=self.embeddings_model_name, dimensions=self.dimensions
|
41 |
-
)
|
42 |
-
|
43 |
-
return [embeddings.embedding for embeddings in embedding_response.data]
|
44 |
-
|
45 |
-
def get_embedding(self, text: str) -> List[float]:
|
46 |
-
embedding = self.client.embeddings.create(
|
47 |
-
input=text, model=self.embeddings_model_name, dimensions=self.dimensions
|
48 |
-
)
|
49 |
-
|
50 |
-
return embedding.data[0].embedding
|
51 |
-
|
52 |
-
|
53 |
-
if __name__ == "__main__":
|
54 |
-
embedding_model = EmbeddingModel()
|
55 |
-
print(asyncio.run(embedding_model.async_get_embedding("Hello, world!")))
|
56 |
-
print(
|
57 |
-
asyncio.run(
|
58 |
-
embedding_model.async_get_embeddings(["Hello, world!", "Goodbye, world!"])
|
59 |
-
)
|
60 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities_2/openai_utils/prompts.py
DELETED
@@ -1,78 +0,0 @@
|
|
1 |
-
import re
|
2 |
-
|
3 |
-
|
4 |
-
class BasePrompt:
|
5 |
-
def __init__(self, prompt):
|
6 |
-
"""
|
7 |
-
Initializes the BasePrompt object with a prompt template.
|
8 |
-
|
9 |
-
:param prompt: A string that can contain placeholders within curly braces
|
10 |
-
"""
|
11 |
-
self.prompt = prompt
|
12 |
-
self._pattern = re.compile(r"\{([^}]+)\}")
|
13 |
-
|
14 |
-
def format_prompt(self, **kwargs):
|
15 |
-
"""
|
16 |
-
Formats the prompt string using the keyword arguments provided.
|
17 |
-
|
18 |
-
:param kwargs: The values to substitute into the prompt string
|
19 |
-
:return: The formatted prompt string
|
20 |
-
"""
|
21 |
-
matches = self._pattern.findall(self.prompt)
|
22 |
-
return self.prompt.format(**{match: kwargs.get(match, "") for match in matches})
|
23 |
-
|
24 |
-
def get_input_variables(self):
|
25 |
-
"""
|
26 |
-
Gets the list of input variable names from the prompt string.
|
27 |
-
|
28 |
-
:return: List of input variable names
|
29 |
-
"""
|
30 |
-
return self._pattern.findall(self.prompt)
|
31 |
-
|
32 |
-
|
33 |
-
class RolePrompt(BasePrompt):
|
34 |
-
def __init__(self, prompt, role: str):
|
35 |
-
"""
|
36 |
-
Initializes the RolePrompt object with a prompt template and a role.
|
37 |
-
|
38 |
-
:param prompt: A string that can contain placeholders within curly braces
|
39 |
-
:param role: The role for the message ('system', 'user', or 'assistant')
|
40 |
-
"""
|
41 |
-
super().__init__(prompt)
|
42 |
-
self.role = role
|
43 |
-
|
44 |
-
def create_message(self, format=True, **kwargs):
|
45 |
-
"""
|
46 |
-
Creates a message dictionary with a role and a formatted message.
|
47 |
-
|
48 |
-
:param kwargs: The values to substitute into the prompt string
|
49 |
-
:return: Dictionary containing the role and the formatted message
|
50 |
-
"""
|
51 |
-
if format:
|
52 |
-
return {"role": self.role, "content": self.format_prompt(**kwargs)}
|
53 |
-
|
54 |
-
return {"role": self.role, "content": self.prompt}
|
55 |
-
|
56 |
-
|
57 |
-
class SystemRolePrompt(RolePrompt):
|
58 |
-
def __init__(self, prompt: str):
|
59 |
-
super().__init__(prompt, "system")
|
60 |
-
|
61 |
-
|
62 |
-
class UserRolePrompt(RolePrompt):
|
63 |
-
def __init__(self, prompt: str):
|
64 |
-
super().__init__(prompt, "user")
|
65 |
-
|
66 |
-
|
67 |
-
class AssistantRolePrompt(RolePrompt):
|
68 |
-
def __init__(self, prompt: str):
|
69 |
-
super().__init__(prompt, "assistant")
|
70 |
-
|
71 |
-
|
72 |
-
if __name__ == "__main__":
|
73 |
-
prompt = BasePrompt("Hello {name}, you are {age} years old")
|
74 |
-
print(prompt.format_prompt(name="John", age=30))
|
75 |
-
|
76 |
-
prompt = SystemRolePrompt("Hello {name}, you are {age} years old")
|
77 |
-
print(prompt.create_message(name="John", age=30))
|
78 |
-
print(prompt.get_input_variables())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities_2/text_utils.py
DELETED
@@ -1,75 +0,0 @@
|
|
1 |
-
import os
|
2 |
-
from typing import List
|
3 |
-
|
4 |
-
class TextFileLoader:
|
5 |
-
def __init__(self, path: str, encoding: str = "utf-8"):
|
6 |
-
self.documents = []
|
7 |
-
self.path = path
|
8 |
-
self.encoding = encoding
|
9 |
-
|
10 |
-
def load(self):
|
11 |
-
if os.path.isdir(self.path):
|
12 |
-
self.load_directory()
|
13 |
-
elif os.path.isfile(self.path) and self.path.endswith(".txt"):
|
14 |
-
self.load_file()
|
15 |
-
else:
|
16 |
-
raise ValueError(
|
17 |
-
"Provided path is neither a valid directory nor a .txt file."
|
18 |
-
)
|
19 |
-
|
20 |
-
def load_file(self):
|
21 |
-
with open(self.path, "r", encoding=self.encoding) as f:
|
22 |
-
self.documents.append(f.read())
|
23 |
-
|
24 |
-
def load_directory(self):
|
25 |
-
for root, _, files in os.walk(self.path):
|
26 |
-
for file in files:
|
27 |
-
if file.endswith(".txt"):
|
28 |
-
with open(
|
29 |
-
os.path.join(root, file), "r", encoding=self.encoding
|
30 |
-
) as f:
|
31 |
-
self.documents.append(f.read())
|
32 |
-
|
33 |
-
def load_documents(self):
|
34 |
-
self.load()
|
35 |
-
return self.documents
|
36 |
-
|
37 |
-
|
38 |
-
class CharacterTextSplitter:
|
39 |
-
def __init__(
|
40 |
-
self,
|
41 |
-
chunk_size: int = 1000,
|
42 |
-
chunk_overlap: int = 200,
|
43 |
-
):
|
44 |
-
assert (
|
45 |
-
chunk_size > chunk_overlap
|
46 |
-
), "Chunk size must be greater than chunk overlap"
|
47 |
-
|
48 |
-
self.chunk_size = chunk_size
|
49 |
-
self.chunk_overlap = chunk_overlap
|
50 |
-
|
51 |
-
def split(self, text: str) -> List[str]:
|
52 |
-
chunks = []
|
53 |
-
for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
|
54 |
-
chunks.append(text[i : i + self.chunk_size])
|
55 |
-
return chunks
|
56 |
-
|
57 |
-
def split_text(self, texts: List[str]) -> List[str]:
|
58 |
-
chunks = []
|
59 |
-
for text in texts:
|
60 |
-
chunks.extend(self.split(text))
|
61 |
-
return chunks
|
62 |
-
|
63 |
-
if __name__ == "__main__":
|
64 |
-
loader = TextFileLoader("data/KingLear.txt")
|
65 |
-
loader.load()
|
66 |
-
splitter = CharacterTextSplitter()
|
67 |
-
chunks = splitter.split_text(loader.documents)
|
68 |
-
print(len(chunks))
|
69 |
-
print(chunks[0])
|
70 |
-
print("--------")
|
71 |
-
print(chunks[1])
|
72 |
-
print("--------")
|
73 |
-
print(chunks[-2])
|
74 |
-
print("--------")
|
75 |
-
print(chunks[-1])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
utilities_2/vectordatabase.py
DELETED
@@ -1,82 +0,0 @@
|
|
1 |
-
import numpy as np
|
2 |
-
from collections import defaultdict
|
3 |
-
from typing import List, Tuple, Callable
|
4 |
-
from utilities_2.openai_utils.embedding import EmbeddingModel
|
5 |
-
import asyncio
|
6 |
-
|
7 |
-
|
8 |
-
def cosine_similarity(vector_a: np.array, vector_b: np.array) -> float:
|
9 |
-
"""Computes the cosine similarity between two vectors."""
|
10 |
-
dot_product = np.dot(vector_a, vector_b)
|
11 |
-
norm_a = np.linalg.norm(vector_a)
|
12 |
-
norm_b = np.linalg.norm(vector_b)
|
13 |
-
return dot_product / (norm_a * norm_b)
|
14 |
-
|
15 |
-
|
16 |
-
class VectorDatabase:
|
17 |
-
def __init__(self, embedding_model: EmbeddingModel = None):
|
18 |
-
self.vectors = defaultdict(np.array)
|
19 |
-
self.embedding_model = embedding_model or EmbeddingModel()
|
20 |
-
|
21 |
-
def insert(self, key: str, vector: np.array) -> None:
|
22 |
-
self.vectors[key] = vector
|
23 |
-
|
24 |
-
def search(
|
25 |
-
self,
|
26 |
-
query_vector: np.array,
|
27 |
-
k: int,
|
28 |
-
distance_measure: Callable = cosine_similarity,
|
29 |
-
) -> List[Tuple[str, float]]:
|
30 |
-
scores = [
|
31 |
-
(key, distance_measure(query_vector, vector))
|
32 |
-
for key, vector in self.vectors.items()
|
33 |
-
]
|
34 |
-
return sorted(scores, key=lambda x: x[1], reverse=True)[:k]
|
35 |
-
|
36 |
-
def search_by_text(
|
37 |
-
self,
|
38 |
-
query_text: str,
|
39 |
-
k: int,
|
40 |
-
distance_measure: Callable = cosine_similarity,
|
41 |
-
return_as_text: bool = False,
|
42 |
-
) -> List[Tuple[str, float]]:
|
43 |
-
query_vector = self.embedding_model.get_embedding(query_text)
|
44 |
-
results = self.search(query_vector, k, distance_measure)
|
45 |
-
return [result[0] for result in results] if return_as_text else results
|
46 |
-
|
47 |
-
def retrieve_from_key(self, key: str) -> np.array:
|
48 |
-
return self.vectors.get(key, None)
|
49 |
-
|
50 |
-
async def abuild_from_list(self, list_of_text: List[str]) -> "VectorDatabase":
|
51 |
-
embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
|
52 |
-
for text, embedding in zip(list_of_text, embeddings):
|
53 |
-
self.insert(text, np.array(embedding))
|
54 |
-
return self
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
if __name__ == "__main__":
|
59 |
-
list_of_text = [
|
60 |
-
"I like to eat broccoli and bananas.",
|
61 |
-
"I ate a banana and spinach smoothie for breakfast.",
|
62 |
-
"Chinchillas and kittens are cute.",
|
63 |
-
"My sister adopted a kitten yesterday.",
|
64 |
-
"Look at this cute hamster munching on a piece of broccoli.",
|
65 |
-
]
|
66 |
-
|
67 |
-
vector_db = VectorDatabase()
|
68 |
-
vector_db = asyncio.run(vector_db.abuild_from_list(list_of_text))
|
69 |
-
k = 2
|
70 |
-
|
71 |
-
searched_vector = vector_db.search_by_text("I think fruit is awesome!", k=k)
|
72 |
-
print(f"Closest {k} vector(s):", searched_vector)
|
73 |
-
|
74 |
-
retrieved_vector = vector_db.retrieve_from_key(
|
75 |
-
"I like to eat broccoli and bananas."
|
76 |
-
)
|
77 |
-
print("Retrieved vector:", retrieved_vector)
|
78 |
-
|
79 |
-
relevant_texts = vector_db.search_by_text(
|
80 |
-
"I think fruit is awesome!", k=k, return_as_text=True
|
81 |
-
)
|
82 |
-
print(f"Closest {k} text(s):", relevant_texts)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|