rchrdgwr commited on
Commit
c6907ac
·
1 Parent(s): 8aeabcb

Updated application

Browse files
BuildingAChainlitApp.md DELETED
@@ -1,312 +0,0 @@
1
- # Building a Chainlit App
2
-
3
- What if we want to take our Week 1 Day 2 assignment - [Pythonic RAG](https://github.com/AI-Maker-Space/AIE4/tree/main/Week%201/Day%202) - and bring it out of the notebook?
4
-
5
- Well - we'll cover exactly that here!
6
-
7
- ## Anatomy of a Chainlit Application
8
-
9
- [Chainlit](https://docs.chainlit.io/get-started/overview) is a Python package similar to Streamlit that lets users write a backend and a front end in a single (or multiple) Python file(s). It is mainly used for prototyping LLM-based Chat Style Applications - though it is used in production in some settings with 1,000,000s of MAUs (Monthly Active Users).
10
-
11
- The primary method of customizing and interacting with the Chainlit UI is through a few critical [decorators](https://blog.hubspot.com/website/decorators-in-python).
12
-
13
- > NOTE: Simply put, the decorators (in Chainlit) are just ways we can "plug-in" to the functionality in Chainlit.
14
-
15
- We'll be concerning ourselves with three main scopes:
16
-
17
- 1. On application start - when we start the Chainlit application with a command like `chainlit run app.py`
18
- 2. On chat start - when a chat session starts (a user opens the web browser to the address hosting the application)
19
- 3. On message - when the users sends a message through the input text box in the Chainlit UI
20
-
21
- Let's dig into each scope and see what we're doing!
22
-
23
- ## On Application Start:
24
-
25
- The first thing you'll notice is that we have the traditional "wall of imports" this is to ensure we have everything we need to run our application.
26
-
27
- ```python
28
- import os
29
- from typing import List
30
- from chainlit.types import AskFileResponse
31
- from utilities_2.text_utils import CharacterTextSplitter, TextFileLoader
32
- from utilities_2.openai_utils.prompts import (
33
- UserRolePrompt,
34
- SystemRolePrompt,
35
- AssistantRolePrompt,
36
- )
37
- from utilities_2.openai_utils.embedding import EmbeddingModel
38
- from utilities_2.vectordatabase import VectorDatabase
39
- from utilities_2.openai_utils.chatmodel import ChatOpenAI
40
- import chainlit as cl
41
- ```
42
-
43
- Next up, we have some prompt templates. As all sessions will use the same prompt templates without modification, and we don't need these templates to be specific per template - we can set them up here - at the application scope.
44
-
45
- ```python
46
- system_template = """\
47
- Use the following context to answer a users question. If you cannot find the answer in the context, say you don't know the answer."""
48
- system_role_prompt = SystemRolePrompt(system_template)
49
-
50
- user_prompt_template = """\
51
- Context:
52
- {context}
53
-
54
- Question:
55
- {question}
56
- """
57
- user_role_prompt = UserRolePrompt(user_prompt_template)
58
- ```
59
-
60
- > NOTE: You'll notice that these are the exact same prompt templates we used from the Pythonic RAG Notebook in Week 1 Day 2!
61
-
62
- Following that - we can create the Python Class definition for our RAG pipeline - or *chain*, as we'll refer to it in the rest of this walkthrough.
63
-
64
- Let's look at the definition first:
65
-
66
- ```python
67
- class RetrievalAugmentedQAPipeline:
68
- def __init__(self, llm: ChatOpenAI(), vector_db_retriever: VectorDatabase) -> None:
69
- self.llm = llm
70
- self.vector_db_retriever = vector_db_retriever
71
-
72
- async def arun_pipeline(self, user_query: str):
73
- ### RETRIEVAL
74
- context_list = self.vector_db_retriever.search_by_text(user_query, k=4)
75
-
76
- context_prompt = ""
77
- for context in context_list:
78
- context_prompt += context[0] + "\n"
79
-
80
- ### AUGMENTED
81
- formatted_system_prompt = system_role_prompt.create_message()
82
-
83
- formatted_user_prompt = user_role_prompt.create_message(question=user_query, context=context_prompt)
84
-
85
-
86
- ### GENERATION
87
- async def generate_response():
88
- async for chunk in self.llm.astream([formatted_system_prompt, formatted_user_prompt]):
89
- yield chunk
90
-
91
- return {"response": generate_response(), "context": context_list}
92
- ```
93
-
94
- Notice a few things:
95
-
96
- 1. We have modified this `RetrievalAugmentedQAPipeline` from the initial notebook to support streaming.
97
- 2. In essence, our pipeline is *chaining* a few events together:
98
- 1. We take our user query, and chain it into our Vector Database to collect related chunks
99
- 2. We take those contexts and our user's questions and chain them into the prompt templates
100
- 3. We take that prompt template and chain it into our LLM call
101
- 4. We chain the response of the LLM call to the user
102
- 3. We are using a lot of `async` again!
103
-
104
- Now, we're going to create a helper function for processing uploaded text files.
105
-
106
- First, we'll instantiate a shared `CharacterTextSplitter`.
107
-
108
- ```python
109
- text_splitter = CharacterTextSplitter()
110
- ```
111
-
112
- Now we can define our helper.
113
-
114
- ```python
115
- def process_text_file(file: AskFileResponse):
116
- import tempfile
117
-
118
- with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".txt") as temp_file:
119
- temp_file_path = temp_file.name
120
-
121
- with open(temp_file_path, "wb") as f:
122
- f.write(file.content)
123
-
124
- text_loader = TextFileLoader(temp_file_path)
125
- documents = text_loader.load_documents()
126
- texts = text_splitter.split_text(documents)
127
- return texts
128
- ```
129
-
130
- Simply put, this downloads the file as a temp file, we load it in with `TextFileLoader` and then split it with our `TextSplitter`, and returns that list of strings!
131
-
132
- <div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
133
- QUESTION #1:
134
-
135
- Why do we want to support streaming? What about streaming is important, or useful?
136
-
137
- ### ANSWER #1:
138
-
139
- Streaming is the continuous transmission of the data from the model to the UI. Instead of waiting and batching up the response into a single
140
- large message, the response is sent in pieces (streams) as it is created.
141
-
142
- The advantages of streaming:
143
- - quicker initial response - the user sees the first part of the answer sooner
144
- - it is easier to identify the results are incorrect and terminate the request
145
- - it is a more natural mode of communication for humans
146
- - better handling of large data, not requiring complex caching
147
- - essential for real time processing
148
- - humans can only read so fast so its an advantage to get some of the data earlier
149
-
150
- </div>
151
-
152
- ## On Chat Start:
153
-
154
- The next scope is where "the magic happens". On Chat Start is when a user begins a chat session. This will happen whenever a user opens a new chat window, or refreshes an existing chat window.
155
-
156
- You'll see that our code is set-up to immediately show the user a chat box requesting them to upload a file.
157
-
158
- ```python
159
- while files == None:
160
- files = await cl.AskFileMessage(
161
- content="Please upload a Text File file to begin!",
162
- accept=["text/plain"],
163
- max_size_mb=2,
164
- timeout=180,
165
- ).send()
166
- ```
167
-
168
- Once we've obtained the text file - we'll use our processing helper function to process our text!
169
-
170
- After we have processed our text file - we'll need to create a `VectorDatabase` and populate it with our processed chunks and their related embeddings!
171
-
172
- ```python
173
- vector_db = VectorDatabase()
174
- vector_db = await vector_db.abuild_from_list(texts)
175
- ```
176
-
177
- Once we have that piece completed - we can create the chain we'll be using to respond to user queries!
178
-
179
- ```python
180
- retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
181
- vector_db_retriever=vector_db,
182
- llm=chat_openai
183
- )
184
- ```
185
-
186
- Now, we'll save that into our user session!
187
-
188
- > NOTE: Chainlit has some great documentation about [User Session](https://docs.chainlit.io/concepts/user-session).
189
-
190
- <div style="border: 2px solid white; padding: 10px; border-radius: 5px; background-color: black; padding: 10px;">
191
-
192
- ### QUESTION #2:
193
-
194
- Why are we using User Session here? What about Python makes us need to use this? Why not just store everything in a global variable?
195
-
196
- ### ANSWER #2:
197
- The application hopefully will be run by many people, at the same time. If the data was stored in a global variable
198
- this would be accessed by everyone using the application. So everytime someone started a new session, the information
199
- would be overwritten, meaning everyone would basically get the same results. Unless only one person used the system
200
- at a time.
201
-
202
- So the goal is to keep each users session information separate from all the other users. The ChainLit User session
203
- provides the capability of storing each users data separately.
204
- </div>
205
-
206
- ## On Message
207
-
208
- First, we load our chain from the user session:
209
-
210
- ```python
211
- chain = cl.user_session.get("chain")
212
- ```
213
-
214
- Then, we run the chain on the content of the message - and stream it to the front end - that's it!
215
-
216
- ```python
217
- msg = cl.Message(content="")
218
- result = await chain.arun_pipeline(message.content)
219
-
220
- async for stream_resp in result["response"]:
221
- await msg.stream_token(stream_resp)
222
- ```
223
-
224
- ## 🎉
225
-
226
- With that - you've created a Chainlit application that moves our Pythonic RAG notebook to a Chainlit application!
227
-
228
- ## 🚧 CHALLENGE MODE 🚧
229
-
230
- For an extra challenge - modify the behaviour of your applciation by integrating changes you made to your Pythonic RAG notebook (using new retrieval methods, etc.)
231
-
232
- If you're still looking for a challenge, or didn't make any modifications to your Pythonic RAG notebook:
233
-
234
- 1) Allow users to upload PDFs (this will require you to build a PDF parser as well)
235
- 2) Modify the VectorStore to leverage [Qdrant](https://python-client.qdrant.tech/)
236
-
237
- > NOTE: The motivation for these challenges is simple - the beginning of the course is extremely information dense, and people come from all kinds of different technical backgrounds. In order to ensure that all learners are able to engage with the content confidently and comfortably, we want to focus on the basic units of technical competency required. This leads to a situation where some learners, who came in with more robust technical skills, find the introductory material to be too simple - and these open-ended challenges help us do this!
238
-
239
- ## Support pdf documents
240
-
241
- Code was modified to support pdf documents in the following areas:
242
-
243
- 1) Change to the request for documents in on_chat_start:
244
-
245
- - changed the message to ask for .txt or .pdf file
246
- - changed the acceptable file formats so that the pdf documents are included in the select pop up
247
-
248
- ```python
249
- while not files:
250
- files = await cl.AskFileMessage(
251
- content="Please upload a .txt or .pdf file to begin processing!",
252
- accept=["text/plain", "application/pdf"],
253
- max_size_mb=2,
254
- timeout=180,
255
- ).send()
256
- ```
257
-
258
- 2) change process_text_file() function to handle .pdf files
259
-
260
- - refactor the code to do all file handling in utilities.text_utils
261
- - app calls process_file, optionally passing in the text splitter function
262
- - default text splitter function is CharacterTextSplitter
263
- ```python
264
- texts = process_file(file)
265
- ```
266
- - load_file() function does the following
267
- - read the uploaded document into a temporary file
268
- - identify the file extension
269
- - process a .txt file as before resulting in the texts list
270
- - if the file is .pdf use the PyMuPDF library to read each page and extract the text and add it to texts list
271
- - use the passed in text splitter function to split the documents
272
-
273
- ```python
274
- def load_file(self, file, text_splitter=CharacterTextSplitter()):
275
- file_extension = os.path.splitext(file.name)[1].lower()
276
- with tempfile.NamedTemporaryFile(mode="wb", delete=False, suffix=file_extension) as temp_file:
277
- self.temp_file_path = temp_file.name
278
- temp_file.write(file.content)
279
-
280
- if os.path.isfile(self.temp_file_path):
281
- if self.temp_file_path.endswith(".txt"):
282
- self.load_text_file()
283
- elif self.temp_file_path.endswith(".pdf"):
284
- self.load_pdf_file()
285
- else:
286
- raise ValueError(
287
- f"Unsupported file type: {self.temp_file_path}"
288
- )
289
- return text_splitter.split_text(self.documents)
290
- else:
291
- raise ValueError(
292
- "Not a file"
293
- )
294
-
295
- def load_text_file(self):
296
- with open(self.temp_file_path, "r", encoding=self.encoding) as f:
297
- self.documents.append(f.read())
298
-
299
- def load_pdf_file(self):
300
-
301
- pdf_document = fitz.open(self.temp_file_path)
302
- for page_num in range(len(pdf_document)):
303
- page = pdf_document.load_page(page_num)
304
- text = page.get_text()
305
- self.documents.append(text)
306
- ```
307
-
308
- 3) Test the handling of .pdf and .txt files
309
-
310
- Several different .pdf and .txt files were successfully uploaded and processed by the app
311
-
312
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,14 +1,16 @@
 
1
  import os
 
 
2
  from dotenv import load_dotenv
3
- import chainlit as cl
4
  from langchain_openai import ChatOpenAI
5
- from langchain.prompts import PromptTemplate
6
- from utilities.rag_utilities import create_vector_store
7
- from langchain_core.prompts import ChatPromptTemplate
8
  from operator import itemgetter
9
- from langchain.schema.output_parser import StrOutputParser
10
- from langchain.schema.runnable import RunnablePassthrough
11
- from classes.app_state import AppState
12
 
13
  document_urls = [
14
  "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
@@ -21,56 +23,39 @@ load_dotenv()
21
  # Get the OpenAI API key from environment variables
22
  openai_api_key = os.getenv("OPENAI_API_KEY")
23
 
24
- # Setup our state
25
- state = AppState()
26
- state.set_debug(False)
 
27
 
28
- state.set_document_urls(document_urls)
29
 
30
- state.set_llm_model("gpt-3.5-turbo")
31
- state.set_embedding_model("text-embedding-3-small")
32
- state.set_chunk_size(1000)
33
- state.set_chunk_overlap(100)
34
 
35
- # Initialize the OpenAI LLM using LangChain
36
- llm = ChatOpenAI(model=state.llm_model, openai_api_key=openai_api_key)
37
- state.set_main_llm(llm)
38
 
39
- qdrant_retriever = create_vector_store(state)
 
 
40
 
41
- system_template = """
42
- You are an expert at explaining technical documents to people.
43
- You are provided context below to answer the question.
44
- Only use the information provided below.
45
- If they do not ask a question, have a conversation with them and ask them if they have any questions
46
- If you cannot answer the question with the content below say 'I don't have enough information, sorry'
47
- The two documents are 'Blueprint for an AI Bill of Rights' and 'Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile'
48
- """
49
- human_template = """
50
- ===
51
- question:
52
- {question}
53
-
54
- ===
55
- context:
56
- {context}
57
- ===
58
- """
59
- chat_prompt = ChatPromptTemplate.from_messages([
60
- ("system", system_template),
61
- ("human", human_template)
62
- ])
63
- # create the chain
64
- openai_chat_model = ChatOpenAI(model="gpt-4o")
65
 
 
66
 
 
67
 
68
  retrieval_augmented_qa_chain = (
69
- {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
70
  | RunnablePassthrough.assign(context=itemgetter("context"))
71
-
72
-
73
- | {"response": chat_prompt | openai_chat_model, "context": itemgetter("context")}
74
  )
75
 
76
  opening_content = """
@@ -116,19 +101,18 @@ async def main(message):
116
 
117
  await cl.Message(content=context_msg).send()
118
 
119
- for doc in context_documents:
120
- document_title = doc.metadata.get("source", "Unknown Document")
121
- document_id = doc.metadata.get("document_id", "Unknown ID")
122
- chunk_number = doc.metadata.get("chunk_number", "Unknown Chunk")
123
-
124
- document_context = doc.page_content.strip()
125
- truncated_context = document_context[:MAX_PREVIEW_LENGTH] + ("..." if len(document_context) > MAX_PREVIEW_LENGTH else "")
126
- print("----------------------------------------")
127
- print(truncated_context)
128
-
129
- await cl.Message(
130
- content=f"**{document_title} ( Chunk: {chunk_number})**",
131
- elements=[
132
- cl.Text(content=truncated_context, display="inline")
133
- ]
134
- ).send()
 
1
+ import chainlit as cl
2
  import os
3
+ from classes.app_state import AppState
4
+ from classes.model_run_state import ModelRunState
5
  from dotenv import load_dotenv
6
+ from langchain.schema.runnable import RunnablePassthrough
7
  from langchain_openai import ChatOpenAI
8
+ from langchain_openai.embeddings import OpenAIEmbeddings
9
+ from langchain.embeddings import HuggingFaceEmbeddings
 
10
  from operator import itemgetter
11
+ from utilities.doc_utilities import get_documents
12
+ from utilities.templates import get_qa_prompt
13
+ from utilities.vector_utilities import create_vector_store
14
 
15
  document_urls = [
16
  "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
 
23
  # Get the OpenAI API key from environment variables
24
  openai_api_key = os.getenv("OPENAI_API_KEY")
25
 
26
+ # Setup our state and read the documents
27
+ app_state = AppState()
28
+ app_state.set_debug(False)
29
+ app_state.set_document_urls(document_urls)
30
 
31
+ get_documents(app_state)
32
 
33
+ # set up this model run
34
+ chainlit_state = ModelRunState()
35
+ chainlit_state.name = "Chainlit"
 
36
 
37
+ chainlit_state.qa_model_name = "gpt-4o-mini"
38
+ chainlit_state.qa_model = ChatOpenAI(model=chainlit_state.qa_model_name, openai_api_key=openai_api_key)
 
39
 
40
+ hf_username = "rchrdgwr"
41
+ hf_repo_name = "finetuned-arctic-model-2"
42
+ finetuned_model_name = f"{hf_username}/{hf_repo_name}"
43
 
44
+ chainlit_state.embedding_model_name = finetuned_model_name
45
+ chainlit_state.embedding_model = HuggingFaceEmbeddings(model_name=chainlit_state.embedding_model_name)
46
+
47
+ chainlit_state.chunk_size = 1000
48
+ chainlit_state.chunk_overlap = 100
49
+ create_vector_store(app_state, chainlit_state )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ chat_prompt = get_qa_prompt()
52
 
53
+ # create the chain
54
 
55
  retrieval_augmented_qa_chain = (
56
+ {"context": itemgetter("question") | chainlit_state.retriever, "question": itemgetter("question")}
57
  | RunnablePassthrough.assign(context=itemgetter("context"))
58
+ | {"response": chat_prompt | chainlit_state.qa_model, "context": itemgetter("context")}
 
 
59
  )
60
 
61
  opening_content = """
 
101
 
102
  await cl.Message(content=context_msg).send()
103
 
104
+ # for doc in context_documents:
105
+ # document_title = doc.metadata.get("source", "Unknown Document")
106
+ # chunk_number = doc.metadata.get("chunk_number", "Unknown Chunk")
107
+
108
+ # document_context = doc.page_content.strip()
109
+ # truncated_context = document_context[:MAX_PREVIEW_LENGTH] + ("..." if len(document_context) > MAX_PREVIEW_LENGTH else "")
110
+ # print("----------------------------------------")
111
+ # print(truncated_context)
112
+
113
+ # await cl.Message(
114
+ # content=f"**{document_title} ( Chunk: {chunk_number})**",
115
+ # elements=[
116
+ # cl.Text(content=truncated_context, display="inline")
117
+ # ]
118
+ # ).send()
 
classes/app_state.py CHANGED
@@ -1,86 +1,16 @@
 
1
  class AppState:
2
  def __init__(self):
3
  self.debug = False
4
- self.llm_model = "gpt-3.5-turbo"
5
- self.embedding_model = "text-embedding-3-small"
6
- self.chunk_size = 1000
7
- self.chunk_overlap = 100
8
  self.document_urls = []
9
  self.download_folder = "data/"
10
- self.loaded_documents = []
11
- self.single_text_documents = []
12
- self.metadata = []
13
- self.titles = []
14
  self.documents = []
15
- self.combined_document_objects = []
16
- self.main_llm = None
17
- self.retriever = None
18
 
19
- self.system_template = "You are a helpful assistant"
20
- #
21
- self.user_input = None
22
- self.retrieved_documents = []
23
- self.chat_history = []
24
- self.current_question = None
25
-
26
  def set_document_urls(self, document_urls):
27
  self.document_urls = document_urls
28
-
29
- def set_llm_model(self, llm_model):
30
- self.llm_model = llm_model
31
-
32
- def set_embedding_model(self, embedding_model):
33
- self.embedding_model = embedding_model
34
-
35
- def set_chunk_size(self, chunk_size):
36
- self.chunk_size = chunk_size
37
-
38
- def set_chunk_overlap(self, chunk_overlap):
39
- self.chunk_overlap = chunk_overlap
40
-
41
- def set_system_template(self, system_template):
42
- self.system_template = system_template
43
-
44
- def add_loaded_document(self, loaded_document):
45
- self.loaded_documents.append(loaded_document)
46
-
47
- def add_single_text_documents(self, single_text_document):
48
- self.single_text_documents.append(single_text_document)
49
- def add_metadata(self, metadata):
50
- self.metadata = metadata
51
-
52
- def add_title(self, title):
53
- self.titles.append(title)
54
  def add_document(self, document):
55
  self.documents.append(document)
56
- def add_combined_document_objects(self, combined_document_objects):
57
- self.combined_document_objects = combined_document_objects
58
- def set_retriever(self, retriever):
59
- self.retriever = retriever
60
- def set_main_llm(self, main_llm):
61
- self.main_llm = main_llm
62
  def set_debug(self, debug):
63
- self.debug = debug
64
- #
65
- # Method to update the user input
66
- def set_user_input(self, input_text):
67
- self.user_input = input_text
68
-
69
- # Method to add a retrieved document
70
- # def add_document(self, document):
71
- # print("adding document")
72
- # print(self)
73
- # self.retrieved_documents.append(document)
74
-
75
- # Method to update chat history
76
- def update_chat_history(self, message):
77
- self.chat_history.append(message)
78
-
79
- # Method to get the current state
80
- def get_state(self):
81
- return {
82
- "user_input": self.user_input,
83
- "retrieved_documents": self.retrieved_documents,
84
- "chat_history": self.chat_history,
85
- "current_question": self.current_question
86
- }
 
1
+ import pprint
2
  class AppState:
3
  def __init__(self):
4
  self.debug = False
 
 
 
 
5
  self.document_urls = []
6
  self.download_folder = "data/"
 
 
 
 
7
  self.documents = []
 
 
 
8
 
9
+ def display(self):
10
+ pprint.pprint(self.__dict__)
 
 
 
 
 
11
  def set_document_urls(self, document_urls):
12
  self.document_urls = document_urls
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  def add_document(self, document):
14
  self.documents.append(document)
 
 
 
 
 
 
15
  def set_debug(self, debug):
16
+ self.debug = debug
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
classes/model_run_state.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pprint
2
+
3
+ from utilities.constants import (
4
+ CHUNKING_STRATEGY_RECURSIVE,
5
+ CHUNKING_STRATEGY_TABLE_AWARE,
6
+ CHUNKING_STRATEGY_SECTION_BASED
7
+ )
8
+
9
+ class ModelRunState:
10
+ def __init__(self):
11
+ self.name = ""
12
+
13
+ self.qa_model_name = "gpt-4o"
14
+ self.qa_model = None
15
+
16
+ self.embedding_model_name = "text-embedding-3-small"
17
+ self.embedding_model = None
18
+
19
+ self.chunking_strategy = CHUNKING_STRATEGY_RECURSIVE
20
+ self.chunk_size = 1000
21
+ self.chunk_overlap = 100
22
+
23
+ self.response_dataset = []
24
+
25
+ self.combined_document_objects = []
26
+ self.retriever = None
27
+
28
+ self.ragas_results = None
29
+ self.system_template = "You are a helpful assistant"
30
+
31
+ def display(self):
32
+ pprint.pprint(self.__dict__)
33
+
34
+ def parameters(self):
35
+ print(f"Base model: {self.qa_model_name}")
36
+ print(f"Embedding model: {self.embedding_model_name}")
37
+ print(f"Chunk size: {self.chunk_size}")
38
+ print(f"Chunk overlap: {self.chunk_overlap}")
39
+
40
+ def results_summary(self):
41
+ print(self.ragas_results)
42
+
43
+ def results(self):
44
+ results_df = self.ragas_results.to_pandas()
45
+ results_df
46
+
47
+ @classmethod
48
+ def compare_ragas_results(cls, model_run_1, model_run_2):
49
+ if not isinstance(model_run_1, cls) or not isinstance(model_run_2, cls):
50
+ raise ValueError("Both instances must be of the same class")
classes/ragas_state.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pprint
2
+ from ragas.testset.evolutions import simple, reasoning, multi_context
3
+ class RagasState:
4
+ def __init__(self):
5
+ self.chunk_size = 600
6
+ self.chunk_overlap = 50
7
+ self.chunks = []
8
+ self.generator_llm = "gpt-4"
9
+ self.critic_llm = "gpt-4o-mini"
10
+ self.distributions = {
11
+ simple: 0.5,
12
+ multi_context: 0.4,
13
+ reasoning: 0.1
14
+ }
15
+ self.num_questions = 3
16
+ self.testset_df = None
17
+
images/docchain_img.png DELETED
Binary file (100 kB)
 
old_app.py DELETED
@@ -1,145 +0,0 @@
1
- import os
2
- from chainlit.types import AskFileResponse
3
-
4
- from utilities_2.openai_utils.prompts import (
5
- UserRolePrompt,
6
- SystemRolePrompt,
7
- AssistantRolePrompt,
8
- )
9
- from utilities_2.openai_utils.embedding import EmbeddingModel
10
- from utilities_2.vectordatabase import VectorDatabase
11
- from utilities_2.openai_utils.chatmodel import ChatOpenAI
12
- import chainlit as cl
13
- from utilities.text_utils import FileLoader
14
- from utilities.pipeline import RetrievalAugmentedQAPipeline
15
- # from utilities.vector_database import QdrantDatabase
16
-
17
-
18
- def process_file(file, use_rct):
19
- fileLoader = FileLoader()
20
- return fileLoader.load_file(file, use_rct)
21
-
22
- system_template = """\
23
- Use the following context to answer a users question.
24
- If you cannot find the answer in the context, say you don't know the answer.
25
- The context contains the text from a document. Refer to it as the document not the context.
26
- """
27
- system_role_prompt = SystemRolePrompt(system_template)
28
-
29
- user_prompt_template = """\
30
- Context:
31
- {context}
32
-
33
- Question:
34
- {question}
35
- """
36
- user_role_prompt = UserRolePrompt(user_prompt_template)
37
-
38
- @cl.on_chat_start
39
- async def on_chat_start():
40
- # get user inputs
41
- res = await cl.AskActionMessage(
42
- content="Do you want to use Qdrant?",
43
- actions=[
44
- cl.Action(name="yes", value="yes", label="✅ Yes"),
45
- cl.Action(name="no", value="no", label="❌ No"),
46
- ],
47
- ).send()
48
- use_qdrant = False
49
- use_qdrant_type = "Local"
50
- if res and res.get("value") == "yes":
51
- use_qdrant = True
52
- local_res = await cl.AskActionMessage(
53
- content="Do you want to use local or cloud?",
54
- actions=[
55
- cl.Action(name="Local", value="Local", label="✅ Local"),
56
- cl.Action(name="Cloud", value="Cloud", label="❌ Cloud"),
57
- ],
58
- ).send()
59
- if local_res and local_res.get("value") == "Cloud":
60
- use_qdrant_type = "Cloud"
61
- use_rct = False
62
- res = await cl.AskActionMessage(
63
- content="Do you want to use RecursiveCharacterTextSplitter?",
64
- actions=[
65
- cl.Action(name="yes", value="yes", label="✅ Yes"),
66
- cl.Action(name="no", value="no", label="❌ No"),
67
- ],
68
- ).send()
69
- if res and res.get("value") == "yes":
70
- use_rct = True
71
-
72
- files = None
73
- # Wait for the user to upload a file
74
- while not files:
75
- files = await cl.AskFileMessage(
76
- content="Please upload a .txt or .pdf file to begin processing!",
77
- accept=["text/plain", "application/pdf"],
78
- max_size_mb=2,
79
- timeout=180,
80
- ).send()
81
-
82
- file = files[0]
83
-
84
- msg = cl.Message(
85
- content=f"Processing `{file.name}`...", disable_human_feedback=True
86
- )
87
- await msg.send()
88
-
89
- texts = process_file(file, use_rct)
90
-
91
- msg = cl.Message(
92
- content=f"Resulted in {len(texts)} chunks", disable_human_feedback=True
93
- )
94
- await msg.send()
95
-
96
- # decide if to use the dict vector store of the Qdrant vector store
97
-
98
- # Create a dict vector store
99
- if use_qdrant == False:
100
- vector_db = VectorDatabase()
101
- vector_db = await vector_db.abuild_from_list(texts)
102
- else:
103
- embedding_model = EmbeddingModel(embeddings_model_name= "text-embedding-3-small", dimensions=1000)
104
- if use_qdrant_type == "Local":
105
- from utilities.vector_database import QdrantDatabase
106
- vector_db = QdrantDatabase(
107
- embedding_model=embedding_model
108
- )
109
-
110
- vector_db = await vector_db.abuild_from_list(texts)
111
-
112
- msg = cl.Message(
113
- content=f"The Vector store has been created", disable_human_feedback=True
114
- )
115
- await msg.send()
116
-
117
- chat_openai = ChatOpenAI()
118
-
119
- # Create a chain
120
- retrieval_augmented_qa_pipeline = RetrievalAugmentedQAPipeline(
121
- vector_db_retriever=vector_db,
122
- llm=chat_openai,
123
- system_role_prompt=system_role_prompt,
124
- user_role_prompt=user_role_prompt
125
- )
126
-
127
- # Let the user know that the system is ready
128
- msg.content = f"Processing `{file.name}` is complete."
129
- await msg.update()
130
- msg.content = f"You can now ask questions about `{file.name}`."
131
- await msg.update()
132
- cl.user_session.set("chain", retrieval_augmented_qa_pipeline)
133
-
134
-
135
- @cl.on_message
136
- async def main(message):
137
- chain = cl.user_session.get("chain")
138
-
139
- msg = cl.Message(content="")
140
- result = await chain.arun_pipeline(message.content)
141
-
142
- async for stream_resp in result["response"]:
143
- await msg.stream_token(stream_resp)
144
-
145
- await msg.send()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities/constants.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ CHUNKING_STRATEGY_RECURSIVE = "recursive"
2
+ CHUNKING_STRATEGY_TABLE_AWARE = "table_aware"
3
+ CHUNKING_STRATEGY_SECTION_BASED = "section_based"
4
+ CHUNKING_STRATEGY_SEMANTIC = "semantic_based"
utilities/doc_utilities.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_community.document_loaders import PyMuPDFLoader
2
+ import fitz
3
+ import os
4
+ import requests
5
+
6
+ from utilities.debugger import dprint
7
+ import uuid
8
+
9
+
10
+
11
+ def download_document(app_state, url, file_name, download_folder):
12
+ file_path = os.path.join(download_folder, file_name)
13
+ if not os.path.exists(download_folder):
14
+ os.makedirs(download_folder)
15
+
16
+ if not os.path.exists(file_path):
17
+ print(f"Downloading {file_name} from {url}...")
18
+ response = requests.get(url)
19
+ if response.status_code == 200:
20
+ with open(file_path, 'wb') as f:
21
+ f.write(response.content)
22
+ else:
23
+ dprint(app_state, f"Failed to download document from {url}. Status code: {response.status_code}")
24
+ else:
25
+ dprint(app_state, f"{file_name} already exists locally.")
26
+ return file_path
27
+
28
+ def get_documents(app_state):
29
+ for url in app_state.document_urls:
30
+ dprint(app_state, f"Downloading and loading document from {url}...")
31
+ file_name = url.split("/")[-1]
32
+ file_path = download_document(app_state, url, file_name, app_state.download_folder)
33
+ loader = PyMuPDFLoader(file_path)
34
+ loaded_document = loader.load()
35
+ single_text_document = "\n".join([doc.page_content for doc in loaded_document])
36
+ dprint(app_state, f"Number of pages: {len(loaded_document)}")
37
+ # lets get titles and metadata
38
+ pdf = fitz.open(file_path)
39
+ metadata = pdf.metadata
40
+ title = metadata.get('title', 'Document 1')
41
+
42
+ document = {
43
+ "url": url,
44
+ "title": title,
45
+ "metadata": metadata,
46
+ "loaded_document": loaded_document,
47
+ "single_text_document": single_text_document,
48
+ "document_id": str(uuid.uuid4())
49
+ }
50
+ app_state.add_document(document)
51
+ dprint(app_state, f"Title of Document: {title}")
52
+ dprint(app_state, f"Full metadata for Document 1: {metadata}")
53
+ pdf.close()
54
+ print(f"Total documents: {len(app_state.documents)}")
utilities/get_documents.py DELETED
@@ -1,33 +0,0 @@
1
- import requests
2
- import os
3
- from langchain.document_loaders import PyMuPDFLoader
4
-
5
- # Define the URLs for the documents
6
- url_1 = "https://example.com/Blueprint-for-an-AI-Bill-of-Rights.pdf"
7
- url_2 = "https://example.com/NIST.AI.600-1.pdf"
8
-
9
- # Define local file paths for storing the downloaded PDFs
10
- file_path_1 = "data/Blueprint-for-an-AI-Bill-of-Rights.pdf"
11
- file_path_2 = "data/NIST.AI.600-1.pdf"
12
-
13
- # Function to download a file from a URL
14
- def download_pdf(url, file_path):
15
- # Check if the file already exists to avoid re-downloading
16
- if not os.path.exists(file_path):
17
- print(f"Downloading {file_path} from {url}...")
18
- response = requests.get(url)
19
- with open(file_path, 'wb') as f:
20
- f.write(response.content)
21
- else:
22
- print(f"{file_path} already exists, skipping download.")
23
-
24
- # Download the PDFs from the URLs
25
- download_pdf(url_1, file_path_1)
26
- download_pdf(url_2, file_path_2)
27
-
28
- # Load the PDFs using PyMuPDFLoader
29
- loader_1 = PyMuPDFLoader(file_path_1)
30
- documents_1 = loader_1.load()
31
-
32
- loader_2 = PyMuPDFLoader(file_path_2)
33
- documents_2 = loader_2.load()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities/pipeline.py DELETED
@@ -1,27 +0,0 @@
1
- from utilities_2.vectordatabase import VectorDatabase
2
-
3
- class RetrievalAugmentedQAPipeline:
4
- def __init__(self, llm, vector_db_retriever: VectorDatabase,
5
- system_role_prompt, user_role_prompt
6
- ) -> None:
7
- self.llm = llm
8
- self.vector_db_retriever = vector_db_retriever
9
- self.system_role_prompt = system_role_prompt
10
- self.user_role_prompt = user_role_prompt
11
-
12
- async def arun_pipeline(self, user_query: str):
13
- context_list = self.vector_db_retriever.search_by_text(user_query, k=4)
14
-
15
- context_prompt = ""
16
- for context in context_list:
17
- context_prompt += context[0] + "\n"
18
-
19
- formatted_system_prompt = self.system_role_prompt.create_message()
20
-
21
- formatted_user_prompt = self.user_role_prompt.create_message(question=user_query, context=context_prompt)
22
-
23
- async def generate_response():
24
- async for chunk in self.llm.astream([formatted_system_prompt, formatted_user_prompt]):
25
- yield chunk
26
-
27
- return {"response": generate_response(), "context": context_list}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities/rag_utilities.py DELETED
@@ -1,125 +0,0 @@
1
- from langchain.text_splitter import RecursiveCharacterTextSplitter
2
- from langchain.docstore.document import Document
3
- from langchain_community.document_loaders import PyMuPDFLoader
4
- from langchain_community.vectorstores import Qdrant
5
- from langchain_openai.embeddings import OpenAIEmbeddings
6
- import fitz
7
- import io
8
- import os
9
- import requests
10
- import tiktoken
11
- from utilities.debugger import dprint
12
- import uuid
13
-
14
- def tiktoken_len(text):
15
- tokens = tiktoken.encoding_for_model("gpt-4o").encode(
16
- text,
17
- )
18
- return len(tokens)
19
-
20
- def download_document(state, url, file_name, download_folder):
21
- file_path = os.path.join(download_folder, file_name)
22
- if not os.path.exists(download_folder):
23
- os.makedirs(download_folder)
24
-
25
- if not os.path.exists(file_path):
26
- print(f"Downloading {file_name} from {url}...")
27
- response = requests.get(url)
28
- if response.status_code == 200:
29
- with open(file_path, 'wb') as f:
30
- f.write(response.content)
31
- else:
32
- dprint(state, f"Failed to download document from {url}. Status code: {response.status_code}")
33
- else:
34
- dprint(state, f"{file_name} already exists locally.")
35
- return file_path
36
-
37
- def get_documents(state):
38
- for url in state.document_urls:
39
- dprint(state, f"Downloading and loading document from {url}...")
40
- file_name = url.split("/")[-1]
41
- file_path = download_document(state, url, file_name, state.download_folder)
42
- loader = PyMuPDFLoader(file_path)
43
- loaded_document = loader.load()
44
- single_text_document = "\n".join([doc.page_content for doc in loaded_document])
45
- #state.add_loaded_document(loaded_document) # Append the loaded documents to the list
46
- #state.add_single_text_document(single_text_document)
47
- dprint(state, f"Number of pages: {len(loaded_document)}")
48
- # lets get titles and metadata
49
- pdf = fitz.open(file_path)
50
- metadata = pdf.metadata
51
- title = metadata.get('title', 'Document 1')
52
- #state.add_metadata(metadata)
53
- #state.add_title(title)
54
- document = {
55
- "url": url,
56
- "title": title,
57
- "metadata": metadata,
58
- "single_text_document": single_text_document,
59
- "document_id": str(uuid.uuid4())
60
- }
61
- state.add_document(document)
62
- dprint(state, f"Title of Document: {title}")
63
- dprint(state, f"Full metadata for Document 1: {metadata}")
64
- pdf.close()
65
- dprint(state, f"documents: {state.documents}")
66
-
67
- def create_chunked_documents(state):
68
- get_documents(state)
69
-
70
-
71
-
72
- text_splitter = RecursiveCharacterTextSplitter(
73
- chunk_size=state.chunk_size,
74
- chunk_overlap=state.chunk_overlap,
75
- length_function = tiktoken_len,
76
- )
77
- combined_document_objects = []
78
- dprint(state, "Chunking documents and creating document objects")
79
- for document in state.documents:
80
- dprint(state, f"processing documend: {document['title']}")
81
- text = document["single_text_document"]
82
- dprint(state, text)
83
- title = document["title"]
84
- document_id = document["document_id"]
85
- chunks_document = text_splitter.split_text(text)
86
- dprint(state, len(chunks_document))
87
-
88
- for chunk_number, chunk in enumerate(chunks_document, start=1):
89
- document_objects = Document(
90
- page_content=chunk,
91
- metadata={
92
- "source": title,
93
- "document_id": document.get("document_id", "default_id"),
94
- "chunk_number": chunk_number # Add unique chunk number
95
- }
96
- )
97
- combined_document_objects.append(document_objects)
98
- state.add_combined_document_objects(combined_document_objects)
99
-
100
-
101
- def create_vector_store(state, **kwargs):
102
- for key, value in kwargs.items():
103
- if hasattr(state, key):
104
- setattr(state, key, value)
105
- else:
106
- print(f"Warning: {key} is not an attribute of the state object")
107
-
108
- # Rest of your create_vector_store logic
109
- print(f"Chunk size after update: {state.chunk_size}")
110
-
111
-
112
-
113
-
114
- create_chunked_documents(state)
115
- embedding_model = OpenAIEmbeddings(model=state.embedding_model)
116
-
117
- qdrant_vectorstore = Qdrant.from_documents(
118
- documents=state.combined_document_objects,
119
- embedding=embedding_model,
120
- location=":memory:"
121
- )
122
- qdrant_retriever = qdrant_vectorstore.as_retriever()
123
- state.set_retriever(qdrant_retriever)
124
- print("Vector store created")
125
- return qdrant_retriever
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities/templates.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ from langchain_core.prompts import ChatPromptTemplate
3
+ def get_qa_prompt():
4
+
5
+ system_template = """
6
+ You are an expert at explaining technical documents to people.
7
+ You are provided context below to answer the question.
8
+ Only use the information provided below.
9
+ If they do not ask a question, have a conversation with them and ask them if they have any questions
10
+ If you cannot answer the question with the content below say 'I don't have enough information, sorry'
11
+ The two documents are 'Blueprint for an AI Bill of Rights' and 'Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile'
12
+ """
13
+ human_template = """
14
+ ===
15
+ question:
16
+ {question}
17
+
18
+ ===
19
+ context:
20
+ {context}
21
+ ===
22
+ """
23
+ chat_prompt = ChatPromptTemplate.from_messages([
24
+ ("system", system_template),
25
+ ("human", human_template)
26
+ ])
27
+ return chat_prompt
utilities/text_utils.py DELETED
@@ -1,103 +0,0 @@
1
- import os
2
- from typing import List
3
- import fitz # pymupdf
4
- import tempfile
5
- from utilities_2.text_utils import CharacterTextSplitter
6
- from langchain_text_splitters import RecursiveCharacterTextSplitter
7
-
8
- # load the file
9
- # handle .txt and .pdf
10
-
11
- class FileLoader:
12
-
13
- def __init__(self, encoding: str = "utf-8"):
14
- self.documents = []
15
- self.encoding = encoding
16
- self.temp_file_path = ""
17
-
18
-
19
- def load_file(self, file, use_rct):
20
- if use_rct:
21
- text_splitter=MyRecursiveCharacterTextSplitter()
22
- else:
23
- text_splitter=CharacterTextSplitter()
24
- file_extension = os.path.splitext(file.name)[1].lower()
25
-
26
- with tempfile.NamedTemporaryFile(mode="wb", delete=False, suffix=file_extension) as temp_file:
27
- self.temp_file_path = temp_file.name
28
- temp_file.write(file.content)
29
-
30
- if os.path.isfile(self.temp_file_path):
31
- if self.temp_file_path.endswith(".txt"):
32
- self.load_text_file()
33
- elif self.temp_file_path.endswith(".pdf"):
34
- self.load_pdf_file()
35
- else:
36
- raise ValueError(
37
- f"Unsupported file type: {self.temp_file_path}"
38
- )
39
- return text_splitter.split_text(self.documents)
40
- else:
41
- raise ValueError(
42
- "Not a file"
43
- )
44
-
45
- def load_text_file(self):
46
- with open(self.temp_file_path, "r", encoding=self.encoding) as f:
47
- self.documents.append(f.read())
48
-
49
- def load_pdf_file(self):
50
- # pymupdf
51
- pdf_document = fitz.open(self.temp_file_path)
52
- for page_num in range(len(pdf_document)):
53
- page = pdf_document.load_page(page_num)
54
- text = page.get_text()
55
- self.documents.append(text)
56
-
57
- class CharacterTextSplitter:
58
- def __init__(
59
- self,
60
- chunk_size: int = 1000,
61
- chunk_overlap: int = 200,
62
- ):
63
- assert (
64
- chunk_size > chunk_overlap
65
- ), "Chunk size must be greater than chunk overlap"
66
-
67
- self.chunk_size = chunk_size
68
- self.chunk_overlap = chunk_overlap
69
-
70
- def split(self, text: str) -> List[str]:
71
- chunks = []
72
- for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
73
- chunks.append(text[i : i + self.chunk_size])
74
- return chunks
75
-
76
- def split_text(self, texts: List[str]) -> List[str]:
77
- chunks = []
78
- for text in texts:
79
- chunks.extend(self.split(text))
80
- return chunks
81
-
82
-
83
-
84
- class MyRecursiveCharacterTextSplitter:
85
- # uses langChain.RecursiveCharacterTextSplitter
86
- def __init__(
87
- self
88
- ):
89
- self.RCTS = RecursiveCharacterTextSplitter(
90
- chunk_size=1000,
91
- chunk_overlap=20,
92
- length_function=len,
93
- separators=["\n\n", "\n", " ", ""]
94
- )
95
-
96
- def split_text(self, texts: List[str]) -> List[str]:
97
- all_chunks = []
98
- for doc in texts:
99
- chunks = self.RCTS.split_text(doc)
100
- all_chunks.extend(chunks)
101
- return all_chunks
102
-
103
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities/vector_database.py DELETED
@@ -1,105 +0,0 @@
1
- import numpy as np
2
- from collections import defaultdict
3
- from typing import List, Tuple, Callable
4
- from utilities_2.openai_utils.embedding import EmbeddingModel
5
- import hashlib
6
- from qdrant_client import QdrantClient
7
- from qdrant_client.http.models import PointStruct
8
- from qdrant_client.models import VectorParams
9
- import uuid
10
-
11
- def cosine_similarity(vector_a: np.array, vector_b: np.array) -> float:
12
- """Computes the cosine similarity between two vectors."""
13
- dot_product = np.dot(vector_a, vector_b)
14
- norm_a = np.linalg.norm(vector_a)
15
- norm_b = np.linalg.norm(vector_b)
16
- return dot_product / (norm_a * norm_b)
17
-
18
- class QdrantDatabase:
19
- def __init__(self, embedding_model=None):
20
- self.qdrant_client = QdrantClient(location=":memory:")
21
- self.collection_name = "my_collection"
22
- self.embedding_model = embedding_model or EmbeddingModel(embeddings_model_name= "text-embedding-3-small", dimensions=1000)
23
- vector_params = VectorParams(
24
- size=self.embedding_model.dimensions, # vector size
25
- distance="Cosine"
26
- ) # distance metric
27
- self.qdrant_client.create_collection(
28
- collection_name=self.collection_name,
29
- vectors_config={"text": vector_params},
30
- )
31
- self.vectors = defaultdict(np.array) # Still keeps a local copy if needed
32
-
33
- def string_to_int_id(self, s: str) -> int:
34
- return int(hashlib.sha256(s.encode('utf-8')).hexdigest(), 16) % (10**8)
35
- def get_test_vector(self):
36
- retrieved_vector = self.qdrant_client.retrieve(
37
- collection_name="my_collection",
38
- ids=[self.string_to_int_id("test_key")]
39
- )
40
- return retrieved_vector
41
- def insert(self, key: str, vector: np.array) -> None:
42
- point_id = str(uuid.uuid4())
43
- payload = {"text": key}
44
-
45
- point = PointStruct(
46
- id=point_id,
47
- vector={"default": vector.tolist()},
48
- payload=payload
49
- )
50
- print(f"Inserting vector for key: {key}, ID: {point_id}")
51
- # Insert the vector into Qdrant with the associated document
52
- self.qdrant_client.upsert(
53
- collection_name=self.collection_name,
54
- points=[point] # Qdrant expects a list of PointStruct
55
- )
56
-
57
-
58
- def search(
59
- self,
60
- query_vector: np.array,
61
- k: int=5,
62
- distance_measure: Callable = cosine_similarity,
63
- ) -> List[Tuple[str, float]]:
64
- # Perform search in Qdrant
65
- if isinstance(query_vector, np.ndarray):
66
- query_vector = query_vector.tolist()
67
- print(type(query_vector))
68
- search_results = self.qdrant_client.search(
69
- collection_name=self.collection_name,
70
- query_vector=query_vector, # Pass the vector as a list
71
- limit=k
72
- )
73
- return [(result.payload['text'], result.score) for result in search_results]
74
-
75
- def search_by_text(
76
- self,
77
- query_text: str,
78
- k: int,
79
- distance_measure: Callable = cosine_similarity,
80
- return_as_text: bool = False,
81
- ) -> List[Tuple[str, float]]:
82
-
83
- query_vector = self.embedding_model.get_embedding(query_text)
84
- results = self.search(query_vector, k, distance_measure)
85
- return [result[0] for result in results] if return_as_text else results
86
-
87
- async def abuild_from_list(self, list_of_text: List[str]) -> "QdrantDatabase":
88
- from qdrant_client.http import models
89
- embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
90
- points = [
91
- models.PointStruct(
92
- id=str(uuid.uuid4()),
93
- vector={"text": embedding}, # Should be a named vector as per vector_config
94
- payload={
95
- "text": text
96
- }
97
- )
98
- for text, embedding in zip(list_of_text, embeddings)
99
- ]
100
- self.qdrant_client.upsert(
101
- collection_name=self.collection_name,
102
- points=points
103
- )
104
- return self
105
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities/vector_utilities.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from utilities.constants import (
2
+ CHUNKING_STRATEGY_TABLE_AWARE,
3
+ CHUNKING_STRATEGY_SECTION_BASED,
4
+ CHUNKING_STRATEGY_SEMANTIC
5
+ )
6
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
7
+ from langchain.docstore.document import Document
8
+ from langchain_community.vectorstores import Qdrant
9
+ from langchain_openai.embeddings import OpenAIEmbeddings
10
+ import numpy as np
11
+ import pdfplumber
12
+ import re
13
+ from sentence_transformers import SentenceTransformer
14
+ from sklearn.metrics.pairwise import cosine_similarity
15
+ import tiktoken
16
+ from utilities.debugger import dprint
17
+
18
+ def create_vector_store(app_state, model_run_state, **kwargs):
19
+ for key, value in kwargs.items():
20
+ if hasattr(model_run_state, key):
21
+ setattr(model_run_state, key, value)
22
+ else:
23
+ print(f"Warning: {key} is not an attribute of the state object")
24
+
25
+ # Rest of your create_vector_store logic
26
+ dprint(app_state, f"Chunk size after update: {model_run_state.chunk_size}")
27
+ create_chunked_documents(app_state, model_run_state)
28
+
29
+ qdrant_vectorstore = Qdrant.from_documents(
30
+ documents=model_run_state.combined_document_objects,
31
+ embedding=model_run_state.embedding_model,
32
+ location=":memory:"
33
+ )
34
+ qdrant_retriever = qdrant_vectorstore.as_retriever()
35
+ model_run_state.retriever = qdrant_retriever
36
+ print("Vector store created")
37
+
38
+ def tiktoken_len(text):
39
+ tokens = tiktoken.encoding_for_model("gpt-4o").encode(
40
+ text,
41
+ )
42
+ return len(tokens)
43
+
44
+ def create_chunked_documents(app_state, model_run_state):
45
+ dprint(app_state, model_run_state.chunking_strategy)
46
+ if model_run_state.chunking_strategy == CHUNKING_STRATEGY_TABLE_AWARE:
47
+ combined_document_objects = chunk_with_table_aware(app_state, model_run_state)
48
+ elif model_run_state.chunking_strategy == CHUNKING_STRATEGY_SECTION_BASED:
49
+ combined_document_objects = chunk_with_section_based(app_state, model_run_state)
50
+ elif model_run_state.chunking_strategy == CHUNKING_STRATEGY_SEMANTIC:
51
+ combined_document_objects = chunk_with_semantic_splitter(app_state, model_run_state)
52
+ else:
53
+ combined_document_objects = chunk_with_recursive_splitter(app_state, model_run_state)
54
+ model_run_state.combined_document_objects = combined_document_objects
55
+ dprint(app_state, "Chunking completed successfully")
56
+
57
+
58
+ def chunk_with_recursive_splitter(app_state, model_run_state):
59
+ text_splitter = RecursiveCharacterTextSplitter(
60
+ chunk_size=model_run_state.chunk_size,
61
+ chunk_overlap=model_run_state.chunk_overlap,
62
+ length_function = tiktoken_len,
63
+ )
64
+ combined_document_objects = []
65
+ dprint(app_state, "Chunking documents and creating document objects")
66
+ for document in app_state.documents:
67
+ dprint(app_state, f"processing documend: {document['title']}")
68
+ text = document["single_text_document"]
69
+ dprint(app_state, text)
70
+ title = document["title"]
71
+ # document_id = document["document_id"]
72
+ chunks_document = text_splitter.split_text(text)
73
+ dprint(app_state, len(chunks_document))
74
+
75
+ for chunk_number, chunk in enumerate(chunks_document, start=1):
76
+ document_objects = Document(
77
+ page_content=chunk,
78
+ metadata={
79
+ "source": title,
80
+ "document_id": document.get("document_id", "default_id"),
81
+ "chunk_number": chunk_number # Add unique chunk number
82
+ }
83
+ )
84
+ combined_document_objects.append(document_objects)
85
+ return combined_document_objects
86
+
87
+ def chunk_with_table_aware(app_state, model_run_state):
88
+ combined_document_objects = []
89
+ dprint(app_state, "Using Table-Aware Chunking for documents.")
90
+
91
+ for document in app_state.documents:
92
+ title = document["title"]
93
+ text = document["single_text_document"]
94
+
95
+ # Check if document is a PDF and contains tables
96
+ if document.get("is_pdf", False):
97
+ with pdfplumber.open(document["file_path"]) as pdf:
98
+ for page in pdf.pages:
99
+ tables = page.extract_tables()
100
+ for table in tables:
101
+ table_content = "\n".join([str(row) for row in table])
102
+ document_objects = Document(
103
+ page_content=table_content,
104
+ metadata={
105
+ "source": title,
106
+ "document_id": document.get("document_id", "default_id"),
107
+ "chunk_number": "table"
108
+ }
109
+ )
110
+ combined_document_objects.append(document_objects)
111
+
112
+ # Chunk the rest of the text
113
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=model_run_state.chunk_size, chunk_overlap=model_run_state.chunk_overlap)
114
+ chunks_document = text_splitter.split_text(text)
115
+
116
+ for chunk_number, chunk in enumerate(chunks_document, start=1):
117
+ document_objects = Document(
118
+ page_content=chunk,
119
+ metadata={
120
+ "source": title,
121
+ "document_id": document.get("document_id", "default_id"),
122
+ "chunk_number": chunk_number
123
+ }
124
+ )
125
+ combined_document_objects.append(document_objects)
126
+
127
+ return combined_document_objects
128
+
129
+
130
+ def chunk_with_section_based(app_state, model_run_state):
131
+ combined_document_objects = []
132
+ dprint(app_state, "Using Section-Based Chunking for documents.")
133
+
134
+ for document in app_state.documents:
135
+ text = document["single_text_document"]
136
+ title = document["title"]
137
+
138
+ # Split the text by headings
139
+ sections = re.split(r"\n[A-Z].+?\n", text)
140
+
141
+ # Chunk each section
142
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=model_run_state.chunk_size, chunk_overlap=model_run_state.chunk_overlap)
143
+ for section_number, section in enumerate(sections, start=1):
144
+ chunks_document = text_splitter.split_text(section)
145
+ for chunk_number, chunk in enumerate(chunks_document, start=1):
146
+ document_objects = Document(
147
+ page_content=chunk,
148
+ metadata={
149
+ "source": title,
150
+ "document_id": document.get("document_id", "default_id"),
151
+ "section_number": section_number,
152
+ "chunk_number": chunk_number
153
+ }
154
+ )
155
+ combined_document_objects.append(document_objects)
156
+
157
+ return combined_document_objects
158
+
159
+
160
+ def chunk_with_semantic_splitter(app_state, model_run_state):
161
+ # Load pre-trained model for embeddings
162
+ model = SentenceTransformer('all-MiniLM-L6-v2')
163
+
164
+ combined_document_objects = []
165
+ dprint(app_state, "Using Semantic-Based Chunking for documents.")
166
+
167
+ for document in app_state.documents:
168
+ text = document["single_text_document"]
169
+ title = document["title"]
170
+
171
+ # Split text into sentences or paragraphs
172
+ sentences = text.split(". ") # Simple split by sentence (you can refine this)
173
+ sentence_embeddings = model.encode(sentences)
174
+
175
+ # Group sentences into chunks based on semantic similarity
176
+ chunks = []
177
+ current_chunk = []
178
+ for i in range(len(sentences) - 1):
179
+ current_chunk.append(sentences[i])
180
+ # Calculate similarity between consecutive sentences
181
+ sim = cosine_similarity([sentence_embeddings[i]], [sentence_embeddings[i + 1]])[0][0]
182
+ if sim < 0.7 or len(current_chunk) >= model_run_state.chunk_size:
183
+ # If similarity is below threshold or chunk size is reached, start a new chunk
184
+ chunks.append(" ".join(current_chunk))
185
+ current_chunk = []
186
+
187
+ # Add the final chunk
188
+ if current_chunk:
189
+ chunks.append(" ".join(current_chunk))
190
+
191
+ # Create document objects for the chunks
192
+ for chunk_number, chunk in enumerate(chunks, start=1):
193
+ document_objects = Document(
194
+ page_content=chunk,
195
+ metadata={
196
+ "source": title,
197
+ "document_id": document.get("document_id", "default_id"),
198
+ "chunk_number": chunk_number
199
+ }
200
+ )
201
+ combined_document_objects.append(document_objects)
202
+
203
+ return combined_document_objects
utilities_2/__init__.py DELETED
File without changes
utilities_2/openai_utils/__init__.py DELETED
File without changes
utilities_2/openai_utils/chatmodel.py DELETED
@@ -1,45 +0,0 @@
1
- from openai import OpenAI, AsyncOpenAI
2
- from dotenv import load_dotenv
3
- import os
4
-
5
- load_dotenv()
6
-
7
-
8
- class ChatOpenAI:
9
- def __init__(self, model_name: str = "gpt-4o-mini"):
10
- self.model_name = model_name
11
- self.openai_api_key = os.getenv("OPENAI_API_KEY")
12
- if self.openai_api_key is None:
13
- raise ValueError("OPENAI_API_KEY is not set")
14
-
15
- def run(self, messages, text_only: bool = True, **kwargs):
16
- if not isinstance(messages, list):
17
- raise ValueError("messages must be a list")
18
-
19
- client = OpenAI()
20
- response = client.chat.completions.create(
21
- model=self.model_name, messages=messages, **kwargs
22
- )
23
-
24
- if text_only:
25
- return response.choices[0].message.content
26
-
27
- return response
28
-
29
- async def astream(self, messages, **kwargs):
30
- if not isinstance(messages, list):
31
- raise ValueError("messages must be a list")
32
-
33
- client = AsyncOpenAI()
34
-
35
- stream = await client.chat.completions.create(
36
- model=self.model_name,
37
- messages=messages,
38
- stream=True,
39
- **kwargs
40
- )
41
-
42
- async for chunk in stream:
43
- content = chunk.choices[0].delta.content
44
- if content is not None:
45
- yield content
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities_2/openai_utils/embedding.py DELETED
@@ -1,60 +0,0 @@
1
- from dotenv import load_dotenv
2
- from openai import AsyncOpenAI, OpenAI
3
- import openai
4
- from typing import List
5
- import os
6
- import asyncio
7
-
8
-
9
- class EmbeddingModel:
10
- def __init__(self, embeddings_model_name: str = "text-embedding-3-small", dimensions: int = None):
11
- load_dotenv()
12
- self.openai_api_key = os.getenv("OPENAI_API_KEY")
13
- self.async_client = AsyncOpenAI()
14
- self.client = OpenAI()
15
- self.dimensions = dimensions
16
-
17
- if self.openai_api_key is None:
18
- raise ValueError(
19
- "OPENAI_API_KEY environment variable is not set. Please set it to your OpenAI API key."
20
- )
21
- openai.api_key = self.openai_api_key
22
- self.embeddings_model_name = embeddings_model_name
23
-
24
- async def async_get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
25
- embedding_response = await self.async_client.embeddings.create(
26
- input=list_of_text, model=self.embeddings_model_name, dimensions=self.dimensions
27
- )
28
-
29
- return [embeddings.embedding for embeddings in embedding_response.data]
30
-
31
- async def async_get_embedding(self, text: str) -> List[float]:
32
- embedding = await self.async_client.embeddings.create(
33
- input=text, model=self.embeddings_model_name, dimensions=self.dimensions
34
- )
35
-
36
- return embedding.data[0].embedding
37
-
38
- def get_embeddings(self, list_of_text: List[str]) -> List[List[float]]:
39
- embedding_response = self.client.embeddings.create(
40
- input=list_of_text, model=self.embeddings_model_name, dimensions=self.dimensions
41
- )
42
-
43
- return [embeddings.embedding for embeddings in embedding_response.data]
44
-
45
- def get_embedding(self, text: str) -> List[float]:
46
- embedding = self.client.embeddings.create(
47
- input=text, model=self.embeddings_model_name, dimensions=self.dimensions
48
- )
49
-
50
- return embedding.data[0].embedding
51
-
52
-
53
- if __name__ == "__main__":
54
- embedding_model = EmbeddingModel()
55
- print(asyncio.run(embedding_model.async_get_embedding("Hello, world!")))
56
- print(
57
- asyncio.run(
58
- embedding_model.async_get_embeddings(["Hello, world!", "Goodbye, world!"])
59
- )
60
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities_2/openai_utils/prompts.py DELETED
@@ -1,78 +0,0 @@
1
- import re
2
-
3
-
4
- class BasePrompt:
5
- def __init__(self, prompt):
6
- """
7
- Initializes the BasePrompt object with a prompt template.
8
-
9
- :param prompt: A string that can contain placeholders within curly braces
10
- """
11
- self.prompt = prompt
12
- self._pattern = re.compile(r"\{([^}]+)\}")
13
-
14
- def format_prompt(self, **kwargs):
15
- """
16
- Formats the prompt string using the keyword arguments provided.
17
-
18
- :param kwargs: The values to substitute into the prompt string
19
- :return: The formatted prompt string
20
- """
21
- matches = self._pattern.findall(self.prompt)
22
- return self.prompt.format(**{match: kwargs.get(match, "") for match in matches})
23
-
24
- def get_input_variables(self):
25
- """
26
- Gets the list of input variable names from the prompt string.
27
-
28
- :return: List of input variable names
29
- """
30
- return self._pattern.findall(self.prompt)
31
-
32
-
33
- class RolePrompt(BasePrompt):
34
- def __init__(self, prompt, role: str):
35
- """
36
- Initializes the RolePrompt object with a prompt template and a role.
37
-
38
- :param prompt: A string that can contain placeholders within curly braces
39
- :param role: The role for the message ('system', 'user', or 'assistant')
40
- """
41
- super().__init__(prompt)
42
- self.role = role
43
-
44
- def create_message(self, format=True, **kwargs):
45
- """
46
- Creates a message dictionary with a role and a formatted message.
47
-
48
- :param kwargs: The values to substitute into the prompt string
49
- :return: Dictionary containing the role and the formatted message
50
- """
51
- if format:
52
- return {"role": self.role, "content": self.format_prompt(**kwargs)}
53
-
54
- return {"role": self.role, "content": self.prompt}
55
-
56
-
57
- class SystemRolePrompt(RolePrompt):
58
- def __init__(self, prompt: str):
59
- super().__init__(prompt, "system")
60
-
61
-
62
- class UserRolePrompt(RolePrompt):
63
- def __init__(self, prompt: str):
64
- super().__init__(prompt, "user")
65
-
66
-
67
- class AssistantRolePrompt(RolePrompt):
68
- def __init__(self, prompt: str):
69
- super().__init__(prompt, "assistant")
70
-
71
-
72
- if __name__ == "__main__":
73
- prompt = BasePrompt("Hello {name}, you are {age} years old")
74
- print(prompt.format_prompt(name="John", age=30))
75
-
76
- prompt = SystemRolePrompt("Hello {name}, you are {age} years old")
77
- print(prompt.create_message(name="John", age=30))
78
- print(prompt.get_input_variables())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities_2/text_utils.py DELETED
@@ -1,75 +0,0 @@
1
- import os
2
- from typing import List
3
-
4
- class TextFileLoader:
5
- def __init__(self, path: str, encoding: str = "utf-8"):
6
- self.documents = []
7
- self.path = path
8
- self.encoding = encoding
9
-
10
- def load(self):
11
- if os.path.isdir(self.path):
12
- self.load_directory()
13
- elif os.path.isfile(self.path) and self.path.endswith(".txt"):
14
- self.load_file()
15
- else:
16
- raise ValueError(
17
- "Provided path is neither a valid directory nor a .txt file."
18
- )
19
-
20
- def load_file(self):
21
- with open(self.path, "r", encoding=self.encoding) as f:
22
- self.documents.append(f.read())
23
-
24
- def load_directory(self):
25
- for root, _, files in os.walk(self.path):
26
- for file in files:
27
- if file.endswith(".txt"):
28
- with open(
29
- os.path.join(root, file), "r", encoding=self.encoding
30
- ) as f:
31
- self.documents.append(f.read())
32
-
33
- def load_documents(self):
34
- self.load()
35
- return self.documents
36
-
37
-
38
- class CharacterTextSplitter:
39
- def __init__(
40
- self,
41
- chunk_size: int = 1000,
42
- chunk_overlap: int = 200,
43
- ):
44
- assert (
45
- chunk_size > chunk_overlap
46
- ), "Chunk size must be greater than chunk overlap"
47
-
48
- self.chunk_size = chunk_size
49
- self.chunk_overlap = chunk_overlap
50
-
51
- def split(self, text: str) -> List[str]:
52
- chunks = []
53
- for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
54
- chunks.append(text[i : i + self.chunk_size])
55
- return chunks
56
-
57
- def split_text(self, texts: List[str]) -> List[str]:
58
- chunks = []
59
- for text in texts:
60
- chunks.extend(self.split(text))
61
- return chunks
62
-
63
- if __name__ == "__main__":
64
- loader = TextFileLoader("data/KingLear.txt")
65
- loader.load()
66
- splitter = CharacterTextSplitter()
67
- chunks = splitter.split_text(loader.documents)
68
- print(len(chunks))
69
- print(chunks[0])
70
- print("--------")
71
- print(chunks[1])
72
- print("--------")
73
- print(chunks[-2])
74
- print("--------")
75
- print(chunks[-1])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
utilities_2/vectordatabase.py DELETED
@@ -1,82 +0,0 @@
1
- import numpy as np
2
- from collections import defaultdict
3
- from typing import List, Tuple, Callable
4
- from utilities_2.openai_utils.embedding import EmbeddingModel
5
- import asyncio
6
-
7
-
8
- def cosine_similarity(vector_a: np.array, vector_b: np.array) -> float:
9
- """Computes the cosine similarity between two vectors."""
10
- dot_product = np.dot(vector_a, vector_b)
11
- norm_a = np.linalg.norm(vector_a)
12
- norm_b = np.linalg.norm(vector_b)
13
- return dot_product / (norm_a * norm_b)
14
-
15
-
16
- class VectorDatabase:
17
- def __init__(self, embedding_model: EmbeddingModel = None):
18
- self.vectors = defaultdict(np.array)
19
- self.embedding_model = embedding_model or EmbeddingModel()
20
-
21
- def insert(self, key: str, vector: np.array) -> None:
22
- self.vectors[key] = vector
23
-
24
- def search(
25
- self,
26
- query_vector: np.array,
27
- k: int,
28
- distance_measure: Callable = cosine_similarity,
29
- ) -> List[Tuple[str, float]]:
30
- scores = [
31
- (key, distance_measure(query_vector, vector))
32
- for key, vector in self.vectors.items()
33
- ]
34
- return sorted(scores, key=lambda x: x[1], reverse=True)[:k]
35
-
36
- def search_by_text(
37
- self,
38
- query_text: str,
39
- k: int,
40
- distance_measure: Callable = cosine_similarity,
41
- return_as_text: bool = False,
42
- ) -> List[Tuple[str, float]]:
43
- query_vector = self.embedding_model.get_embedding(query_text)
44
- results = self.search(query_vector, k, distance_measure)
45
- return [result[0] for result in results] if return_as_text else results
46
-
47
- def retrieve_from_key(self, key: str) -> np.array:
48
- return self.vectors.get(key, None)
49
-
50
- async def abuild_from_list(self, list_of_text: List[str]) -> "VectorDatabase":
51
- embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
52
- for text, embedding in zip(list_of_text, embeddings):
53
- self.insert(text, np.array(embedding))
54
- return self
55
-
56
-
57
-
58
- if __name__ == "__main__":
59
- list_of_text = [
60
- "I like to eat broccoli and bananas.",
61
- "I ate a banana and spinach smoothie for breakfast.",
62
- "Chinchillas and kittens are cute.",
63
- "My sister adopted a kitten yesterday.",
64
- "Look at this cute hamster munching on a piece of broccoli.",
65
- ]
66
-
67
- vector_db = VectorDatabase()
68
- vector_db = asyncio.run(vector_db.abuild_from_list(list_of_text))
69
- k = 2
70
-
71
- searched_vector = vector_db.search_by_text("I think fruit is awesome!", k=k)
72
- print(f"Closest {k} vector(s):", searched_vector)
73
-
74
- retrieved_vector = vector_db.retrieve_from_key(
75
- "I like to eat broccoli and bananas."
76
- )
77
- print("Retrieved vector:", retrieved_vector)
78
-
79
- relevant_texts = vector_db.search_by_text(
80
- "I think fruit is awesome!", k=k, return_as_text=True
81
- )
82
- print(f"Closest {k} text(s):", relevant_texts)