Spaces:
Running
Running
luanpoppe
commited on
Commit
·
c625f4c
1
Parent(s):
e725020
feat: tentando adicionar ragas
Browse files- .vscode/launch.json +19 -0
- _utils/files/ragas_testset.csv +18 -0
- _utils/resumo_completo_cursor.py +570 -251
- resumos/serializer.py +102 -21
- resumos/views.py +284 -107
- setup/settings.py +3 -3
- setup/urls.py +24 -10
.vscode/launch.json
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
// Use IntelliSense to learn about possible attributes.
|
3 |
+
// Hover to view descriptions of existing attributes.
|
4 |
+
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
|
5 |
+
"version": "0.2.0",
|
6 |
+
"configurations": [
|
7 |
+
{
|
8 |
+
"name": "Python Debugger: Django",
|
9 |
+
"type": "debugpy",
|
10 |
+
"request": "launch",
|
11 |
+
"args": [
|
12 |
+
"runserver"
|
13 |
+
],
|
14 |
+
"django": true,
|
15 |
+
"autoStartBrowser": false,
|
16 |
+
"program": "${workspaceFolder}\\manage.py"
|
17 |
+
}
|
18 |
+
]
|
19 |
+
}
|
_utils/files/ragas_testset.csv
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
position,user_input,reference,response,retrieved_contexts
|
2 |
+
0,What are the global implications of the USA Supreme Court ruling on abortion?,"The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections.","The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:
|
3 |
+
|
4 |
+
1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.
|
5 |
+
|
6 |
+
2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.
|
7 |
+
|
8 |
+
3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.
|
9 |
+
|
10 |
+
4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.
|
11 |
+
|
12 |
+
5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.
|
13 |
+
|
14 |
+
6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.
|
15 |
+
|
16 |
+
It is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.","[""- In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates have also observed the ruling's impact on progressive law reform and the stalling of the adoption and enforcement of abortion guidelines in certain African countries.\n- The ruling has created a chilling effect in international policy spaces, emboldening anti-abortion state and non-state actors to undermine human rights protections.""
|
17 |
+
'The USA Supreme Court ruling on abortion has sparked intense debates and discussions not only within the country but also around the world. Many countries look to the United States as a leader in legal and social issues, so the decision could potentially influence the policies and attitudes towards abortion in other nations.'
|
18 |
+
""The ruling may also impact international organizations and non-governmental groups that work on reproductive rights and women's health issues. Depending on the outcome, there could be shifts in funding, advocacy efforts, and collaborations with American counterparts, leading to ripple effects in the global fight for reproductive justice.""]"
|
_utils/resumo_completo_cursor.py
CHANGED
@@ -1,4 +1,3 @@
|
|
1 |
-
|
2 |
import os
|
3 |
from typing import List, Dict, Tuple, Optional
|
4 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
@@ -19,6 +18,9 @@ from cohere import Client
|
|
19 |
import requests
|
20 |
from setup.environment import api_url
|
21 |
from rest_framework.response import Response
|
|
|
|
|
|
|
22 |
|
23 |
def reciprocal_rank_fusion(result_lists, weights=None):
|
24 |
"""Combine multiple ranked lists using reciprocal rank fusion"""
|
@@ -34,18 +36,16 @@ def reciprocal_rank_fusion(result_lists, weights=None):
|
|
34 |
fused_scores[doc_id] += weights[i] * score
|
35 |
|
36 |
# Sort by score in descending order
|
37 |
-
sorted_results = sorted(
|
38 |
-
|
39 |
-
key=lambda x: x[1],
|
40 |
-
reverse=True
|
41 |
-
)
|
42 |
-
|
43 |
return sorted_results
|
44 |
|
45 |
-
|
46 |
-
os.environ["
|
|
|
47 |
os.environ.get("LANGCHAIN_API_KEY")
|
48 |
-
os.environ["LANGCHAIN_PROJECT"]="VELLA"
|
|
|
49 |
|
50 |
@dataclass
|
51 |
class DocumentChunk:
|
@@ -55,6 +55,7 @@ class DocumentChunk:
|
|
55 |
start_char: int
|
56 |
end_char: int
|
57 |
|
|
|
58 |
@dataclass
|
59 |
class RetrievalConfig:
|
60 |
num_chunks: int = 5
|
@@ -64,158 +65,216 @@ class RetrievalConfig:
|
|
64 |
chunk_overlap: int = 200
|
65 |
chunk_size: int = 1000
|
66 |
|
|
|
67 |
@dataclass
|
68 |
class ContextualizedChunk(DocumentChunk):
|
69 |
context: str = ""
|
70 |
embedding: Optional[np.ndarray] = None
|
71 |
bm25_score: Optional[float] = None
|
72 |
|
|
|
73 |
class DocumentSummarizer:
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
self.openai_api_key = openai_api_key
|
76 |
self.cohere_client = Client(cohere_api_key)
|
77 |
-
self.embeddings = HuggingFaceEmbeddings(
|
78 |
-
model_name=embedding_model
|
79 |
-
)
|
80 |
self.text_splitter = RecursiveCharacterTextSplitter(
|
81 |
-
chunk_size=chunk_size,
|
82 |
-
chunk_overlap=chunk_overlap
|
83 |
)
|
84 |
self.chunk_metadata = {} # Store chunk metadata for tracing
|
85 |
self.num_k_rerank = num_k_rerank
|
86 |
self.model_cohere_rerank = model_cohere_rerank
|
87 |
-
|
88 |
def load_and_split_document(self, pdf_path: str) -> List[DocumentChunk]:
|
89 |
"""Load PDF and split into chunks with metadata"""
|
90 |
loader = PyPDFLoader(pdf_path)
|
91 |
-
pages =
|
|
|
|
|
92 |
chunks = []
|
93 |
char_count = 0
|
94 |
-
|
95 |
for page in pages:
|
96 |
text = page.page_content
|
97 |
-
page_chunks = self.text_splitter.split_text(
|
98 |
-
|
|
|
|
|
99 |
for chunk in page_chunks:
|
100 |
chunk_id = str(uuid.uuid4())
|
101 |
-
start_char = text.find(
|
|
|
|
|
102 |
end_char = start_char + len(chunk)
|
103 |
-
|
104 |
-
doc_chunk = DocumentChunk(
|
105 |
content=chunk,
|
106 |
-
page_number=page.metadata.get(
|
107 |
chunk_id=chunk_id,
|
108 |
start_char=char_count + start_char,
|
109 |
-
end_char=char_count + end_char
|
110 |
)
|
111 |
chunks.append(doc_chunk)
|
112 |
-
|
113 |
# Store metadata for later retrieval
|
114 |
self.chunk_metadata[chunk_id] = {
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
}
|
119 |
-
|
120 |
char_count += len(text)
|
121 |
-
|
122 |
return chunks
|
123 |
|
124 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
"""Create vector store with metadata"""
|
126 |
texts = [chunk.content for chunk in chunks]
|
127 |
-
metadatas = [
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
|
|
|
|
|
|
134 |
vector_store = Chroma.from_texts(
|
135 |
-
texts=texts,
|
136 |
-
metadatas=metadatas,
|
137 |
-
embedding=self.embeddings
|
138 |
)
|
139 |
return vector_store
|
140 |
|
141 |
-
def rerank_chunks(
|
142 |
-
self,
|
143 |
-
chunks: List[Dict],
|
144 |
-
query: str,
|
145 |
-
k: int = 5
|
146 |
) -> List[Dict]:
|
147 |
"""
|
148 |
Rerank chunks using Cohere's reranking model.
|
149 |
-
|
150 |
Args:
|
151 |
chunks: List of dictionaries containing chunks and their metadata
|
152 |
query: Original search query
|
153 |
k: Number of top chunks to return
|
154 |
-
|
155 |
Returns:
|
156 |
List of reranked chunks with updated relevance scores
|
157 |
"""
|
158 |
try:
|
159 |
# Prepare documents for reranking
|
160 |
-
documents = [chunk[
|
161 |
-
|
162 |
# Get reranking scores from Cohere
|
163 |
results = self.cohere_client.rerank(
|
164 |
query=query,
|
165 |
documents=documents,
|
166 |
top_n=k,
|
167 |
-
model=self.model_cohere_rerank
|
168 |
)
|
169 |
-
|
170 |
# Create reranked results with original metadata
|
171 |
reranked_chunks = []
|
172 |
for hit in results:
|
173 |
original_chunk = chunks[hit.index]
|
174 |
-
reranked_chunks.append(
|
175 |
-
**original_chunk,
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
return reranked_chunks
|
180 |
-
|
181 |
except Exception as e:
|
182 |
logging.error(f"Reranking failed: {str(e)}")
|
183 |
return chunks[:k] # Fallback to original ordering
|
184 |
|
185 |
-
def generate_summary_with_sources(
|
186 |
-
self,
|
187 |
-
vector_store: Chroma,
|
188 |
-
query: str = "Summarize the main points of this document"
|
189 |
) -> List[Dict]:
|
190 |
"""Generate summary with source citations using reranking"""
|
191 |
# Retrieve more initial chunks for reranking
|
192 |
relevant_docs = vector_store.similarity_search_with_score(query, k=20)
|
193 |
-
|
194 |
# Prepare chunks for reranking
|
195 |
chunks = []
|
196 |
for doc, score in relevant_docs:
|
197 |
-
chunks.append(
|
198 |
-
|
199 |
-
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
-
|
|
|
|
|
204 |
# Rerank chunks
|
205 |
reranked_chunks = self.rerank_chunks(chunks, query, k=self.num_k_rerank)
|
206 |
-
|
207 |
# Prepare context and sources from reranked chunks
|
208 |
contexts = []
|
209 |
sources = []
|
210 |
-
|
211 |
for chunk in reranked_chunks:
|
212 |
-
contexts.append(chunk[
|
213 |
-
sources.append(
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
|
|
|
|
219 |
|
220 |
prompt_template = """
|
221 |
Based on the following context, provide multiple key points from the document.
|
@@ -226,101 +285,172 @@ class DocumentSummarizer:
|
|
226 |
|
227 |
Key points:
|
228 |
"""
|
229 |
-
|
230 |
-
prompt = PromptTemplate(
|
231 |
-
|
232 |
-
input_variables=["context"]
|
233 |
-
)
|
234 |
-
|
235 |
llm = ChatOpenAI(
|
236 |
-
temperature=0,
|
237 |
-
model_name="gpt-4o-mini",
|
238 |
-
api_key=self.openai_api_key
|
239 |
)
|
240 |
-
|
241 |
response = llm.predict(prompt.format(context="\n\n".join(contexts)))
|
242 |
-
|
243 |
# Split the response into paragraphs
|
244 |
-
summaries = [p.strip() for p in response.split(
|
245 |
-
|
246 |
# Create structured output
|
247 |
structured_output = []
|
248 |
for idx, summary in enumerate(summaries):
|
249 |
# Associate each summary with the most relevant source
|
250 |
-
structured_output.append(
|
251 |
-
|
252 |
-
|
253 |
-
"
|
254 |
-
|
255 |
-
|
|
|
|
|
|
|
|
|
|
|
256 |
}
|
257 |
-
|
258 |
-
|
259 |
return structured_output
|
260 |
|
261 |
-
def get_source_context(
|
|
|
|
|
262 |
"""Get extended context around a specific chunk"""
|
263 |
metadata = self.chunk_metadata.get(chunk_id)
|
264 |
if not metadata:
|
265 |
return None
|
266 |
-
|
267 |
return {
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
}
|
272 |
|
|
|
273 |
class ContextualRetriever:
|
274 |
-
|
275 |
-
|
|
|
|
|
|
|
276 |
self.claude_client = Anthropic(api_key=claude_api_key)
|
277 |
self.logger = logging.getLogger(__name__)
|
278 |
self.bm25 = None
|
279 |
self.claude_context_model = claude_context_model
|
280 |
-
|
281 |
def generate_context(self, full_text: str, chunk: DocumentChunk) -> str:
|
282 |
"""Generate contextual description using Claude"""
|
283 |
try:
|
284 |
-
prompt = f"""<document>
|
285 |
-
{full_text}
|
286 |
-
</document>
|
287 |
-
Here is the chunk we want to situate within the whole document
|
288 |
-
<chunk>
|
289 |
-
{chunk.content}
|
290 |
-
</chunk>
|
291 |
-
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
292 |
|
293 |
response = self.claude_client.messages.create(
|
294 |
model=self.claude_context_model,
|
295 |
max_tokens=100,
|
296 |
-
messages=[{"role": "user", "content": prompt}]
|
297 |
)
|
298 |
-
return response.content[
|
|
|
|
|
299 |
except Exception as e:
|
300 |
-
self.logger.error(
|
|
|
|
|
301 |
return ""
|
302 |
|
303 |
-
def contextualize_chunks(
|
|
|
|
|
|
|
|
|
304 |
"""Add context to all chunks"""
|
|
|
|
|
305 |
contextualized_chunks = []
|
|
|
|
|
306 |
for chunk in chunks:
|
307 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
308 |
contextualized_chunk = ContextualizedChunk(
|
309 |
content=chunk.content,
|
310 |
page_number=chunk.page_number,
|
311 |
chunk_id=chunk.chunk_id,
|
312 |
start_char=chunk.start_char,
|
313 |
end_char=chunk.end_char,
|
314 |
-
context=context
|
315 |
)
|
316 |
contextualized_chunks.append(contextualized_chunk)
|
317 |
return contextualized_chunks
|
318 |
|
|
|
319 |
class EnhancedDocumentSummarizer(DocumentSummarizer):
|
320 |
-
|
321 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
322 |
self.config = config
|
323 |
-
self.contextual_retriever = ContextualRetriever(
|
|
|
|
|
324 |
self.logger = logging.getLogger(__name__)
|
325 |
self.prompt_relatorio = prompt_relatorio
|
326 |
self.gpt_model = gpt_model
|
@@ -328,25 +458,28 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
|
|
328 |
self.id_modelo_do_usuario = id_modelo_do_usuario
|
329 |
self.prompt_modelo = prompt_modelo
|
330 |
|
331 |
-
def create_enhanced_vector_store(
|
|
|
|
|
332 |
"""Create vector store and BM25 index with contextualized chunks"""
|
333 |
try:
|
334 |
# Prepare texts with context
|
335 |
texts = [f"{chunk.context} {chunk.content}" for chunk in chunks]
|
336 |
-
|
337 |
# Create vector store
|
338 |
-
metadatas = [
|
339 |
-
|
340 |
-
|
341 |
-
|
342 |
-
|
343 |
-
|
344 |
-
|
345 |
-
|
|
|
|
|
|
|
346 |
vector_store = Chroma.from_texts(
|
347 |
-
texts=texts,
|
348 |
-
metadatas=metadatas,
|
349 |
-
embedding=self.embeddings
|
350 |
)
|
351 |
|
352 |
# Create BM25 index
|
@@ -363,37 +496,34 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
|
|
363 |
raise
|
364 |
|
365 |
def retrieve_with_rank_fusion(
|
366 |
-
self,
|
367 |
-
vector_store: Chroma,
|
368 |
-
bm25: BM25Okapi,
|
369 |
-
chunk_ids: List[str],
|
370 |
-
query: str
|
371 |
) -> List[Dict]:
|
372 |
"""Combine embedding and BM25 retrieval results"""
|
373 |
try:
|
374 |
# Get embedding results
|
375 |
embedding_results = vector_store.similarity_search_with_score(
|
376 |
-
query,
|
377 |
-
k=self.config.num_chunks
|
378 |
)
|
379 |
-
|
380 |
# Convert embedding results to list of (chunk_id, score)
|
381 |
embedding_list = [
|
382 |
-
(doc.metadata[
|
383 |
for doc, score in embedding_results
|
384 |
]
|
385 |
-
|
386 |
# Get BM25 results
|
387 |
tokenized_query = query.split()
|
388 |
bm25_scores = bm25.get_scores(tokenized_query)
|
389 |
-
|
390 |
# Convert BM25 scores to list of (chunk_id, score)
|
391 |
bm25_list = [
|
392 |
(chunk_ids[i], float(score)) for i, score in enumerate(bm25_scores)
|
393 |
]
|
394 |
|
395 |
# Sort bm25_list by score in descending order and limit to top N results
|
396 |
-
bm25_list = sorted(bm25_list, key=lambda x: x[1], reverse=True)[
|
|
|
|
|
397 |
|
398 |
# Normalize BM25 scores
|
399 |
max_bm25 = max([score for _, score in bm25_list]) if bm25_list else 1
|
@@ -402,12 +532,9 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
|
|
402 |
# Pass the lists to rank fusion
|
403 |
result_lists = [embedding_list, bm25_list]
|
404 |
weights = [self.config.embedding_weight, self.config.bm25_weight]
|
405 |
-
|
406 |
-
combined_results = reciprocal_rank_fusion(
|
407 |
-
|
408 |
-
weights=weights
|
409 |
-
)
|
410 |
-
|
411 |
return combined_results
|
412 |
|
413 |
except Exception as e:
|
@@ -415,100 +542,112 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
|
|
415 |
raise
|
416 |
|
417 |
def generate_enhanced_summary(
|
418 |
-
self,
|
419 |
-
vector_store: Chroma,
|
420 |
bm25: BM25Okapi,
|
421 |
chunk_ids: List[str],
|
422 |
-
query: str = "Summarize the main points of this document"
|
423 |
) -> List[Dict]:
|
424 |
"""Generate enhanced summary using both vector and BM25 retrieval"""
|
425 |
try:
|
426 |
# Get combined results using rank fusion
|
427 |
ranked_results = self.retrieve_with_rank_fusion(
|
428 |
-
vector_store,
|
429 |
-
bm25,
|
430 |
-
chunk_ids,
|
431 |
-
query
|
432 |
)
|
433 |
|
434 |
# Prepare context and track sources
|
435 |
contexts = []
|
436 |
sources = []
|
437 |
-
|
438 |
# Get full documents for top results
|
439 |
-
for chunk_id, score in ranked_results[:self.config.num_chunks]:
|
440 |
results = vector_store.get(
|
441 |
-
where={"chunk_id": chunk_id},
|
442 |
-
include=["documents", "metadatas"]
|
443 |
)
|
444 |
-
|
445 |
if results["documents"]:
|
446 |
context = results["documents"][0]
|
447 |
metadata = results["metadatas"][0]
|
448 |
-
|
449 |
contexts.append(context)
|
450 |
-
sources.append(
|
451 |
-
|
452 |
-
|
453 |
-
|
454 |
-
|
455 |
-
|
456 |
-
|
457 |
-
|
|
|
|
|
458 |
url_request = f"{api_url}/modelo/{self.id_modelo_do_usuario}"
|
459 |
resposta = requests.get(url_request)
|
460 |
-
|
461 |
-
if
|
462 |
-
return Response(
|
463 |
-
|
|
|
|
|
|
|
|
|
464 |
modelo_buscado = resposta.json()["modelo"]
|
465 |
-
|
466 |
llm = ChatOpenAI(
|
467 |
temperature=self.gpt_temperature,
|
468 |
-
model_name=self.gpt_model,
|
469 |
api_key=self.openai_api_key,
|
470 |
-
|
471 |
)
|
472 |
-
|
473 |
prompt_gerar_relatorio = PromptTemplate(
|
474 |
-
template=self.prompt_relatorio,
|
475 |
-
|
|
|
|
|
|
|
476 |
)
|
477 |
-
|
478 |
-
relatorio_gerado = llm.predict(prompt_gerar_relatorio.format(context="\n\n".join(contexts)))
|
479 |
|
480 |
prompt_gerar_modelo = PromptTemplate(
|
481 |
template=self.prompt_modelo,
|
482 |
-
input_variables=["context", "modelo_usuario"]
|
483 |
)
|
484 |
-
|
485 |
-
modelo_gerado = llm.predict(
|
486 |
-
|
|
|
|
|
|
|
|
|
487 |
# Split the response into paragraphs
|
488 |
-
summaries = [p.strip() for p in modelo_gerado.split(
|
489 |
-
|
490 |
# Create structured output
|
491 |
structured_output = []
|
492 |
for idx, summary in enumerate(summaries):
|
493 |
-
source_idx = min(idx, len(sources)-1)
|
494 |
-
structured_output.append(
|
495 |
-
|
496 |
-
|
497 |
-
"
|
498 |
-
|
499 |
-
|
500 |
-
|
501 |
-
|
|
|
|
|
502 |
}
|
503 |
-
|
504 |
-
|
505 |
return structured_output
|
506 |
|
507 |
except Exception as e:
|
508 |
self.logger.error(f"Error generating enhanced summary: {str(e)}")
|
509 |
raise
|
510 |
|
511 |
-
|
|
|
|
|
|
|
|
|
512 |
allPdfsChunks = []
|
513 |
|
514 |
# Configuration
|
@@ -517,13 +656,13 @@ def get_llm_summary_answer_by_cursor_complete(serializer, listaPDFs):
|
|
517 |
embedding_weight=serializer["embedding_weight"],
|
518 |
bm25_weight=serializer["bm25_weight"],
|
519 |
context_window=serializer["context_window"],
|
520 |
-
chunk_overlap=serializer["chunk_overlap"]
|
521 |
)
|
522 |
|
523 |
# Initialize enhanced summarizer
|
524 |
summarizer = EnhancedDocumentSummarizer(
|
525 |
openai_api_key=os.environ.get("OPENAI_API_KEY"),
|
526 |
-
claude_api_key=
|
527 |
config=config,
|
528 |
embedding_model=serializer["hf_embedding"],
|
529 |
chunk_overlap=serializer["chunk_overlap"],
|
@@ -535,60 +674,240 @@ def get_llm_summary_answer_by_cursor_complete(serializer, listaPDFs):
|
|
535 |
gpt_model=serializer["model"],
|
536 |
gpt_temperature=serializer["gpt_temperature"],
|
537 |
id_modelo_do_usuario=serializer["id_modelo_do_usuario"],
|
538 |
-
prompt_modelo=serializer["prompt_modelo"]
|
539 |
)
|
540 |
|
541 |
-
|
542 |
-
|
543 |
-
|
544 |
-
|
545 |
-
|
546 |
-
|
547 |
-
|
548 |
-
|
549 |
-
|
550 |
-
|
551 |
-
|
552 |
-
|
553 |
-
|
554 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
555 |
|
556 |
# Contextualize chunks
|
557 |
-
contextualized_chunks = summarizer.contextual_retriever.contextualize_chunks(
|
|
|
|
|
558 |
|
559 |
# Create enhanced vector store and BM25 index
|
560 |
-
vector_store, bm25, chunk_ids = summarizer.create_enhanced_vector_store(
|
|
|
|
|
561 |
|
562 |
# Generate enhanced summary
|
563 |
structured_summaries = summarizer.generate_enhanced_summary(
|
564 |
-
vector_store,
|
565 |
-
bm25,
|
566 |
-
chunk_ids,
|
567 |
-
serializer["user_message"]
|
568 |
)
|
569 |
|
570 |
# Output results as JSON
|
571 |
json_output = json.dumps(structured_summaries, indent=2)
|
572 |
print("\nStructured Summaries:")
|
573 |
print(json_output)
|
|
|
|
|
|
|
574 |
return {
|
575 |
-
|
576 |
-
|
577 |
-
"
|
578 |
-
|
579 |
-
|
580 |
-
|
581 |
-
|
582 |
-
|
583 |
-
|
584 |
-
|
585 |
-
|
586 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
587 |
"user_message": serializer["user_message"],
|
588 |
-
|
589 |
-
|
590 |
-
|
591 |
-
|
592 |
-
|
593 |
-
|
594 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import os
|
2 |
from typing import List, Dict, Tuple, Optional
|
3 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
|
|
18 |
import requests
|
19 |
from setup.environment import api_url
|
20 |
from rest_framework.response import Response
|
21 |
+
from langchain.schema import Document
|
22 |
+
|
23 |
+
listaContador = []
|
24 |
|
25 |
def reciprocal_rank_fusion(result_lists, weights=None):
|
26 |
"""Combine multiple ranked lists using reciprocal rank fusion"""
|
|
|
36 |
fused_scores[doc_id] += weights[i] * score
|
37 |
|
38 |
# Sort by score in descending order
|
39 |
+
sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
|
40 |
+
|
|
|
|
|
|
|
|
|
41 |
return sorted_results
|
42 |
|
43 |
+
|
44 |
+
os.environ["LANGCHAIN_TRACING_V2"] = "true"
|
45 |
+
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
|
46 |
os.environ.get("LANGCHAIN_API_KEY")
|
47 |
+
os.environ["LANGCHAIN_PROJECT"] = "VELLA"
|
48 |
+
|
49 |
|
50 |
@dataclass
|
51 |
class DocumentChunk:
|
|
|
55 |
start_char: int
|
56 |
end_char: int
|
57 |
|
58 |
+
|
59 |
@dataclass
|
60 |
class RetrievalConfig:
|
61 |
num_chunks: int = 5
|
|
|
65 |
chunk_overlap: int = 200
|
66 |
chunk_size: int = 1000
|
67 |
|
68 |
+
|
69 |
@dataclass
|
70 |
class ContextualizedChunk(DocumentChunk):
|
71 |
context: str = ""
|
72 |
embedding: Optional[np.ndarray] = None
|
73 |
bm25_score: Optional[float] = None
|
74 |
|
75 |
+
|
76 |
class DocumentSummarizer:
|
77 |
+
|
78 |
+
def __init__(
|
79 |
+
self,
|
80 |
+
openai_api_key: str,
|
81 |
+
cohere_api_key: str,
|
82 |
+
embedding_model,
|
83 |
+
chunk_size,
|
84 |
+
chunk_overlap,
|
85 |
+
num_k_rerank,
|
86 |
+
model_cohere_rerank,
|
87 |
+
):
|
88 |
self.openai_api_key = openai_api_key
|
89 |
self.cohere_client = Client(cohere_api_key)
|
90 |
+
self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
|
|
|
|
|
91 |
self.text_splitter = RecursiveCharacterTextSplitter(
|
92 |
+
chunk_size=chunk_size, chunk_overlap=chunk_overlap
|
|
|
93 |
)
|
94 |
self.chunk_metadata = {} # Store chunk metadata for tracing
|
95 |
self.num_k_rerank = num_k_rerank
|
96 |
self.model_cohere_rerank = model_cohere_rerank
|
97 |
+
|
98 |
def load_and_split_document(self, pdf_path: str) -> List[DocumentChunk]:
|
99 |
"""Load PDF and split into chunks with metadata"""
|
100 |
loader = PyPDFLoader(pdf_path)
|
101 |
+
pages = (
|
102 |
+
loader.load()
|
103 |
+
) # Gera uma lista de objetos Document, sendo cada item da lista referente a UMA PÁGINA inteira do PDF.
|
104 |
chunks = []
|
105 |
char_count = 0
|
106 |
+
|
107 |
for page in pages:
|
108 |
text = page.page_content
|
109 |
+
page_chunks = self.text_splitter.split_text(
|
110 |
+
text
|
111 |
+
) # Quebra o item que é um Document de UMA PÁGINA inteira em um lista onde cada item é referente a um chunk, que são pedaços menores do que uma página.
|
112 |
+
|
113 |
for chunk in page_chunks:
|
114 |
chunk_id = str(uuid.uuid4())
|
115 |
+
start_char = text.find(
|
116 |
+
chunk
|
117 |
+
) # Retorna a posição onde se encontra o chunk dentro da página inteira
|
118 |
end_char = start_char + len(chunk)
|
119 |
+
|
120 |
+
doc_chunk = DocumentChunk( # Gera o objeto do chunk com informações adicionais, como a posição e id do chunk
|
121 |
content=chunk,
|
122 |
+
page_number=page.metadata.get("page") + 1, # 1-based page numbering
|
123 |
chunk_id=chunk_id,
|
124 |
start_char=char_count + start_char,
|
125 |
+
end_char=char_count + end_char,
|
126 |
)
|
127 |
chunks.append(doc_chunk)
|
128 |
+
|
129 |
# Store metadata for later retrieval
|
130 |
self.chunk_metadata[chunk_id] = {
|
131 |
+
"page": doc_chunk.page_number,
|
132 |
+
"start_char": doc_chunk.start_char,
|
133 |
+
"end_char": doc_chunk.end_char,
|
134 |
}
|
135 |
+
|
136 |
char_count += len(text)
|
137 |
+
|
138 |
return chunks
|
139 |
|
140 |
+
def load_and_split_text(self, text: str) -> List[DocumentChunk]:
|
141 |
+
"""Load Text and split into chunks with metadata - Criei essa função apenas para o ragas"""
|
142 |
+
page = Document(page_content=text, metadata={"page": 1})
|
143 |
+
chunks = []
|
144 |
+
char_count = 0
|
145 |
+
|
146 |
+
text = page.page_content
|
147 |
+
page_chunks = self.text_splitter.split_text(
|
148 |
+
text
|
149 |
+
) # Quebra o item que é um Document de UMA PÁGINA inteira em um lista onde cada item é referente a um chunk, que são pedaços menores do que uma página.
|
150 |
+
print("\n\n\n")
|
151 |
+
print("page_chunks: ", page_chunks)
|
152 |
+
|
153 |
+
for chunk in page_chunks:
|
154 |
+
chunk_id = str(uuid.uuid4())
|
155 |
+
start_char = text.find(
|
156 |
+
chunk
|
157 |
+
) # Retorna a posição onde se encontra o chunk dentro da página inteira
|
158 |
+
end_char = start_char + len(chunk)
|
159 |
+
|
160 |
+
doc_chunk = DocumentChunk( # Gera o objeto do chunk com informações adicionais, como a posição e id do chunk
|
161 |
+
content=chunk,
|
162 |
+
page_number=page.metadata.get("page") + 1, # 1-based page numbering
|
163 |
+
chunk_id=chunk_id,
|
164 |
+
start_char=char_count + start_char,
|
165 |
+
end_char=char_count + end_char,
|
166 |
+
)
|
167 |
+
chunks.append(doc_chunk)
|
168 |
+
|
169 |
+
# Store metadata for later retrieval
|
170 |
+
self.chunk_metadata[chunk_id] = {
|
171 |
+
"page": doc_chunk.page_number,
|
172 |
+
"start_char": doc_chunk.start_char,
|
173 |
+
"end_char": doc_chunk.end_char,
|
174 |
+
}
|
175 |
+
|
176 |
+
char_count += len(text)
|
177 |
+
|
178 |
+
return chunks
|
179 |
+
|
180 |
+
def create_vector_store(
|
181 |
+
self, chunks: List[DocumentChunk]
|
182 |
+
) -> Chroma: # Esta função nunca está sendo utilizada
|
183 |
"""Create vector store with metadata"""
|
184 |
texts = [chunk.content for chunk in chunks]
|
185 |
+
metadatas = [
|
186 |
+
{
|
187 |
+
"chunk_id": chunk.chunk_id,
|
188 |
+
"page": chunk.page_number,
|
189 |
+
"start_char": chunk.start_char,
|
190 |
+
"end_char": chunk.end_char,
|
191 |
+
}
|
192 |
+
for chunk in chunks
|
193 |
+
]
|
194 |
+
|
195 |
vector_store = Chroma.from_texts(
|
196 |
+
texts=texts, metadatas=metadatas, embedding=self.embeddings
|
|
|
|
|
197 |
)
|
198 |
return vector_store
|
199 |
|
200 |
+
def rerank_chunks( # Esta função nunca está sendo utilizada
|
201 |
+
self, chunks: List[Dict], query: str, k: int = 5
|
|
|
|
|
|
|
202 |
) -> List[Dict]:
|
203 |
"""
|
204 |
Rerank chunks using Cohere's reranking model.
|
205 |
+
|
206 |
Args:
|
207 |
chunks: List of dictionaries containing chunks and their metadata
|
208 |
query: Original search query
|
209 |
k: Number of top chunks to return
|
210 |
+
|
211 |
Returns:
|
212 |
List of reranked chunks with updated relevance scores
|
213 |
"""
|
214 |
try:
|
215 |
# Prepare documents for reranking
|
216 |
+
documents = [chunk["content"] for chunk in chunks]
|
217 |
+
|
218 |
# Get reranking scores from Cohere
|
219 |
results = self.cohere_client.rerank(
|
220 |
query=query,
|
221 |
documents=documents,
|
222 |
top_n=k,
|
223 |
+
model=self.model_cohere_rerank,
|
224 |
)
|
225 |
+
|
226 |
# Create reranked results with original metadata
|
227 |
reranked_chunks = []
|
228 |
for hit in results:
|
229 |
original_chunk = chunks[hit.index]
|
230 |
+
reranked_chunks.append(
|
231 |
+
{**original_chunk, "relevance_score": hit.relevance_score}
|
232 |
+
)
|
233 |
+
|
|
|
234 |
return reranked_chunks
|
235 |
+
|
236 |
except Exception as e:
|
237 |
logging.error(f"Reranking failed: {str(e)}")
|
238 |
return chunks[:k] # Fallback to original ordering
|
239 |
|
240 |
+
def generate_summary_with_sources( # Esta função nunca está sendo utilizada
|
241 |
+
self,
|
242 |
+
vector_store: Chroma,
|
243 |
+
query: str = "Summarize the main points of this document",
|
244 |
) -> List[Dict]:
|
245 |
"""Generate summary with source citations using reranking"""
|
246 |
# Retrieve more initial chunks for reranking
|
247 |
relevant_docs = vector_store.similarity_search_with_score(query, k=20)
|
248 |
+
|
249 |
# Prepare chunks for reranking
|
250 |
chunks = []
|
251 |
for doc, score in relevant_docs:
|
252 |
+
chunks.append(
|
253 |
+
{
|
254 |
+
"content": doc.page_content,
|
255 |
+
"page": doc.metadata["page"],
|
256 |
+
"chunk_id": doc.metadata["chunk_id"],
|
257 |
+
"relevance_score": score,
|
258 |
+
}
|
259 |
+
)
|
260 |
+
|
261 |
# Rerank chunks
|
262 |
reranked_chunks = self.rerank_chunks(chunks, query, k=self.num_k_rerank)
|
263 |
+
|
264 |
# Prepare context and sources from reranked chunks
|
265 |
contexts = []
|
266 |
sources = []
|
267 |
+
|
268 |
for chunk in reranked_chunks:
|
269 |
+
contexts.append(chunk["content"])
|
270 |
+
sources.append(
|
271 |
+
{
|
272 |
+
"content": chunk["content"],
|
273 |
+
"page": chunk["page"],
|
274 |
+
"chunk_id": chunk["chunk_id"],
|
275 |
+
"relevance_score": chunk["relevance_score"],
|
276 |
+
}
|
277 |
+
)
|
278 |
|
279 |
prompt_template = """
|
280 |
Based on the following context, provide multiple key points from the document.
|
|
|
285 |
|
286 |
Key points:
|
287 |
"""
|
288 |
+
|
289 |
+
prompt = PromptTemplate(template=prompt_template, input_variables=["context"])
|
290 |
+
|
|
|
|
|
|
|
291 |
llm = ChatOpenAI(
|
292 |
+
temperature=0, model_name="gpt-4o-mini", api_key=self.openai_api_key
|
|
|
|
|
293 |
)
|
294 |
+
|
295 |
response = llm.predict(prompt.format(context="\n\n".join(contexts)))
|
296 |
+
|
297 |
# Split the response into paragraphs
|
298 |
+
summaries = [p.strip() for p in response.split("\n\n") if p.strip()]
|
299 |
+
|
300 |
# Create structured output
|
301 |
structured_output = []
|
302 |
for idx, summary in enumerate(summaries):
|
303 |
# Associate each summary with the most relevant source
|
304 |
+
structured_output.append(
|
305 |
+
{
|
306 |
+
"content": summary,
|
307 |
+
"source": {
|
308 |
+
"page": sources[min(idx, len(sources) - 1)]["page"],
|
309 |
+
"text": sources[min(idx, len(sources) - 1)]["content"][:200]
|
310 |
+
+ "...",
|
311 |
+
"relevance_score": sources[min(idx, len(sources) - 1)][
|
312 |
+
"relevance_score"
|
313 |
+
],
|
314 |
+
},
|
315 |
}
|
316 |
+
)
|
317 |
+
|
318 |
return structured_output
|
319 |
|
320 |
+
def get_source_context(
|
321 |
+
self, chunk_id: str, window: int = 100
|
322 |
+
) -> Dict: # Esta função nunca está sendo utilizada
|
323 |
"""Get extended context around a specific chunk"""
|
324 |
metadata = self.chunk_metadata.get(chunk_id)
|
325 |
if not metadata:
|
326 |
return None
|
327 |
+
|
328 |
return {
|
329 |
+
"page": metadata["page"],
|
330 |
+
"start_char": metadata["start_char"],
|
331 |
+
"end_char": metadata["end_char"],
|
332 |
}
|
333 |
|
334 |
+
|
335 |
class ContextualRetriever:
|
336 |
+
|
337 |
+
def __init__(
|
338 |
+
self, config: RetrievalConfig, claude_api_key: str, claude_context_model
|
339 |
+
):
|
340 |
+
self.config = config # Este self.config no momento não está sendo utilizada para nada dentro desta classe. Analisar se deveria estar sendo utilizada.
|
341 |
self.claude_client = Anthropic(api_key=claude_api_key)
|
342 |
self.logger = logging.getLogger(__name__)
|
343 |
self.bm25 = None
|
344 |
self.claude_context_model = claude_context_model
|
345 |
+
|
346 |
def generate_context(self, full_text: str, chunk: DocumentChunk) -> str:
|
347 |
"""Generate contextual description using Claude"""
|
348 |
try:
|
349 |
+
# prompt = f"""<document>
|
350 |
+
# {full_text}
|
351 |
+
# </document>
|
352 |
+
# Here is the chunk we want to situate within the whole document
|
353 |
+
# <chunk>
|
354 |
+
# {chunk.content}
|
355 |
+
# </chunk>
|
356 |
+
# Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
|
357 |
+
|
358 |
+
prompt = f"""You are a language model tasked with providing context to improve the retrieval of information from a chunk extracted from a document. Follow these steps internally (do not display reasoning or reflection in the final output):
|
359 |
+
1. **Chain of Thought (internal)**:
|
360 |
+
- Identify the document ID, which is the value between "NUM." and "- Pág".
|
361 |
+
- Identify the document name from the header.
|
362 |
+
2. **Reflection (internal)**:
|
363 |
+
- Confirm the document ID and name are correctly identified.
|
364 |
+
- Ensure the final context is concise and helpful.
|
365 |
+
3. **Final Response**:
|
366 |
+
- Provide a short context situating the *chunk* within the document, including the document ID and document name.
|
367 |
+
- Do not include any reasoning or reflection in your response.
|
368 |
+
**Example Usage:**
|
369 |
+
```
|
370 |
+
<document> {full_text} </document>
|
371 |
+
<chunk> {chunk.content} </chunk>
|
372 |
+
Please return only the succinct context (without displaying your internal reasoning), including the document ID and the document name.
|
373 |
+
```
|
374 |
+
"""
|
375 |
|
376 |
response = self.claude_client.messages.create(
|
377 |
model=self.claude_context_model,
|
378 |
max_tokens=100,
|
379 |
+
messages=[{"role": "user", "content": prompt}],
|
380 |
)
|
381 |
+
return response.content[
|
382 |
+
0
|
383 |
+
].text # O response.content é uma lista pois é passada uma lista de mensagens, e também retornado uma lista de mensagens, sendo a primeira a mais recente, que é a resposta do model
|
384 |
except Exception as e:
|
385 |
+
self.logger.error(
|
386 |
+
f"Context generation failed for chunk {chunk.chunk_id}: {str(e)}"
|
387 |
+
)
|
388 |
return ""
|
389 |
|
390 |
+
def contextualize_chunks(
|
391 |
+
self, full_text: List[Document], chunks: List[DocumentChunk]
|
392 |
+
) -> List[
|
393 |
+
ContextualizedChunk
|
394 |
+
]: # Pega um chunk e apenas adiciona uma propriedade de contexto a ela, sendo esta propriedade a resposta da função acima, que chama um Model do Claude para dizer o contexto de um chunk
|
395 |
"""Add context to all chunks"""
|
396 |
+
|
397 |
+
smaller_context = ""
|
398 |
contextualized_chunks = []
|
399 |
+
print("\n\n")
|
400 |
+
print("len(chunks): ", len(chunks))
|
401 |
for chunk in chunks:
|
402 |
+
contador_pagina = -1
|
403 |
+
while contador_pagina <= 1:
|
404 |
+
local_page = full_text[chunk.page_number + contador_pagina]
|
405 |
+
if local_page:
|
406 |
+
smaller_context += local_page.page_content
|
407 |
+
contador_pagina += 1
|
408 |
+
print("chunk.page_number: ", chunk.page_number)
|
409 |
+
context = self.generate_context(smaller_context, chunk)
|
410 |
contextualized_chunk = ContextualizedChunk(
|
411 |
content=chunk.content,
|
412 |
page_number=chunk.page_number,
|
413 |
chunk_id=chunk.chunk_id,
|
414 |
start_char=chunk.start_char,
|
415 |
end_char=chunk.end_char,
|
416 |
+
context=context,
|
417 |
)
|
418 |
contextualized_chunks.append(contextualized_chunk)
|
419 |
return contextualized_chunks
|
420 |
|
421 |
+
|
422 |
class EnhancedDocumentSummarizer(DocumentSummarizer):
|
423 |
+
|
424 |
+
def __init__(
|
425 |
+
self,
|
426 |
+
openai_api_key: str,
|
427 |
+
claude_api_key: str,
|
428 |
+
config: RetrievalConfig,
|
429 |
+
embedding_model,
|
430 |
+
chunk_size,
|
431 |
+
chunk_overlap,
|
432 |
+
num_k_rerank,
|
433 |
+
model_cohere_rerank,
|
434 |
+
claude_context_model,
|
435 |
+
prompt_relatorio,
|
436 |
+
gpt_model,
|
437 |
+
gpt_temperature,
|
438 |
+
id_modelo_do_usuario,
|
439 |
+
prompt_modelo,
|
440 |
+
):
|
441 |
+
super().__init__(
|
442 |
+
openai_api_key,
|
443 |
+
os.environ.get("COHERE_API_KEY"),
|
444 |
+
embedding_model,
|
445 |
+
chunk_size,
|
446 |
+
chunk_overlap,
|
447 |
+
num_k_rerank,
|
448 |
+
model_cohere_rerank,
|
449 |
+
)
|
450 |
self.config = config
|
451 |
+
self.contextual_retriever = ContextualRetriever(
|
452 |
+
config, claude_api_key, claude_context_model
|
453 |
+
)
|
454 |
self.logger = logging.getLogger(__name__)
|
455 |
self.prompt_relatorio = prompt_relatorio
|
456 |
self.gpt_model = gpt_model
|
|
|
458 |
self.id_modelo_do_usuario = id_modelo_do_usuario
|
459 |
self.prompt_modelo = prompt_modelo
|
460 |
|
461 |
+
def create_enhanced_vector_store(
|
462 |
+
self, chunks: List[ContextualizedChunk]
|
463 |
+
) -> Tuple[Chroma, BM25Okapi, List[str]]:
|
464 |
"""Create vector store and BM25 index with contextualized chunks"""
|
465 |
try:
|
466 |
# Prepare texts with context
|
467 |
texts = [f"{chunk.context} {chunk.content}" for chunk in chunks]
|
468 |
+
|
469 |
# Create vector store
|
470 |
+
metadatas = [
|
471 |
+
{
|
472 |
+
"chunk_id": chunk.chunk_id,
|
473 |
+
"page": chunk.page_number,
|
474 |
+
"start_char": chunk.start_char,
|
475 |
+
"end_char": chunk.end_char,
|
476 |
+
"context": chunk.context,
|
477 |
+
}
|
478 |
+
for chunk in chunks
|
479 |
+
]
|
480 |
+
|
481 |
vector_store = Chroma.from_texts(
|
482 |
+
texts=texts, metadatas=metadatas, embedding=self.embeddings
|
|
|
|
|
483 |
)
|
484 |
|
485 |
# Create BM25 index
|
|
|
496 |
raise
|
497 |
|
498 |
def retrieve_with_rank_fusion(
|
499 |
+
self, vector_store: Chroma, bm25: BM25Okapi, chunk_ids: List[str], query: str
|
|
|
|
|
|
|
|
|
500 |
) -> List[Dict]:
|
501 |
"""Combine embedding and BM25 retrieval results"""
|
502 |
try:
|
503 |
# Get embedding results
|
504 |
embedding_results = vector_store.similarity_search_with_score(
|
505 |
+
query, k=self.config.num_chunks
|
|
|
506 |
)
|
507 |
+
|
508 |
# Convert embedding results to list of (chunk_id, score)
|
509 |
embedding_list = [
|
510 |
+
(doc.metadata["chunk_id"], 1 / (1 + score))
|
511 |
for doc, score in embedding_results
|
512 |
]
|
513 |
+
|
514 |
# Get BM25 results
|
515 |
tokenized_query = query.split()
|
516 |
bm25_scores = bm25.get_scores(tokenized_query)
|
517 |
+
|
518 |
# Convert BM25 scores to list of (chunk_id, score)
|
519 |
bm25_list = [
|
520 |
(chunk_ids[i], float(score)) for i, score in enumerate(bm25_scores)
|
521 |
]
|
522 |
|
523 |
# Sort bm25_list by score in descending order and limit to top N results
|
524 |
+
bm25_list = sorted(bm25_list, key=lambda x: x[1], reverse=True)[
|
525 |
+
: self.config.num_chunks
|
526 |
+
]
|
527 |
|
528 |
# Normalize BM25 scores
|
529 |
max_bm25 = max([score for _, score in bm25_list]) if bm25_list else 1
|
|
|
532 |
# Pass the lists to rank fusion
|
533 |
result_lists = [embedding_list, bm25_list]
|
534 |
weights = [self.config.embedding_weight, self.config.bm25_weight]
|
535 |
+
|
536 |
+
combined_results = reciprocal_rank_fusion(result_lists, weights=weights)
|
537 |
+
|
|
|
|
|
|
|
538 |
return combined_results
|
539 |
|
540 |
except Exception as e:
|
|
|
542 |
raise
|
543 |
|
544 |
def generate_enhanced_summary(
|
545 |
+
self,
|
546 |
+
vector_store: Chroma,
|
547 |
bm25: BM25Okapi,
|
548 |
chunk_ids: List[str],
|
549 |
+
query: str = "Summarize the main points of this document",
|
550 |
) -> List[Dict]:
|
551 |
"""Generate enhanced summary using both vector and BM25 retrieval"""
|
552 |
try:
|
553 |
# Get combined results using rank fusion
|
554 |
ranked_results = self.retrieve_with_rank_fusion(
|
555 |
+
vector_store, bm25, chunk_ids, query
|
|
|
|
|
|
|
556 |
)
|
557 |
|
558 |
# Prepare context and track sources
|
559 |
contexts = []
|
560 |
sources = []
|
561 |
+
|
562 |
# Get full documents for top results
|
563 |
+
for chunk_id, score in ranked_results[: self.config.num_chunks]:
|
564 |
results = vector_store.get(
|
565 |
+
where={"chunk_id": chunk_id}, include=["documents", "metadatas"]
|
|
|
566 |
)
|
567 |
+
|
568 |
if results["documents"]:
|
569 |
context = results["documents"][0]
|
570 |
metadata = results["metadatas"][0]
|
571 |
+
|
572 |
contexts.append(context)
|
573 |
+
sources.append(
|
574 |
+
{
|
575 |
+
"content": context,
|
576 |
+
"page": metadata["page"],
|
577 |
+
"chunk_id": chunk_id,
|
578 |
+
"relevance_score": score,
|
579 |
+
"context": metadata.get("context", ""),
|
580 |
+
}
|
581 |
+
)
|
582 |
+
|
583 |
url_request = f"{api_url}/modelo/{self.id_modelo_do_usuario}"
|
584 |
resposta = requests.get(url_request)
|
585 |
+
|
586 |
+
if resposta.status_code != 200:
|
587 |
+
return Response(
|
588 |
+
{
|
589 |
+
"error": "Ocorreu um problema. Pode ser que o modelo não tenha sido encontrado. Tente novamente e/ou entre em contato com a equipe técnica"
|
590 |
+
}
|
591 |
+
)
|
592 |
+
|
593 |
modelo_buscado = resposta.json()["modelo"]
|
594 |
+
|
595 |
llm = ChatOpenAI(
|
596 |
temperature=self.gpt_temperature,
|
597 |
+
model_name=self.gpt_model,
|
598 |
api_key=self.openai_api_key,
|
|
|
599 |
)
|
600 |
+
|
601 |
prompt_gerar_relatorio = PromptTemplate(
|
602 |
+
template=self.prompt_relatorio, input_variables=["context"]
|
603 |
+
)
|
604 |
+
|
605 |
+
relatorio_gerado = llm.predict(
|
606 |
+
prompt_gerar_relatorio.format(context="\n\n".join(contexts))
|
607 |
)
|
|
|
|
|
608 |
|
609 |
prompt_gerar_modelo = PromptTemplate(
|
610 |
template=self.prompt_modelo,
|
611 |
+
input_variables=["context", "modelo_usuario"],
|
612 |
)
|
613 |
+
|
614 |
+
modelo_gerado = llm.predict(
|
615 |
+
prompt_gerar_modelo.format(
|
616 |
+
context=relatorio_gerado, modelo_usuario=modelo_buscado
|
617 |
+
)
|
618 |
+
)
|
619 |
+
|
620 |
# Split the response into paragraphs
|
621 |
+
summaries = [p.strip() for p in modelo_gerado.split("\n\n") if p.strip()]
|
622 |
+
|
623 |
# Create structured output
|
624 |
structured_output = []
|
625 |
for idx, summary in enumerate(summaries):
|
626 |
+
source_idx = min(idx, len(sources) - 1)
|
627 |
+
structured_output.append(
|
628 |
+
{
|
629 |
+
"content": summary,
|
630 |
+
"source": {
|
631 |
+
"page": sources[source_idx]["page"],
|
632 |
+
"text": sources[source_idx]["content"][:200] + "...",
|
633 |
+
"context": sources[source_idx]["context"],
|
634 |
+
"relevance_score": sources[source_idx]["relevance_score"],
|
635 |
+
"chunk_id": sources[source_idx]["chunk_id"],
|
636 |
+
},
|
637 |
}
|
638 |
+
)
|
639 |
+
|
640 |
return structured_output
|
641 |
|
642 |
except Exception as e:
|
643 |
self.logger.error(f"Error generating enhanced summary: {str(e)}")
|
644 |
raise
|
645 |
|
646 |
+
|
647 |
+
async def get_llm_summary_answer_by_cursor_complete(
|
648 |
+
serializer, listaPDFs=None, contexto=None
|
649 |
+
):
|
650 |
+
"""Parâmetro "contexto" só deve ser passado quando quiser utilizar o teste com ragas, e assim, não quiser passar PDFs"""
|
651 |
allPdfsChunks = []
|
652 |
|
653 |
# Configuration
|
|
|
656 |
embedding_weight=serializer["embedding_weight"],
|
657 |
bm25_weight=serializer["bm25_weight"],
|
658 |
context_window=serializer["context_window"],
|
659 |
+
chunk_overlap=serializer["chunk_overlap"],
|
660 |
)
|
661 |
|
662 |
# Initialize enhanced summarizer
|
663 |
summarizer = EnhancedDocumentSummarizer(
|
664 |
openai_api_key=os.environ.get("OPENAI_API_KEY"),
|
665 |
+
claude_api_key=os.environ.get("CLAUDE_API_KEY"),
|
666 |
config=config,
|
667 |
embedding_model=serializer["hf_embedding"],
|
668 |
chunk_overlap=serializer["chunk_overlap"],
|
|
|
674 |
gpt_model=serializer["model"],
|
675 |
gpt_temperature=serializer["gpt_temperature"],
|
676 |
id_modelo_do_usuario=serializer["id_modelo_do_usuario"],
|
677 |
+
prompt_modelo=serializer["prompt_modelo"],
|
678 |
)
|
679 |
|
680 |
+
full_text = ""
|
681 |
+
if contexto:
|
682 |
+
full_text = contexto
|
683 |
+
chunks = summarizer.load_and_split_text(full_text)
|
684 |
+
allPdfsChunks = chunks
|
685 |
+
else:
|
686 |
+
# # Load and process document
|
687 |
+
# pdf_path = "./Im_a_storyteller.pdf"
|
688 |
+
# chunks = summarizer.load_and_split_document(pdf_path)
|
689 |
+
|
690 |
+
# Load and process document
|
691 |
+
for pdf in listaPDFs:
|
692 |
+
pdf_path = pdf
|
693 |
+
chunks = summarizer.load_and_split_document(pdf_path)
|
694 |
+
allPdfsChunks = allPdfsChunks + chunks
|
695 |
+
|
696 |
+
# Get full text for contextualization
|
697 |
+
loader = PyPDFLoader(pdf_path)
|
698 |
+
pages = loader.load()
|
699 |
+
full_text = " ".join([page.page_content for page in pages])
|
700 |
|
701 |
# Contextualize chunks
|
702 |
+
contextualized_chunks = await summarizer.contextual_retriever.contextualize_chunks(
|
703 |
+
pages, allPdfsChunks
|
704 |
+
)
|
705 |
|
706 |
# Create enhanced vector store and BM25 index
|
707 |
+
vector_store, bm25, chunk_ids = summarizer.create_enhanced_vector_store(
|
708 |
+
contextualized_chunks
|
709 |
+
)
|
710 |
|
711 |
# Generate enhanced summary
|
712 |
structured_summaries = summarizer.generate_enhanced_summary(
|
713 |
+
vector_store, bm25, chunk_ids, serializer["user_message"]
|
|
|
|
|
|
|
714 |
)
|
715 |
|
716 |
# Output results as JSON
|
717 |
json_output = json.dumps(structured_summaries, indent=2)
|
718 |
print("\nStructured Summaries:")
|
719 |
print(json_output)
|
720 |
+
texto_completo = ""
|
721 |
+
for x in structured_summaries:
|
722 |
+
texto_completo = texto_completo + x["content"]
|
723 |
return {
|
724 |
+
"resultado": structured_summaries,
|
725 |
+
"texto_completo": texto_completo,
|
726 |
+
"parametros-utilizados": {
|
727 |
+
"num_chunks_retrieval": serializer["num_chunks_retrieval"],
|
728 |
+
"embedding_weight": serializer["embedding_weight"],
|
729 |
+
"bm25_weight": serializer["bm25_weight"],
|
730 |
+
"context_window": serializer["context_window"],
|
731 |
+
"chunk_overlap": serializer["chunk_overlap"],
|
732 |
+
"num_k_rerank": serializer["num_k_rerank"],
|
733 |
+
"model_cohere_rerank": serializer["model_cohere_rerank"],
|
734 |
+
"more_initial_chunks_for_reranking": serializer[
|
735 |
+
"more_initial_chunks_for_reranking"
|
736 |
+
],
|
737 |
+
"claude_context_model": serializer["claude_context_model"],
|
738 |
+
"gpt_temperature": serializer["gpt_temperature"],
|
739 |
+
"user_message": serializer["user_message"],
|
740 |
+
"model": serializer["model"],
|
741 |
+
"hf_embedding": serializer["hf_embedding"],
|
742 |
+
"chunk_size": serializer["chunk_size"],
|
743 |
+
"chunk_overlap": serializer["chunk_overlap"],
|
744 |
+
"prompt_relatorio": serializer["prompt_relatorio"],
|
745 |
+
"prompt_modelo": serializer["prompt_modelo"],
|
746 |
+
},
|
747 |
+
}
|
748 |
+
|
749 |
+
|
750 |
+
from ragas import evaluate
|
751 |
+
|
752 |
+
from langchain.chains import SequentialChain
|
753 |
+
from langchain.prompts import PromptTemplate
|
754 |
+
|
755 |
+
# from langchain.schema import ChainResult
|
756 |
+
from langchain.memory import SimpleMemory
|
757 |
+
|
758 |
+
|
759 |
+
def test_ragas(serializer, listaPDFs):
|
760 |
+
|
761 |
+
# Step 2: Setup RetrievalConfig and EnhancedDocumentSummarizer
|
762 |
+
config = RetrievalConfig(
|
763 |
+
num_chunks=serializer["num_chunks_retrieval"],
|
764 |
+
embedding_weight=serializer["embedding_weight"],
|
765 |
+
bm25_weight=serializer["bm25_weight"],
|
766 |
+
context_window=serializer["context_window"],
|
767 |
+
chunk_overlap=serializer["chunk_overlap"],
|
768 |
+
)
|
769 |
+
|
770 |
+
summarizer = EnhancedDocumentSummarizer(
|
771 |
+
openai_api_key=os.environ.get("OPENAI_API_KEY"),
|
772 |
+
claude_api_key=os.environ.get("CLAUDE_API_KEY"),
|
773 |
+
config=config,
|
774 |
+
embedding_model=serializer["hf_embedding"],
|
775 |
+
chunk_overlap=serializer["chunk_overlap"],
|
776 |
+
chunk_size=serializer["chunk_size"],
|
777 |
+
num_k_rerank=serializer["num_k_rerank"],
|
778 |
+
model_cohere_rerank=serializer["model_cohere_rerank"],
|
779 |
+
claude_context_model=serializer["claude_context_model"],
|
780 |
+
prompt_relatorio=serializer["prompt_relatorio"],
|
781 |
+
gpt_model=serializer["model"],
|
782 |
+
gpt_temperature=serializer["gpt_temperature"],
|
783 |
+
id_modelo_do_usuario=serializer["id_modelo_do_usuario"],
|
784 |
+
prompt_modelo=serializer["prompt_modelo"],
|
785 |
+
)
|
786 |
+
|
787 |
+
# Step 1: Define the components
|
788 |
+
def load_and_split_documents(pdf_list, summarizer):
|
789 |
+
"""Loads and splits PDF documents into chunks."""
|
790 |
+
all_chunks = []
|
791 |
+
for pdf_path in pdf_list:
|
792 |
+
chunks = summarizer.load_and_split_document(pdf_path)
|
793 |
+
all_chunks.extend(chunks)
|
794 |
+
return {"chunks": all_chunks}
|
795 |
+
|
796 |
+
def get_full_text_from_pdfs(pdf_list):
|
797 |
+
"""Gets the full text from PDFs for contextualization."""
|
798 |
+
full_text = []
|
799 |
+
for pdf_path in pdf_list:
|
800 |
+
loader = PyPDFLoader(pdf_path)
|
801 |
+
pages = loader.load()
|
802 |
+
text = " ".join([page.page_content for page in pages])
|
803 |
+
full_text.append(text)
|
804 |
+
return {"full_text": " ".join(full_text)}
|
805 |
+
|
806 |
+
def contextualize_chunks(full_text, chunks, contextual_retriever):
|
807 |
+
"""Adds context to chunks using Claude."""
|
808 |
+
contextualized_chunks = contextual_retriever.contextualize_chunks(
|
809 |
+
full_text, chunks
|
810 |
+
)
|
811 |
+
return {"contextualized_chunks": contextualized_chunks}
|
812 |
+
|
813 |
+
def create_vector_store(contextualized_chunks, summarizer):
|
814 |
+
"""Creates an enhanced vector store and BM25 index."""
|
815 |
+
vector_store, bm25, chunk_ids = summarizer.create_enhanced_vector_store(
|
816 |
+
contextualized_chunks
|
817 |
+
)
|
818 |
+
return {"vector_store": vector_store, "bm25": bm25, "chunk_ids": chunk_ids}
|
819 |
+
|
820 |
+
def generate_summary(vector_store, bm25, chunk_ids, query, summarizer):
|
821 |
+
"""Generates an enhanced summary using the vector store and BM25 index."""
|
822 |
+
structured_summaries = summarizer.generate_enhanced_summary(
|
823 |
+
vector_store, bm25, chunk_ids, query
|
824 |
+
)
|
825 |
+
return {"structured_summaries": structured_summaries}
|
826 |
+
|
827 |
+
# Step 3: Define Sequential Chain
|
828 |
+
chain = SequentialChain(
|
829 |
+
chains=[
|
830 |
+
lambda inputs: load_and_split_documents(inputs["pdf_list"], summarizer),
|
831 |
+
lambda inputs: get_full_text_from_pdfs(inputs["pdf_list"]),
|
832 |
+
lambda inputs: contextualize_chunks(
|
833 |
+
inputs["full_text"], inputs["chunks"], summarizer.contextual_retriever
|
834 |
+
),
|
835 |
+
lambda inputs: create_vector_store(
|
836 |
+
inputs["contextualized_chunks"], summarizer
|
837 |
+
),
|
838 |
+
lambda inputs: generate_summary(
|
839 |
+
inputs["vector_store"],
|
840 |
+
inputs["bm25"],
|
841 |
+
inputs["chunk_ids"],
|
842 |
+
inputs["user_message"],
|
843 |
+
summarizer,
|
844 |
+
),
|
845 |
+
],
|
846 |
+
input_variables=["pdf_list", "user_message"],
|
847 |
+
output_variables=["structured_summaries"],
|
848 |
+
)
|
849 |
+
|
850 |
+
from ragas.langchain.evalchain import RagasEvaluatorChain
|
851 |
+
from ragas.metrics import (
|
852 |
+
LLMContextRecall,
|
853 |
+
Faithfulness,
|
854 |
+
FactualCorrectness,
|
855 |
+
SemanticSimilarity,
|
856 |
+
)
|
857 |
+
from ragas import evaluate
|
858 |
+
from ragas.llms import LangchainLLMWrapper
|
859 |
+
|
860 |
+
# from ragas.embeddings import LangchainEmbeddingsWrapper
|
861 |
+
# evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
|
862 |
+
evaluator_llm = LangchainLLMWrapper(chain)
|
863 |
+
# evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
|
864 |
+
from datasets import load_dataset
|
865 |
+
|
866 |
+
dataset = load_dataset(
|
867 |
+
"explodinggradients/amnesty_qa", "english_v3", trust_remote_code=True
|
868 |
+
)
|
869 |
+
|
870 |
+
from ragas import EvaluationDataset
|
871 |
+
|
872 |
+
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
|
873 |
+
|
874 |
+
metrics = [
|
875 |
+
LLMContextRecall(llm=evaluator_llm),
|
876 |
+
FactualCorrectness(llm=evaluator_llm),
|
877 |
+
Faithfulness(llm=evaluator_llm),
|
878 |
+
# SemanticSimilarity(embeddings=evaluator_embeddings)
|
879 |
+
]
|
880 |
+
results = evaluate(dataset=eval_dataset, metrics=metrics)
|
881 |
+
print("results: ", results)
|
882 |
+
|
883 |
+
# Step 4: Run the Chain
|
884 |
+
inputs = {
|
885 |
+
"pdf_list": listaPDFs,
|
886 |
"user_message": serializer["user_message"],
|
887 |
+
}
|
888 |
+
# result = chain.run(inputs)
|
889 |
+
return Response({"msg": results})
|
890 |
+
|
891 |
+
# Step 5: Format the Output
|
892 |
+
# return {
|
893 |
+
# "resultado": result["structured_summaries"],
|
894 |
+
# "parametros-utilizados": {
|
895 |
+
# "num_chunks_retrieval": serializer["num_chunks_retrieval"],
|
896 |
+
# "embedding_weight": serializer["embedding_weight"],
|
897 |
+
# "bm25_weight": serializer["bm25_weight"],
|
898 |
+
# "context_window": serializer["context_window"],
|
899 |
+
# "chunk_overlap": serializer["chunk_overlap"],
|
900 |
+
# "num_k_rerank": serializer["num_k_rerank"],
|
901 |
+
# "model_cohere_rerank": serializer["model_cohere_rerank"],
|
902 |
+
# "more_initial_chunks_for_reranking": serializer["more_initial_chunks_for_reranking"],
|
903 |
+
# "claude_context_model": serializer["claude_context_model"],
|
904 |
+
# "gpt_temperature": serializer["gpt_temperature"],
|
905 |
+
# "user_message": serializer["user_message"],
|
906 |
+
# "model": serializer["model"],
|
907 |
+
# "hf_embedding": serializer["hf_embedding"],
|
908 |
+
# "chunk_size": serializer["chunk_size"],
|
909 |
+
# "chunk_overlap": serializer["chunk_overlap"],
|
910 |
+
# "prompt_relatorio": serializer["prompt_relatorio"],
|
911 |
+
# "prompt_modelo": serializer["prompt_modelo"],
|
912 |
+
# },
|
913 |
+
# }
|
resumos/serializer.py
CHANGED
@@ -29,31 +29,99 @@ class ResumoCursorSerializer(serializers.Serializer):
|
|
29 |
chunk_overlap = serializers.IntegerField(required=False, default=200)
|
30 |
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
system_prompt_relatorio = """
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
user_message = "What are the main points of this document?"
|
44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
system_prompt_modelo = """
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
class ResumoCursorCompeltoSerializer(ResumoCursorSerializer):
|
58 |
system_prompt = None
|
59 |
prompt_relatorio = serializers.CharField(required=False, default=system_prompt_relatorio)
|
@@ -69,4 +137,17 @@ class ResumoCursorCompeltoSerializer(ResumoCursorSerializer):
|
|
69 |
more_initial_chunks_for_reranking = serializers.IntegerField(default=20)
|
70 |
claude_context_model = serializers.CharField(required=False, default="claude-3-haiku-20240307")
|
71 |
gpt_temperature = serializers.FloatField(default=0)
|
72 |
-
id_modelo_do_usuario = serializers.IntegerField(required=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
chunk_overlap = serializers.IntegerField(required=False, default=200)
|
30 |
|
31 |
|
32 |
+
# system_prompt_relatorio = """
|
33 |
+
# Based on the following context, provide multiple key points from the document.
|
34 |
+
# For each point, create a new paragraph.
|
35 |
+
# Each paragraph should be a complete, self-contained insight.
|
36 |
+
# Include any relevant context provided.
|
37 |
+
|
38 |
+
# Context: {context}
|
39 |
+
|
40 |
+
# Key points:
|
41 |
+
# """
|
42 |
system_prompt_relatorio = """
|
43 |
+
You are a language model specialized in producing concise and well-structured legal case summaries in Portuguese. You will receive a variable `context`, which contains information about a legal case. Your task is to read the `context` carefully and produce a summary report in Portuguese, following the specific format provided below. Do not include any additional comments or reasoning steps in your final answer.
|
44 |
+
**Instructions**:
|
45 |
+
1. **Chain of Thought**: Before producing your final answer, you must think through and plan your summary silently, without showing this reasoning in the final output. The final answer must only contain the required formatted report and nothing else.
|
46 |
+
2. **Reading the Context**: Extract the following information from `context`:
|
47 |
+
- The name of the defendant (réu).
|
48 |
+
- The crime they have been accused of (nome_do_crime).
|
49 |
+
- The applicable article and subsection of the Penal Code (artigo_e_inciso_do_crime).
|
50 |
+
- The date the accusation was accepted (data_do_recebimento).
|
51 |
+
- The ID of the decision document (id_do_documento).
|
52 |
+
3. **Prescriptive Details**: If no other interruptive or suspensive causes of prescription are mentioned, confirm that there are none.
|
53 |
+
4. **Formatting**: Your final answer must strictly follow the format below, in Portuguese, and replace the placeholders with the appropriate information:
|
54 |
+
```
|
55 |
+
<formato>
|
56 |
+
Trata-se de Ação Penal em que o Ministério Público denunciou [nome_do_reu], pela prática do [nome_do_crime] [artigo_e_inciso_do_crime], do Código Penal.
|
57 |
+
A denúncia foi recebida em [data_do_recebimento], conforme Decisão [id_do_documento].
|
58 |
+
Não há outras causas interruptivas ou suspensivas da prescrição.
|
59 |
+
</formato>
|
60 |
+
```
|
61 |
+
5. **Completeness**: If any piece of required information is missing in the `context`, note that explicitly in the final answer within the format.
|
62 |
+
**Reminder**:
|
63 |
+
- Do not include your chain of thought in the final output.
|
64 |
+
- Do not add extra information or commentary beyond the specified format.
|
65 |
+
- The final answer must be in Portuguese.
|
66 |
+
---
|
67 |
+
|
68 |
+
**Contextual Information (provided separately):**
|
69 |
+
{context}
|
70 |
+
---
|
71 |
+
**Example with a given context**:
|
72 |
+
- Input:
|
73 |
+
`context` = "Em 10/03/2021, o Ministério Público denunciou João da Silva, imputando-lhe o crime de furto qualificado, previsto no art. 155, §4º, inciso II, do Código Penal. A denúncia foi recebida em 12/03/2021, conforme Decisão nº 20210312-01. Não há menção a qualquer causa interruptiva ou suspensiva da prescrição."
|
74 |
+
- Expected final answer:
|
75 |
+
```
|
76 |
+
<formato>
|
77 |
+
Trata-se de Ação Penal em que o Ministério Público denunciou João da Silva, pela prática do furto qualificado (art. 155, §4º, inciso II do Código Penal).
|
78 |
+
A denúncia foi recebida em 12/03/2021, conforme Decisão 20210312-01.
|
79 |
+
Não há outras causas interruptivas ou suspensivas da prescrição.
|
80 |
+
</formato>
|
81 |
+
"""
|
82 |
|
83 |
user_message = "What are the main points of this document?"
|
84 |
|
85 |
+
# system_prompt_modelo = """
|
86 |
+
# Based on the following context, provide multiple key points from the document.
|
87 |
+
# For each point, create a new paragraph.
|
88 |
+
# Each paragraph should be a complete, self-contained insight.
|
89 |
+
# Include any relevant context provided.
|
90 |
+
|
91 |
+
# Context: {context}
|
92 |
+
|
93 |
+
# Modelo do usuário: {modelo_usuario}
|
94 |
+
|
95 |
+
# Key points:
|
96 |
+
# """
|
97 |
system_prompt_modelo = """
|
98 |
+
You are a large language model that must produce a single final sentence in **Portuguese**. To do this, you will follow a private chain of thought and then produce a final answer. The final answer must follow the formatting and stylistic conventions shown in the user-provided model `user's template`. The information to be included in the final sentence is derived from the `context` (a report describing a legal case).
|
99 |
+
**Contextual Information (provided separately):**
|
100 |
+
{context}
|
101 |
+
**User Model (provided separately):**
|
102 |
+
{modelo_do_usuario}
|
103 |
+
**Instructions:**
|
104 |
+
1. **Goal:** Produce one single final sentence in Portuguese that matches the structure, format, and style given by `user's template`.
|
105 |
+
2. **Chain of Thought (private to the assistant and not to be shown in the final answer):**
|
106 |
+
- Carefully review the `context` which is a legal report of a case.
|
107 |
+
- Identify:
|
108 |
+
- The defendant’s name.
|
109 |
+
- The crime’s name, its article, and any subsection (inciso).
|
110 |
+
- The date of receipt of the complaint (data do recebimento da denúncia).
|
111 |
+
- The document ID.
|
112 |
+
- Ensure these elements are correctly incorporated into the final sentence.
|
113 |
+
- Check compliance with the formatting style indicated by `user's template`.
|
114 |
+
- Compose the sentence following the structure from the user model.
|
115 |
+
- Use reflection: Before finalizing the answer, reassess if all required information is included, if the format matches the user model, and if the sentence is written correctly in Portuguese.
|
116 |
+
3. **Reflection Technique (private):**
|
117 |
+
After composing the sentence, but before presenting it as the final answer, reflect if:
|
118 |
+
- All required details from the `context` are accurately included.
|
119 |
+
- The sentence format strictly matches the pattern of `user's template`.
|
120 |
+
- The sentence is grammatically correct in Portuguese.
|
121 |
+
4. **Final Answer:**
|
122 |
+
- After completing the chain of thought and ensuring correctness through reflection, present only the final sentence in Portuguese.
|
123 |
+
- Do not show the chain of thought or the reflection step. Only the final formatted sentence should be visible to the user.
|
124 |
+
"""
|
125 |
class ResumoCursorCompeltoSerializer(ResumoCursorSerializer):
|
126 |
system_prompt = None
|
127 |
prompt_relatorio = serializers.CharField(required=False, default=system_prompt_relatorio)
|
|
|
137 |
more_initial_chunks_for_reranking = serializers.IntegerField(default=20)
|
138 |
claude_context_model = serializers.CharField(required=False, default="claude-3-haiku-20240307")
|
139 |
gpt_temperature = serializers.FloatField(default=0)
|
140 |
+
id_modelo_do_usuario = serializers.IntegerField(required=True)
|
141 |
+
|
142 |
+
|
143 |
+
class RagasSerializer(ResumoCursorCompeltoSerializer):
|
144 |
+
files = serializers.ListField(child=serializers.FileField(), required=True)
|
145 |
+
id_modelo_do_usuario = serializers.IntegerField(required=False)
|
146 |
+
hf_embedding = serializers.CharField(required=False, default="all-MiniLM-L6-v2")
|
147 |
+
|
148 |
+
|
149 |
+
class RagasFromTextSerializer(ResumoCursorCompeltoSerializer):
|
150 |
+
files = None
|
151 |
+
id_modelo_do_usuario = serializers.IntegerField(required=False, default=9)
|
152 |
+
user_message = serializers.CharField(required=True)
|
153 |
+
context_provided = serializers.CharField(required=False)
|
resumos/views.py
CHANGED
@@ -1,11 +1,21 @@
|
|
1 |
from rest_framework.views import APIView
|
|
|
2 |
import tempfile, os
|
3 |
from rest_framework.response import Response
|
4 |
|
5 |
-
from _utils.resumo_completo_cursor import
|
|
|
|
|
|
|
6 |
from _utils.resumo_simples_cursor import get_llm_summary_answer_by_cursor
|
7 |
from _utils.utils import DEFAULT_SYSTEM_PROMPT
|
8 |
-
from .serializer import
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
from _utils.main import get_llm_answer_summary, get_llm_answer_summary_with_embedding
|
10 |
from setup.environment import default_model
|
11 |
from rest_framework.parsers import MultiPartParser
|
@@ -13,125 +23,292 @@ from drf_spectacular.utils import extend_schema
|
|
13 |
|
14 |
|
15 |
class ResumoView(APIView):
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
# print('listaPDFs: ', listaPDFs)
|
38 |
|
39 |
-
|
40 |
-
|
41 |
|
42 |
-
|
43 |
-
os.remove(file)
|
44 |
|
45 |
-
return Response({"resposta": resposta_llm})
|
46 |
|
47 |
class ResumoEmbeddingView(APIView):
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
|
70 |
-
|
71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
-
|
74 |
-
|
|
|
|
|
75 |
|
76 |
-
return Response({"resposta": resposta_llm})
|
77 |
|
78 |
class ResumoSimplesCursorView(APIView):
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
listaPDFs = []
|
120 |
-
|
121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
|
123 |
-
|
124 |
-
|
125 |
-
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file: # Create a temporary file to save the uploaded PDF
|
126 |
-
for chunk in file.chunks(): # Write the uploaded file content to the temporary file
|
127 |
-
temp_file.write(chunk)
|
128 |
-
temp_file_path = temp_file.name # Get the path of the temporary file
|
129 |
-
listaPDFs.append(temp_file_path)
|
130 |
-
print('listaPDFs: ', listaPDFs)
|
131 |
|
132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
|
134 |
-
|
135 |
-
|
136 |
|
137 |
-
|
|
|
|
1 |
from rest_framework.views import APIView
|
2 |
+
from adrf.views import APIView as AsyncAPIView
|
3 |
import tempfile, os
|
4 |
from rest_framework.response import Response
|
5 |
|
6 |
+
from _utils.resumo_completo_cursor import (
|
7 |
+
get_llm_summary_answer_by_cursor_complete,
|
8 |
+
test_ragas,
|
9 |
+
)
|
10 |
from _utils.resumo_simples_cursor import get_llm_summary_answer_by_cursor
|
11 |
from _utils.utils import DEFAULT_SYSTEM_PROMPT
|
12 |
+
from .serializer import (
|
13 |
+
RagasFromTextSerializer,
|
14 |
+
ResumoCursorCompeltoSerializer,
|
15 |
+
ResumoPDFSerializer,
|
16 |
+
ResumoCursorSerializer,
|
17 |
+
RagasSerializer,
|
18 |
+
)
|
19 |
from _utils.main import get_llm_answer_summary, get_llm_answer_summary_with_embedding
|
20 |
from setup.environment import default_model
|
21 |
from rest_framework.parsers import MultiPartParser
|
|
|
23 |
|
24 |
|
25 |
class ResumoView(APIView):
|
26 |
+
parser_classes = [MultiPartParser]
|
27 |
+
|
28 |
+
@extend_schema(
|
29 |
+
request=ResumoPDFSerializer,
|
30 |
+
)
|
31 |
+
def post(self, request):
|
32 |
+
serializer = ResumoPDFSerializer(data=request.data)
|
33 |
+
if serializer.is_valid(raise_exception=True):
|
34 |
+
listaPDFs = []
|
35 |
+
data = serializer.validated_data
|
36 |
+
model = serializer.validated_data.get("model", default_model)
|
37 |
+
print("serializer.validated_data: ", serializer.validated_data)
|
38 |
+
|
39 |
+
for file in serializer.validated_data["files"]:
|
40 |
+
print("file: ", file)
|
41 |
+
file.seek(0)
|
42 |
+
with tempfile.NamedTemporaryFile(
|
43 |
+
delete=False, suffix=".pdf"
|
44 |
+
) as temp_file: # Create a temporary file to save the uploaded PDF
|
45 |
+
for (
|
46 |
+
chunk
|
47 |
+
) in (
|
48 |
+
file.chunks()
|
49 |
+
): # Write the uploaded file content to the temporary file
|
50 |
+
temp_file.write(chunk)
|
51 |
+
temp_file_path = (
|
52 |
+
temp_file.name
|
53 |
+
) # Get the path of the temporary file
|
54 |
+
listaPDFs.append(temp_file_path)
|
55 |
+
# print('listaPDFs: ', listaPDFs)
|
56 |
|
57 |
+
system_prompt = data.get("system_prompt", DEFAULT_SYSTEM_PROMPT)
|
58 |
+
resposta_llm = get_llm_answer_summary(
|
59 |
+
system_prompt,
|
60 |
+
data["user_message"],
|
61 |
+
listaPDFs,
|
62 |
+
model=model,
|
63 |
+
isIterativeRefinement=data["iterative_refinement"],
|
64 |
+
)
|
|
|
65 |
|
66 |
+
for file in listaPDFs:
|
67 |
+
os.remove(file)
|
68 |
|
69 |
+
return Response({"resposta": resposta_llm})
|
|
|
70 |
|
|
|
71 |
|
72 |
class ResumoEmbeddingView(APIView):
|
73 |
+
parser_classes = [MultiPartParser]
|
74 |
+
|
75 |
+
@extend_schema(
|
76 |
+
request=ResumoPDFSerializer,
|
77 |
+
)
|
78 |
+
def post(self, request):
|
79 |
+
serializer = ResumoPDFSerializer(data=request.data)
|
80 |
+
if serializer.is_valid(raise_exception=True):
|
81 |
+
listaPDFs = []
|
82 |
+
data = serializer.validated_data
|
83 |
+
model = serializer.validated_data.get("model", default_model)
|
84 |
+
print("serializer.validated_data: ", serializer.validated_data)
|
85 |
|
86 |
+
for file in serializer.validated_data["files"]:
|
87 |
+
file.seek(0)
|
88 |
+
with tempfile.NamedTemporaryFile(
|
89 |
+
delete=False, suffix=".pdf"
|
90 |
+
) as temp_file: # Create a temporary file to save the uploaded PDF
|
91 |
+
for (
|
92 |
+
chunk
|
93 |
+
) in (
|
94 |
+
file.chunks()
|
95 |
+
): # Write the uploaded file content to the temporary file
|
96 |
+
temp_file.write(chunk)
|
97 |
+
temp_file_path = (
|
98 |
+
temp_file.name
|
99 |
+
) # Get the path of the temporary file
|
100 |
+
listaPDFs.append(temp_file_path)
|
101 |
+
print("listaPDFs: ", listaPDFs)
|
102 |
|
103 |
+
system_prompt = data.get("system_prompt", DEFAULT_SYSTEM_PROMPT)
|
104 |
+
resposta_llm = get_llm_answer_summary_with_embedding(
|
105 |
+
system_prompt,
|
106 |
+
data["user_message"],
|
107 |
+
listaPDFs,
|
108 |
+
model=model,
|
109 |
+
isIterativeRefinement=data["iterative_refinement"],
|
110 |
+
)
|
111 |
|
112 |
+
for file in listaPDFs:
|
113 |
+
os.remove(file)
|
114 |
+
|
115 |
+
return Response({"resposta": resposta_llm})
|
116 |
|
|
|
117 |
|
118 |
class ResumoSimplesCursorView(APIView):
|
119 |
+
parser_classes = [MultiPartParser]
|
120 |
+
|
121 |
+
@extend_schema(
|
122 |
+
request=ResumoCursorSerializer,
|
123 |
+
)
|
124 |
+
def post(self, request):
|
125 |
+
serializer = ResumoCursorSerializer(data=request.data)
|
126 |
+
if serializer.is_valid(raise_exception=True):
|
127 |
+
listaPDFs = []
|
128 |
+
data = serializer.validated_data
|
129 |
+
print("\nserializer.validated_data: ", serializer.validated_data)
|
130 |
+
|
131 |
+
for file in serializer.validated_data["files"]:
|
132 |
+
file.seek(0)
|
133 |
+
with tempfile.NamedTemporaryFile(
|
134 |
+
delete=False, suffix=".pdf"
|
135 |
+
) as temp_file: # Create a temporary file to save the uploaded PDF
|
136 |
+
for (
|
137 |
+
chunk
|
138 |
+
) in (
|
139 |
+
file.chunks()
|
140 |
+
): # Write the uploaded file content to the temporary file
|
141 |
+
temp_file.write(chunk)
|
142 |
+
temp_file_path = (
|
143 |
+
temp_file.name
|
144 |
+
) # Get the path of the temporary file
|
145 |
+
listaPDFs.append(temp_file_path)
|
146 |
+
print("listaPDFs: ", listaPDFs)
|
147 |
+
|
148 |
+
resposta_llm = get_llm_summary_answer_by_cursor(data, listaPDFs)
|
149 |
+
|
150 |
+
for file in listaPDFs:
|
151 |
+
os.remove(file)
|
152 |
+
|
153 |
+
return Response({"resposta": resposta_llm})
|
154 |
+
|
155 |
+
|
156 |
+
class ResumoSimplesCursorCompletoView(AsyncAPIView):
|
157 |
+
parser_classes = [MultiPartParser]
|
158 |
+
|
159 |
+
@extend_schema(
|
160 |
+
request=ResumoCursorCompeltoSerializer,
|
161 |
+
)
|
162 |
+
async def post(self, request):
|
163 |
+
serializer = ResumoCursorCompeltoSerializer(data=request.data)
|
164 |
+
if serializer.is_valid(raise_exception=True):
|
165 |
+
print("\n\n\n")
|
166 |
+
print("serializer.validated_data: ", serializer.validated_data)
|
167 |
+
print("\n\n\n")
|
168 |
+
listaPDFs = []
|
169 |
+
data = serializer.validated_data
|
170 |
+
print("\nserializer.validated_data: ", serializer.validated_data)
|
171 |
+
|
172 |
+
for file in serializer.validated_data["files"]:
|
173 |
+
file.seek(0)
|
174 |
+
with tempfile.NamedTemporaryFile(
|
175 |
+
delete=False, suffix=".pdf"
|
176 |
+
) as temp_file: # Create a temporary file to save the uploaded PDF
|
177 |
+
for (
|
178 |
+
chunk
|
179 |
+
) in (
|
180 |
+
file.chunks()
|
181 |
+
): # Write the uploaded file content to the temporary file
|
182 |
+
temp_file.write(chunk)
|
183 |
+
temp_file_path = (
|
184 |
+
temp_file.name
|
185 |
+
) # Get the path of the temporary file
|
186 |
+
listaPDFs.append(temp_file_path)
|
187 |
+
print("listaPDFs: ", listaPDFs)
|
188 |
+
|
189 |
+
# resposta_llm = await get_llm_summary_answer_by_cursor_complete(
|
190 |
+
# data, listaPDFs
|
191 |
+
# )
|
192 |
+
|
193 |
+
resposta_llm = await get_llm_summary_answer_by_cursor_complete(
|
194 |
+
data, listaPDFs
|
195 |
+
)
|
196 |
+
|
197 |
+
final = resposta_llm
|
198 |
+
print("\n\n\n")
|
199 |
+
print("final: ", final)
|
200 |
+
|
201 |
+
for file in listaPDFs:
|
202 |
+
os.remove(file)
|
203 |
+
|
204 |
+
return Response({"resposta": final})
|
205 |
+
|
206 |
+
|
207 |
+
class RagasView(APIView):
|
208 |
+
parser_classes = [MultiPartParser]
|
209 |
+
|
210 |
+
@extend_schema(
|
211 |
+
request=RagasSerializer,
|
212 |
+
)
|
213 |
+
def post(self, request):
|
214 |
+
serializer = RagasSerializer(data=request.data)
|
215 |
+
print("\n\n\n")
|
216 |
+
print("\n\n\n")
|
217 |
+
print("serializer.data: ", serializer)
|
218 |
listaPDFs = []
|
219 |
+
if serializer.is_valid(raise_exception=True):
|
220 |
+
for file in serializer.validated_data["files"]:
|
221 |
+
file.seek(0)
|
222 |
+
with tempfile.NamedTemporaryFile(
|
223 |
+
delete=False, suffix=".pdf"
|
224 |
+
) as temp_file: # Create a temporary file to save the uploaded PDF
|
225 |
+
for (
|
226 |
+
chunk
|
227 |
+
) in (
|
228 |
+
file.chunks()
|
229 |
+
): # Write the uploaded file content to the temporary file
|
230 |
+
temp_file.write(chunk)
|
231 |
+
temp_file_path = (
|
232 |
+
temp_file.name
|
233 |
+
) # Get the path of the temporary file
|
234 |
+
listaPDFs.append(temp_file_path)
|
235 |
+
|
236 |
+
result = test_ragas(serializer, listaPDFs)
|
237 |
+
|
238 |
+
for file in listaPDFs:
|
239 |
+
os.remove(file)
|
240 |
+
|
241 |
+
return Response({"msg": result})
|
242 |
+
|
243 |
+
|
244 |
+
class RagasFromTextView(APIView):
|
245 |
+
def post(self, request):
|
246 |
+
serializer = RagasFromTextSerializer(data=request.data)
|
247 |
+
if serializer.is_valid(raise_exception=True):
|
248 |
+
from datasets import Dataset
|
249 |
+
from ragas import evaluate
|
250 |
+
from ragas.metrics import (
|
251 |
+
faithfulness,
|
252 |
+
answer_relevancy,
|
253 |
+
answer_correctness,
|
254 |
+
context_precision,
|
255 |
+
context_recall,
|
256 |
+
)
|
257 |
+
import os
|
258 |
+
from datasets import load_dataset
|
259 |
+
import pandas as pd
|
260 |
+
|
261 |
+
os.environ.get("OPENAI_API_KEY")
|
262 |
+
|
263 |
+
df_pandas = pd.read_csv(
|
264 |
+
"D:/repositorios/projetos-pessoais/projeto-y-backend-hugginf-face-teste-01/vella-backend/_utils/files/ragas_testset.csv"
|
265 |
+
)
|
266 |
+
# print(df_pandas["position"]) # Print a specific column
|
267 |
+
data = {
|
268 |
+
"user_input": [
|
269 |
+
"What is the capital of France?",
|
270 |
+
],
|
271 |
+
"response": [],
|
272 |
+
"retrieved_contexts": [],
|
273 |
+
}
|
274 |
+
|
275 |
+
reference = [
|
276 |
+
"Paris is the capital of France. It is a major European city known for its culture."
|
277 |
+
]
|
278 |
+
|
279 |
+
for x in df_pandas["user_input"]:
|
280 |
+
data["user_input"].append(x)
|
281 |
+
|
282 |
+
for x in df_pandas["reference"]:
|
283 |
+
reference.append(x)
|
284 |
+
|
285 |
+
print("data: ", reference)
|
286 |
+
|
287 |
+
for i in range(len(reference)):
|
288 |
+
serializer.validated_data["user_message"] = data["user_input"][i]
|
289 |
+
resposta_llm = get_llm_summary_answer_by_cursor_complete(
|
290 |
+
serializer.validated_data, contexto=reference[i]
|
291 |
+
)
|
292 |
+
data["response"].append(resposta_llm["texto_completo"])
|
293 |
+
lista_reference_contexts = []
|
294 |
+
for x in resposta_llm["resultado"]:
|
295 |
+
lista_reference_contexts.append(x["source"]["text"])
|
296 |
+
data["retrieved_contexts"].append(lista_reference_contexts)
|
297 |
|
298 |
+
# Convert the data to a Hugging Face Dataset
|
299 |
+
dataset = Dataset.from_dict(data)
|
|
|
|
|
|
|
|
|
|
|
|
|
300 |
|
301 |
+
# Define the metrics you want to evaluate
|
302 |
+
metrics = [
|
303 |
+
faithfulness,
|
304 |
+
# answer_relevancy,
|
305 |
+
# answer_correctness,
|
306 |
+
# context_precision,
|
307 |
+
# context_recall,
|
308 |
+
]
|
309 |
|
310 |
+
# Evaluate the dataset using the selected metrics
|
311 |
+
results = evaluate(dataset, metrics)
|
312 |
|
313 |
+
# results.to_pandas().to_csv("./result.csv")
|
314 |
+
return Response({"resposta": results.to_pandas().to_string()})
|
setup/settings.py
CHANGED
@@ -22,7 +22,6 @@ https://docs.djangoproject.com/en/4.2/ref/settings/
|
|
22 |
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
23 |
|
24 |
|
25 |
-
|
26 |
# Quick-start development settings - unsuitable for production
|
27 |
# See https://docs.djangoproject.com/en/4.2/howto/deployment/checklist/
|
28 |
|
@@ -45,12 +44,13 @@ INSTALLED_APPS = [
|
|
45 |
"django.contrib.messages",
|
46 |
"django.contrib.staticfiles",
|
47 |
"rest_framework",
|
|
|
48 |
"django_filters",
|
49 |
"corsheaders",
|
50 |
"pdfs",
|
51 |
"drf_spectacular",
|
52 |
"resumos",
|
53 |
-
"modelos_usuarios"
|
54 |
]
|
55 |
|
56 |
MIDDLEWARE = [
|
@@ -176,4 +176,4 @@ SPECTACULAR_SETTINGS = {
|
|
176 |
'SERVE_INCLUDE_SCHEMA': False,
|
177 |
'COMPONENT_SPLIT_REQUEST': True
|
178 |
# OTHER SETTINGS
|
179 |
-
}
|
|
|
22 |
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
23 |
|
24 |
|
|
|
25 |
# Quick-start development settings - unsuitable for production
|
26 |
# See https://docs.djangoproject.com/en/4.2/howto/deployment/checklist/
|
27 |
|
|
|
44 |
"django.contrib.messages",
|
45 |
"django.contrib.staticfiles",
|
46 |
"rest_framework",
|
47 |
+
"adrf",
|
48 |
"django_filters",
|
49 |
"corsheaders",
|
50 |
"pdfs",
|
51 |
"drf_spectacular",
|
52 |
"resumos",
|
53 |
+
"modelos_usuarios",
|
54 |
]
|
55 |
|
56 |
MIDDLEWARE = [
|
|
|
176 |
'SERVE_INCLUDE_SCHEMA': False,
|
177 |
'COMPONENT_SPLIT_REQUEST': True
|
178 |
# OTHER SETTINGS
|
179 |
+
}
|
setup/urls.py
CHANGED
@@ -5,22 +5,36 @@ from drf_spectacular.views import SpectacularSwaggerView, SpectacularAPIView
|
|
5 |
|
6 |
|
7 |
from pdfs.views import getPDF
|
8 |
-
from resumos.views import
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
from modelos_usuarios.views import ListCreateModeloUsuarioView, CreateUpdateDeleteModeloUsuarioView, ListModelosPorUsuarioView
|
10 |
|
11 |
router = routers.DefaultRouter()
|
12 |
# router.register("endpoint-teste", EndpointTesteViewSet, basename="Basename do endpoint-teste")
|
13 |
|
14 |
urlpatterns = [
|
15 |
-
path(
|
16 |
-
path(
|
|
|
|
|
17 |
path("admin/", admin.site.urls),
|
18 |
-
path(
|
19 |
-
path(
|
20 |
-
path(
|
21 |
-
path(
|
22 |
-
path(
|
|
|
|
|
|
|
|
|
23 |
path("modelo", ListCreateModeloUsuarioView.as_view()),
|
24 |
path("modelo/<int:pk>", CreateUpdateDeleteModeloUsuarioView.as_view()),
|
25 |
-
path("usuario/<int:user_id>/modelos", ListModelosPorUsuarioView.as_view())
|
26 |
-
|
|
|
|
|
|
5 |
|
6 |
|
7 |
from pdfs.views import getPDF
|
8 |
+
from resumos.views import (
|
9 |
+
RagasFromTextView,
|
10 |
+
RagasView,
|
11 |
+
ResumoView,
|
12 |
+
ResumoSimplesCursorView,
|
13 |
+
ResumoSimplesCursorCompletoView,
|
14 |
+
)
|
15 |
from modelos_usuarios.views import ListCreateModeloUsuarioView, CreateUpdateDeleteModeloUsuarioView, ListModelosPorUsuarioView
|
16 |
|
17 |
router = routers.DefaultRouter()
|
18 |
# router.register("endpoint-teste", EndpointTesteViewSet, basename="Basename do endpoint-teste")
|
19 |
|
20 |
urlpatterns = [
|
21 |
+
path("api/schema/", SpectacularAPIView.as_view(), name="schema"),
|
22 |
+
path(
|
23 |
+
"swagger/", SpectacularSwaggerView.as_view(url_name="schema"), name="swagger-ui"
|
24 |
+
),
|
25 |
path("admin/", admin.site.urls),
|
26 |
+
path("", include(router.urls)),
|
27 |
+
path("pdf", getPDF, name="upload-pdf"),
|
28 |
+
path("resumo", ResumoView.as_view(), name="summary-pdf"),
|
29 |
+
path("resumo/cursor", ResumoSimplesCursorView.as_view(), name="summary-cursor-pdf"),
|
30 |
+
path(
|
31 |
+
"resumo/cursor-completo",
|
32 |
+
ResumoSimplesCursorCompletoView.as_view(),
|
33 |
+
name="summary-cursor-completo-pdf",
|
34 |
+
),
|
35 |
path("modelo", ListCreateModeloUsuarioView.as_view()),
|
36 |
path("modelo/<int:pk>", CreateUpdateDeleteModeloUsuarioView.as_view()),
|
37 |
+
path("usuario/<int:user_id>/modelos", ListModelosPorUsuarioView.as_view()),
|
38 |
+
path("ragas", RagasView.as_view()),
|
39 |
+
path("ragas/no-pdf", RagasFromTextView.as_view()),
|
40 |
+
]
|