luanpoppe commited on
Commit
c625f4c
·
1 Parent(s): e725020

feat: tentando adicionar ragas

Browse files
.vscode/launch.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ // Use IntelliSense to learn about possible attributes.
3
+ // Hover to view descriptions of existing attributes.
4
+ // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
5
+ "version": "0.2.0",
6
+ "configurations": [
7
+ {
8
+ "name": "Python Debugger: Django",
9
+ "type": "debugpy",
10
+ "request": "launch",
11
+ "args": [
12
+ "runserver"
13
+ ],
14
+ "django": true,
15
+ "autoStartBrowser": false,
16
+ "program": "${workspaceFolder}\\manage.py"
17
+ }
18
+ ]
19
+ }
_utils/files/ragas_testset.csv ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ position,user_input,reference,response,retrieved_contexts
2
+ 0,What are the global implications of the USA Supreme Court ruling on abortion?,"The global implications of the USA Supreme Court ruling on abortion are significant. The ruling has led to limited or no access to abortion for one in three women and girls of reproductive age in states where abortion access is restricted. These states also have weaker maternal health support, higher maternal death rates, and higher child poverty rates. Additionally, the ruling has had an impact beyond national borders due to the USA's geopolitical and cultural influence globally. Organizations and activists worldwide are concerned that the ruling may inspire anti-abortion legislative and policy attacks in other countries. The ruling has also hindered progressive law reform and the implementation of abortion guidelines in certain African countries. Furthermore, the ruling has created a chilling effect in international policy spaces, empowering anti-abortion actors to undermine human rights protections.","The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:
3
+
4
+ 1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.
5
+
6
+ 2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.
7
+
8
+ 3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.
9
+
10
+ 4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.
11
+
12
+ 5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.
13
+
14
+ 6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.
15
+
16
+ It is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.","[""- In 2022, the USA Supreme Court handed down a decision ruling that overturned 50 years of jurisprudence recognizing a constitutional right to abortion.\n- This decision has had a massive impact: one in three women and girls of reproductive age now live in states where abortion access is either totally or near-totally inaccessible.\n- The states with the most restrictive abortion laws have the weakest maternal health support, higher maternal death rates, and higher child poverty rates.\n- The USA Supreme Court ruling has also had impacts beyond national borders due to the geopolitical and cultural influence wielded by the USA globally and the aid it funds.\n- SRR organizations and activists across the world have expressed fear about the ruling laying the groundwork for anti-abortion legislative and policy attacks in other countries.\n- Advocates have also observed the ruling's impact on progressive law reform and the stalling of the adoption and enforcement of abortion guidelines in certain African countries.\n- The ruling has created a chilling effect in international policy spaces, emboldening anti-abortion state and non-state actors to undermine human rights protections.""
17
+ 'The USA Supreme Court ruling on abortion has sparked intense debates and discussions not only within the country but also around the world. Many countries look to the United States as a leader in legal and social issues, so the decision could potentially influence the policies and attitudes towards abortion in other nations.'
18
+ ""The ruling may also impact international organizations and non-governmental groups that work on reproductive rights and women's health issues. Depending on the outcome, there could be shifts in funding, advocacy efforts, and collaborations with American counterparts, leading to ripple effects in the global fight for reproductive justice.""]"
_utils/resumo_completo_cursor.py CHANGED
@@ -1,4 +1,3 @@
1
-
2
  import os
3
  from typing import List, Dict, Tuple, Optional
4
  from langchain.text_splitter import RecursiveCharacterTextSplitter
@@ -19,6 +18,9 @@ from cohere import Client
19
  import requests
20
  from setup.environment import api_url
21
  from rest_framework.response import Response
 
 
 
22
 
23
  def reciprocal_rank_fusion(result_lists, weights=None):
24
  """Combine multiple ranked lists using reciprocal rank fusion"""
@@ -34,18 +36,16 @@ def reciprocal_rank_fusion(result_lists, weights=None):
34
  fused_scores[doc_id] += weights[i] * score
35
 
36
  # Sort by score in descending order
37
- sorted_results = sorted(
38
- fused_scores.items(),
39
- key=lambda x: x[1],
40
- reverse=True
41
- )
42
-
43
  return sorted_results
44
 
45
- os.environ["LANGCHAIN_TRACING_V2"]="true"
46
- os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
 
47
  os.environ.get("LANGCHAIN_API_KEY")
48
- os.environ["LANGCHAIN_PROJECT"]="VELLA"
 
49
 
50
  @dataclass
51
  class DocumentChunk:
@@ -55,6 +55,7 @@ class DocumentChunk:
55
  start_char: int
56
  end_char: int
57
 
 
58
  @dataclass
59
  class RetrievalConfig:
60
  num_chunks: int = 5
@@ -64,158 +65,216 @@ class RetrievalConfig:
64
  chunk_overlap: int = 200
65
  chunk_size: int = 1000
66
 
 
67
  @dataclass
68
  class ContextualizedChunk(DocumentChunk):
69
  context: str = ""
70
  embedding: Optional[np.ndarray] = None
71
  bm25_score: Optional[float] = None
72
 
 
73
  class DocumentSummarizer:
74
- def __init__(self, openai_api_key: str, cohere_api_key: str, embedding_model, chunk_size, chunk_overlap, num_k_rerank, model_cohere_rerank):
 
 
 
 
 
 
 
 
 
 
75
  self.openai_api_key = openai_api_key
76
  self.cohere_client = Client(cohere_api_key)
77
- self.embeddings = HuggingFaceEmbeddings(
78
- model_name=embedding_model
79
- )
80
  self.text_splitter = RecursiveCharacterTextSplitter(
81
- chunk_size=chunk_size,
82
- chunk_overlap=chunk_overlap
83
  )
84
  self.chunk_metadata = {} # Store chunk metadata for tracing
85
  self.num_k_rerank = num_k_rerank
86
  self.model_cohere_rerank = model_cohere_rerank
87
-
88
  def load_and_split_document(self, pdf_path: str) -> List[DocumentChunk]:
89
  """Load PDF and split into chunks with metadata"""
90
  loader = PyPDFLoader(pdf_path)
91
- pages = loader.load() # Gera uma lista de objetos Document, sendo cada item da lista referente a UMA PÁGINA inteira do PDF.
 
 
92
  chunks = []
93
  char_count = 0
94
-
95
  for page in pages:
96
  text = page.page_content
97
- page_chunks = self.text_splitter.split_text(text) # Quebra o item que é um Document de UMA PÁGINA inteira em um lista onde cada item é referente a um chunk, que são pedaços menores do que uma página.
98
-
 
 
99
  for chunk in page_chunks:
100
  chunk_id = str(uuid.uuid4())
101
- start_char = text.find(chunk) # Retorna a posição onde se encontra o chunk dentro da página inteira
 
 
102
  end_char = start_char + len(chunk)
103
-
104
- doc_chunk = DocumentChunk( # Gera o objeto do chunk com informações adicionais, como a posição e id do chunk
105
  content=chunk,
106
- page_number=page.metadata.get('page') + 1, # 1-based page numbering
107
  chunk_id=chunk_id,
108
  start_char=char_count + start_char,
109
- end_char=char_count + end_char
110
  )
111
  chunks.append(doc_chunk)
112
-
113
  # Store metadata for later retrieval
114
  self.chunk_metadata[chunk_id] = {
115
- 'page': doc_chunk.page_number,
116
- 'start_char': doc_chunk.start_char,
117
- 'end_char': doc_chunk.end_char
118
  }
119
-
120
  char_count += len(text)
121
-
122
  return chunks
123
 
124
- def create_vector_store(self, chunks: List[DocumentChunk]) -> Chroma: # Esta função nunca está sendo utilizada
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  """Create vector store with metadata"""
126
  texts = [chunk.content for chunk in chunks]
127
- metadatas = [{
128
- 'chunk_id': chunk.chunk_id,
129
- 'page': chunk.page_number,
130
- 'start_char': chunk.start_char,
131
- 'end_char': chunk.end_char
132
- } for chunk in chunks]
133
-
 
 
 
134
  vector_store = Chroma.from_texts(
135
- texts=texts,
136
- metadatas=metadatas,
137
- embedding=self.embeddings
138
  )
139
  return vector_store
140
 
141
- def rerank_chunks( # Esta função nunca está sendo utilizada
142
- self,
143
- chunks: List[Dict],
144
- query: str,
145
- k: int = 5
146
  ) -> List[Dict]:
147
  """
148
  Rerank chunks using Cohere's reranking model.
149
-
150
  Args:
151
  chunks: List of dictionaries containing chunks and their metadata
152
  query: Original search query
153
  k: Number of top chunks to return
154
-
155
  Returns:
156
  List of reranked chunks with updated relevance scores
157
  """
158
  try:
159
  # Prepare documents for reranking
160
- documents = [chunk['content'] for chunk in chunks]
161
-
162
  # Get reranking scores from Cohere
163
  results = self.cohere_client.rerank(
164
  query=query,
165
  documents=documents,
166
  top_n=k,
167
- model=self.model_cohere_rerank
168
  )
169
-
170
  # Create reranked results with original metadata
171
  reranked_chunks = []
172
  for hit in results:
173
  original_chunk = chunks[hit.index]
174
- reranked_chunks.append({
175
- **original_chunk,
176
- 'relevance_score': hit.relevance_score
177
- })
178
-
179
  return reranked_chunks
180
-
181
  except Exception as e:
182
  logging.error(f"Reranking failed: {str(e)}")
183
  return chunks[:k] # Fallback to original ordering
184
 
185
- def generate_summary_with_sources( # Esta função nunca está sendo utilizada
186
- self,
187
- vector_store: Chroma,
188
- query: str = "Summarize the main points of this document"
189
  ) -> List[Dict]:
190
  """Generate summary with source citations using reranking"""
191
  # Retrieve more initial chunks for reranking
192
  relevant_docs = vector_store.similarity_search_with_score(query, k=20)
193
-
194
  # Prepare chunks for reranking
195
  chunks = []
196
  for doc, score in relevant_docs:
197
- chunks.append({
198
- 'content': doc.page_content,
199
- 'page': doc.metadata['page'],
200
- 'chunk_id': doc.metadata['chunk_id'],
201
- 'relevance_score': score
202
- })
203
-
 
 
204
  # Rerank chunks
205
  reranked_chunks = self.rerank_chunks(chunks, query, k=self.num_k_rerank)
206
-
207
  # Prepare context and sources from reranked chunks
208
  contexts = []
209
  sources = []
210
-
211
  for chunk in reranked_chunks:
212
- contexts.append(chunk['content'])
213
- sources.append({
214
- 'content': chunk['content'],
215
- 'page': chunk['page'],
216
- 'chunk_id': chunk['chunk_id'],
217
- 'relevance_score': chunk['relevance_score']
218
- })
 
 
219
 
220
  prompt_template = """
221
  Based on the following context, provide multiple key points from the document.
@@ -226,101 +285,172 @@ class DocumentSummarizer:
226
 
227
  Key points:
228
  """
229
-
230
- prompt = PromptTemplate(
231
- template=prompt_template,
232
- input_variables=["context"]
233
- )
234
-
235
  llm = ChatOpenAI(
236
- temperature=0,
237
- model_name="gpt-4o-mini",
238
- api_key=self.openai_api_key
239
  )
240
-
241
  response = llm.predict(prompt.format(context="\n\n".join(contexts)))
242
-
243
  # Split the response into paragraphs
244
- summaries = [p.strip() for p in response.split('\n\n') if p.strip()]
245
-
246
  # Create structured output
247
  structured_output = []
248
  for idx, summary in enumerate(summaries):
249
  # Associate each summary with the most relevant source
250
- structured_output.append({
251
- "content": summary,
252
- "source": {
253
- "page": sources[min(idx, len(sources)-1)]['page'],
254
- "text": sources[min(idx, len(sources)-1)]['content'][:200] + "...",
255
- "relevance_score": sources[min(idx, len(sources)-1)]['relevance_score']
 
 
 
 
 
256
  }
257
- })
258
-
259
  return structured_output
260
 
261
- def get_source_context(self, chunk_id: str, window: int = 100) -> Dict: # Esta função nunca está sendo utilizada
 
 
262
  """Get extended context around a specific chunk"""
263
  metadata = self.chunk_metadata.get(chunk_id)
264
  if not metadata:
265
  return None
266
-
267
  return {
268
- 'page': metadata['page'],
269
- 'start_char': metadata['start_char'],
270
- 'end_char': metadata['end_char']
271
  }
272
 
 
273
  class ContextualRetriever:
274
- def __init__(self, config: RetrievalConfig, claude_api_key: str, claude_context_model):
275
- self.config = config # Este self.config no momento não está sendo utilizada para nada dentro desta classe. Analisar se deveria estar sendo utilizada.
 
 
 
276
  self.claude_client = Anthropic(api_key=claude_api_key)
277
  self.logger = logging.getLogger(__name__)
278
  self.bm25 = None
279
  self.claude_context_model = claude_context_model
280
-
281
  def generate_context(self, full_text: str, chunk: DocumentChunk) -> str:
282
  """Generate contextual description using Claude"""
283
  try:
284
- prompt = f"""<document>
285
- {full_text}
286
- </document>
287
- Here is the chunk we want to situate within the whole document
288
- <chunk>
289
- {chunk.content}
290
- </chunk>
291
- Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
292
 
293
  response = self.claude_client.messages.create(
294
  model=self.claude_context_model,
295
  max_tokens=100,
296
- messages=[{"role": "user", "content": prompt}]
297
  )
298
- return response.content[0].text # O response.content é uma lista pois é passada uma lista de mensagens, e também retornado uma lista de mensagens, sendo a primeira a mais recente, que é a resposta do model
 
 
299
  except Exception as e:
300
- self.logger.error(f"Context generation failed for chunk {chunk.chunk_id}: {str(e)}")
 
 
301
  return ""
302
 
303
- def contextualize_chunks(self, full_text: str, chunks: List[DocumentChunk]) -> List[ContextualizedChunk]: # Pega um chunk e apenas adiciona uma propriedade de contexto a ela, sendo esta propriedade a resposta da função acima, que chama um Model do Claude para dizer o contexto de um chunk
 
 
 
 
304
  """Add context to all chunks"""
 
 
305
  contextualized_chunks = []
 
 
306
  for chunk in chunks:
307
- context = self.generate_context(full_text, chunk)
 
 
 
 
 
 
 
308
  contextualized_chunk = ContextualizedChunk(
309
  content=chunk.content,
310
  page_number=chunk.page_number,
311
  chunk_id=chunk.chunk_id,
312
  start_char=chunk.start_char,
313
  end_char=chunk.end_char,
314
- context=context
315
  )
316
  contextualized_chunks.append(contextualized_chunk)
317
  return contextualized_chunks
318
 
 
319
  class EnhancedDocumentSummarizer(DocumentSummarizer):
320
- def __init__(self, openai_api_key: str, claude_api_key: str, config: RetrievalConfig, embedding_model, chunk_size, chunk_overlap, num_k_rerank, model_cohere_rerank, claude_context_model, prompt_relatorio, gpt_model, gpt_temperature, id_modelo_do_usuario, prompt_modelo):
321
- super().__init__(openai_api_key, os.environ.get("COHERE_API_KEY"), embedding_model, chunk_size, chunk_overlap, num_k_rerank, model_cohere_rerank)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
322
  self.config = config
323
- self.contextual_retriever = ContextualRetriever(config, claude_api_key, claude_context_model)
 
 
324
  self.logger = logging.getLogger(__name__)
325
  self.prompt_relatorio = prompt_relatorio
326
  self.gpt_model = gpt_model
@@ -328,25 +458,28 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
328
  self.id_modelo_do_usuario = id_modelo_do_usuario
329
  self.prompt_modelo = prompt_modelo
330
 
331
- def create_enhanced_vector_store(self, chunks: List[ContextualizedChunk]) -> Tuple[Chroma, BM25Okapi, List[str]]:
 
 
332
  """Create vector store and BM25 index with contextualized chunks"""
333
  try:
334
  # Prepare texts with context
335
  texts = [f"{chunk.context} {chunk.content}" for chunk in chunks]
336
-
337
  # Create vector store
338
- metadatas = [{
339
- 'chunk_id': chunk.chunk_id,
340
- 'page': chunk.page_number,
341
- 'start_char': chunk.start_char,
342
- 'end_char': chunk.end_char,
343
- 'context': chunk.context
344
- } for chunk in chunks]
345
-
 
 
 
346
  vector_store = Chroma.from_texts(
347
- texts=texts,
348
- metadatas=metadatas,
349
- embedding=self.embeddings
350
  )
351
 
352
  # Create BM25 index
@@ -363,37 +496,34 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
363
  raise
364
 
365
  def retrieve_with_rank_fusion(
366
- self,
367
- vector_store: Chroma,
368
- bm25: BM25Okapi,
369
- chunk_ids: List[str],
370
- query: str
371
  ) -> List[Dict]:
372
  """Combine embedding and BM25 retrieval results"""
373
  try:
374
  # Get embedding results
375
  embedding_results = vector_store.similarity_search_with_score(
376
- query,
377
- k=self.config.num_chunks
378
  )
379
-
380
  # Convert embedding results to list of (chunk_id, score)
381
  embedding_list = [
382
- (doc.metadata['chunk_id'], 1 / (1 + score))
383
  for doc, score in embedding_results
384
  ]
385
-
386
  # Get BM25 results
387
  tokenized_query = query.split()
388
  bm25_scores = bm25.get_scores(tokenized_query)
389
-
390
  # Convert BM25 scores to list of (chunk_id, score)
391
  bm25_list = [
392
  (chunk_ids[i], float(score)) for i, score in enumerate(bm25_scores)
393
  ]
394
 
395
  # Sort bm25_list by score in descending order and limit to top N results
396
- bm25_list = sorted(bm25_list, key=lambda x: x[1], reverse=True)[:self.config.num_chunks]
 
 
397
 
398
  # Normalize BM25 scores
399
  max_bm25 = max([score for _, score in bm25_list]) if bm25_list else 1
@@ -402,12 +532,9 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
402
  # Pass the lists to rank fusion
403
  result_lists = [embedding_list, bm25_list]
404
  weights = [self.config.embedding_weight, self.config.bm25_weight]
405
-
406
- combined_results = reciprocal_rank_fusion(
407
- result_lists,
408
- weights=weights
409
- )
410
-
411
  return combined_results
412
 
413
  except Exception as e:
@@ -415,100 +542,112 @@ class EnhancedDocumentSummarizer(DocumentSummarizer):
415
  raise
416
 
417
  def generate_enhanced_summary(
418
- self,
419
- vector_store: Chroma,
420
  bm25: BM25Okapi,
421
  chunk_ids: List[str],
422
- query: str = "Summarize the main points of this document"
423
  ) -> List[Dict]:
424
  """Generate enhanced summary using both vector and BM25 retrieval"""
425
  try:
426
  # Get combined results using rank fusion
427
  ranked_results = self.retrieve_with_rank_fusion(
428
- vector_store,
429
- bm25,
430
- chunk_ids,
431
- query
432
  )
433
 
434
  # Prepare context and track sources
435
  contexts = []
436
  sources = []
437
-
438
  # Get full documents for top results
439
- for chunk_id, score in ranked_results[:self.config.num_chunks]:
440
  results = vector_store.get(
441
- where={"chunk_id": chunk_id},
442
- include=["documents", "metadatas"]
443
  )
444
-
445
  if results["documents"]:
446
  context = results["documents"][0]
447
  metadata = results["metadatas"][0]
448
-
449
  contexts.append(context)
450
- sources.append({
451
- 'content': context,
452
- 'page': metadata['page'],
453
- 'chunk_id': chunk_id,
454
- 'relevance_score': score,
455
- 'context': metadata.get('context', '')
456
- })
457
-
 
 
458
  url_request = f"{api_url}/modelo/{self.id_modelo_do_usuario}"
459
  resposta = requests.get(url_request)
460
-
461
- if (resposta.status_code != 200):
462
- return Response({"error": "Ocorreu um problema. Pode ser que o modelo não tenha sido encontrado. Tente novamente e/ou entre em contato com a equipe técnica"})
463
-
 
 
 
 
464
  modelo_buscado = resposta.json()["modelo"]
465
-
466
  llm = ChatOpenAI(
467
  temperature=self.gpt_temperature,
468
- model_name=self.gpt_model,
469
  api_key=self.openai_api_key,
470
-
471
  )
472
-
473
  prompt_gerar_relatorio = PromptTemplate(
474
- template=self.prompt_relatorio,
475
- input_variables=["context"]
 
 
 
476
  )
477
-
478
- relatorio_gerado = llm.predict(prompt_gerar_relatorio.format(context="\n\n".join(contexts)))
479
 
480
  prompt_gerar_modelo = PromptTemplate(
481
  template=self.prompt_modelo,
482
- input_variables=["context", "modelo_usuario"]
483
  )
484
-
485
- modelo_gerado = llm.predict(prompt_gerar_modelo.format(context=relatorio_gerado, modelo_usuario=modelo_buscado))
486
-
 
 
 
 
487
  # Split the response into paragraphs
488
- summaries = [p.strip() for p in modelo_gerado.split('\n\n') if p.strip()]
489
-
490
  # Create structured output
491
  structured_output = []
492
  for idx, summary in enumerate(summaries):
493
- source_idx = min(idx, len(sources)-1)
494
- structured_output.append({
495
- "content": summary,
496
- "source": {
497
- "page": sources[source_idx]['page'],
498
- "text": sources[source_idx]['content'][:200] + "...",
499
- "context": sources[source_idx]['context'],
500
- "relevance_score": sources[source_idx]['relevance_score'],
501
- "chunk_id": sources[source_idx]['chunk_id']
 
 
502
  }
503
- })
504
-
505
  return structured_output
506
 
507
  except Exception as e:
508
  self.logger.error(f"Error generating enhanced summary: {str(e)}")
509
  raise
510
 
511
- def get_llm_summary_answer_by_cursor_complete(serializer, listaPDFs):
 
 
 
 
512
  allPdfsChunks = []
513
 
514
  # Configuration
@@ -517,13 +656,13 @@ def get_llm_summary_answer_by_cursor_complete(serializer, listaPDFs):
517
  embedding_weight=serializer["embedding_weight"],
518
  bm25_weight=serializer["bm25_weight"],
519
  context_window=serializer["context_window"],
520
- chunk_overlap=serializer["chunk_overlap"]
521
  )
522
 
523
  # Initialize enhanced summarizer
524
  summarizer = EnhancedDocumentSummarizer(
525
  openai_api_key=os.environ.get("OPENAI_API_KEY"),
526
- claude_api_key= os.environ.get("CLAUDE_API_KEY"),
527
  config=config,
528
  embedding_model=serializer["hf_embedding"],
529
  chunk_overlap=serializer["chunk_overlap"],
@@ -535,60 +674,240 @@ def get_llm_summary_answer_by_cursor_complete(serializer, listaPDFs):
535
  gpt_model=serializer["model"],
536
  gpt_temperature=serializer["gpt_temperature"],
537
  id_modelo_do_usuario=serializer["id_modelo_do_usuario"],
538
- prompt_modelo=serializer["prompt_modelo"]
539
  )
540
 
541
- # # Load and process document
542
- # pdf_path = "./Im_a_storyteller.pdf"
543
- # chunks = summarizer.load_and_split_document(pdf_path)
544
-
545
- # Load and process document
546
- for pdf in listaPDFs:
547
- pdf_path = pdf
548
- chunks = summarizer.load_and_split_document(pdf_path)
549
- allPdfsChunks = allPdfsChunks + chunks
550
-
551
- # Get full text for contextualization
552
- loader = PyPDFLoader(pdf_path)
553
- pages = loader.load()
554
- full_text = " ".join([page.page_content for page in pages])
 
 
 
 
 
 
555
 
556
  # Contextualize chunks
557
- contextualized_chunks = summarizer.contextual_retriever.contextualize_chunks(full_text, allPdfsChunks)
 
 
558
 
559
  # Create enhanced vector store and BM25 index
560
- vector_store, bm25, chunk_ids = summarizer.create_enhanced_vector_store(contextualized_chunks)
 
 
561
 
562
  # Generate enhanced summary
563
  structured_summaries = summarizer.generate_enhanced_summary(
564
- vector_store,
565
- bm25,
566
- chunk_ids,
567
- serializer["user_message"]
568
  )
569
 
570
  # Output results as JSON
571
  json_output = json.dumps(structured_summaries, indent=2)
572
  print("\nStructured Summaries:")
573
  print(json_output)
 
 
 
574
  return {
575
- "resultado": structured_summaries,
576
- "parametros-utilizados": {
577
- "num_chunks_retrieval": serializer["num_chunks_retrieval"],
578
- "embedding_weight": serializer["embedding_weight"],
579
- "bm25_weight": serializer["bm25_weight"],
580
- "context_window": serializer["context_window"],
581
- "chunk_overlap": serializer["chunk_overlap"],
582
- "num_k_rerank": serializer["num_k_rerank"],
583
- "model_cohere_rerank": serializer["model_cohere_rerank"],
584
- "more_initial_chunks_for_reranking": serializer["more_initial_chunks_for_reranking"],
585
- "claude_context_model": serializer["claude_context_model"],
586
- "gpt_temperature": serializer["gpt_temperature"],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
587
  "user_message": serializer["user_message"],
588
- "model": serializer["model"],
589
- "hf_embedding": serializer["hf_embedding"],
590
- "chunk_size": serializer["chunk_size"],
591
- "chunk_overlap": serializer["chunk_overlap"],
592
- "prompt_relatorio": serializer["prompt_relatorio"],
593
- "prompt_modelo": serializer["prompt_modelo"]
594
- }}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import os
2
  from typing import List, Dict, Tuple, Optional
3
  from langchain.text_splitter import RecursiveCharacterTextSplitter
 
18
  import requests
19
  from setup.environment import api_url
20
  from rest_framework.response import Response
21
+ from langchain.schema import Document
22
+
23
+ listaContador = []
24
 
25
  def reciprocal_rank_fusion(result_lists, weights=None):
26
  """Combine multiple ranked lists using reciprocal rank fusion"""
 
36
  fused_scores[doc_id] += weights[i] * score
37
 
38
  # Sort by score in descending order
39
+ sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
40
+
 
 
 
 
41
  return sorted_results
42
 
43
+
44
+ os.environ["LANGCHAIN_TRACING_V2"] = "true"
45
+ os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
46
  os.environ.get("LANGCHAIN_API_KEY")
47
+ os.environ["LANGCHAIN_PROJECT"] = "VELLA"
48
+
49
 
50
  @dataclass
51
  class DocumentChunk:
 
55
  start_char: int
56
  end_char: int
57
 
58
+
59
  @dataclass
60
  class RetrievalConfig:
61
  num_chunks: int = 5
 
65
  chunk_overlap: int = 200
66
  chunk_size: int = 1000
67
 
68
+
69
  @dataclass
70
  class ContextualizedChunk(DocumentChunk):
71
  context: str = ""
72
  embedding: Optional[np.ndarray] = None
73
  bm25_score: Optional[float] = None
74
 
75
+
76
  class DocumentSummarizer:
77
+
78
+ def __init__(
79
+ self,
80
+ openai_api_key: str,
81
+ cohere_api_key: str,
82
+ embedding_model,
83
+ chunk_size,
84
+ chunk_overlap,
85
+ num_k_rerank,
86
+ model_cohere_rerank,
87
+ ):
88
  self.openai_api_key = openai_api_key
89
  self.cohere_client = Client(cohere_api_key)
90
+ self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
 
 
91
  self.text_splitter = RecursiveCharacterTextSplitter(
92
+ chunk_size=chunk_size, chunk_overlap=chunk_overlap
 
93
  )
94
  self.chunk_metadata = {} # Store chunk metadata for tracing
95
  self.num_k_rerank = num_k_rerank
96
  self.model_cohere_rerank = model_cohere_rerank
97
+
98
  def load_and_split_document(self, pdf_path: str) -> List[DocumentChunk]:
99
  """Load PDF and split into chunks with metadata"""
100
  loader = PyPDFLoader(pdf_path)
101
+ pages = (
102
+ loader.load()
103
+ ) # Gera uma lista de objetos Document, sendo cada item da lista referente a UMA PÁGINA inteira do PDF.
104
  chunks = []
105
  char_count = 0
106
+
107
  for page in pages:
108
  text = page.page_content
109
+ page_chunks = self.text_splitter.split_text(
110
+ text
111
+ ) # Quebra o item que é um Document de UMA PÁGINA inteira em um lista onde cada item é referente a um chunk, que são pedaços menores do que uma página.
112
+
113
  for chunk in page_chunks:
114
  chunk_id = str(uuid.uuid4())
115
+ start_char = text.find(
116
+ chunk
117
+ ) # Retorna a posição onde se encontra o chunk dentro da página inteira
118
  end_char = start_char + len(chunk)
119
+
120
+ doc_chunk = DocumentChunk( # Gera o objeto do chunk com informações adicionais, como a posição e id do chunk
121
  content=chunk,
122
+ page_number=page.metadata.get("page") + 1, # 1-based page numbering
123
  chunk_id=chunk_id,
124
  start_char=char_count + start_char,
125
+ end_char=char_count + end_char,
126
  )
127
  chunks.append(doc_chunk)
128
+
129
  # Store metadata for later retrieval
130
  self.chunk_metadata[chunk_id] = {
131
+ "page": doc_chunk.page_number,
132
+ "start_char": doc_chunk.start_char,
133
+ "end_char": doc_chunk.end_char,
134
  }
135
+
136
  char_count += len(text)
137
+
138
  return chunks
139
 
140
+ def load_and_split_text(self, text: str) -> List[DocumentChunk]:
141
+ """Load Text and split into chunks with metadata - Criei essa função apenas para o ragas"""
142
+ page = Document(page_content=text, metadata={"page": 1})
143
+ chunks = []
144
+ char_count = 0
145
+
146
+ text = page.page_content
147
+ page_chunks = self.text_splitter.split_text(
148
+ text
149
+ ) # Quebra o item que é um Document de UMA PÁGINA inteira em um lista onde cada item é referente a um chunk, que são pedaços menores do que uma página.
150
+ print("\n\n\n")
151
+ print("page_chunks: ", page_chunks)
152
+
153
+ for chunk in page_chunks:
154
+ chunk_id = str(uuid.uuid4())
155
+ start_char = text.find(
156
+ chunk
157
+ ) # Retorna a posição onde se encontra o chunk dentro da página inteira
158
+ end_char = start_char + len(chunk)
159
+
160
+ doc_chunk = DocumentChunk( # Gera o objeto do chunk com informações adicionais, como a posição e id do chunk
161
+ content=chunk,
162
+ page_number=page.metadata.get("page") + 1, # 1-based page numbering
163
+ chunk_id=chunk_id,
164
+ start_char=char_count + start_char,
165
+ end_char=char_count + end_char,
166
+ )
167
+ chunks.append(doc_chunk)
168
+
169
+ # Store metadata for later retrieval
170
+ self.chunk_metadata[chunk_id] = {
171
+ "page": doc_chunk.page_number,
172
+ "start_char": doc_chunk.start_char,
173
+ "end_char": doc_chunk.end_char,
174
+ }
175
+
176
+ char_count += len(text)
177
+
178
+ return chunks
179
+
180
+ def create_vector_store(
181
+ self, chunks: List[DocumentChunk]
182
+ ) -> Chroma: # Esta função nunca está sendo utilizada
183
  """Create vector store with metadata"""
184
  texts = [chunk.content for chunk in chunks]
185
+ metadatas = [
186
+ {
187
+ "chunk_id": chunk.chunk_id,
188
+ "page": chunk.page_number,
189
+ "start_char": chunk.start_char,
190
+ "end_char": chunk.end_char,
191
+ }
192
+ for chunk in chunks
193
+ ]
194
+
195
  vector_store = Chroma.from_texts(
196
+ texts=texts, metadatas=metadatas, embedding=self.embeddings
 
 
197
  )
198
  return vector_store
199
 
200
+ def rerank_chunks( # Esta função nunca está sendo utilizada
201
+ self, chunks: List[Dict], query: str, k: int = 5
 
 
 
202
  ) -> List[Dict]:
203
  """
204
  Rerank chunks using Cohere's reranking model.
205
+
206
  Args:
207
  chunks: List of dictionaries containing chunks and their metadata
208
  query: Original search query
209
  k: Number of top chunks to return
210
+
211
  Returns:
212
  List of reranked chunks with updated relevance scores
213
  """
214
  try:
215
  # Prepare documents for reranking
216
+ documents = [chunk["content"] for chunk in chunks]
217
+
218
  # Get reranking scores from Cohere
219
  results = self.cohere_client.rerank(
220
  query=query,
221
  documents=documents,
222
  top_n=k,
223
+ model=self.model_cohere_rerank,
224
  )
225
+
226
  # Create reranked results with original metadata
227
  reranked_chunks = []
228
  for hit in results:
229
  original_chunk = chunks[hit.index]
230
+ reranked_chunks.append(
231
+ {**original_chunk, "relevance_score": hit.relevance_score}
232
+ )
233
+
 
234
  return reranked_chunks
235
+
236
  except Exception as e:
237
  logging.error(f"Reranking failed: {str(e)}")
238
  return chunks[:k] # Fallback to original ordering
239
 
240
+ def generate_summary_with_sources( # Esta função nunca está sendo utilizada
241
+ self,
242
+ vector_store: Chroma,
243
+ query: str = "Summarize the main points of this document",
244
  ) -> List[Dict]:
245
  """Generate summary with source citations using reranking"""
246
  # Retrieve more initial chunks for reranking
247
  relevant_docs = vector_store.similarity_search_with_score(query, k=20)
248
+
249
  # Prepare chunks for reranking
250
  chunks = []
251
  for doc, score in relevant_docs:
252
+ chunks.append(
253
+ {
254
+ "content": doc.page_content,
255
+ "page": doc.metadata["page"],
256
+ "chunk_id": doc.metadata["chunk_id"],
257
+ "relevance_score": score,
258
+ }
259
+ )
260
+
261
  # Rerank chunks
262
  reranked_chunks = self.rerank_chunks(chunks, query, k=self.num_k_rerank)
263
+
264
  # Prepare context and sources from reranked chunks
265
  contexts = []
266
  sources = []
267
+
268
  for chunk in reranked_chunks:
269
+ contexts.append(chunk["content"])
270
+ sources.append(
271
+ {
272
+ "content": chunk["content"],
273
+ "page": chunk["page"],
274
+ "chunk_id": chunk["chunk_id"],
275
+ "relevance_score": chunk["relevance_score"],
276
+ }
277
+ )
278
 
279
  prompt_template = """
280
  Based on the following context, provide multiple key points from the document.
 
285
 
286
  Key points:
287
  """
288
+
289
+ prompt = PromptTemplate(template=prompt_template, input_variables=["context"])
290
+
 
 
 
291
  llm = ChatOpenAI(
292
+ temperature=0, model_name="gpt-4o-mini", api_key=self.openai_api_key
 
 
293
  )
294
+
295
  response = llm.predict(prompt.format(context="\n\n".join(contexts)))
296
+
297
  # Split the response into paragraphs
298
+ summaries = [p.strip() for p in response.split("\n\n") if p.strip()]
299
+
300
  # Create structured output
301
  structured_output = []
302
  for idx, summary in enumerate(summaries):
303
  # Associate each summary with the most relevant source
304
+ structured_output.append(
305
+ {
306
+ "content": summary,
307
+ "source": {
308
+ "page": sources[min(idx, len(sources) - 1)]["page"],
309
+ "text": sources[min(idx, len(sources) - 1)]["content"][:200]
310
+ + "...",
311
+ "relevance_score": sources[min(idx, len(sources) - 1)][
312
+ "relevance_score"
313
+ ],
314
+ },
315
  }
316
+ )
317
+
318
  return structured_output
319
 
320
+ def get_source_context(
321
+ self, chunk_id: str, window: int = 100
322
+ ) -> Dict: # Esta função nunca está sendo utilizada
323
  """Get extended context around a specific chunk"""
324
  metadata = self.chunk_metadata.get(chunk_id)
325
  if not metadata:
326
  return None
327
+
328
  return {
329
+ "page": metadata["page"],
330
+ "start_char": metadata["start_char"],
331
+ "end_char": metadata["end_char"],
332
  }
333
 
334
+
335
  class ContextualRetriever:
336
+
337
+ def __init__(
338
+ self, config: RetrievalConfig, claude_api_key: str, claude_context_model
339
+ ):
340
+ self.config = config # Este self.config no momento não está sendo utilizada para nada dentro desta classe. Analisar se deveria estar sendo utilizada.
341
  self.claude_client = Anthropic(api_key=claude_api_key)
342
  self.logger = logging.getLogger(__name__)
343
  self.bm25 = None
344
  self.claude_context_model = claude_context_model
345
+
346
  def generate_context(self, full_text: str, chunk: DocumentChunk) -> str:
347
  """Generate contextual description using Claude"""
348
  try:
349
+ # prompt = f"""<document>
350
+ # {full_text}
351
+ # </document>
352
+ # Here is the chunk we want to situate within the whole document
353
+ # <chunk>
354
+ # {chunk.content}
355
+ # </chunk>
356
+ # Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
357
+
358
+ prompt = f"""You are a language model tasked with providing context to improve the retrieval of information from a chunk extracted from a document. Follow these steps internally (do not display reasoning or reflection in the final output):
359
+ 1. **Chain of Thought (internal)**:
360
+ - Identify the document ID, which is the value between "NUM." and "- Pág".
361
+ - Identify the document name from the header.
362
+ 2. **Reflection (internal)**:
363
+ - Confirm the document ID and name are correctly identified.
364
+ - Ensure the final context is concise and helpful.
365
+ 3. **Final Response**:
366
+ - Provide a short context situating the *chunk* within the document, including the document ID and document name.
367
+ - Do not include any reasoning or reflection in your response.
368
+ **Example Usage:**
369
+ ```
370
+ <document> {full_text} </document>
371
+ <chunk> {chunk.content} </chunk>
372
+ Please return only the succinct context (without displaying your internal reasoning), including the document ID and the document name.
373
+ ```
374
+ """
375
 
376
  response = self.claude_client.messages.create(
377
  model=self.claude_context_model,
378
  max_tokens=100,
379
+ messages=[{"role": "user", "content": prompt}],
380
  )
381
+ return response.content[
382
+ 0
383
+ ].text # O response.content é uma lista pois é passada uma lista de mensagens, e também retornado uma lista de mensagens, sendo a primeira a mais recente, que é a resposta do model
384
  except Exception as e:
385
+ self.logger.error(
386
+ f"Context generation failed for chunk {chunk.chunk_id}: {str(e)}"
387
+ )
388
  return ""
389
 
390
+ def contextualize_chunks(
391
+ self, full_text: List[Document], chunks: List[DocumentChunk]
392
+ ) -> List[
393
+ ContextualizedChunk
394
+ ]: # Pega um chunk e apenas adiciona uma propriedade de contexto a ela, sendo esta propriedade a resposta da função acima, que chama um Model do Claude para dizer o contexto de um chunk
395
  """Add context to all chunks"""
396
+
397
+ smaller_context = ""
398
  contextualized_chunks = []
399
+ print("\n\n")
400
+ print("len(chunks): ", len(chunks))
401
  for chunk in chunks:
402
+ contador_pagina = -1
403
+ while contador_pagina <= 1:
404
+ local_page = full_text[chunk.page_number + contador_pagina]
405
+ if local_page:
406
+ smaller_context += local_page.page_content
407
+ contador_pagina += 1
408
+ print("chunk.page_number: ", chunk.page_number)
409
+ context = self.generate_context(smaller_context, chunk)
410
  contextualized_chunk = ContextualizedChunk(
411
  content=chunk.content,
412
  page_number=chunk.page_number,
413
  chunk_id=chunk.chunk_id,
414
  start_char=chunk.start_char,
415
  end_char=chunk.end_char,
416
+ context=context,
417
  )
418
  contextualized_chunks.append(contextualized_chunk)
419
  return contextualized_chunks
420
 
421
+
422
  class EnhancedDocumentSummarizer(DocumentSummarizer):
423
+
424
+ def __init__(
425
+ self,
426
+ openai_api_key: str,
427
+ claude_api_key: str,
428
+ config: RetrievalConfig,
429
+ embedding_model,
430
+ chunk_size,
431
+ chunk_overlap,
432
+ num_k_rerank,
433
+ model_cohere_rerank,
434
+ claude_context_model,
435
+ prompt_relatorio,
436
+ gpt_model,
437
+ gpt_temperature,
438
+ id_modelo_do_usuario,
439
+ prompt_modelo,
440
+ ):
441
+ super().__init__(
442
+ openai_api_key,
443
+ os.environ.get("COHERE_API_KEY"),
444
+ embedding_model,
445
+ chunk_size,
446
+ chunk_overlap,
447
+ num_k_rerank,
448
+ model_cohere_rerank,
449
+ )
450
  self.config = config
451
+ self.contextual_retriever = ContextualRetriever(
452
+ config, claude_api_key, claude_context_model
453
+ )
454
  self.logger = logging.getLogger(__name__)
455
  self.prompt_relatorio = prompt_relatorio
456
  self.gpt_model = gpt_model
 
458
  self.id_modelo_do_usuario = id_modelo_do_usuario
459
  self.prompt_modelo = prompt_modelo
460
 
461
+ def create_enhanced_vector_store(
462
+ self, chunks: List[ContextualizedChunk]
463
+ ) -> Tuple[Chroma, BM25Okapi, List[str]]:
464
  """Create vector store and BM25 index with contextualized chunks"""
465
  try:
466
  # Prepare texts with context
467
  texts = [f"{chunk.context} {chunk.content}" for chunk in chunks]
468
+
469
  # Create vector store
470
+ metadatas = [
471
+ {
472
+ "chunk_id": chunk.chunk_id,
473
+ "page": chunk.page_number,
474
+ "start_char": chunk.start_char,
475
+ "end_char": chunk.end_char,
476
+ "context": chunk.context,
477
+ }
478
+ for chunk in chunks
479
+ ]
480
+
481
  vector_store = Chroma.from_texts(
482
+ texts=texts, metadatas=metadatas, embedding=self.embeddings
 
 
483
  )
484
 
485
  # Create BM25 index
 
496
  raise
497
 
498
  def retrieve_with_rank_fusion(
499
+ self, vector_store: Chroma, bm25: BM25Okapi, chunk_ids: List[str], query: str
 
 
 
 
500
  ) -> List[Dict]:
501
  """Combine embedding and BM25 retrieval results"""
502
  try:
503
  # Get embedding results
504
  embedding_results = vector_store.similarity_search_with_score(
505
+ query, k=self.config.num_chunks
 
506
  )
507
+
508
  # Convert embedding results to list of (chunk_id, score)
509
  embedding_list = [
510
+ (doc.metadata["chunk_id"], 1 / (1 + score))
511
  for doc, score in embedding_results
512
  ]
513
+
514
  # Get BM25 results
515
  tokenized_query = query.split()
516
  bm25_scores = bm25.get_scores(tokenized_query)
517
+
518
  # Convert BM25 scores to list of (chunk_id, score)
519
  bm25_list = [
520
  (chunk_ids[i], float(score)) for i, score in enumerate(bm25_scores)
521
  ]
522
 
523
  # Sort bm25_list by score in descending order and limit to top N results
524
+ bm25_list = sorted(bm25_list, key=lambda x: x[1], reverse=True)[
525
+ : self.config.num_chunks
526
+ ]
527
 
528
  # Normalize BM25 scores
529
  max_bm25 = max([score for _, score in bm25_list]) if bm25_list else 1
 
532
  # Pass the lists to rank fusion
533
  result_lists = [embedding_list, bm25_list]
534
  weights = [self.config.embedding_weight, self.config.bm25_weight]
535
+
536
+ combined_results = reciprocal_rank_fusion(result_lists, weights=weights)
537
+
 
 
 
538
  return combined_results
539
 
540
  except Exception as e:
 
542
  raise
543
 
544
  def generate_enhanced_summary(
545
+ self,
546
+ vector_store: Chroma,
547
  bm25: BM25Okapi,
548
  chunk_ids: List[str],
549
+ query: str = "Summarize the main points of this document",
550
  ) -> List[Dict]:
551
  """Generate enhanced summary using both vector and BM25 retrieval"""
552
  try:
553
  # Get combined results using rank fusion
554
  ranked_results = self.retrieve_with_rank_fusion(
555
+ vector_store, bm25, chunk_ids, query
 
 
 
556
  )
557
 
558
  # Prepare context and track sources
559
  contexts = []
560
  sources = []
561
+
562
  # Get full documents for top results
563
+ for chunk_id, score in ranked_results[: self.config.num_chunks]:
564
  results = vector_store.get(
565
+ where={"chunk_id": chunk_id}, include=["documents", "metadatas"]
 
566
  )
567
+
568
  if results["documents"]:
569
  context = results["documents"][0]
570
  metadata = results["metadatas"][0]
571
+
572
  contexts.append(context)
573
+ sources.append(
574
+ {
575
+ "content": context,
576
+ "page": metadata["page"],
577
+ "chunk_id": chunk_id,
578
+ "relevance_score": score,
579
+ "context": metadata.get("context", ""),
580
+ }
581
+ )
582
+
583
  url_request = f"{api_url}/modelo/{self.id_modelo_do_usuario}"
584
  resposta = requests.get(url_request)
585
+
586
+ if resposta.status_code != 200:
587
+ return Response(
588
+ {
589
+ "error": "Ocorreu um problema. Pode ser que o modelo não tenha sido encontrado. Tente novamente e/ou entre em contato com a equipe técnica"
590
+ }
591
+ )
592
+
593
  modelo_buscado = resposta.json()["modelo"]
594
+
595
  llm = ChatOpenAI(
596
  temperature=self.gpt_temperature,
597
+ model_name=self.gpt_model,
598
  api_key=self.openai_api_key,
 
599
  )
600
+
601
  prompt_gerar_relatorio = PromptTemplate(
602
+ template=self.prompt_relatorio, input_variables=["context"]
603
+ )
604
+
605
+ relatorio_gerado = llm.predict(
606
+ prompt_gerar_relatorio.format(context="\n\n".join(contexts))
607
  )
 
 
608
 
609
  prompt_gerar_modelo = PromptTemplate(
610
  template=self.prompt_modelo,
611
+ input_variables=["context", "modelo_usuario"],
612
  )
613
+
614
+ modelo_gerado = llm.predict(
615
+ prompt_gerar_modelo.format(
616
+ context=relatorio_gerado, modelo_usuario=modelo_buscado
617
+ )
618
+ )
619
+
620
  # Split the response into paragraphs
621
+ summaries = [p.strip() for p in modelo_gerado.split("\n\n") if p.strip()]
622
+
623
  # Create structured output
624
  structured_output = []
625
  for idx, summary in enumerate(summaries):
626
+ source_idx = min(idx, len(sources) - 1)
627
+ structured_output.append(
628
+ {
629
+ "content": summary,
630
+ "source": {
631
+ "page": sources[source_idx]["page"],
632
+ "text": sources[source_idx]["content"][:200] + "...",
633
+ "context": sources[source_idx]["context"],
634
+ "relevance_score": sources[source_idx]["relevance_score"],
635
+ "chunk_id": sources[source_idx]["chunk_id"],
636
+ },
637
  }
638
+ )
639
+
640
  return structured_output
641
 
642
  except Exception as e:
643
  self.logger.error(f"Error generating enhanced summary: {str(e)}")
644
  raise
645
 
646
+
647
+ async def get_llm_summary_answer_by_cursor_complete(
648
+ serializer, listaPDFs=None, contexto=None
649
+ ):
650
+ """Parâmetro "contexto" só deve ser passado quando quiser utilizar o teste com ragas, e assim, não quiser passar PDFs"""
651
  allPdfsChunks = []
652
 
653
  # Configuration
 
656
  embedding_weight=serializer["embedding_weight"],
657
  bm25_weight=serializer["bm25_weight"],
658
  context_window=serializer["context_window"],
659
+ chunk_overlap=serializer["chunk_overlap"],
660
  )
661
 
662
  # Initialize enhanced summarizer
663
  summarizer = EnhancedDocumentSummarizer(
664
  openai_api_key=os.environ.get("OPENAI_API_KEY"),
665
+ claude_api_key=os.environ.get("CLAUDE_API_KEY"),
666
  config=config,
667
  embedding_model=serializer["hf_embedding"],
668
  chunk_overlap=serializer["chunk_overlap"],
 
674
  gpt_model=serializer["model"],
675
  gpt_temperature=serializer["gpt_temperature"],
676
  id_modelo_do_usuario=serializer["id_modelo_do_usuario"],
677
+ prompt_modelo=serializer["prompt_modelo"],
678
  )
679
 
680
+ full_text = ""
681
+ if contexto:
682
+ full_text = contexto
683
+ chunks = summarizer.load_and_split_text(full_text)
684
+ allPdfsChunks = chunks
685
+ else:
686
+ # # Load and process document
687
+ # pdf_path = "./Im_a_storyteller.pdf"
688
+ # chunks = summarizer.load_and_split_document(pdf_path)
689
+
690
+ # Load and process document
691
+ for pdf in listaPDFs:
692
+ pdf_path = pdf
693
+ chunks = summarizer.load_and_split_document(pdf_path)
694
+ allPdfsChunks = allPdfsChunks + chunks
695
+
696
+ # Get full text for contextualization
697
+ loader = PyPDFLoader(pdf_path)
698
+ pages = loader.load()
699
+ full_text = " ".join([page.page_content for page in pages])
700
 
701
  # Contextualize chunks
702
+ contextualized_chunks = await summarizer.contextual_retriever.contextualize_chunks(
703
+ pages, allPdfsChunks
704
+ )
705
 
706
  # Create enhanced vector store and BM25 index
707
+ vector_store, bm25, chunk_ids = summarizer.create_enhanced_vector_store(
708
+ contextualized_chunks
709
+ )
710
 
711
  # Generate enhanced summary
712
  structured_summaries = summarizer.generate_enhanced_summary(
713
+ vector_store, bm25, chunk_ids, serializer["user_message"]
 
 
 
714
  )
715
 
716
  # Output results as JSON
717
  json_output = json.dumps(structured_summaries, indent=2)
718
  print("\nStructured Summaries:")
719
  print(json_output)
720
+ texto_completo = ""
721
+ for x in structured_summaries:
722
+ texto_completo = texto_completo + x["content"]
723
  return {
724
+ "resultado": structured_summaries,
725
+ "texto_completo": texto_completo,
726
+ "parametros-utilizados": {
727
+ "num_chunks_retrieval": serializer["num_chunks_retrieval"],
728
+ "embedding_weight": serializer["embedding_weight"],
729
+ "bm25_weight": serializer["bm25_weight"],
730
+ "context_window": serializer["context_window"],
731
+ "chunk_overlap": serializer["chunk_overlap"],
732
+ "num_k_rerank": serializer["num_k_rerank"],
733
+ "model_cohere_rerank": serializer["model_cohere_rerank"],
734
+ "more_initial_chunks_for_reranking": serializer[
735
+ "more_initial_chunks_for_reranking"
736
+ ],
737
+ "claude_context_model": serializer["claude_context_model"],
738
+ "gpt_temperature": serializer["gpt_temperature"],
739
+ "user_message": serializer["user_message"],
740
+ "model": serializer["model"],
741
+ "hf_embedding": serializer["hf_embedding"],
742
+ "chunk_size": serializer["chunk_size"],
743
+ "chunk_overlap": serializer["chunk_overlap"],
744
+ "prompt_relatorio": serializer["prompt_relatorio"],
745
+ "prompt_modelo": serializer["prompt_modelo"],
746
+ },
747
+ }
748
+
749
+
750
+ from ragas import evaluate
751
+
752
+ from langchain.chains import SequentialChain
753
+ from langchain.prompts import PromptTemplate
754
+
755
+ # from langchain.schema import ChainResult
756
+ from langchain.memory import SimpleMemory
757
+
758
+
759
+ def test_ragas(serializer, listaPDFs):
760
+
761
+ # Step 2: Setup RetrievalConfig and EnhancedDocumentSummarizer
762
+ config = RetrievalConfig(
763
+ num_chunks=serializer["num_chunks_retrieval"],
764
+ embedding_weight=serializer["embedding_weight"],
765
+ bm25_weight=serializer["bm25_weight"],
766
+ context_window=serializer["context_window"],
767
+ chunk_overlap=serializer["chunk_overlap"],
768
+ )
769
+
770
+ summarizer = EnhancedDocumentSummarizer(
771
+ openai_api_key=os.environ.get("OPENAI_API_KEY"),
772
+ claude_api_key=os.environ.get("CLAUDE_API_KEY"),
773
+ config=config,
774
+ embedding_model=serializer["hf_embedding"],
775
+ chunk_overlap=serializer["chunk_overlap"],
776
+ chunk_size=serializer["chunk_size"],
777
+ num_k_rerank=serializer["num_k_rerank"],
778
+ model_cohere_rerank=serializer["model_cohere_rerank"],
779
+ claude_context_model=serializer["claude_context_model"],
780
+ prompt_relatorio=serializer["prompt_relatorio"],
781
+ gpt_model=serializer["model"],
782
+ gpt_temperature=serializer["gpt_temperature"],
783
+ id_modelo_do_usuario=serializer["id_modelo_do_usuario"],
784
+ prompt_modelo=serializer["prompt_modelo"],
785
+ )
786
+
787
+ # Step 1: Define the components
788
+ def load_and_split_documents(pdf_list, summarizer):
789
+ """Loads and splits PDF documents into chunks."""
790
+ all_chunks = []
791
+ for pdf_path in pdf_list:
792
+ chunks = summarizer.load_and_split_document(pdf_path)
793
+ all_chunks.extend(chunks)
794
+ return {"chunks": all_chunks}
795
+
796
+ def get_full_text_from_pdfs(pdf_list):
797
+ """Gets the full text from PDFs for contextualization."""
798
+ full_text = []
799
+ for pdf_path in pdf_list:
800
+ loader = PyPDFLoader(pdf_path)
801
+ pages = loader.load()
802
+ text = " ".join([page.page_content for page in pages])
803
+ full_text.append(text)
804
+ return {"full_text": " ".join(full_text)}
805
+
806
+ def contextualize_chunks(full_text, chunks, contextual_retriever):
807
+ """Adds context to chunks using Claude."""
808
+ contextualized_chunks = contextual_retriever.contextualize_chunks(
809
+ full_text, chunks
810
+ )
811
+ return {"contextualized_chunks": contextualized_chunks}
812
+
813
+ def create_vector_store(contextualized_chunks, summarizer):
814
+ """Creates an enhanced vector store and BM25 index."""
815
+ vector_store, bm25, chunk_ids = summarizer.create_enhanced_vector_store(
816
+ contextualized_chunks
817
+ )
818
+ return {"vector_store": vector_store, "bm25": bm25, "chunk_ids": chunk_ids}
819
+
820
+ def generate_summary(vector_store, bm25, chunk_ids, query, summarizer):
821
+ """Generates an enhanced summary using the vector store and BM25 index."""
822
+ structured_summaries = summarizer.generate_enhanced_summary(
823
+ vector_store, bm25, chunk_ids, query
824
+ )
825
+ return {"structured_summaries": structured_summaries}
826
+
827
+ # Step 3: Define Sequential Chain
828
+ chain = SequentialChain(
829
+ chains=[
830
+ lambda inputs: load_and_split_documents(inputs["pdf_list"], summarizer),
831
+ lambda inputs: get_full_text_from_pdfs(inputs["pdf_list"]),
832
+ lambda inputs: contextualize_chunks(
833
+ inputs["full_text"], inputs["chunks"], summarizer.contextual_retriever
834
+ ),
835
+ lambda inputs: create_vector_store(
836
+ inputs["contextualized_chunks"], summarizer
837
+ ),
838
+ lambda inputs: generate_summary(
839
+ inputs["vector_store"],
840
+ inputs["bm25"],
841
+ inputs["chunk_ids"],
842
+ inputs["user_message"],
843
+ summarizer,
844
+ ),
845
+ ],
846
+ input_variables=["pdf_list", "user_message"],
847
+ output_variables=["structured_summaries"],
848
+ )
849
+
850
+ from ragas.langchain.evalchain import RagasEvaluatorChain
851
+ from ragas.metrics import (
852
+ LLMContextRecall,
853
+ Faithfulness,
854
+ FactualCorrectness,
855
+ SemanticSimilarity,
856
+ )
857
+ from ragas import evaluate
858
+ from ragas.llms import LangchainLLMWrapper
859
+
860
+ # from ragas.embeddings import LangchainEmbeddingsWrapper
861
+ # evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
862
+ evaluator_llm = LangchainLLMWrapper(chain)
863
+ # evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
864
+ from datasets import load_dataset
865
+
866
+ dataset = load_dataset(
867
+ "explodinggradients/amnesty_qa", "english_v3", trust_remote_code=True
868
+ )
869
+
870
+ from ragas import EvaluationDataset
871
+
872
+ eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])
873
+
874
+ metrics = [
875
+ LLMContextRecall(llm=evaluator_llm),
876
+ FactualCorrectness(llm=evaluator_llm),
877
+ Faithfulness(llm=evaluator_llm),
878
+ # SemanticSimilarity(embeddings=evaluator_embeddings)
879
+ ]
880
+ results = evaluate(dataset=eval_dataset, metrics=metrics)
881
+ print("results: ", results)
882
+
883
+ # Step 4: Run the Chain
884
+ inputs = {
885
+ "pdf_list": listaPDFs,
886
  "user_message": serializer["user_message"],
887
+ }
888
+ # result = chain.run(inputs)
889
+ return Response({"msg": results})
890
+
891
+ # Step 5: Format the Output
892
+ # return {
893
+ # "resultado": result["structured_summaries"],
894
+ # "parametros-utilizados": {
895
+ # "num_chunks_retrieval": serializer["num_chunks_retrieval"],
896
+ # "embedding_weight": serializer["embedding_weight"],
897
+ # "bm25_weight": serializer["bm25_weight"],
898
+ # "context_window": serializer["context_window"],
899
+ # "chunk_overlap": serializer["chunk_overlap"],
900
+ # "num_k_rerank": serializer["num_k_rerank"],
901
+ # "model_cohere_rerank": serializer["model_cohere_rerank"],
902
+ # "more_initial_chunks_for_reranking": serializer["more_initial_chunks_for_reranking"],
903
+ # "claude_context_model": serializer["claude_context_model"],
904
+ # "gpt_temperature": serializer["gpt_temperature"],
905
+ # "user_message": serializer["user_message"],
906
+ # "model": serializer["model"],
907
+ # "hf_embedding": serializer["hf_embedding"],
908
+ # "chunk_size": serializer["chunk_size"],
909
+ # "chunk_overlap": serializer["chunk_overlap"],
910
+ # "prompt_relatorio": serializer["prompt_relatorio"],
911
+ # "prompt_modelo": serializer["prompt_modelo"],
912
+ # },
913
+ # }
resumos/serializer.py CHANGED
@@ -29,31 +29,99 @@ class ResumoCursorSerializer(serializers.Serializer):
29
  chunk_overlap = serializers.IntegerField(required=False, default=200)
30
 
31
 
 
 
 
 
 
 
 
 
 
 
32
  system_prompt_relatorio = """
33
- Based on the following context, provide multiple key points from the document.
34
- For each point, create a new paragraph.
35
- Each paragraph should be a complete, self-contained insight.
36
- Include any relevant context provided.
37
-
38
- Context: {context}
39
-
40
- Key points:
41
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  user_message = "What are the main points of this document?"
44
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  system_prompt_modelo = """
46
- Based on the following context, provide multiple key points from the document.
47
- For each point, create a new paragraph.
48
- Each paragraph should be a complete, self-contained insight.
49
- Include any relevant context provided.
50
-
51
- Context: {context}
52
-
53
- Modelo do usuário: {modelo_usuario}
54
-
55
- Key points:
56
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  class ResumoCursorCompeltoSerializer(ResumoCursorSerializer):
58
  system_prompt = None
59
  prompt_relatorio = serializers.CharField(required=False, default=system_prompt_relatorio)
@@ -69,4 +137,17 @@ class ResumoCursorCompeltoSerializer(ResumoCursorSerializer):
69
  more_initial_chunks_for_reranking = serializers.IntegerField(default=20)
70
  claude_context_model = serializers.CharField(required=False, default="claude-3-haiku-20240307")
71
  gpt_temperature = serializers.FloatField(default=0)
72
- id_modelo_do_usuario = serializers.IntegerField(required=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  chunk_overlap = serializers.IntegerField(required=False, default=200)
30
 
31
 
32
+ # system_prompt_relatorio = """
33
+ # Based on the following context, provide multiple key points from the document.
34
+ # For each point, create a new paragraph.
35
+ # Each paragraph should be a complete, self-contained insight.
36
+ # Include any relevant context provided.
37
+
38
+ # Context: {context}
39
+
40
+ # Key points:
41
+ # """
42
  system_prompt_relatorio = """
43
+ You are a language model specialized in producing concise and well-structured legal case summaries in Portuguese. You will receive a variable `context`, which contains information about a legal case. Your task is to read the `context` carefully and produce a summary report in Portuguese, following the specific format provided below. Do not include any additional comments or reasoning steps in your final answer.
44
+ **Instructions**:
45
+ 1. **Chain of Thought**: Before producing your final answer, you must think through and plan your summary silently, without showing this reasoning in the final output. The final answer must only contain the required formatted report and nothing else.
46
+ 2. **Reading the Context**: Extract the following information from `context`:
47
+ - The name of the defendant (réu).
48
+ - The crime they have been accused of (nome_do_crime).
49
+ - The applicable article and subsection of the Penal Code (artigo_e_inciso_do_crime).
50
+ - The date the accusation was accepted (data_do_recebimento).
51
+ - The ID of the decision document (id_do_documento).
52
+ 3. **Prescriptive Details**: If no other interruptive or suspensive causes of prescription are mentioned, confirm that there are none.
53
+ 4. **Formatting**: Your final answer must strictly follow the format below, in Portuguese, and replace the placeholders with the appropriate information:
54
+ ```
55
+ <formato>
56
+ Trata-se de Ação Penal em que o Ministério Público denunciou [nome_do_reu], pela prática do [nome_do_crime] [artigo_e_inciso_do_crime], do Código Penal.
57
+ A denúncia foi recebida em [data_do_recebimento], conforme Decisão [id_do_documento].
58
+ Não há outras causas interruptivas ou suspensivas da prescrição.
59
+ </formato>
60
+ ```
61
+ 5. **Completeness**: If any piece of required information is missing in the `context`, note that explicitly in the final answer within the format.
62
+ **Reminder**:
63
+ - Do not include your chain of thought in the final output.
64
+ - Do not add extra information or commentary beyond the specified format.
65
+ - The final answer must be in Portuguese.
66
+ ---
67
+
68
+ **Contextual Information (provided separately):**
69
+ {context}
70
+ ---
71
+ **Example with a given context**:
72
+ - Input:
73
+ `context` = "Em 10/03/2021, o Ministério Público denunciou João da Silva, imputando-lhe o crime de furto qualificado, previsto no art. 155, §4º, inciso II, do Código Penal. A denúncia foi recebida em 12/03/2021, conforme Decisão nº 20210312-01. Não há menção a qualquer causa interruptiva ou suspensiva da prescrição."
74
+ - Expected final answer:
75
+ ```
76
+ <formato>
77
+ Trata-se de Ação Penal em que o Ministério Público denunciou João da Silva, pela prática do furto qualificado (art. 155, §4º, inciso II do Código Penal).
78
+ A denúncia foi recebida em 12/03/2021, conforme Decisão 20210312-01.
79
+ Não há outras causas interruptivas ou suspensivas da prescrição.
80
+ </formato>
81
+ """
82
 
83
  user_message = "What are the main points of this document?"
84
 
85
+ # system_prompt_modelo = """
86
+ # Based on the following context, provide multiple key points from the document.
87
+ # For each point, create a new paragraph.
88
+ # Each paragraph should be a complete, self-contained insight.
89
+ # Include any relevant context provided.
90
+
91
+ # Context: {context}
92
+
93
+ # Modelo do usuário: {modelo_usuario}
94
+
95
+ # Key points:
96
+ # """
97
  system_prompt_modelo = """
98
+ You are a large language model that must produce a single final sentence in **Portuguese**. To do this, you will follow a private chain of thought and then produce a final answer. The final answer must follow the formatting and stylistic conventions shown in the user-provided model `user's template`. The information to be included in the final sentence is derived from the `context` (a report describing a legal case).
99
+ **Contextual Information (provided separately):**
100
+ {context}
101
+ **User Model (provided separately):**
102
+ {modelo_do_usuario}
103
+ **Instructions:**
104
+ 1. **Goal:** Produce one single final sentence in Portuguese that matches the structure, format, and style given by `user's template`.
105
+ 2. **Chain of Thought (private to the assistant and not to be shown in the final answer):**
106
+ - Carefully review the `context` which is a legal report of a case.
107
+ - Identify:
108
+ - The defendant’s name.
109
+ - The crime’s name, its article, and any subsection (inciso).
110
+ - The date of receipt of the complaint (data do recebimento da denúncia).
111
+ - The document ID.
112
+ - Ensure these elements are correctly incorporated into the final sentence.
113
+ - Check compliance with the formatting style indicated by `user's template`.
114
+ - Compose the sentence following the structure from the user model.
115
+ - Use reflection: Before finalizing the answer, reassess if all required information is included, if the format matches the user model, and if the sentence is written correctly in Portuguese.
116
+ 3. **Reflection Technique (private):**
117
+ After composing the sentence, but before presenting it as the final answer, reflect if:
118
+ - All required details from the `context` are accurately included.
119
+ - The sentence format strictly matches the pattern of `user's template`.
120
+ - The sentence is grammatically correct in Portuguese.
121
+ 4. **Final Answer:**
122
+ - After completing the chain of thought and ensuring correctness through reflection, present only the final sentence in Portuguese.
123
+ - Do not show the chain of thought or the reflection step. Only the final formatted sentence should be visible to the user.
124
+ """
125
  class ResumoCursorCompeltoSerializer(ResumoCursorSerializer):
126
  system_prompt = None
127
  prompt_relatorio = serializers.CharField(required=False, default=system_prompt_relatorio)
 
137
  more_initial_chunks_for_reranking = serializers.IntegerField(default=20)
138
  claude_context_model = serializers.CharField(required=False, default="claude-3-haiku-20240307")
139
  gpt_temperature = serializers.FloatField(default=0)
140
+ id_modelo_do_usuario = serializers.IntegerField(required=True)
141
+
142
+
143
+ class RagasSerializer(ResumoCursorCompeltoSerializer):
144
+ files = serializers.ListField(child=serializers.FileField(), required=True)
145
+ id_modelo_do_usuario = serializers.IntegerField(required=False)
146
+ hf_embedding = serializers.CharField(required=False, default="all-MiniLM-L6-v2")
147
+
148
+
149
+ class RagasFromTextSerializer(ResumoCursorCompeltoSerializer):
150
+ files = None
151
+ id_modelo_do_usuario = serializers.IntegerField(required=False, default=9)
152
+ user_message = serializers.CharField(required=True)
153
+ context_provided = serializers.CharField(required=False)
resumos/views.py CHANGED
@@ -1,11 +1,21 @@
1
  from rest_framework.views import APIView
 
2
  import tempfile, os
3
  from rest_framework.response import Response
4
 
5
- from _utils.resumo_completo_cursor import get_llm_summary_answer_by_cursor_complete
 
 
 
6
  from _utils.resumo_simples_cursor import get_llm_summary_answer_by_cursor
7
  from _utils.utils import DEFAULT_SYSTEM_PROMPT
8
- from .serializer import ResumoCursorCompeltoSerializer, ResumoPDFSerializer, ResumoCursorSerializer
 
 
 
 
 
 
9
  from _utils.main import get_llm_answer_summary, get_llm_answer_summary_with_embedding
10
  from setup.environment import default_model
11
  from rest_framework.parsers import MultiPartParser
@@ -13,125 +23,292 @@ from drf_spectacular.utils import extend_schema
13
 
14
 
15
  class ResumoView(APIView):
16
- parser_classes = [MultiPartParser]
17
-
18
- @extend_schema(
19
- request=ResumoPDFSerializer,
20
- )
21
- def post(self, request):
22
- serializer = ResumoPDFSerializer(data=request.data)
23
- if serializer.is_valid(raise_exception=True):
24
- listaPDFs = []
25
- data = serializer.validated_data
26
- model = serializer.validated_data.get("model", default_model)
27
- print('serializer.validated_data: ', serializer.validated_data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- for file in serializer.validated_data['files']:
30
- print("file: ", file)
31
- file.seek(0)
32
- with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file: # Create a temporary file to save the uploaded PDF
33
- for chunk in file.chunks(): # Write the uploaded file content to the temporary file
34
- temp_file.write(chunk)
35
- temp_file_path = temp_file.name # Get the path of the temporary file
36
- listaPDFs.append(temp_file_path)
37
- # print('listaPDFs: ', listaPDFs)
38
 
39
- system_prompt = data.get("system_prompt", DEFAULT_SYSTEM_PROMPT)
40
- resposta_llm = get_llm_answer_summary(system_prompt, data["user_message"], listaPDFs, model=model, isIterativeRefinement=data["iterative_refinement"])
41
 
42
- for file in listaPDFs:
43
- os.remove(file)
44
 
45
- return Response({"resposta": resposta_llm})
46
 
47
  class ResumoEmbeddingView(APIView):
48
- parser_classes = [MultiPartParser]
49
-
50
- @extend_schema(
51
- request=ResumoPDFSerializer,
52
- )
53
- def post(self, request):
54
- serializer = ResumoPDFSerializer(data=request.data)
55
- if serializer.is_valid(raise_exception=True):
56
- listaPDFs = []
57
- data = serializer.validated_data
58
- model = serializer.validated_data.get("model", default_model)
59
- print('serializer.validated_data: ', serializer.validated_data)
60
 
61
- for file in serializer.validated_data['files']:
62
- file.seek(0)
63
- with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file: # Create a temporary file to save the uploaded PDF
64
- for chunk in file.chunks(): # Write the uploaded file content to the temporary file
65
- temp_file.write(chunk)
66
- temp_file_path = temp_file.name # Get the path of the temporary file
67
- listaPDFs.append(temp_file_path)
68
- print('listaPDFs: ', listaPDFs)
 
 
 
 
 
 
 
 
69
 
70
- system_prompt = data.get("system_prompt", DEFAULT_SYSTEM_PROMPT)
71
- resposta_llm = get_llm_answer_summary_with_embedding(system_prompt, data["user_message"], listaPDFs, model=model, isIterativeRefinement=data["iterative_refinement"])
 
 
 
 
 
 
72
 
73
- for file in listaPDFs:
74
- os.remove(file)
 
 
75
 
76
- return Response({"resposta": resposta_llm})
77
 
78
  class ResumoSimplesCursorView(APIView):
79
- parser_classes = [MultiPartParser]
80
-
81
- @extend_schema(
82
- request=ResumoCursorSerializer,
83
- )
84
- def post(self, request):
85
- serializer = ResumoCursorSerializer(data=request.data)
86
- if serializer.is_valid(raise_exception=True):
87
- listaPDFs = []
88
- data = serializer.validated_data
89
- print('\nserializer.validated_data: ', serializer.validated_data)
90
-
91
- for file in serializer.validated_data['files']:
92
- file.seek(0)
93
- with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file: # Create a temporary file to save the uploaded PDF
94
- for chunk in file.chunks(): # Write the uploaded file content to the temporary file
95
- temp_file.write(chunk)
96
- temp_file_path = temp_file.name # Get the path of the temporary file
97
- listaPDFs.append(temp_file_path)
98
- print('listaPDFs: ', listaPDFs)
99
-
100
- resposta_llm = get_llm_summary_answer_by_cursor(data, listaPDFs)
101
-
102
- for file in listaPDFs:
103
- os.remove(file)
104
-
105
- return Response({"resposta": resposta_llm})
106
-
107
- class ResumoSimplesCursorCompletoView(APIView):
108
- parser_classes = [MultiPartParser]
109
-
110
- @extend_schema(
111
- request=ResumoCursorCompeltoSerializer,
112
- )
113
- def post(self, request):
114
- serializer = ResumoCursorCompeltoSerializer(data=request.data)
115
- if serializer.is_valid(raise_exception=True):
116
- print('\n\n\n')
117
- print('serializer.validated_data: ', serializer.validated_data)
118
- print('\n\n\n')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  listaPDFs = []
120
- data = serializer.validated_data
121
- print('\nserializer.validated_data: ', serializer.validated_data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
- for file in serializer.validated_data['files']:
124
- file.seek(0)
125
- with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file: # Create a temporary file to save the uploaded PDF
126
- for chunk in file.chunks(): # Write the uploaded file content to the temporary file
127
- temp_file.write(chunk)
128
- temp_file_path = temp_file.name # Get the path of the temporary file
129
- listaPDFs.append(temp_file_path)
130
- print('listaPDFs: ', listaPDFs)
131
 
132
- resposta_llm = get_llm_summary_answer_by_cursor_complete(data, listaPDFs)
 
 
 
 
 
 
 
133
 
134
- for file in listaPDFs:
135
- os.remove(file)
136
 
137
- return Response({"resposta": resposta_llm})
 
 
1
  from rest_framework.views import APIView
2
+ from adrf.views import APIView as AsyncAPIView
3
  import tempfile, os
4
  from rest_framework.response import Response
5
 
6
+ from _utils.resumo_completo_cursor import (
7
+ get_llm_summary_answer_by_cursor_complete,
8
+ test_ragas,
9
+ )
10
  from _utils.resumo_simples_cursor import get_llm_summary_answer_by_cursor
11
  from _utils.utils import DEFAULT_SYSTEM_PROMPT
12
+ from .serializer import (
13
+ RagasFromTextSerializer,
14
+ ResumoCursorCompeltoSerializer,
15
+ ResumoPDFSerializer,
16
+ ResumoCursorSerializer,
17
+ RagasSerializer,
18
+ )
19
  from _utils.main import get_llm_answer_summary, get_llm_answer_summary_with_embedding
20
  from setup.environment import default_model
21
  from rest_framework.parsers import MultiPartParser
 
23
 
24
 
25
  class ResumoView(APIView):
26
+ parser_classes = [MultiPartParser]
27
+
28
+ @extend_schema(
29
+ request=ResumoPDFSerializer,
30
+ )
31
+ def post(self, request):
32
+ serializer = ResumoPDFSerializer(data=request.data)
33
+ if serializer.is_valid(raise_exception=True):
34
+ listaPDFs = []
35
+ data = serializer.validated_data
36
+ model = serializer.validated_data.get("model", default_model)
37
+ print("serializer.validated_data: ", serializer.validated_data)
38
+
39
+ for file in serializer.validated_data["files"]:
40
+ print("file: ", file)
41
+ file.seek(0)
42
+ with tempfile.NamedTemporaryFile(
43
+ delete=False, suffix=".pdf"
44
+ ) as temp_file: # Create a temporary file to save the uploaded PDF
45
+ for (
46
+ chunk
47
+ ) in (
48
+ file.chunks()
49
+ ): # Write the uploaded file content to the temporary file
50
+ temp_file.write(chunk)
51
+ temp_file_path = (
52
+ temp_file.name
53
+ ) # Get the path of the temporary file
54
+ listaPDFs.append(temp_file_path)
55
+ # print('listaPDFs: ', listaPDFs)
56
 
57
+ system_prompt = data.get("system_prompt", DEFAULT_SYSTEM_PROMPT)
58
+ resposta_llm = get_llm_answer_summary(
59
+ system_prompt,
60
+ data["user_message"],
61
+ listaPDFs,
62
+ model=model,
63
+ isIterativeRefinement=data["iterative_refinement"],
64
+ )
 
65
 
66
+ for file in listaPDFs:
67
+ os.remove(file)
68
 
69
+ return Response({"resposta": resposta_llm})
 
70
 
 
71
 
72
  class ResumoEmbeddingView(APIView):
73
+ parser_classes = [MultiPartParser]
74
+
75
+ @extend_schema(
76
+ request=ResumoPDFSerializer,
77
+ )
78
+ def post(self, request):
79
+ serializer = ResumoPDFSerializer(data=request.data)
80
+ if serializer.is_valid(raise_exception=True):
81
+ listaPDFs = []
82
+ data = serializer.validated_data
83
+ model = serializer.validated_data.get("model", default_model)
84
+ print("serializer.validated_data: ", serializer.validated_data)
85
 
86
+ for file in serializer.validated_data["files"]:
87
+ file.seek(0)
88
+ with tempfile.NamedTemporaryFile(
89
+ delete=False, suffix=".pdf"
90
+ ) as temp_file: # Create a temporary file to save the uploaded PDF
91
+ for (
92
+ chunk
93
+ ) in (
94
+ file.chunks()
95
+ ): # Write the uploaded file content to the temporary file
96
+ temp_file.write(chunk)
97
+ temp_file_path = (
98
+ temp_file.name
99
+ ) # Get the path of the temporary file
100
+ listaPDFs.append(temp_file_path)
101
+ print("listaPDFs: ", listaPDFs)
102
 
103
+ system_prompt = data.get("system_prompt", DEFAULT_SYSTEM_PROMPT)
104
+ resposta_llm = get_llm_answer_summary_with_embedding(
105
+ system_prompt,
106
+ data["user_message"],
107
+ listaPDFs,
108
+ model=model,
109
+ isIterativeRefinement=data["iterative_refinement"],
110
+ )
111
 
112
+ for file in listaPDFs:
113
+ os.remove(file)
114
+
115
+ return Response({"resposta": resposta_llm})
116
 
 
117
 
118
  class ResumoSimplesCursorView(APIView):
119
+ parser_classes = [MultiPartParser]
120
+
121
+ @extend_schema(
122
+ request=ResumoCursorSerializer,
123
+ )
124
+ def post(self, request):
125
+ serializer = ResumoCursorSerializer(data=request.data)
126
+ if serializer.is_valid(raise_exception=True):
127
+ listaPDFs = []
128
+ data = serializer.validated_data
129
+ print("\nserializer.validated_data: ", serializer.validated_data)
130
+
131
+ for file in serializer.validated_data["files"]:
132
+ file.seek(0)
133
+ with tempfile.NamedTemporaryFile(
134
+ delete=False, suffix=".pdf"
135
+ ) as temp_file: # Create a temporary file to save the uploaded PDF
136
+ for (
137
+ chunk
138
+ ) in (
139
+ file.chunks()
140
+ ): # Write the uploaded file content to the temporary file
141
+ temp_file.write(chunk)
142
+ temp_file_path = (
143
+ temp_file.name
144
+ ) # Get the path of the temporary file
145
+ listaPDFs.append(temp_file_path)
146
+ print("listaPDFs: ", listaPDFs)
147
+
148
+ resposta_llm = get_llm_summary_answer_by_cursor(data, listaPDFs)
149
+
150
+ for file in listaPDFs:
151
+ os.remove(file)
152
+
153
+ return Response({"resposta": resposta_llm})
154
+
155
+
156
+ class ResumoSimplesCursorCompletoView(AsyncAPIView):
157
+ parser_classes = [MultiPartParser]
158
+
159
+ @extend_schema(
160
+ request=ResumoCursorCompeltoSerializer,
161
+ )
162
+ async def post(self, request):
163
+ serializer = ResumoCursorCompeltoSerializer(data=request.data)
164
+ if serializer.is_valid(raise_exception=True):
165
+ print("\n\n\n")
166
+ print("serializer.validated_data: ", serializer.validated_data)
167
+ print("\n\n\n")
168
+ listaPDFs = []
169
+ data = serializer.validated_data
170
+ print("\nserializer.validated_data: ", serializer.validated_data)
171
+
172
+ for file in serializer.validated_data["files"]:
173
+ file.seek(0)
174
+ with tempfile.NamedTemporaryFile(
175
+ delete=False, suffix=".pdf"
176
+ ) as temp_file: # Create a temporary file to save the uploaded PDF
177
+ for (
178
+ chunk
179
+ ) in (
180
+ file.chunks()
181
+ ): # Write the uploaded file content to the temporary file
182
+ temp_file.write(chunk)
183
+ temp_file_path = (
184
+ temp_file.name
185
+ ) # Get the path of the temporary file
186
+ listaPDFs.append(temp_file_path)
187
+ print("listaPDFs: ", listaPDFs)
188
+
189
+ # resposta_llm = await get_llm_summary_answer_by_cursor_complete(
190
+ # data, listaPDFs
191
+ # )
192
+
193
+ resposta_llm = await get_llm_summary_answer_by_cursor_complete(
194
+ data, listaPDFs
195
+ )
196
+
197
+ final = resposta_llm
198
+ print("\n\n\n")
199
+ print("final: ", final)
200
+
201
+ for file in listaPDFs:
202
+ os.remove(file)
203
+
204
+ return Response({"resposta": final})
205
+
206
+
207
+ class RagasView(APIView):
208
+ parser_classes = [MultiPartParser]
209
+
210
+ @extend_schema(
211
+ request=RagasSerializer,
212
+ )
213
+ def post(self, request):
214
+ serializer = RagasSerializer(data=request.data)
215
+ print("\n\n\n")
216
+ print("\n\n\n")
217
+ print("serializer.data: ", serializer)
218
  listaPDFs = []
219
+ if serializer.is_valid(raise_exception=True):
220
+ for file in serializer.validated_data["files"]:
221
+ file.seek(0)
222
+ with tempfile.NamedTemporaryFile(
223
+ delete=False, suffix=".pdf"
224
+ ) as temp_file: # Create a temporary file to save the uploaded PDF
225
+ for (
226
+ chunk
227
+ ) in (
228
+ file.chunks()
229
+ ): # Write the uploaded file content to the temporary file
230
+ temp_file.write(chunk)
231
+ temp_file_path = (
232
+ temp_file.name
233
+ ) # Get the path of the temporary file
234
+ listaPDFs.append(temp_file_path)
235
+
236
+ result = test_ragas(serializer, listaPDFs)
237
+
238
+ for file in listaPDFs:
239
+ os.remove(file)
240
+
241
+ return Response({"msg": result})
242
+
243
+
244
+ class RagasFromTextView(APIView):
245
+ def post(self, request):
246
+ serializer = RagasFromTextSerializer(data=request.data)
247
+ if serializer.is_valid(raise_exception=True):
248
+ from datasets import Dataset
249
+ from ragas import evaluate
250
+ from ragas.metrics import (
251
+ faithfulness,
252
+ answer_relevancy,
253
+ answer_correctness,
254
+ context_precision,
255
+ context_recall,
256
+ )
257
+ import os
258
+ from datasets import load_dataset
259
+ import pandas as pd
260
+
261
+ os.environ.get("OPENAI_API_KEY")
262
+
263
+ df_pandas = pd.read_csv(
264
+ "D:/repositorios/projetos-pessoais/projeto-y-backend-hugginf-face-teste-01/vella-backend/_utils/files/ragas_testset.csv"
265
+ )
266
+ # print(df_pandas["position"]) # Print a specific column
267
+ data = {
268
+ "user_input": [
269
+ "What is the capital of France?",
270
+ ],
271
+ "response": [],
272
+ "retrieved_contexts": [],
273
+ }
274
+
275
+ reference = [
276
+ "Paris is the capital of France. It is a major European city known for its culture."
277
+ ]
278
+
279
+ for x in df_pandas["user_input"]:
280
+ data["user_input"].append(x)
281
+
282
+ for x in df_pandas["reference"]:
283
+ reference.append(x)
284
+
285
+ print("data: ", reference)
286
+
287
+ for i in range(len(reference)):
288
+ serializer.validated_data["user_message"] = data["user_input"][i]
289
+ resposta_llm = get_llm_summary_answer_by_cursor_complete(
290
+ serializer.validated_data, contexto=reference[i]
291
+ )
292
+ data["response"].append(resposta_llm["texto_completo"])
293
+ lista_reference_contexts = []
294
+ for x in resposta_llm["resultado"]:
295
+ lista_reference_contexts.append(x["source"]["text"])
296
+ data["retrieved_contexts"].append(lista_reference_contexts)
297
 
298
+ # Convert the data to a Hugging Face Dataset
299
+ dataset = Dataset.from_dict(data)
 
 
 
 
 
 
300
 
301
+ # Define the metrics you want to evaluate
302
+ metrics = [
303
+ faithfulness,
304
+ # answer_relevancy,
305
+ # answer_correctness,
306
+ # context_precision,
307
+ # context_recall,
308
+ ]
309
 
310
+ # Evaluate the dataset using the selected metrics
311
+ results = evaluate(dataset, metrics)
312
 
313
+ # results.to_pandas().to_csv("./result.csv")
314
+ return Response({"resposta": results.to_pandas().to_string()})
setup/settings.py CHANGED
@@ -22,7 +22,6 @@ https://docs.djangoproject.com/en/4.2/ref/settings/
22
  BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
23
 
24
 
25
-
26
  # Quick-start development settings - unsuitable for production
27
  # See https://docs.djangoproject.com/en/4.2/howto/deployment/checklist/
28
 
@@ -45,12 +44,13 @@ INSTALLED_APPS = [
45
  "django.contrib.messages",
46
  "django.contrib.staticfiles",
47
  "rest_framework",
 
48
  "django_filters",
49
  "corsheaders",
50
  "pdfs",
51
  "drf_spectacular",
52
  "resumos",
53
- "modelos_usuarios"
54
  ]
55
 
56
  MIDDLEWARE = [
@@ -176,4 +176,4 @@ SPECTACULAR_SETTINGS = {
176
  'SERVE_INCLUDE_SCHEMA': False,
177
  'COMPONENT_SPLIT_REQUEST': True
178
  # OTHER SETTINGS
179
- }
 
22
  BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
23
 
24
 
 
25
  # Quick-start development settings - unsuitable for production
26
  # See https://docs.djangoproject.com/en/4.2/howto/deployment/checklist/
27
 
 
44
  "django.contrib.messages",
45
  "django.contrib.staticfiles",
46
  "rest_framework",
47
+ "adrf",
48
  "django_filters",
49
  "corsheaders",
50
  "pdfs",
51
  "drf_spectacular",
52
  "resumos",
53
+ "modelos_usuarios",
54
  ]
55
 
56
  MIDDLEWARE = [
 
176
  'SERVE_INCLUDE_SCHEMA': False,
177
  'COMPONENT_SPLIT_REQUEST': True
178
  # OTHER SETTINGS
179
+ }
setup/urls.py CHANGED
@@ -5,22 +5,36 @@ from drf_spectacular.views import SpectacularSwaggerView, SpectacularAPIView
5
 
6
 
7
  from pdfs.views import getPDF
8
- from resumos.views import ResumoView, ResumoSimplesCursorView, ResumoSimplesCursorCompletoView
 
 
 
 
 
 
9
  from modelos_usuarios.views import ListCreateModeloUsuarioView, CreateUpdateDeleteModeloUsuarioView, ListModelosPorUsuarioView
10
 
11
  router = routers.DefaultRouter()
12
  # router.register("endpoint-teste", EndpointTesteViewSet, basename="Basename do endpoint-teste")
13
 
14
  urlpatterns = [
15
- path('api/schema/', SpectacularAPIView.as_view(), name='schema'),
16
- path('swagger/', SpectacularSwaggerView.as_view(url_name='schema'), name='swagger-ui'),
 
 
17
  path("admin/", admin.site.urls),
18
- path('', include(router.urls)),
19
- path('pdf', getPDF, name='upload-pdf'),
20
- path('resumo', ResumoView.as_view(), name='summary-pdf'),
21
- path('resumo/cursor', ResumoSimplesCursorView.as_view(), name='summary-cursor-pdf'),
22
- path('resumo/cursor-completo', ResumoSimplesCursorCompletoView.as_view(), name='summary-cursor-completo-pdf'),
 
 
 
 
23
  path("modelo", ListCreateModeloUsuarioView.as_view()),
24
  path("modelo/<int:pk>", CreateUpdateDeleteModeloUsuarioView.as_view()),
25
- path("usuario/<int:user_id>/modelos", ListModelosPorUsuarioView.as_view())
26
- ]
 
 
 
5
 
6
 
7
  from pdfs.views import getPDF
8
+ from resumos.views import (
9
+ RagasFromTextView,
10
+ RagasView,
11
+ ResumoView,
12
+ ResumoSimplesCursorView,
13
+ ResumoSimplesCursorCompletoView,
14
+ )
15
  from modelos_usuarios.views import ListCreateModeloUsuarioView, CreateUpdateDeleteModeloUsuarioView, ListModelosPorUsuarioView
16
 
17
  router = routers.DefaultRouter()
18
  # router.register("endpoint-teste", EndpointTesteViewSet, basename="Basename do endpoint-teste")
19
 
20
  urlpatterns = [
21
+ path("api/schema/", SpectacularAPIView.as_view(), name="schema"),
22
+ path(
23
+ "swagger/", SpectacularSwaggerView.as_view(url_name="schema"), name="swagger-ui"
24
+ ),
25
  path("admin/", admin.site.urls),
26
+ path("", include(router.urls)),
27
+ path("pdf", getPDF, name="upload-pdf"),
28
+ path("resumo", ResumoView.as_view(), name="summary-pdf"),
29
+ path("resumo/cursor", ResumoSimplesCursorView.as_view(), name="summary-cursor-pdf"),
30
+ path(
31
+ "resumo/cursor-completo",
32
+ ResumoSimplesCursorCompletoView.as_view(),
33
+ name="summary-cursor-completo-pdf",
34
+ ),
35
  path("modelo", ListCreateModeloUsuarioView.as_view()),
36
  path("modelo/<int:pk>", CreateUpdateDeleteModeloUsuarioView.as_view()),
37
+ path("usuario/<int:user_id>/modelos", ListModelosPorUsuarioView.as_view()),
38
+ path("ragas", RagasView.as_view()),
39
+ path("ragas/no-pdf", RagasFromTextView.as_view()),
40
+ ]