Spaces:

RJuro
/

pdf-digest

Running

App Files Files Community

RJuro commited on 18 days ago

Commit

46e7800

1 Parent(s): 411f49d

podcast-review audio

Browse files

Files changed (12) hide show

app.py +23 -2
final_literature_review.pdf +2 -2
prompts/manuscript_style_example.prompt +134 -0
prompts/papers_synthesis.prompt +3 -1
prompts/review_podcast_manus_v2.prompt +43 -0
prompts/review_podcast_outline.prompt +61 -0
utils/__init__.py +4 -2
utils/__pycache__/__init__.cpython-311.pyc +0 -0
utils/__pycache__/review_flow.cpython-311.pyc +0 -0
utils/__pycache__/tts_utils.cpython-311.pyc +0 -0
utils/review_flow.py +91 -2
utils/tts_utils.py +46 -1

app.py CHANGED Viewed

@@ -92,7 +92,7 @@ from utils.llm_utils import (
     wait_for_files_active
 )
 from utils.tts_utils import generate_tts_audio
-from utils.review_flow import process_multiple_pdfs, generate_final_review_pdf
 logging.basicConfig(level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
@@ -398,4 +398,25 @@ elif mode == "Write a Literature Review":
                 data=final_pdf_bytes,
                 file_name="final_literature_review.pdf",
                 mime="application/pdf"
-            )

     wait_for_files_active
 )
 from utils.tts_utils import generate_tts_audio
+from utils.review_flow import process_multiple_pdfs, generate_final_review_pdf, generate_multi_speaker_podcast
 logging.basicConfig(level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
                 data=final_pdf_bytes,
                 file_name="final_literature_review.pdf",
                 mime="application/pdf"
+            )
+            # Save final text and a base filename for podcast generation.
+            st.session_state["final_text"] = final_review_text
+            st.session_state["pdf_basename"] = "final_literature_review"
+    if st.session_state.get("final_text"):
+        if st.button("Generate Multi-Speaker Podcast 🎤"):
+            progress_bar = st.progress(0)
+            with st.spinner("Generating multi-speaker podcast..."):
+                try:
+                    podcast_audio = asyncio.run(
+                        generate_multi_speaker_podcast(st.session_state["final_text"], progress_bar=progress_bar)
+                    )
+                    st.audio(podcast_audio, format="audio/mp3")
+                    st.download_button(
+                        "Download Podcast Audio 📥",
+                        podcast_audio,
+                        file_name=f"{st.session_state.get('pdf_basename', 'literature_review')}_podcast.mp3",
+                        mime="audio/mp3"
+                    )
+                except Exception as e:
+                    st.error("Podcast generation failed: " + str(e))

final_literature_review.pdf CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:db8761c86f9baa8294bbfd318f2a415b3087919033f7c16999ae9a31f46f3870
-size 296125

 version https://git-lfs.github.com/spec/v1
+oid sha256:7d10e19d189310b940f8735f1725bac1451b7259bdd442aa2b7385bde9026f96
+size 298472

prompts/manuscript_style_example.prompt ADDED Viewed

	@@ -0,0 +1,134 @@

+{
+  "opening": [
+    {
+      "speaker": "host",
+      "text": "Today, we're diving into the rapidly evolving world of Large Language Models, or LLMs, and their increasing role in qualitative data analysis. The sheer volume of textual data researchers are dealing with today is staggering, and it's really pushing the boundaries of traditional, labor-intensive analysis methods. This is where LLMs are starting to show real promise, offering potential ways to enhance efficiency. But a central question remains: how effectively can these models actually perform the nuanced tasks required in qualitative research, and what are the trade-offs?"
+    },
+     {
+      "speaker": "expert",
+      "text": "Absolutely. Qualitative data analysis, as practiced through methods like thematic analysis, as Braun and Clarke established back in 2006, is inherently time-consuming. It involves deep engagement with the data, identifying patterns, and constructing meaning. Researchers are now grappling with how to scale these processes, and LLMs, with their impressive natural language processing capabilities, seem like a potential solution. The question isn't just *if* LLMs can help, but *how* they can best be integrated into the research workflow."
+     }
+  ],
+  "foundation": [
+    {
+      "speaker": "host",
+      "text": "So, let's get a bit more concrete. What are some of the specific tasks within qualitative data analysis that researchers are actually applying these LLMs to? And, what were those expectations at the beginning? What are the hopes for using them in this field?"
+    },
+    {
+      "speaker": "expert",
+      "text": "We're seeing LLMs applied across a range of tasks, from deductive coding, where you're applying a pre-defined coding framework to the data, to thematic analysis, which is more about identifying emergent themes. Some researchers are also using them for annotation, for instance, classifying the sentiment or political leaning of social media posts, as Törnberg did in his 2023 study on Twitter data. Initially, there's this hope that LLMs could almost automate large parts of the qualitative analysis process – imagine drastically reducing the time spent coding vast amounts of text. And many hoped that there could be a nearly perfect coding result by the models."
+    },
+     {
+      "speaker": "expert",
+      "text": "But, of course, qualitative analysis isn't just about speed; it's about depth and nuanced interpretation. So, alongside the excitement, there's also a healthy dose of skepticism, and a real need to rigorously evaluate the performance of these models."
+    }
+  ],
+  "main_discussion_methodological_approaches": [
+      {
+      "speaker": "host",
+      "text": "Let's move the discussion to the ways researchers implement LLMs. Let's dive into the practical side. What are the methodologies being tested? How are researchers setting this up?"
+    },
+    {
+      "speaker": "expert",
+      "text": "There's a real spectrum of approaches. Some, like Törnberg's work with political Twitter annotation, are using a 'zero-shot' approach. Essentially, this means they're leveraging the LLM's pre-existing knowledge without any task-specific training. It's testing the inherent capabilities of the model. Then you have approaches like the 'LLM-in-the-loop' framework that Dai and colleagues developed in 2023. This is a much more structured, iterative process where the LLM and human coders work together, refining the analysis in stages. It's a collaborative model, aiming to leverage the strengths of both human and artificial intelligence."
+    },
+    {
+        "speaker": "expert",
+        "text": "Another significant methodology is LACA, or LLM-Assisted Content Analysis, presented by Chew and their team, also in 2023. This is a detailed, step-by-step process for integrating LLMs into deductive coding. It even involves the LLM participating in the co-development of the codebook, which is pretty fascinating. Then they run rigorous tests to check its reliability against human coders."
+    },
+   {
+      "speaker": "host",
+      "text": "This 'LLM-in-the-loop' approach you mention sounds particularly interesting. How does that work in practice, and how does it contrast with, say, the zero-shot approach or LACA?"
+    },
+     {
+      "speaker": "expert",
+      "text": "In the 'LLM-in-the-loop' model, it's not about handing over the entire analysis to the LLM. Instead, the LLM might generate an initial set of codes or themes, and then the human researcher steps in to review, refine, and validate those codes. It's a back-and-forth process. The human provides context and expertise that the LLM might lack, ensuring the analysis remains grounded in the nuances of the data. This differs significantly from the zero-shot approach, where you're really relying on the LLM's raw ability. LACA, while also collaborative, is more focused on a pre-defined, deductive coding process, whereas 'LLM-in-the-loop' can be more flexible and adaptable to different stages of analysis, including the more inductive, theme-development stages."
+    }
+  ],
+    "main_discussion_performance_and_metrics": [
+        {
+          "speaker":"host",
+          "text": "Okay, now let's change the perspective a little bit: evaluation! How do we measure the success in these applications of LLMs? What are the main metrics that researchers are using?"
+        },
+        {
+          "speaker": "expert",
+          "text": "A dominant metric across the board is Inter-rater reliability, or IRR. This essentially measures the level of agreement between different coders – in this case, between the LLM and human coders, or even between multiple human coders to establish a benchmark. There are different ways to calculate IRR. Kirsten and colleagues, in their 2024 paper, along with Dai's team, use Cohen's Kappa, which is a common measure. Törnberg uses Krippendorf's Alpha. And Chew and colleagues, in their LACA work, advocate for Gwet's AC1, arguing it's more robust in certain situations, particularly when you have rare codes."
+      },
+      {
+          "speaker": "expert",
+          "text": "Besides IRR, researchers are also looking at things like accuracy – how often does the LLM get the coding 'right' compared to a gold standard? – and bias, which is a huge concern. Törnberg's study, for example, examined whether ChatGPT-4 exhibited any political bias in its annotations. And then there's the issue of 'hallucinations,' where the LLM essentially invents information or makes up codes that aren't grounded in the data."
+      },
+       {
+      "speaker": "host",
+      "text": "This concept of 'hallucination' is quite specific to LLMs, isn't it? Could you elaborate on what that means in this context?"
+    },
+        {
+          "speaker": "expert",
+          "text": "Right. In the context of LLMs, 'hallucination' refers to the model generating text that is factually incorrect, nonsensical, or, in the case of qualitative coding, not supported by the actual data being analyzed. It's as if the model is 'making things up.' It's a significant concern because, in qualitative research, we're striving for interpretations that are deeply rooted in the data. A hallucinating LLM could lead to misleading or completely inaccurate findings."
+       },
+        {
+          "speaker": "expert",
+           "text": "And, to circle back a bit to Inter-rater Reliability, what we aim to achieve with such metrics is a measure of consistency. In traditional qualitative research, you'd have multiple human coders analyzing the same data to ensure the findings aren't just the result of one person's subjective interpretation. With LLMs, IRR helps us understand how well the model's coding aligns with human judgment, and whether it's consistent enough to be reliable."
+        }
+    ],
+  "main_discussion_strengths_and_limitations": [
+    {
+      "speaker": "host",
+      "text": "Let's move to a discussion of strengths and limitations. What are we learning about where LLMs excel and where they fall short in qualitative data analysis, based on the current research?"
+    },
+    {
+      "speaker": "expert",
+      "text": "One consistent finding is that LLMs show real promise for efficiency. Chew and colleagues, and Dai and their team, both demonstrate significant time savings when using LLMs for coding. We're talking about reducing coding time from minutes per document to seconds. In terms of performance, Kirsten's 2024 research shows that GPT-4, in particular, can achieve very high agreement with human coders on simpler, what they call 'semantic' coding tasks. It's almost on par with human inter-coder agreement in some cases."
+    },
+     {
+      "speaker": "expert",
+      "text": "However – and this is a crucial point – task complexity matters a lot. Kirsten and colleagues found that agreement, for both humans and LLMs, decreases as you move from these simpler, semantic coding tasks to more complex, 'latent' coding. Latent coding requires deeper interpretation, drawing inferences, and understanding underlying meanings. This is where LLMs currently struggle more. It seems they're better at identifying surface-level patterns than at grasping the deeper, more nuanced interpretations that are often central to qualitative research. Chew also finds that GPT3.5 struggles with formatting of codes, and does better with semantical tasks."
+    },
+      {
+      "speaker": "host",
+      "text": "You've made this crucial distinction between 'semantic' and 'latent' tasks. Could you provide a concrete example to illustrate why latent coding presents such a challenge for these models?"
+    },
+       {
+      "speaker": "expert",
+      "text": "Sure. Imagine you're analyzing interview transcripts about people's experiences with a new technology. A semantic coding task might be to identify every time someone mentions a specific feature of the technology – that's relatively straightforward. A latent coding task, however, might be to identify underlying themes of, say, 'technological anxiety' or 'empowerment.' These themes aren't always explicitly stated; they require the coder to interpret the overall tone, the context, and the subtle meanings in the language used. That's much harder for an LLM to do reliably, at least at this stage. There are some edge cases. But it's very challenging to program it."
+    },
+    {
+        "speaker":"expert",
+        "text": "It's also important to note that there are differences between LLM models. The Kirsten study consistently found GPT-4 outperforming GPT-3.5, suggesting that model choice is a significant factor. And there are also inherent limitations like those hallucinations we talked about. And while techniques like few-shot learning – giving the LLM a few examples of how to code – can help mitigate these issues, they don't eliminate them entirely."
+    }
+  ],
+  "implications": [
+    {
+      "speaker": "host",
+      "text": "So, let's broaden the scope. What are the practical implications of all this for qualitative researchers? And looking ahead, what are the critical challenges and future directions for this field?"
+    },
+    {
+      "speaker": "expert",
+      "text": "On the practical side, LLMs offer a pathway to significantly speed up the initial stages of qualitative analysis, especially with large datasets. They can help with tasks like identifying key terms, generating initial codes, or summarizing large volumes of text. But, and I want to emphasize this, it's crucial to remember that these are tools to *assist* human researchers, not to replace them. Kirsten and their colleagues rightly caution against a one-size-fits-all approach. They recommend carefully evaluating the specific task and choosing the right LLM and approach accordingly."
+    },
+     {
+      "speaker": "expert",
+      "text": "The ethical considerations are paramount. We need to be very mindful of potential biases in these models, and ensure that we're not introducing or amplifying those biases in our research. Transparency is also key. Researchers need to be very clear about how they're using LLMs, what prompts they're using, and what the limitations of their approach are. The 'black box' nature of some of these models is a real concern, and we need to find ways to make their reasoning more transparent and understandable. Going forward, I think a major focus will be on developing better methods for human-AI collaboration in qualitative research. We need interfaces and workflows that allow researchers to seamlessly interact with LLMs, to review and refine their outputs, and to bring their own expertise to bear on the analysis. And, of course, there's ongoing work on improving the models themselves, particularly their ability to handle those more complex, interpretive tasks."
+    },
+     {
+      "speaker": "host",
+      "text": "What specific types of tools or interfaces do you envision that could best support effective human-AI collaboration in qualitative data analysis in the future?"
+    },
+     {
+      "speaker": "expert",
+      "text": "I imagine interfaces that allow for a more fluid dialogue between the researcher and the LLM. For example, imagine being able to highlight a passage of text and ask the LLM, 'Why did you code this in this way?' or 'What other codes might be relevant here?' and receive a clear, understandable explanation. Or perhaps a system that allows you to easily compare and contrast different coding schemes generated by different LLMs, or by the LLM at different stages of the analysis. The key is to move beyond the model as a 'black box' and to create tools that empower the researcher to critically engage with the LLM's outputs and to integrate them thoughtfully into their own analysis."
+     }
+  ],
+  "wrap": [
+    {
+      "speaker": "host",
+      "text": "To sum things up, it's clear that Large Language Models hold considerable promise for transforming qualitative data analysis. The potential for increased efficiency, particularly with large datasets, is undeniable. But the research also highlights the crucial importance of human oversight and the need for a nuanced, task-specific approach. LLMs are powerful tools, but they are not a replacement for careful, critical thinking and interpretive expertise."
+    },
+    {
+      "speaker": "expert",
+      "text": "Precisely. We're in a period of rapid development and exploration. The studies we've discussed today, from Törnberg's work on annotation to Chew's LACA methodology, Dai's 'LLM-in-the-loop' framework, and Kirsten's investigation of task complexity, all point to a future where LLMs play an increasingly significant role in qualitative research. But it's a future that demands caution, ethical awareness, and a continued commitment to rigorous methodological standards. The focus on human-AI collaboration, rather than full automation, is key. And the path forward requires further research into bias mitigation, model improvement, and perhaps most importantly development of user interfaces and methodologies which enable a seemless integration."
+    }
+  ]
+}

prompts/papers_synthesis.prompt CHANGED Viewed

@@ -8,7 +8,7 @@ Using the paper summaries, comparative table, and detailed outline provided abov
 - Sections as outlined in the analysis above
 - Comparative overview (featuring the provided table)
 - Conclusions and implications
-- References (Harvard style)
 2. FORMATTING REQUIREMENTS:
 - Use markdown formatting
@@ -35,6 +35,8 @@ Using the paper summaries, comparative table, and detailed outline provided abov
 - Length: 2500 words (excluding table and references)
 - Academic language appropriate to the discipline
 - APA style citations
 - Complete reference list
 - Refer to papers with proper academic citations, not filenames
 - Adapt style and emphasis to disciplinary norms

 - Sections as outlined in the analysis above
 - Comparative overview (featuring the provided table)
 - Conclusions and implications
+- References (APA style)
 2. FORMATTING REQUIREMENTS:
 - Use markdown formatting
 - Length: 2500 words (excluding table and references)
 - Academic language appropriate to the discipline
 - APA style citations
+- Citations in text: (Author, Year) or Author (Year) for more than 2 authors use et al.
+- For core papers with multiple authors it may be useful to create abbreviations for the author names e.g. BP (Brown & Pinker, 2010) or TRZ (Taylor, Reddy, & Ziegler, 2015)
 - Complete reference list
 - Refer to papers with proper academic citations, not filenames
 - Adapt style and emphasis to disciplinary norms

prompts/review_podcast_manus_v2.prompt ADDED Viewed

	@@ -0,0 +1,43 @@

+Generate a JSON-formatted podcast script based on a provided outline and research paper that results in a natural academic discussion. The script should follow these guidelines:
+FORMAT:
+{
+  "segment_name": [
+    {
+      "speaker": "host/expert",
+      "text": "content"
+    }
+  ]
+}
+TONE & STYLE:
+- Begin directly with the topic; no introductions.
+- Use sophisticated yet conversational language, reflecting a tone of intellectual curiosity similar to Ezra Klein's style.
+- Keep the discussion academically substantive but accessible.
+- Avoid overly formal or cliché expressions; refrain from using phrases like "Exactly" or "It's not only about...it's about".
+SPEAKING PATTERNS:
+- The host should:
+  * Start with context and emphasize the importance of the topic.
+  * Reference studies naturally (e.g., "What I found interesting in Smith and colleagues' study...").
+  * Ask specific questions about the research findings.
+  * Draw connections between different studies organically.
+- The expert should:
+  * Provide clear explanations of research methods and findings.
+  * Reference authors naturally (e.g., "The research by Jones and colleagues...").
+  * Introduce and elaborate on key points, sometimes taking control of the discussion.
+  * Build on the host’s questions by connecting broader research insights.
+STRUCTURE:
+- Organize the script into distinct segments (e.g., opening, foundation, individual paper discussions, comparative discussion, implications, and wrap-up).
+- Ensure smooth transitions between segments, with either the host or expert explicitly introducing the next section.
+- Gradually build complex ideas and make natural comparisons between studies.
+- End with a discussion of broader implications and future research directions.
+GENERAL INSTRUCTIONS:
+- Base the discussion on the provided outline and research paper, ensuring that references to studies are integrated naturally.
+- Do not include any special characters (e.g., asterisks, parentheses) that might affect text-to-speech readability.
+- Keep the conversation engaging and focused on the substance of the research.
+- Ensure that the resulting JSON is correctly formatted and adheres to the defined structure.
+Generate a script that would represent approximately a 30-minute discussion.

prompts/review_podcast_outline.prompt ADDED Viewed

	@@ -0,0 +1,61 @@

+Create a super detailed still focused outline for a 30-minute research podcast episode that maintains academic rigor while being engaging and accessible:
+Base it on the provided PDF literature review.
+PRE-SHOW PREPARATION:
+- Extract 2-3 core theoretical/practical contributions
+- List technical terms requiring clarification
+- Note potential complex concepts that need unpacking
+- Identify natural points of connection between topics
+STRUCTURAL OUTLINE:
+1. OPENING (1-2 min)
+- "Today we're talking about..." [frame topic in broader context]
+- Why this matters now
+- Quick orientation to core problem/challenge
+2. FOUNDATION (4-5 min)
+- Accessible entry point to complex topic
+- Current state of knowledge/research
+- [2 focused questions that bridge common understanding with technical depth]
+3. MAIN DISCUSSION (17-19 min)
+Structure as 3-4 segments, each containing:
+- Precise lead-in question
+- Space for expert elaboration (2-4 min)
+- 1 strategic follow-up question
+- Optional "Earlier you mentioned..." callbacks
+Key elements to include:
+- Natural progression from broader to specific insights
+- Points where host can request clarification
+- Moments to synthesize or connect ideas
+- Strategic devil's advocate questions
+- Brief "let's unpack that" interventions for technical terms
+4. IMPLICATIONS (3-4 min)
+- Practical applications
+- Critical challenges ahead
+- Future directions
+[1-2 challenge questions from host]
+5. WRAP (1-2 min)
+- Key insights crystallized
+- Broader significance
+- Forward-looking statement
+INTERACTION GUIDELINES:
+- Host interventions should be precise and purposeful
+- Allow expert to fully develop ideas before follow-up
+- Use "You mentioned..." to return to important points
+- Frame challenges as curious inquiry rather than debate
+PACING NOTES:
+- Mark natural transition points
+- Note segments requiring extended expert explanation
+- Identify moments for brief host guidance
+- Plan smooth segment transitions
+- Plan where host or expert go on a tangent.  They are allowed to do so!
+[Apply to specific content while maintaining balance between accessibility and depth]
+Just return the outline - no notes, info, remarks

utils/__init__.py CHANGED Viewed

@@ -3,7 +3,8 @@ from .llm_utils import (
     async_generate_text,
     generate_title_reference_and_classification,
     upload_to_gemini,
-    wait_for_files_active
 )
 from .file_utils import (
@@ -19,4 +20,5 @@ from .review_flow import (
     generate_final_review_pdf,
     create_comparative_table_prompt)
-from .tts_utils import generate_tts_audio

     async_generate_text,
     generate_title_reference_and_classification,
     upload_to_gemini,
+    wait_for_files_active,
+    clean_json_response
 )
 from .file_utils import (
     generate_final_review_pdf,
     create_comparative_table_prompt)
+from .tts_utils import (generate_tts_audio,
+                        generate_podcast_audio)

utils/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/utils/__pycache__/__init__.cpython-311.pyc and b/utils/__pycache__/__init__.cpython-311.pyc differ

utils/__pycache__/review_flow.cpython-311.pyc CHANGED Viewed

Binary files a/utils/__pycache__/review_flow.cpython-311.pyc and b/utils/__pycache__/review_flow.cpython-311.pyc differ

utils/__pycache__/tts_utils.cpython-311.pyc CHANGED Viewed

Binary files a/utils/__pycache__/tts_utils.cpython-311.pyc and b/utils/__pycache__/tts_utils.cpython-311.pyc differ

utils/review_flow.py CHANGED Viewed

@@ -1,11 +1,14 @@
 import os
 import time
 import asyncio
 import logging
 import streamlit as st
 from markdown_pdf import MarkdownPdf, Section
 from utils.file_utils import load_prompt, save_intermediate_output
-from utils.llm_utils import get_generation_model, async_generate_text, upload_to_gemini, wait_for_files_active
 logger = logging.getLogger(__name__)
@@ -185,4 +188,90 @@ async def generate_final_review_pdf(structured_outputs):
             logger.error(f"Error generating PDF: {e}")
         progress_bar.progress(100)
-    return final_checked_review

+# review_flow.py
 import os
+import json
 import time
 import asyncio
 import logging
 import streamlit as st
 from markdown_pdf import MarkdownPdf, Section
 from utils.file_utils import load_prompt, save_intermediate_output
+from utils.llm_utils import get_generation_model, async_generate_text, upload_to_gemini, wait_for_files_active, clean_json_response
+from utils.tts_utils import generate_podcast_audio
 logger = logging.getLogger(__name__)
             logger.error(f"Error generating PDF: {e}")
         progress_bar.progress(100)
+    return final_checked_review
+async def generate_multi_speaker_podcast(final_markdown: str, progress_bar=None):
+    """
+    Generate a multi-speaker podcast audio from the final review markdown.
+    Uses an outline prompt, a manuscript prompt, and a style example.
+    The final_markdown (generated review text) is directly sent to the LLM.
+    Optionally updates a progress bar.
+    """
+    # Step 0: Load prompt files.
+    if progress_bar:
+        progress_bar.progress(5)
+    outline_prompt = load_prompt("./dev/dev_prompts/review_podcast_outline.prompt")
+    manuscript_prompt = load_prompt("./dev/dev_prompts/review_podcast_manus_v2.prompt")
+    style_example = load_prompt("./dev/dev_prompts/manuscript_style_example.prompt")
+    # Step 1: Outline Generation.
+    # Now include the final markdown text in the outline prompt.
+    if progress_bar:
+        progress_bar.progress(15)
+    thinking_model_name, thinking_config = get_generation_model("thinking")
+    thinking_config.system_instruction = (
+        "You are an extremely smart researcher and communicator who crafts extremely detailed outlines for science podcasts."
+    )
+    combined_outline_prompt = f"Review Text:\n{final_markdown} \n\n {outline_prompt}"
+    outline = await async_generate_text(
+        prompt=combined_outline_prompt,
+        pdf_file=None,  # No PDF needed here.
+        model_name=thinking_model_name,
+        generation_config=thinking_config
+    )
+    # Step 2: Manuscript Generation.
+    if progress_bar:
+        progress_bar.progress(40)
+    flash_model_name, flash_config = get_generation_model("flash")
+    flash_config.response_mime_type = "application/json"
+    flash_config.system_instruction = (
+        "You are an extremely smart manuscript writer for science communication. You produce detailed and engaging content that can be used for TTS - ABSOLUTELY NO! special characters like asterisks or parentheses."
+        "Below is an example of the desired writing style:\n\n" + style_example
+    )
+    combined_manuscript_prompt = (
+        f"Review Text:\n{final_markdown}"
+        f"Outline:\n{outline}\n\n"
+        f"{manuscript_prompt}\n\n"
+        "ABSOLUTELY NO! special characters like asterisks or parentheses."
+    )
+    manuscript_json_str = await async_generate_text(
+        prompt=combined_manuscript_prompt,
+        pdf_file=None,
+        model_name=flash_model_name,
+        generation_config=flash_config
+    )
+    manuscript_json_str = clean_json_response(manuscript_json_str)
+    try:
+        manuscript_data = json.loads(manuscript_json_str)
+    except Exception as e:
+        raise Exception("Failed to parse manuscript JSON: " + str(e))
+    # Build segments for TTS.
+    segments = []
+    if isinstance(manuscript_data, dict):
+        for section in manuscript_data.values():
+            if isinstance(section, list):
+                segments.extend(section)
+            elif isinstance(section, dict) and "text" in section:
+                segments.append(section)
+            elif isinstance(section, str):
+                segments.append({"speaker": "host", "text": section})
+    elif isinstance(manuscript_data, list):
+        segments = manuscript_data
+    if not segments:
+        raise Exception("No valid segments found in manuscript JSON for TTS.")
+    # Step 3: Generate Podcast Audio.
+    if progress_bar:
+        progress_bar.progress(60)
+    podcast_audio = generate_podcast_audio(segments)
+    if progress_bar:
+        progress_bar.progress(100)
+    return podcast_audio

utils/tts_utils.py CHANGED Viewed

@@ -36,4 +36,49 @@ def generate_tts_audio(text, voice="af_heart", speed=1.0):
         elif status_json.get("status") in ["FAILED", "ERROR"]:
             logger.error("TTS generation failed.")
             st.error("TTS generation failed. Please try again later.")
-            raise Exception("TTS generation failed.")

         elif status_json.get("status") in ["FAILED", "ERROR"]:
             logger.error("TTS generation failed.")
             st.error("TTS generation failed. Please try again later.")
+            raise Exception("TTS generation failed.")
+def generate_podcast_audio(segments, host_voice="am_michael", expert_voice="af_bella", silence_duration_ms=300, speed=1.0):
+    RUNPOD_API_TOKEN = os.getenv("RUNPOD_GPU")
+    headers = {
+        'Content-Type': 'application/json',
+        'Authorization': f'Bearer {RUNPOD_API_TOKEN}'
+    }
+    data_payload = {
+        "input": {
+            "mode": "podcast",
+            "segments": segments,
+            "host_voice": host_voice,
+            "expert_voice": expert_voice,
+            "silence_duration_ms": silence_duration_ms,
+            "speed": speed
+        }
+    }
+    run_url = "https://api.runpod.ai/v2/ozz8w092oprwqx/run"
+    print("Podcast TTS generation started, please wait...")
+    response = requests.post(run_url, headers=headers, json=data_payload)
+    if response.status_code != 200:
+        raise Exception(f"RunPod API call failed with status {response.status_code}: {response.text}")
+    run_id = response.json().get("id")
+    status_url = f"https://api.runpod.ai/v2/ozz8w092oprwqx/status/{run_id}"
+    while True:
+        time.sleep(5)
+        status_response = requests.post(status_url, headers=headers, json=data_payload)
+        status_json = status_response.json()
+        logger.debug("Podcast TTS status: %s", status_json.get("status"))
+        if status_json.get("status") == "COMPLETED":
+            download_url = status_json.get("output", {}).get("download_url")
+            if download_url:
+                mp3_response = requests.get(download_url)
+                if mp3_response.status_code == 200:
+                    print("Podcast TTS generation completed!")
+                    return mp3_response.content
+                else:
+                    raise Exception(f"Failed to download audio: {mp3_response.status_code}")
+        elif status_json.get("status") in ["FAILED", "ERROR"]:
+            logger.error("Podcast TTS generation failed.")
+            raise Exception("Podcast TTS generation failed.")