FastRTC: The Real-Time Communication Library for Python

Published February 25, 2025

In the last few months, many new real-time speech models have been released and entire companies have been founded around both open and closed source models. To name a few milestones:

OpenAI and Google released their live multimodal APIs for ChatGPT and Gemini. OpenAI even went so far as to release a 1-800-ChatGPT phone number!
Kyutai released Moshi, a fully open-source audio-to-audio LLM. Alibaba released Qwen2-Audio and Fixie.ai released Ultravox - two open-source LLMs that natively understand audio.
ElevenLabs raised $180m in their Series C.

Despite the explosion on the model and funding side, it's still difficult to build real-time AI applications that stream audio and video, especially in Python.

ML engineers may not have experience with the technologies needed to build real-time applications, such as WebRTC.
Even code assistant tools like Cursor and Copilot struggle to write Python code that supports real-time audio/video applications. I know from experience!

That's why we're excited to announce FastRTC, the real-time communication library for Python. The library is designed to make it super easy to build real-time audio and video AI applications entirely in Python!

In this blog post, we'll walk through the basics of FastRTC by building real-time audio applications. At the end, you'll understand the core features of FastRTC:

🗣️ Automatic Voice Detection and Turn Taking built-in, so you only need to worry about the logic for responding to the user.
💻 Automatic UI - Built-in WebRTC-enabled Gradio UI for testing (or deploying to production!).
📞 Call via Phone - Use fastphone() to get a FREE phone number to call into your audio stream (HF Token required. Increased limits for PRO accounts).
⚡️ WebRTC and Websocket support.
💪 Customizable - You can mount the stream to any FastAPI app so you can serve a custom UI or deploy beyond Gradio.
🧰 Lots of utilities for text-to-speech, speech-to-text, stop word detection to get you started.

Let's dive in.

Getting Started

We'll start by building the "hello world" of real-time audio: echoing back what the user says. In FastRTC, this is as simple as:

from fastrtc import Stream, ReplyOnPause
import numpy as np

def echo(audio: tuple[int, np.ndarray]) -> tuple[int, np.ndarray]:
    yield audio

stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()

Let's break it down:

The ReplyOnPause will handle the voice detection and turn taking for you. You just have to worry about the logic for responding to the user. Any generator that returns a tuple of audio, (represented as (sample_rate, audio_data)) will work.
The Stream class will build a Gradio UI for you to quickly test out your stream. Once you have finished prototyping, you can deploy your Stream as a production-ready FastAPI app in a single line of code - stream.mount(app). Where app is a FastAPI app.

Here it is in action:

Leveling-Up: LLM Voice Chat

The next level is to use an LLM to respond to the user. FastRTC comes with built-in speech-to-text and text-to-speech capabilities, so working with LLMs is really easy. Let's change our echo function accordingly:

import os

from fastrtc import (ReplyOnPause, Stream, get_stt_model, get_tts_model)
from openai import OpenAI

sambanova_client = OpenAI(
    api_key=os.getenv("SAMBANOVA_API_KEY"), base_url="https://api.sambanova.ai/v1"
)
stt_model = get_stt_model()
tts_model = get_tts_model()

def echo(audio):
    prompt = stt_model.stt(audio)
    response = sambanova_client.chat.completions.create(
        model="Meta-Llama-3.2-3B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
    )
    prompt = response.choices[0].message.content
    for audio_chunk in tts_model.stream_tts_sync(prompt):
        yield audio_chunk

stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()

We're using the SambaNova API since it's fast. The get_stt_model() will fetch Moonshine Base and get_tts_model() will fetch Kokoro from the Hub, both of which have been further optimized for on-device CPU inference. But you can use any LLM/text-to-speech/speech-to-text API or even a speech-to-speech model. Bring the tools you love - FastRTC just handles the real-time communication layer.

Bonus: Call via Phone

If instead of stream.ui.launch(), you call stream.fastphone(), you'll get a free phone number to call into your stream. Note, a Hugging Face token is required. Increased limits for PRO accounts.

You'll see something like this in your terminal:

INFO:	  Your FastPhone is now live! Call +1 877-713-4471 and use code 530574 to connect to your stream.
INFO:	  You have 30:00 minutes remaining in your quota (Resetting on 2025-03-23)

You can then call the number and it will connect you to your stream!

Next Steps

Read the docs to learn more about the basics of FastRTC.
The best way to start building is by checking out the cookbook. Find out how to integrate with popular LLM providers (including OpenAI and Gemini's real-time APIs), integrate your stream with a FastAPI app and do a custom deployment, return additional data from your handler, do video processing, and more!
⭐️ Star the repo and file bug and issue requests!
Follow the FastRTC Org on HuggingFace for updates and check out deployed examples!

Thank you for checking out FastRTC!

SmolVLM2: Bringing Video Understanding to Every Device

By February 20, 2025 guest • 163

Build awesome datasets for video generation

By February 12, 2025 • 25

Community

fmurimi

about 3 hours ago

Wow.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote