
English version δΈζηζ¬
A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.
Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text β LLM β text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.
Resources are tagged π’ Beginner, π‘ Intermediate, or π΄ Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.
How to use this list
Read top-to-bottom if youβre brand new. The recommended path:
- Foundations β understand the pipeline and latency budget
- Frameworks β pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
- Components (STT, TTS, LLM, VAD, turn detection) β swap pieces to learn what each layer does
- Transport & telephony β connect to a real phone number
- Evaluation, production, ethics β make it safe enough to ship
Table of contents
- Foundational concepts and learning paths
- Frameworks and orchestration platforms
- Speech-to-text (STT / ASR)
- Text-to-speech (TTS)
- LLMs for voice and real-time AI
- Voice activity detection and turn-taking
- WebRTC fundamentals
- Telephony and SIP
- Tutorials and hands-on projects
- GitHub starter repos and awesome lists
- Datasets and benchmarks
- Beginner-accessible research papers
- Evaluation and testing
- Production, deployment, and scaling
- Ethics, safety, and regulation
- Blogs and newsletters
- Podcasts
- Communities
- Conferences and events
- Hackathons and competitions
1. Foundational concepts and learning paths
Start here. These resources establish the mental model of the voice agent pipeline and the latency budget youβll fight for the rest of your career.
- Voice AI & Voice Agents An Illustrated Primer Kwindla Hultman Kramerβs free, regularly-updated long-form primer. The de facto textbook for the field. π’ Beginner
- Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (LiveKit) Visual walkthrough of streaming patterns, turn detection, and where latency accumulates. π’ Beginner
- Everything You Need to Know About Voice AI Agents (Deepgram) End-to-end primer covering feature extraction, ASR, LLM reasoning, and synthesis. π’ Beginner
- AI Voice Agents (LiveKit Docs) The canonical βwhat is a voice agentβ reference, covering pipeline vs multimodal and agent state. π’ Beginner
- Core Latency in AI Voice Agents (Twilio) Visual explanation of end-of-turn detection, silence thresholds, and smart endpointing. π’ Beginner
- Advice on Building Voice AI in June 2025 (Daily.co) Practical P50/P95 latency-budget guidance from Pipecatβs creators. π‘ Intermediate
- How Intelligent Turn Detection Solves the Biggest Challenge in Voice Agents (AssemblyAI) Endpointing is the most underestimated problem; this is the clearest deep-dive. π‘ Intermediate
The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.
Open-source frameworks
- LiveKit Agents Voice AI Quickstart Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. π’ Beginner
- Pipecat Quickstart Scaffolds a Deepgram + OpenAI + Cartesia pipeline you can talk to in the browser in 5 minutes. π’ Beginner
- Ultravox (fixie-ai/ultravox) Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. π΄ Advanced
Realtime / speech-to-speech APIs
Vendor-neutral comparisons
3. Speech-to-text (STT / ASR)
Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases.
Commercial APIs
Open source
- openai/whisper The original repo and the de facto starting point for any DIY ASR project. π’ Beginner
- SYSTRAN/faster-whisper CTranslate2 reimplementation up to 4Γ faster with INT8; recommended for self-hosted Whisper. π‘ Intermediate
- NVIDIA NeMo (Parakeet / Canary) Top-of-leaderboard open ASR models with streaming inference recipes. π΄ Advanced
- Moonshine Tiny on-device ASR (~190 MB) optimized for live streaming on edge devices. π‘ Intermediate
Benchmarks and explainers
4. Text-to-speech (TTS)
Latency, not raw quality, is what kills voice agents prioritize providers offering true streaming with first-byte under 200 ms.
Commercial APIs
Open source
- Coqui TTS (idiap fork) Maintained fork of Coqui-TTS / XTTS v2; the most battle-tested OSS TTS toolkit. π‘ Intermediate
- Piper (OHF-Voice/piper1-gpl) Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. π’ Beginner
- Kokoro 82M Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. π’ Beginner
- F5-TTS Diffusion-transformer TTS with high-quality zero-shot voice cloning. π‘ Intermediate
- Orpheus-TTS Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. π‘ Intermediate
- Sesame CSM Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. π΄ Advanced
Streaming and ethics
5. LLMs for voice and real-time AI
A voice agentβs perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.
Low-latency inference
- Groq LPU-based inference cloud delivering ~10Γ faster Llama tokens/sec than commodity GPUs. π’ Beginner
- Cerebras Inference Wafer-scale chip inference with very high throughput on Llama models. π’ Beginner
- SambaNova Cloud Reconfigurable Dataflow inference; stable throughput at low latency. π’ Beginner
Speech-to-speech models
- OpenAI Realtime API guide Flagship S2S product with WebRTC/WebSocket transport. π‘ Intermediate
- Google Gemini Live Real-time multimodal voice/video with barge-in and 70-language support. π‘ Intermediate
- Moshi (kyutai-labs) Open-source full-duplex speech-text foundation model with 200 ms latency the premier OSS S2S model to study. π΄ Advanced
6. Voice activity detection and turn-taking
Pure VAD is no longer enough modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.
7. WebRTC fundamentals
WebRTC is the default transport for voice agents that donβt run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.
8. Telephony and SIP
The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.
9. Tutorials and hands-on projects
Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.
- LiveKit Voice AI Quickstart Official 10-minute walkthrough in Python or Node with starter templates. π’ Beginner
- Build Your First AI Voice Agent in Python (LiveKit) End-to-end Python tutorial covering streaming, latency, and deployment. π’ Beginner
- Pipecat Quickstart Build and deploy a Deepgram + OpenAI + Cartesia bot in roughly 10 minutes. π’ Beginner
- How to Build a Real-Time Voice Agent with Pipecat (AssemblyAI) Production-oriented walkthrough including local testing and Pipecat Cloud deployment. π‘ Intermediate
- Deepgram Build a Voice AI Agent Step-by-step guide wiring Deepgram STT, GPT, and Aura TTS. π’ Beginner
- Build a Voice Assistant with Twilio ConversationRelay + LiteLLM Provider-agnostic tutorial supporting OpenAI, Anthropic, or DeepSeek. π‘ Intermediate
- freeCodeCamp Build Advanced AI Agents (LiveKit, Exa, LangChain) Free 3-part video course covering interactive voice agents end-to-end. π’ Beginner
- freeCodeCamp Private On-Device Voice Assistant Hands-on local stack with Whisper, a local LLM, and system TTS. π‘ Intermediate
10. GitHub starter repos and awesome lists
Clone these instead of writing boilerplate from scratch.
- livekit/agents The flagship open-source Python/Node framework for production voice agents. π’ β π΄
- pipecat-ai/pipecat Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. π’ β π΄
- livekit-examples/agent-starter-python Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. π’ Beginner
- livekit-examples (org) Official collection of LiveKit Python/React/Swift/Android starters. π’ Beginner
- pipecat-ai/pipecat-examples Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. π’ β π‘
- elevenlabs/elevenlabs-examples Runnable Next.js and Python examples for TTS, STT, and real-time agents. π’ Beginner
- vocodedev/vocode-core Open-source modular framework for voice-LLM agents on phone, Zoom, or system audio. π‘ Intermediate (less actively maintained than LiveKit/Pipecat)
- kwindla/macos-local-voice-agents Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. π‘ Intermediate
- zzw922cn/awesome-speech-recognition-speech-synthesis-papers Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. π‘ Intermediate
- wildminder/awesome-ai-voice Up-to-date 2025β2026 list of open-source TTS and voice-cloning models.
- CorentinJ/Real-Time-Voice-Cloning Classic 5-second voice cloning project for understanding TTS fundamentals. π‘ Intermediate
11. Datasets and benchmarks
Youβll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.
- LibriSpeech ASR Corpus ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. π’ Beginner
- Mozilla Common Voice Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. π’ Beginner
- Common Voice on HuggingFace One-line
load_dataset() access for hands-on experiments. π’ Beginner
- Open ASR Leaderboard Live comparison of 60+ ASR models on WER and real-time factor. π’ Beginner
- Artificial Analysis Speech Independent benchmarks of commercial STT and TTS providers. π’ Beginner
- LJSpeech Dataset ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. π’ Beginner
- VCTK Corpus ~110 English speakers with diverse accents; widely used for multi-speaker TTS. π‘ Intermediate
- VoxCeleb (Oxford VGG) Million-utterance βin the wildβ dataset for speaker identification and verification. π‘ Intermediate
12. Beginner-accessible research papers
These are the landmark papers behind the models youβll actually use. Read the Whisper and Common Voice papers first theyβre unusually approachable.
13. Evaluation and testing
You canβt ship what you canβt measure. Voice-agent evaluation is fundamentally probabilistic a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.
- Coval Voice AI Testing Platform Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. π’ Beginner
- Coval How to Evaluate Voice Agents (Practical Guide) One of the most cited 2025 guides on probabilistic vs deterministic evaluation. π’ Beginner
- Cekura Metrics Overview Predefined metrics, instruction-following checks, and simulation framework. π’ Beginner
- Cekura Performance Testing for Voice Agents Practical 2025 guide on multi-turn simulation and edge-case generation. π‘ Intermediate
- Hamming AI Production-focused QA platform with simulation, load testing, and 50+ metrics. π‘ Intermediate
- Hamming Voice Agent Evaluation Metrics Guide Reference of latency percentiles, WER, MOS-style quality, and task completion with formulas. π‘ Intermediate
- LiveKit Understand and Improve Agent Latency Per-turn latency metrics (e2e, LLM TTFT, TTS TTFB) and where to optimize. π‘ Intermediate
- Twilio How Do You Know if Your Voice AI Agents Are Working? Vendor-neutral 2025 guide arguing for business-outcome metrics over raw WER/latency. π’ Beginner
14. Production, deployment, and scaling
Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.
15. Ethics, safety, and regulation
If youβre shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.
16. Blogs and newsletters
Subscribe to two or three to stay current the field moves quickly.
- LiveKit Blog Engineering deep-dives on WebRTC, agents framework releases, and production patterns.
- Deepgram Learn Tutorials on STT/TTS, voice agent design, evals, and pipeline architecture.
- Cartesia Blog State-space TTS models, Sonic releases, and yearly βState of Voice AIβ reports.
- ElevenLabs Blog Product and research announcements with implementation notes.
- Daily.co Blog (Pipecat) Posts from Pipecatβs maintainers covering scaling and feature releases.
- Voice AI & Voice Agents Illustrated Primer Free, regularly-updated long-form primer.
- Latent Space (swyx & Alessio) AI Engineer newsletter and podcast with frequent voice-AI episodes.
- Voice AI Newsletter (Krisp) βFuture of Voice AIβ interview series with founders; published weekly in 2025.
- Voice AI Weekly (Vapi) Weekly Substack rounding up news, products, and tools.
- Voicebot.ai (Synthedia) Long-running daily news and paid newsletter on industry trends.
17. Podcasts
18. Communities
19. Conferences and events
- AI Engineer Worldβs Fair Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. π’ Beginner
- AI Engineer YouTube channel All Worldβs Fair and Summit talks are posted free; the best library of recent voice-AI talks. π’ Beginner
- AI Engineer Summit Online Voice playlist Curated playlist including voice-track sessions from leading labs. π’ Beginner
- AIEWF 2025 Recap (Latent Space) Written deep-dive into 2025βs voice-track talks and major launches. π’ Beginner
- VOICE & AI (Modev) Long-running voice technology conference with broader CX and voicebot focus. π’ Beginner
- Project Voice Main U.S. event for conversational AI across voice, text, and chat. π’ Beginner
- Interspeech Top academic speech-science conference; intimidating but worth knowing most landmark papers debut here. π΄ Advanced
20. Hackathons and competitions
Suggested learning path
- Week 1 Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 7).
- Week 2 First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 9).
- Week 3 Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
- Week 4 Turn-taking & telephony: Add Silero VAD and a turn detector; connect a SIP trunk (sections 6, 8).
- Week 5 Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 13, 14, 15).
- Ongoing: Subscribe to two newsletters and join voice ai community in linkedin (sections 16, 17, 18).
Contributing
Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals.