
The ultimate Voice AI Stack
Apr 2, 2025
Voice AI is fundamentally transforming a magnitude of industries. The rapid evolution of voice models has accelerated growth across both vertical and horizontal applications, establishing voice as a primary interface for next-generation AI systems. However, creating a voice agent that feels natural while delivering powerful functionality can be a challenge.
This article breaks down the complete voice AI stack, examining how companies build, scale, and evaluate their voice agents as they scale their applications.
Note: The providers mentioned in this article are not listed based on our preference but rather on customer research and feedback we gathered while writing it.
Types of Voice Architecture
Voice AI architectures have evolved significantly over the past year. While speech-to-speech models are gaining traction, the majority of the market still relies on cascading architectures—where there’s still plenty of room for optimization across its layers.
Cascading Architecture
Cascading architecture follows three distinct sequential steps:
Speech-to-Text (STT): Converting spoken language into written text
Language Model Processing: Understanding and generating appropriate responses
Text-to-Speech (TTS): Converting text responses back into natural-sounding speech
While functional and relatively straightforward to implement, cascading architectures face two significant limitations:
High latency: Processing times can exceed 1000ms (compared to human conversation's 200-500ms)
Information loss: Non-textual context and emotional cues often get lost in translation
Speech-to-Speech Models
Emerging speech-to-speech models represent the cutting edge of voice AI, with OpenAI and Google leading the way with initial releases. These models bypass the text conversion stage entirely, processing audio directly into audio responses. While promising for reducing latency and preserving emotional context, these approaches still face challenges with precise control and have not yet seen widespread adoption.
Full-Stack Voice Orchestration Platforms
These comprehensive platforms handle the entire infrastructure needed for voice applications. Popular examples include Vapi, Retell AI, and Bland, which provide integrated solutions for quickly deploying voice applications.
Core Components of the Voice AI Stack & Where to pull Levers
A complete voice AI stack consists of several critical components, each contributing to the overall user experience. Here’s an overview and where it’s best to pull levers to optimize your stack.
1. Speech-to-Text (STT)
Leading providers include:
Deepgram: Focuses on real-time transcription with specialized models for different contexts
AssemblyAI: Known for high accuracy and specialized features like sentiment analysis
Whisper (OpenAI): Offers robust multilingual capabilities and high accuracy
Where to pull levers:
Consider fine-tuning for domain-specific vocabulary
Optimize for your users' accents and speech patterns
Implement error tracking and recovery mechanisms
When choosing your SST provider, make sure to optimize for time-to-first-token; and if your application has a highly specialized use case with hard-to-pronounce words, consider fine-tuning your STT model
2. LLM
GPT-4o / OpenAI for general applications – the go-to model given it performs well on latency, instruction following & tool calling.
Google Gemini 2.0 Flash - if you have a high reliability on tool calling and instruction following, you may want to test them out as Coval has seen instances where Gemini has been performing better on strict instruction following.
Domain-specific models for specialized use cases in healthcare, finance, etc.
Where to pull levers:
Optimize prompting strategies and context management
Consider fine-tuning for specific use cases
3. Text-to-Speech (TTS)
Top solutions include:
ElevenLabs: Known for high-performance latency while supporting multiple languages
Cartesia: Specializes in natural-sounding speech with emotional range
Rime: Focuses on training models with local accents for more authentic speech
Play.ht: Offers a wide range of voices and customization options
Amazon Polly: Provides reliable performance with good multilingual support, large adoption among enterprises
Hume: Strong focus on the emotional layer
Where to pull levers:
Generally best sourced from specialized providers
Select based on latency requirements, voice quality needs, and latency benchmarks if you have multiple languages
Consider custom voices for brand differentiation
4. Turn Detection
Critical for natural conversation flow, turn detection determines when a user has finished speaking and the AI should respond. Leading solutions include:
LiveKit: Offers robust audio streaming with turn detection capabilities
Pipecat: Offers an an open source, community-driven, native audio turn detection model
Custom implementations that use silence detection, prosodic features, and contextual cues
5. Emotional Engine (Emerging Component)
These systems analyze voice signals to detect emotional cues, allowing the AI to adjust its responses accordingly:
Sentiment analysis: Detecting positive, negative, or neutral sentiment
Emotional state recognition: Identifying specific emotions like happiness, frustration, or confusion
Stress level assessment: Measuring tension in the voice to adapt conversation style
Deepgram offers audio intelligence models to score sentiment and recognize caller intent, Hume AI has also developed a “voice-based LLM” that aims to predict emotions and cadence.
6. Transport Layer
The infrastructure that handles audio transmission:
WebRTC: Web Real-Time Communication for browser-based applications
WebSockets: For lightweight, bidirectional communication
If you care about latency, we recommend using WebRTC instead of WebSockets.
Where to pull levers:
Generally best left to off-the-shelf solutions
Focus on reliability and latency rather than customization
Voice Orchestration Models – Optimizing for Speed to Market
Companies beginning their voice AI journey typically prioritize quick deployment and iteration over customization. At this stage, the focus is on:
End-to-end platforms that offer quick setup and deployment
Strong community support and comprehensive documentation
Integrated VOIP capabilities for seamless communication
Flexibility to swap components as needs evolve
Popular end-to-end platforms include:
Retell
Vapi
Bland
Pipecat Cloud
LiveKit Cloud
Other providers starting out in the market are Vocode, Leaping AI and Sindarin, if you prefer working with smaller startups.
These solutions typically integrate with Twilio for VOIP capabilities and phone numbers, providing a complete solution for getting to market quickly. For early-stage companies, the specific technology choices matter less than the ability to launch, learn, and iterate rapidly.
What makes a great Voice AI Stack?
Modular architectures to easily swap components as needed. This is probably the most important point - make sure you platform allows you to scale and doesn’t make you too dependent from the beginning on. While voice orchestration is great to get started, your moat will evolve as you specialize your stack to your use-case.
Redundancy plans to prevent single points of failure
Strong evaluation frameworks to benchmark performance across providers
Continuous monitoring systems to track quality and detect issues
When to Evolve Your Stack
As voice AI applications mature, companies often reach inflection points that necessitate stack evolution. These typically occur when:
Performance requirements become more specific: Needing particular latency benchmarks or accuracy in specialized domains
Custom voice development becomes a competitive advantage
Cost optimization becomes critical at scale
Domain-specific requirements emerge that generic solutions can't address
At this stage, companies often shift toward more flexible orchestration models, such as:
Open-source frameworks like Pipecat or LiveKit
Custom component integration for specific parts of the stack
Hybrid approaches that maintain some packaged components while replacing others
These flexible frameworks provide options to:
Run applications locally or in custom cloud environments
Use different transport layers based on specific needs
Configure and deploy applications according to unique requirements
Implement redundancy across critical components
How to Choose Providers and What to Build
Selecting the right components for your voice AI stack requires a methodical approach:
Define Your Critical Metrics
Latency: Maximum acceptable response time
Accuracy: Required transcription and understanding precision
Voice quality: Natural speech requirements
Language/accent support: Coverage needed for your user base
Cost: Budget constraints at scale
Benchmark Potential Providers
Create standardized tests covering your specific use cases
Test across different user accents and speech patterns
Measure both average and p95/p99 performance
Evaluate integration complexity and long-term support
Build vs. Buy Decision Framework
For each component, evaluate:
Is this a core differentiator for your product?
Do off-the-shelf solutions meet your specific requirements?
What is the maintenance cost of a custom solution?
How critical is this component to your overall user experience?
Most successful companies build custom solutions only where they provide strategic advantage, while leveraging specialized providers for components where differentiation isn't critical.
Evaluating Your Voice AI Stack
Testing voice AI presents unique challenges compared to traditional software testing due to its multi-turn, probabilistic nature. A comprehensive evaluation strategy addresses several key dimensions:
Component-Level Performance
Each part of the stack requires individual evaluation:
STT: Word Error Rate (WER), accuracy on domain-specific terms
LLM: Response relevance, factual accuracy / hallucination, instruction following
TTS: Voice naturalness, pronunciation accuracy, emotional appropriateness
End-to-end: Total latency, conversation success rate
Testing Challenges
Voice AI evaluation faces several unique difficulties:
Probabilistic outcomes: Unlike traditional software with fixed inputs/outputs
Multi-turn dynamics: Conversations that build on previous exchanges
Non-binary results: Success often involves trade-offs between metrics
Dataset limitations: Finding or creating representative test data
Metric development: Defining what constitutes "good" performance
Building Effective Evaluation Systems
To overcome these challenges:
Create synthetic datasets that represent your specific use cases
Develop comprehensive metrics across technical and user experience dimensions
Implement continuous monitoring to track performance trends
Establish benchmarks for each component and the overall system
Build automated testing pipelines for regression detection
Most companies start with basic testing of core flows but should evolve to comprehensive evaluation as they scale. This includes automated testing, real-world scenario validation, and detailed performance analytics across all stack components.
Metrics + Datasets: The Hard Part
The most challenging aspects of voice AI evaluation are:
Finding representative data: Collecting diverse samples across accents, background conditions, and conversation types
Building meaningful metrics: Creating measures that correlate with actual user satisfaction
Balancing competing factors: Managing trade-offs between speed, accuracy, and naturalness
Testing edge cases: Identifying and handling unusual but important scenarios
Companies that excel in voice AI invest heavily in evaluation infrastructure, recognizing that systematic testing is essential for building reliable, high-quality voice experiences.
Enterprise Takeaways
For enterprises considering voice AI implementation, several key considerations emerge:
Strategic Implementation
Start with defined use cases that deliver clear value
Choose modular architectures that can evolve with your needs
Invest in evaluation infrastructure from the beginning
Build with compliance and security as foundational requirements
Provider Selection
Benchmark against your specific requirements, not generic metrics
Consider support and longevity alongside current performance
Evaluate integration complexity with existing systems
Test with your actual user demographics and use cases
Risk Management
Implement robust testing and monitoring strategies
Regular performance benchmarking across all components
Redundancy planning for critical system elements
Clear escalation paths for handling system failures
Privacy and security reviews specific to voice data
The key to success is finding the right balance between leveraging existing platforms for speed and building custom components where they provide strategic advantage. Custom benchmark testing based on your specific stack and use cases will be crucial in making these decisions.
If you're curious to get more insights and updates on the ultimate voice AI Stack, visit http://theultimatevoiceaistack.com/