In a wide-ranging Q&A, ElevenLabs co-founder Mati Staniszewski discusses how audio models work, the company's business strategy, the conversational Turing Test, and the evolution of voice agents in the AI landscape.
In a recent interview with Stripe co-founder John Collison on the Cheeky Pint podcast, ElevenLabs co-founder Mati Staniszewski provided an in-depth look at the company's approach to voice AI technology, its business model, and the broader implications for conversational AI systems.
How Audio Models Actually Work
Staniszewski explained that ElevenLabs' audio models operate through a sophisticated pipeline that begins with understanding the semantic meaning of text and translating that into acoustic features. The company's models are trained on vast datasets of human speech across multiple languages and voices, learning the subtle nuances that make speech sound natural.
The key innovation, according to Staniszewski, is the model's ability to maintain consistency across long-form content while adapting to different speaking styles and emotional contexts. Unlike earlier text-to-speech systems that sounded robotic or monotonous, ElevenLabs' models can vary pitch, pace, and emphasis based on the content being read.
"The model doesn't just read words," Staniszewski said. "It understands the intent behind them and adjusts delivery accordingly."
Business Model and Market Position
ElevenLabs has positioned itself as a research company focused on making audio accessible across languages and voices. The company offers both consumer and enterprise solutions, with pricing tiers that scale based on usage volume and feature access.
Staniszewski revealed that the company's revenue model is built around three main pillars:
- API access for developers and businesses integrating voice capabilities into their applications
- Direct-to-consumer tools for content creators and individual users
- Enterprise licensing for large-scale deployments
The company has seen particular success in the audiobook and gaming industries, where high-quality voice synthesis can dramatically reduce production costs while maintaining or improving quality.
The Conversational Turing Test
One of the more intriguing topics discussed was the concept of a "conversational Turing Test" - a modern evolution of Alan Turing's original test for machine intelligence. Staniszewski suggested that as voice AI becomes more sophisticated, the ability to engage in natural, context-aware conversation may become the new benchmark for AI capability.
"We're moving beyond just generating speech to creating systems that can truly converse," he explained. "The question isn't just whether a machine can sound human, but whether it can understand and respond appropriately in real-time dialogue."
This shift has significant implications for applications ranging from customer service to personal assistants, where the quality of interaction matters as much as the accuracy of responses.
Voice Agents and the Future
Looking ahead, Staniszewski sees voice agents becoming increasingly central to human-computer interaction. He predicts that within the next few years, voice will become the primary interface for many applications, particularly as models improve in their ability to handle complex, multi-turn conversations.
The company is already working on what Staniszewski calls "expressive mode" - a technology that adds emotional nuance to synthetic speech, making it sound more natural and engaging. This technology is available in over 70 languages with ultra-low latency, positioning ElevenLabs at the forefront of voice AI development.
Competitive Landscape
When asked about competition from tech giants like Google, Amazon, and Microsoft, Staniszewski acknowledged the challenge but emphasized ElevenLabs' focus on research and quality. "We're not trying to build the broadest platform," he said. "We're trying to build the best voice technology."
The company's approach appears to be paying off, with adoption growing across industries that require high-quality voice synthesis. However, the space remains highly competitive, with major players investing heavily in their own voice AI capabilities.
Technical Challenges
Staniszewski was candid about the technical hurdles still facing voice AI. These include:
- Latency: Reducing the time between input and output to make conversations feel natural
- Context retention: Maintaining conversational context across long interactions
- Emotional range: Expanding the spectrum of emotions that can be expressed naturally
- Multilingual consistency: Ensuring quality across all supported languages
Despite these challenges, Staniszewski expressed confidence that continued advances in model architecture and training techniques would address many of these issues in the coming years.
The Broader AI Landscape
The interview also touched on the broader AI landscape, with Staniszewski noting that voice AI represents just one piece of a larger puzzle. He sees significant potential in combining voice technology with other AI capabilities, such as computer vision and reasoning, to create more comprehensive AI systems.
"The future isn't just about better voices," he said. "It's about creating AI that can understand and interact with the world in more human-like ways."
What This Means for the Industry
ElevenLabs' progress and vision highlight several key trends in the AI industry:
- Specialization pays off: By focusing on voice AI specifically, ElevenLabs has been able to achieve technical advantages over more generalized approaches
- Quality over quantity: The company's emphasis on natural-sounding speech demonstrates that users value quality interactions over basic functionality
- Enterprise adoption is accelerating: As voice AI improves, more businesses are finding practical applications for the technology
- The conversational interface is maturing: Advances in voice AI are making conversational interfaces more viable for a wider range of applications
As the AI industry continues to evolve, companies like ElevenLabs that combine technical excellence with clear business strategy are likely to play an increasingly important role in shaping how humans interact with machines.
Listen to the full interview on Cheeky Pint

Comments
Please log in or register to join the discussion