ElevenLabs Co-Founder on Voice AI, Business Model, and the Future of Conversational Agents

In a wide-ranging Q&A, ElevenLabs co-founder Mati Staniszewski discusses how audio models work, the company's business strategy, the conversational Turing Test, and the evolution of voice agents in the AI landscape.

In a recent interview with Stripe co-founder John Collison on the Cheeky Pint podcast, ElevenLabs co-founder Mati Staniszewski provided an in-depth look at the company's approach to voice AI technology, its business model, and the broader implications for conversational AI systems.

How Audio Models Actually Work

Staniszewski explained that ElevenLabs' audio models operate through a sophisticated pipeline that begins with understanding the semantic meaning of text and translating that into acoustic features. The company's models are trained on vast datasets of human speech across multiple languages and voices, learning the subtle nuances that make speech sound natural.

The key innovation, according to Staniszewski, is the model's ability to maintain consistency across long-form content while adapting to different speaking styles and emotional contexts. Unlike earlier text-to-speech systems that sounded robotic or monotonous, ElevenLabs' models can vary pitch, pace, and emphasis based on the content being read.

"The model doesn't just read words," Staniszewski said. "It understands the intent behind them and adjusts delivery accordingly."

Business Model and Market Position

ElevenLabs has positioned itself as a research company focused on making audio accessible across languages and voices. The company offers both consumer and enterprise solutions, with pricing tiers that scale based on usage volume and feature access.

Staniszewski revealed that the company's revenue model is built around three main pillars:

API access for developers and businesses integrating voice capabilities into their applications
Direct-to-consumer tools for content creators and individual users
Enterprise licensing for large-scale deployments

The company has seen particular success in the audiobook and gaming industries, where high-quality voice synthesis can dramatically reduce production costs while maintaining or improving quality.

The Conversational Turing Test

One of the more intriguing topics discussed was the concept of a "conversational Turing Test" - a modern evolution of Alan Turing's original test for machine intelligence. Staniszewski suggested that as voice AI becomes more sophisticated, the ability to engage in natural, context-aware conversation may become the new benchmark for AI capability.

"We're moving beyond just generating speech to creating systems that can truly converse," he explained. "The question isn't just whether a machine can sound human, but whether it can understand and respond appropriately in real-time dialogue."

This shift has significant implications for applications ranging from customer service to personal assistants, where the quality of interaction matters as much as the accuracy of responses.

Voice Agents and the Future

Looking ahead, Staniszewski sees voice agents becoming increasingly central to human-computer interaction. He predicts that within the next few years, voice will become the primary interface for many applications, particularly as models improve in their ability to handle complex, multi-turn conversations.

The company is already working on what Staniszewski calls "expressive mode" - a technology that adds emotional nuance to synthetic speech, making it sound more natural and engaging. This technology is available in over 70 languages with ultra-low latency, positioning ElevenLabs at the forefront of voice AI development.

Competitive Landscape

When asked about competition from tech giants like Google, Amazon, and Microsoft, Staniszewski acknowledged the challenge but emphasized ElevenLabs' focus on research and quality. "We're not trying to build the broadest platform," he said. "We're trying to build the best voice technology."

The company's approach appears to be paying off, with adoption growing across industries that require high-quality voice synthesis. However, the space remains highly competitive, with major players investing heavily in their own voice AI capabilities.

Technical Challenges

Staniszewski was candid about the technical hurdles still facing voice AI. These include:

Latency: Reducing the time between input and output to make conversations feel natural
Context retention: Maintaining conversational context across long interactions
Emotional range: Expanding the spectrum of emotions that can be expressed naturally
Multilingual consistency: Ensuring quality across all supported languages

Despite these challenges, Staniszewski expressed confidence that continued advances in model architecture and training techniques would address many of these issues in the coming years.

The Broader AI Landscape

The interview also touched on the broader AI landscape, with Staniszewski noting that voice AI represents just one piece of a larger puzzle. He sees significant potential in combining voice technology with other AI capabilities, such as computer vision and reasoning, to create more comprehensive AI systems.

"The future isn't just about better voices," he said. "It's about creating AI that can understand and interact with the world in more human-like ways."

What This Means for the Industry

ElevenLabs' progress and vision highlight several key trends in the AI industry:

Specialization pays off: By focusing on voice AI specifically, ElevenLabs has been able to achieve technical advantages over more generalized approaches
Quality over quantity: The company's emphasis on natural-sounding speech demonstrates that users value quality interactions over basic functionality
Enterprise adoption is accelerating: As voice AI improves, more businesses are finding practical applications for the technology
The conversational interface is maturing: Advances in voice AI are making conversational interfaces more viable for a wider range of applications

As the AI industry continues to evolve, companies like ElevenLabs that combine technical excellence with clear business strategy are likely to play an increasingly important role in shaping how humans interact with machines.

Listen to the full interview on Cheeky Pint

Learn more about ElevenLabs' technology

Explore voice AI applications