The AI Text-to-Speech Gap: Why Screen Reader Users Are Being Left Behind

While AI voices have revolutionized speech synthesis for the sighted world, blind users who rely on screen readers are stuck with aging, unmaintained technology. A deep dive into the technical and usability barriers preventing modern AI TTS from serving the accessibility community.

The text-to-speech (TTS) revolution that has transformed smartphones, smart speakers, and navigation systems has largely bypassed the community that needs it most: blind screen reader users. For over three decades, the voices used by the majority of western English-speaking blind people have remained fundamentally unchanged, creating a growing technological chasm between mainstream TTS advancement and accessibility needs.

The core of this divide lies in fundamentally different requirements. While sighted users overwhelmingly prefer natural, conversational voices that sound human, blind users who depend on TTS for information consumption prioritize different characteristics: speed, clarity, predictability, and efficiency. This results in a preference for voices that may sound somewhat robotic but can be understood at speeds of 800 to 900 words per minute—four times faster than average human speech.

The Eloquence Problem

The dominant voice in this space is called Eloquence, a 32-bit voice last updated in 2003. Its popularity is so overwhelming that Apple eventually added it to iPhone, Mac, Apple TV, and Apple Watch, but only through an emulation layer. The voice's source code remains lost, and even large companies like Apple haven't been able to recompile it for modern systems.

As the NVDA screen reader transitions from 32-bit to 64-bit architecture, maintaining Eloquence compatibility has become increasingly complex. Community developers have spent countless hours creating workarounds, but these stopgap solutions are fundamentally unsustainable. The Eloquence libraries contain known security vulnerabilities that cannot be patched, forcing users to accept the risks or abandon the voice entirely.

The International Language Gap

For users who speak languages other than English, the situation is even more dire. Modern TTS systems designed for sighted users often produce voices that are inefficient, overly conversational, and slow for screen reader use. While espeak-ng attempts to address this by supporting hundreds of languages, it carries its own baggage:

Questionable Language Support: Many language implementations were added based on pronunciation rules extracted from Wikipedia articles without consulting native speakers.
Legacy Architecture: The system is based directly on Speak, a 1995 TTS engine written for RISC OS on the BBC Micro. Modern users inherit design decisions made for an operating system that no longer exists.
Maintenance Concerns: The espeak-ng repository shows only one or two active maintainers—better than Eloquence's zero, but still precarious for critical accessibility software.

Testing Modern AI TTS for Screen Readers

Over the holiday break, I evaluated two modern AI-based TTS systems—Supertonic and Kitten TTS—to determine if they could be integrated into NVDA. Both advertised themselves as fast, GPU-free, and responsive. However, testing revealed four fundamental issues that make current AI TTS unsuitable for screen reader use.

1. Dependency Bloat

Bundling these systems as NVDA addons requires including a vast number of Python packages. Kitten TTS needs approximately 103 dependencies, while Supertonic requires just over 30. The standard NVDA addon building system doesn't support automatic dependency management, forcing developers to manually copy and include these packages in repositories. Loading these dependencies directly into NVDA causes:

Slower screen reader startup
Increased system resource usage
Security exposure from unpatched vulnerabilities in any of the bundled libraries

For screen readers that require system-wide access, this dependency bloat presents significant security and performance concerns.

2. Accuracy Deficits

Modern AI TTS systems are optimized for natural-sounding speech, often at the expense of accuracy. In testing, both Supertonic and Kitten TTS exhibited:

Word skipping
Incorrect number pronunciation
Truncated short utterances
Ignored prosody cues from punctuation

Kitten TTS performed slightly better by using a deterministic phonemizer (the same one as espeak) for pronunciation, leaving only audio generation to AI. However, even this hybrid approach falls short of the precision required for screen reader users, where any mispronunciation or skipped word can lead to critical information loss.

3. Speed Limitations

Screen reader users need TTS systems that can begin generating speech immediately, not after processing entire text chunks. Traditional TTS engines like Eloquence can start speaking within milliseconds of receiving text. In contrast:

Supertonic: Can stream audio as it becomes available, but still requires initial processing time
Kitten TTS: Cannot begin speaking until the entire audio chunk is generated

Both systems operate at speeds far below the 800-900 WPM benchmark that power users require. This latency is particularly problematic because screen reader users frequently jump through text and interrupt speech, requiring the ability to discard and restart generation instantly.

4. Lack of Real-Time Control

Traditional TTS systems offer extensive parameter controls: pitch, speed, volume, breathiness, roughness, head size, and more. These allow users to customize voices to exact needs and adjust characteristics in real-time based on text formatting or content type.

AI TTS models, trained on specific speakers' data, inherit fixed characteristics from their training data. While Supertonic and Kitten TTS offer basic speed control, it's highly variable between voices and utterances. This represents a significant loss of functionality that many blind users depend on for efficient information consumption.

The Path Forward

The issues identified with Supertonic and Kitten TTS aren't unique to these systems. Other models like Kokoro exhibit the same problems, often more severely. These represent fundamental architectural differences between AI TTS (optimized for naturalness) and screen reader TTS (optimized for efficiency).

Several potential solutions exist, each with significant challenges:

Open-Source Eloquence Implementation: The ideal solution would be reimplementing Eloquence as open-source libraries. This would require expertise in linguistics, digital signal processing, audiology, and programming—skills rarely found together. The estimated cost for such an effort would likely exceed several million dollars.

Blastbay Studios' Approach: This company has created TTS voices using modern technology while attempting to meet blind users' needs. However, it remains a closed-source product with a single maintainer and still suffers from pronunciation accuracy issues.

AI-Assisted Development: Perhaps future AI systems could be prompted to create TTS systems meeting accessibility standards, though this remains speculative.

Community Mobilization: Articles like this one aim to raise awareness and bring the accessibility community together to recognize the problems and develop solutions.

Current Reality

For now, screen reader users face an uncomfortable reality: the voices that work best for their needs are becoming increasingly untenable, while modern alternatives don't meet their requirements. The community will likely have to settle for "good enough" solutions that are nowhere near as fast and efficient as current systems.

Personally, I'll continue maintaining Eloquence compatibility for as long possible, until the layers of emulation and bridges make real-time use impossible. Until then, the accessibility community remains in a holding pattern—dependent on aging technology while waiting for the TTS field to recognize and address their unique needs.

The gap between mainstream AI TTS advancement and accessibility requirements continues to widen. Bridging it will require not just technical innovation, but a fundamental shift in how TTS developers prioritize and understand the needs of blind users.