Voice Content and Usability: Designing for Conversation as the Oldest Interface

Voice interfaces demand a fundamental shift in how we think about content design, moving from spatial, visual layouts to temporal, conversational flows that honor the primacy of spoken language.

Open book with bookmark

Conversation is not a new interface. It's the oldest interface. — Erika Hall, Conversational Design

We've been having conversations for thousands of years. Whether to convey information, conduct transactions, or simply to check in on one another, people have yammered away, chattering and gesticulating, through spoken conversation for countless generations. Only in the last few millennia have we begun to commit our conversations to writing, and only in the last few decades have we begun to outsource them to the computer, a machine that shows much more affinity for written correspondence than for the slangy vagaries of spoken language.

The Messiness of Speech vs. the Cleanliness of Text

Computers struggle with voice because speech is more primordial than writing. To have successful conversations with us, machines must grapple with the messiness of human speech: the disfluencies and pauses, the gestures and body language, and the variations in word choice and spoken dialect that can stymie even the most carefully crafted human-computer interaction. In human-to-human scenarios, spoken language has the privilege of face-to-face contact, where we can readily interpret nonverbal social cues.

In contrast, written language immediately concretizes as we commit it to record and retains usages long after they become obsolete in spoken communication (the salutation "To whom it may concern," for example), generating its own fossil record of outdated terms and phrases. Because it tends to be more consistent, polished, and formal, written text is fundamentally much easier for machines to parse and understand. Spoken language has no such luxury.

Besides the nonverbal cues that decorate conversations with emphasis and emotional context, there are also verbal cues and vocal behaviors that modulate conversation in nuanced ways: how something is said, not what. Whether rapid-fire, low-pitched, or high-decibel, whether sarcastic, stilted, or sighing, our spoken language conveys much more than the written word could ever muster.

Three Conversational Modes

According to Michael McTear, Zoraida Callejas, and David Griol in The Conversational Interface, the motivations for interacting with voice interfaces largely mirror the reasons we initiate conversations with other people. Generally, we start up a conversation because:

We need something done (such as a transaction)
We want to know something (information of some sort)
We are social beings and want someone to talk to (conversation for conversation's sake)

These three categories—transactional, informational, and prosocial—characterize essentially every voice interaction: a single conversation from beginning to end that realizes some outcome for the user, starting with the voice interface's first greeting and ending with the user exiting the interface.

A conversation in our human sense—a chat between people that leads to some result and lasts an arbitrary length of time—could encompass multiple transactional, informational, and prosocial voice interactions in succession. In other words, a voice interaction is a conversation, but a conversation is not necessarily a single voice interaction.

The Prosocial Problem

Purely prosocial conversations are more gimmicky than captivating in most voice interfaces, because machines don't yet have the capacity to really want to know how we're doing and to do the sort of glad-handing humans crave. There's also ongoing debate as to whether users actually prefer the sort of organic human conversation that begins with a prosocial voice interaction and shifts seamlessly into other types.

In Voice User Interface Design, Michael Cohen, James Giangola, and Jennifer Balogh recommend sticking to users' expectations by mimicking how they interact with other voice interfaces rather than trying too hard to be human—potentially alienating them in the process. This leaves us with two viable conversational genres: transactional and informational.

Transactional Voice Interactions

Transactional conversations realize some outcome. They're direct, to the point, and economical, quickly dispensing with pleasantries. Each progressive disclosure reveals more and more of the desired outcome: a service rendered or a product delivered.

Consider this pizza ordering dialogue:

Alison: Hey, how's it going? Burhan: Hi, welcome to Crust Deluxe! It's cold out there. How can I help you? Alison: Can I get a Hawaiian pizza with extra pineapple? Burhan: Sure, what size? Alison: Large. Burhan: Anything else? Alison: No thanks, that's it. Burhan: Something to drink? Alison: I'll have a bottle of Coke. Burhan: You got it. That'll be $13.55 and about fifteen minutes.

The conversation moves efficiently toward its goal without unnecessary detours.

Informational Voice Interactions

Informational conversations are investigative quests for the truth—research expeditions to gather data, news, or facts. They might be more long-winded than transactional conversations by necessity, with responses that are lengthier, more informative, and carefully communicated so the customer understands the key takeaways.

Here's the same pizza shop, but for information gathering:

Alison: Hey, how's it going? Burhan: Hi, welcome to Crust Deluxe! It's cold out there. How can I help you? Alison: Can I ask a few questions? Burhan: Of course! Go right ahead. Alison: Do you have any halal options on the menu? Burhan: Absolutely! We can make any pie halal by request. We also have lots of vegetarian, ovo-lacto, and vegan options. Are you thinking about any other dietary restrictions? Alison: What about gluten-free pizzas? Burhan: We can definitely do a gluten-free crust for you, no problem, for both our deep-dish and thin-crust pizzas. Anything else I can answer for you? Alison: That's it for now. Good to know. Thanks! Burhan: Anytime, come back soon!

The goal here is fact-finding, not action completion.

Voice Interface Evolution

Interactive Voice Response (IVR) Systems

IVR systems emerged in the early 1990s as the first true voice interfaces that engaged in authentic conversation. Intended as an alternative to overburdened customer service representatives, they allowed organizations to reduce their reliance on call centers but soon became notorious for their clunkiness. These systems were primarily designed as metaphorical switchboards to guide customers to a real phone agent ("Say Reservations to book a flight or check an itinerary").

Despite their functional issues and users' frustration with their inability to speak to an actual human right away, IVR systems proliferated across a variety of industries. They're great for highly repetitive, monotonous conversations that generally don't veer from a single format, but they have a reputation for less scintillating conversation than we're used to in real life.

Screen Readers

Parallel to IVR evolution was the invention of the screen reader, a tool that transcribes visual content into synthesized speech. For blind or visually impaired website users, it's the predominant method of interacting with text, multimedia, or form elements. Screen readers represent perhaps the closest equivalent we have today to an out-of-the-box implementation of content delivered through voice.

The first screen reader known by that moniker was developed for the BBC Micro and NEEC Portable by the Research Centre for the Education of the Visually Handicapped at the University of Birmingham in 1986. That same year, Jim Thatcher created the first IBM Screen Reader for text-based computers, later recreated for computers with graphical user interfaces.

With the rapid growth of the web in the 1990s, the demand for accessible tools exploded. Thanks to the introduction of semantic HTML and especially ARIA roles beginning in 2008, screen readers started facilitating speedy interactions with web pages that ostensibly allow disabled users to traverse the page as an aural and temporal space rather than a visual and physical one.

As Aaron Gustafson writes in A List Apart, screen readers "provide mechanisms that translate visual design constructs—proximity, proportion, etc.—into useful information. At least they do when documents are authored thoughtfully."

There's one significant problem with screen readers: they're difficult to use and unremittingly verbose. The visual structures of websites and web navigation don't translate well to screen readers, sometimes resulting in unwieldy pronouncements that name every manipulable HTML element and announce every formatting change. For many screen reader users, working with web-based interfaces exacts a cognitive toll.

Accessibility advocate and voice engineer Chris Maury considers why the screen reader experience is ill-suited to users relying on voice:

From the beginning, I hated the way that Screen Readers work. Why are they designed the way they are? It makes no sense to present information visually and then, and only then, translate that into audio. All of the time and energy that goes into creating the perfect user experience for an app is wasted, or even worse, adversely impacting the experience for blind users.

In many cases, well-designed voice interfaces can speed users to their destination better than long-winded screen reader monologues. Visual interface users have the benefit of darting around the viewport freely to find information, ignoring areas irrelevant to them. Blind users, meanwhile, are obligated to listen to every utterance synthesized into speech and therefore prize brevity and efficiency.

Voice Content and Usability – A List Apart

Voice Assistants

Voice assistants are akin to personal concierges that can answer questions, schedule appointments, conduct searches, and perform other common day-to-day tasks. They're rapidly gaining more attention from accessibility advocates for their assistive potential.

The vision predates reality. In 1987, Apple published a demonstration video depicting the Knowledge Navigator, a voice assistant that could transcribe spoken words and recognize human speech to a great degree of accuracy. Then, in 2001, Tim Berners-Lee and others formulated their vision for a Semantic Web "agent" that would perform typical errands like "checking calendars, making appointments, and finding locations."

It wasn't until 2011 that Apple's Siri finally entered the picture, making voice assistants a tangible reality for consumers.

Programmability Spectrum

There's considerable variation in how programmable and customizable certain voice assistants are:

Locked down: Apple's Siri and Microsoft's Cortana couldn't be extended beyond their existing capabilities. It isn't possible to program Siri to perform arbitrary functions because there's no means by which developers can interact with Siri at a low level, apart from predefined categories of tasks.
Programmable: Amazon Alexa and Google Home offer a core foundation on which developers can build custom voice interfaces. Amazon offers the Alexa Skills Kit, while Google Home offers the ability to program arbitrary Google Assistant skills. Users can choose from among thousands of custom-built skills within both ecosystems.

Many development platforms like Google's Dialogflow have introduced omnichannel capabilities so users can build a single conversational interface that then manifests as a voice interface, textual chatbot, and IVR system upon deployment.

Voice Content: The New Microcontent

Voice content is content delivered through voice. To preserve what makes human conversation so compelling, voice content needs to be free-flowing and organic, contextless and concise—everything written content isn't.

Our world is replete with voice content: screen readers reciting website content, voice assistants rattling off a weather forecast, and automated phone hotline responses governed by IVR systems.

The Macrocontent Problem

Websites are colossal vaults of macrocontent: lengthy prose that can extend for infinitely scrollable miles in a browser window. But voice interfaces require something different.

Technologist Anil Dash defined microcontent in 2002 as permalinked pieces of content that stay legible regardless of environment:

A day's weather forecast, the arrival and departure times for an airplane flight, an abstract from a long publication, or a single instant message can all be examples of microcontent.

I'd update Dash's definition to include all examples of bite-sized content that go well beyond written communiqués. Today we encounter microcontent in interfaces where a small snippet of copy is displayed alone, unmoored from the browser, like a textbot confirmation of a restaurant reservation.

The Temporal Nature of Voice Content

Microcontent offers the best opportunity to gauge how your content can be stretched to the very edges of its capabilities, informing delivery channels both established and novel. As microcontent, voice content is unique because it's an example of how content is experienced in time rather than in space.

We can glance at a digital sign underground for an instant and know when the next train is arriving, but voice interfaces hold our attention captive for periods of time that we can't easily escape or skip—something screen reader users are all too familiar with.

Because microcontent is fundamentally made up of isolated blobs with no relation to the channels where they'll eventually end up, we need to ensure that our microcontent truly performs well as voice content. This means focusing on the two most important traits of robust voice content:

Voice content legibility
Voice content discoverability

Both have to do with how voice content manifests in perceived time and space.

Designing for the Oldest Interface

The shift to voice interfaces represents more than a technological evolution—it's a return to humanity's original interface. For thousands of years, we've refined the art of conversation, developing nuanced systems of turn-taking, clarification, and context-switching that computers are only beginning to understand.

The challenge for designers and content strategists isn't simply to translate existing written content into spoken form. It's to recognize that voice interfaces demand fundamentally different thinking about information architecture, user flow, and content structure.

Transactional interactions require efficiency and directness. Informational interactions require clarity and thoroughness. And both require an understanding that we're designing for temporal experiences, not spatial ones.

As we continue to build voice interfaces, we must remember that the goal isn't to replicate the visual web in audio form. It's to create experiences that honor the primacy of speech while leveraging the unique capabilities of technology. The screen reader experience teaches us what happens when we simply layer audio on top of visual design. The evolution of voice assistants shows us what's possible when we design specifically for voice from the ground up.

The oldest interface is becoming new again, and with it comes the opportunity to rethink not just how we deliver content, but how we connect with people through conversation itself.

#voice #conversational design #microcontent #Accessibility #UX