ScholarGate
Assistant

Multimodal and Voice Interaction

Multimodal interaction combines two or more input or output channels, such as speech and gesture, while voice interaction lets users speak to systems; both aim for more natural, flexible communication with computers.

Definition

Multimodal interaction is interaction in which the user communicates through more than one modality, and the system may interpret them jointly; voice interaction is interaction through spoken language; conversational interfaces structure this as a dialogue between user and system.

Scope

This topic covers speech-based and multimodal interfaces: voice user interfaces and conversational interaction, the combination of modalities such as speech with pointing or gesture, the fusion and disambiguation of multiple inputs, and the design issues of error, context, and feedback in these settings. It does not cover the underlying speech-recognition or natural-language algorithms, which belong to artificial intelligence, nor unimodal touch and gesture, treated under touch and gesture interaction.

Core questions

  • How can combining modalities such as speech and gesture improve interaction?
  • What advantages and limits do voice and conversational interfaces have?
  • How does a system fuse and disambiguate inputs from different modalities?
  • How should multimodal and voice interfaces handle errors and context?

Key concepts

  • voice user interface
  • conversational interface
  • multimodal fusion
  • complementary vs redundant modalities
  • speech and gesture combination
  • dialogue and turn-taking
  • error recovery
  • context and grounding

Key theories

Combining voice and gesture
Bolt's 'Put-that-there' demonstrated that combining spoken commands with pointing lets users resolve references naturally, for example saying 'put that there' while pointing, an early illustration of complementary modalities.
Principles of multimodal interaction
Oviatt argued against common assumptions about multimodal use, showing that users do not simply duplicate input across modalities and that well-designed fusion of complementary modalities can improve robustness and efficiency.
Conversational interface design
Conversational interfaces model interaction as dialogue, requiring attention to turn-taking, grounding, error recovery, and the management of context so that spoken or text exchanges remain coherent and useful.

Clinical relevance

Voice and conversational interfaces power smart speakers, virtual assistants, and in-car systems, supporting hands-free and eyes-free use; multimodal designs can make systems more robust and accessible, including for users who cannot use conventional input, though they raise distinct error and privacy considerations.

History

Bolt's 1980 'Put-that-there' system pioneered combined voice and gesture interaction. Research through the 1990s, including systems such as QuickSet, developed multimodal fusion, and Oviatt's work corrected misconceptions about how people use multiple modalities. Advances in speech recognition led to widespread voice assistants and conversational interfaces in the 2010s.

Key figures

  • Richard A. Bolt
  • Sharon Oviatt
  • Philip R. Cohen
  • Michael McTear

Related topics

Seminal works

  • bolt1980
  • oviatt1999
  • cohen1997

Frequently asked questions

Is multimodal interaction just offering several input options?
Not exactly. Offering alternative inputs is one benefit, but true multimodal interaction can interpret modalities together, so speech and a pointing gesture jointly specify a command. This can resolve ambiguity and improve robustness in ways that separate, independent inputs cannot.
Why do voice interfaces still struggle in some settings?
Voice depends on accurate speech recognition and on resolving ambiguous or context-dependent requests, which are hard in noisy environments or open-ended tasks. Voice also lacks the persistent visual feedback of screens, so designers must carefully manage confirmation, error recovery, and what the system can and cannot do.

Methods for this concept

Related concepts