Neural dynamics of auditory-visual speech fusion

van Wassenhove, V.1, Grant, K. W.2, and Poeppel D.1,3
1Neuroscience and Cognitive Science graduate program, Department of Biology, Cognitive Neuroscience of Language Laboratory, University of Maryland, College Park, MD; 2Walter Reed Army Medical Center, Army Audiology and Speech Center, Washington, DC; 3Department of Linguistics, University of Maryland, College Park, MD

How does the nervous system fuse two separate streams of sensory information, for which timing and spatial location may be the only source of redundancy? Multisensory fusion is classically accounted for by neural convergence onto multisensory integration sites, where spatio-temporal coincidence of auditory-visual (AV) events provides a sufficient constraint for the integration process.

Why then, does an audio [pa] dubbed onto a place-of-articulation [ka] results in perceiving [ta] (i.e. a unified percept) while an audio [ka] dubbed onto a visual [pa] does not (McGurk and McDonald, 1976)? Here, we take an information-theoric perspective and provide evidence for the necessity of an abstract (and modality-independent) representation of speech.

First, electrophysiological (EEG) recordings show that in natural AV syllables, visual information facilitates the neural processing of auditory speech such that the more salient visual inputs are, the faster the auditory processing of speech is. In bimodal condition, the shortened latency of auditory-evoked potentials (tens of milliseconds) was accompanied by an amplitude decrease (~200ms), suggesting that neural computations in the gamma (~40Hz) and theta (4-7Hz) range are crucial for the processing of AV speech.

Second, EEG recordings using desynchronized AV speech stimuli shows that cognitive demands (i.e. identifying vs. judging the synchrony of AV stimuli) are associated with two different global cortical states of activation in the gamma and theta range, suggesting implicit modulation of information extraction in AV speech.

Furthermore, it is in the interaction gamma/theta, that hemispheric-specific computations could be dissociated. For instance, interactions in the right hemisphere correlated with the perception of temporal order.

Our results strongly suggest the 'analysis-by-synthesis' nature of AV speech integration. A forward model of AV speech fusion is proposed to account for the results and will be discussed in the context of current multisensory neural convergence and feedback models of multisensory integration.