- Speech Analysis in a Model of the Central Auditory System - IEEE Journals & Magazine
- Related Articles
- Auditory perception and the evolution of speech
- Auditory analysis and perception of speech
Competing interests: The authors have declared that no competing interests exist. In speech perception, we unconsciously process a continuous auditory stream with a complex time-frequency structure that does not contain fixed, highly reproducible, or evident boundaries between the different perceptual elements that we detect in the stream of speech.
- Auditory Analysis and Perception of Speech!
- Celine All the Way: A Decade of Song (Popular Matching Folios).
- The High Price of Materialism.
- Rabbit Is Rich: A Novel.
Phonemes [ 1 ] or syllables [ 2 ], the building-blocks of speech, are sophisticated perceptual entities. Through a long evolutionary process, human brains have learned to extract certain auditory primitives from the speech signal and associate them with different perceptual categories. Which acoustic features are extracted and used to perceive speech remains unknown, largely because of the lack of an experimental method enabling the direct visualization of auditory cue extraction.
The aim of this paper is to propose and demonstrate the validity of adapting the classification image framework to directly visualize auditory functional cues actually used by individual listeners that are processing speech. Speech is a continuous waveform comprising an alternation of harmonic and non-harmonic acoustic segments.
Periodic sounds are caused by vibrations of the vocal folds and are shaped by resonances of the vocal tract to produce formants in the acoustic signal [ 5 ]. Thus, formants correspond to local energy maxima inside the spectral envelope of the signal and are present for vocalic sounds e. The number of formants is typically 4 to 5, depending on the phoneme considered. Formants cover a frequency range from approximately Hz to 4 to 6 kHz, with approximately one formant per kHz, and last approximately ms.
Speech Analysis in a Model of the Central Auditory System - IEEE Journals & Magazine
Each vowel appears to be loosely tied to a specific formantic structure essentially determined by the height of the first two formants, F1 and F2. Perturbations of the acoustic flux created by the rapid occlusion or release of the air flow generate silences, hisses, bursts or explosions that constitute the core of consonantal sounds e. Their presence transitorily inflects the formant trajectories, thus creating brief formantic transitions. Although the first effect is clearly due to the partial overlapping of articulatory commands between adjacent phonemes, the exact nature of the compensation phenomenon remains undetermined [ 8 — 11 ].
Coarticulation introduces internal variations into the system referred to as allophonic variations: a range of different formantic structures will be perceived as the same phoneme. This phenomenon makes the system more resistant to intra- and inter-speaker variations, but it also makes the problem of learning to associate acoustic cues to phonemic percepts more difficult and the reverse engineering problem of designing automatic speech recognition and automatic speech comprehension systems largely unresolved [ 12 ].
The precise mechanism underlying the transformation from continuous acoustical properties into the presence or absence of some acoustic cues and finally into a discrete perceptual unit remains undetermined. The acoustic-phonetic interface has been studied extensively since Many studies on this topic have been conducted under experimental conditions, which have involved stimuli that were degraded in a controlled fashion in order to narrow the problem to a small number of possible cues.
Among the most well-known attempts is the series of papers published by the Haskins Laboratories on the relationship between second formant transition and stop consonant perception using synthetic speech [ 13 , 14 ]. However, their conclusions are inherently limited by the non-naturalness of the synthesized stimuli: the variations of synthetic stimuli are restricted to a small number of cues, and thus they may not be processed in the same manner as natural stimuli.
As before, the question remains: Are the evidenced acoustic cues with synthetic speech identical to those for natural speech? The resistance of speech intelligibility to drastic signal reductions, such as those noted above, could rely on secondary perceptual cues not used in natural listening situations.
Scientists seeking to address this problem will ultimately be required to use natural speech production as stimuli.
- Medieval Warfare: A Bibliographical Guide.
- Auditory analysis and perception of speech - Ghent University Library.
- Science Fiction and Speculative Fiction: Challenging Genres (Critical Literacy Teaching Series: Challenging Authors and Genre, Volume 3).
- Related Articles;
- Towards an Open World Economy.
- Introduction to differential geometry and general relativity?
- Original Research ARTICLE.
In this context, a recent solution demonstrates the merits of using a masking noise on natural speech utterances to isolate the regions of the spectrogram crucial for identifying a particular phoneme. The technique initially proposed by [ 17 ] involves masking natural speech utterances with noise at various signal-to-noise ratios SNRs. By reverse-engineering the processing of speech in the brain, it has become possible to reveal the encoding of sub-phonological information in the auditory cortex [ 20 , 21 ].
One such solution has been to record the firing rate modulations of individual auditory neurons in response to specific stimuli to derive their spectrotemporal receptive-fields STRFs , which are a linear approximation of the time-frequency function of the neuron.
This technique has been widely used in studying birds, specifically when hearing conspecific birdsongs [ 22 , 23 ]. These studies have demonstrated that auditory neurons are tuned to specific time-frequency regions, surrounded by one or more inhibitory regions. Spectrotemporal filters are assumed to be somewhat similar for human auditory neurons.
Electrocorticographical ECoG recordings have enabled the estimation of average STRFs for small groups of human auditory neurons in epileptic patients [ 24 ], thereby strengthening the idea that the basic auditory cues for humans are also composed of an excitatory region surrounded by inhibitory regions.
As a next step, [ 20 ] gathered STRFs from clusters of neurons that are functionally similar, e. They obtained the first images of the encoding of acoustic cues for several features, as well as the tuning of neurons to frequencies corresponding to formant values.
Although these results represent a major breakthrough in understanding how speech sounds are primarily decoded along the primary auditory pathway, it is difficult to infer how this information is combined to facilitate the identification of one phoneme rather than another phoneme. Computational models have been proposed [ 25 ] that link the STRF with a multiresolution representation of speech sounds in the auditory cortex. This approach could provide a unified model of the transformation of a speech input from the cochlea to the midbrain.
However, this account currently remains theoretical, because of the lack of a method allowing the observation of the use of acoustic cues in normal participants and other non-epileptic patients and large-group studies or studies on the individual variations of these processes. In a previous paper [ 26 ], we demonstrated the feasibility of addressing this gap in the auditory domain by adapting a method designed to identify the primitives of simple perceptual tasks, the classification image technique.
Inspired from an auditory tone-in-noise detection experiment by Ahumada and Lovell [ 27 ], classification images have then been developed in the visual domain and successfully used to study Vernier acuity [ 28 ], perceptual learning [ 29 , 30 ], the interpolation of illusory contours [ 31 ], the detection of luminance [ 32 ] and chromatic [ 33 ] modulations, and recently face pareidolia [ 34 ]. The rationale underlying this technique is that if the time-frequency coordinates at which the noise interferes with the decision of the observer are known, then the regions on which the observer focuses to perform the task would also be known.
By fitting the decision weights corresponding to every pixel of the representation, it became possible to draw a time-frequency map of the categorization strategy and directly visualize which parts of the stimulus are crucial for the decision. In the first report on ACIs, we only reported individual data on three volunteers and used two speech productions as targets, thus leaving the question of the specificity of the obtained ACIs to these particular utterances unanswered. In the present study, we aimed to 1 further develop the method and complete a first group study to extend the feasibility of the method to group studies; 2 apply statistical tests permitting the evaluation of statistical significance inside or between classification images and 3 explore the specificity of the ACI to the utterances used as targets.
Seventeen native speakers of French with no background knowledge of phonetics and phonology participated in this study. They obtained scores within normal ranges on all tests S1 Table. Thus, the analyses are based on the answers of 16 participants 12 females; mean age: Four speech samples, i. The 4 stimuli were obtained by removing the silent gap between the two syllables to align the onset of the second syllable at the same temporal position and then equating the 4 sounds in root mean square and in duration ms.
The resulting speech signals hereafter denoted sounded perfectly natural and were perfectly intelligible in a quiet setting.
Auditory perception and the evolution of speech
Each stimulus in this experiment consisted of one target signal embedded in an additive Gaussian noise at a given SNR using Equation 1. The sampling rate of the stimuli was set to 48 kHz for the original sounds. All stimuli were root-mean-square normalized and were then preceded by 75 ms of Gaussian-noise with a Gaussian fade-in to avoid abrupt attacks. The cochleograms of the 4 stimuli are shown in Fig. Parameters for spectral and temporal resolution are identical to those used to derive the ACIs see details in the main text.
They completed a set of 10, trials consisting of 2, noisy presentations of each of the 4 speech signals, presented in random order.
Participants were allowed to replay the stimulus before entering their response. Given the long duration of the experiment approximately 4 h , we divided it into 20 sessions of trials completed over 4 days to avoid mental and auditory fatigue. Sessions were separated by minimum breaks of 3 min.
Auditory analysis and perception of speech
In addition, there was a short practice block before the beginning of the experiment that was similar to the test phase, except that the correct answers were displayed after each trial. The SNR was increased by one step after each incorrect response and decreased by one step after three consecutive correct responses from the last change in stimulus intensity. The method used for deriving ACIs has been previously detailed [ 26 ]. A summary is provided below, with a focus on several improvements that have been introduced since the publication of the first version.
The same preprocessing was applied to all noise and speech sounds before analysis. The vertical axis of the cochleogram represents the center frequencies of each auditory filter. Two additional processing levels are implemented in this function to mimic the non-linear behavior of the hair cells: a half-wave rectifier followed by an Automatic Gain Control modeling the neural adaptation mechanism and a difference between adjacent filters to improve the frequency response.
Finally, the output of the model is decimated in time to a sample rate of The cochleograms of our 4 stimuli are presented in Fig. The cochleogram of the noise sound at each trial i was calculated and will be hereafter denoted by in its vectorized form. Phoneme categorization is regarded in this context as a simple template-matching process between the input sound and two mental representations of the targets stored in memory.
The decision template corresponds to a particular linear weighting of the noise profile and is specific to the two targets involved in the task. The output of the dot-product is added to the factor to yield a linear predictor that is eventually transformed nonlinearly through the psychometric function into a probability ranging between 0 and 1.
It is important to note that the GLM does not simulate the internal processing of the human speech perception system. However, it is useful for determining which variations of the stimulus affect human perception. Thus, our main goal was to approach the decision template with an estimator , the ACI. Rather than directly estimating the model parameters with a simple maximization of the log-likelihood , we introduced a smoothness prior during the optimization of the GLM. The main concept is to place constraints on the parameter values during the estimation process.
This method has been shown to be efficient for preventing the overfitting problem inherent in maximum likelihood estimation when processing a large number of parameters.