![Speech perception](https://www.english.nina.az/wikipedia/image/aHR0cHM6Ly91cGxvYWQud2lraW1lZGlhLm9yZy93aWtpcGVkaWEvY29tbW9ucy90aHVtYi83Lzc1L1NwZWN0cm9ncmFtc19vZl9zeWxsYWJsZXNfZGVlX2RhaF9kb28ucG5nLzE2MDBweC1TcGVjdHJvZ3JhbXNfb2Zfc3lsbGFibGVzX2RlZV9kYWhfZG9vLnBuZw==.png )
This article's lead section may be too short to adequately summarize the key points.(January 2021) |
Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
The process of perceiving speech begins at the level of the sound signal and the process of audition. (For a complete description of the process of audition see Hearing.) After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition.
Acoustic cues
![image](https://www.english.nina.az/wikipedia/image/aHR0cHM6Ly93d3cuZW5nbGlzaC5uaW5hLmF6L3dpa2lwZWRpYS9pbWFnZS9hSFIwY0hNNkx5OTFjR3h2WVdRdWQybHJhVzFsWkdsaExtOXlaeTkzYVd0cGNHVmthV0V2WTI5dGJXOXVjeTkwYUhWdFlpODNMemMxTDFOd1pXTjBjbTluY21GdGMxOXZabDl6ZVd4c1lXSnNaWE5mWkdWbFgyUmhhRjlrYjI4dWNHNW5MekkxTUhCNExWTndaV04wY205bmNtRnRjMTl2Wmw5emVXeHNZV0pzWlhOZlpHVmxYMlJoYUY5a2IyOHVjRzVuLnBuZw==.png)
Acoustic cues are sensory cues contained in the speech sound signal which are used in speech perception to differentiate speech sounds belonging to different phonetic categories. For example, one of the most studied cues in speech is voice onset time or VOT. VOT is a primary cue signaling the difference between voiced and voiceless plosives, such as "b" and "p". Other cues differentiate sounds that are produced at different places of articulation or manners of articulation. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of phonemes. These representations can then be combined for use in word recognition and other language processes.
It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound:
At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find, even after some forty-five years of research on the problem.
If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests using speech synthesizers would be sufficient to determine such a cue or cues. However, there are two significant obstacles:
- One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases (like American English /ɛ/ and /æ/) it can distinguish the identity of vowels. Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.
- One linguistic unit can be cued by several acoustic properties. For example, in a classic experiment, Alvin Liberman (1957) showed that the onset formant transitions of /d/ differ depending on the following vowel (see Figure 1) but they are all interpreted as the phoneme /d/ by listeners.
Linearity and the segmentation problem
![image](https://www.english.nina.az/wikipedia/image/aHR0cHM6Ly93d3cuZW5nbGlzaC5uaW5hLmF6L3dpa2lwZWRpYS9pbWFnZS9hSFIwY0hNNkx5OTFjR3h2WVdRdWQybHJhVzFsWkdsaExtOXlaeTkzYVd0cGNHVmthV0V2WTI5dGJXOXVjeTkwYUhWdFlpODFMelZqTDFOd1pXTjBjbTluY21GdFgyOW1YMGxmYjNkbFgzbHZkUzV3Ym1jdk16QXdjSGd0VTNCbFkzUnliMmR5WVcxZmIyWmZTVjl2ZDJWZmVXOTFMbkJ1Wnc9PS5wbmc=.png)
Although listeners perceive speech as a stream of discrete units[citation needed] (phonemes, syllables, and words), this linearity is difficult to see in the physical speech signal (see Figure 2 for an example). Speech sounds do not strictly follow one another, rather, they overlap. A speech sound is influenced by the ones that precede and the ones that follow. This influence can even be exerted at a distance of two or more segments (and across syllable- and word-boundaries).
Because the speech signal is not linear, there is a problem of segmentation. It is difficult to delimit a stretch of speech signal as belonging to a single perceptual unit. As an example, the acoustic properties of the phoneme /d/ will depend on the production of the following vowel (because of coarticulation).
Lack of invariance
The research and application of speech perception must deal with several problems which result from what has been termed the lack of invariance. Reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find. There are several reasons for this:
Context-induced variation
Phonetic environment affects the acoustic properties of speech sounds. For example, /u/ in English is fronted when surrounded by coronal consonants. Or, the voice onset time marking the boundary between voiced and voiceless plosives are different for labial, alveolar and velar plosives and they shift under stress or depending on the position within a syllable.
Variation due to differing speech conditions
One important factor that causes variation is differing speech rate. Many phonemic contrasts are constituted by temporal characteristics (short vs. long vowels or consonants, affricates vs. fricatives, plosives vs. glides, voiced vs. voiceless plosives, etc.) and they are certainly affected by changes in speaking tempo. Another major source of variation is articulatory carefulness vs. sloppiness which is typical for connected speech (articulatory "undershoot" is obviously reflected in the acoustic properties of the sounds produced).
Variation due to different speaker identity
The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers. Men, women, and children generally produce voices having different pitch. Because speakers have vocal tracts of different sizes (due to sex and age especially) the resonant frequencies (formants), which are important for recognition of speech sounds, will vary in their absolute values across individuals (see Figure 3 for an illustration of this). Research shows that infants at the age of 7.5 months cannot recognize information presented by speakers of different genders; however by the age of 10.5 months, they can detect the similarities. Dialect and foreign accent can also cause variation, as can the social characteristics of the speaker and listener.
Perceptual constancy and normalization
![image](https://www.english.nina.az/wikipedia/image/aHR0cHM6Ly93d3cuZW5nbGlzaC5uaW5hLmF6L3dpa2lwZWRpYS9pbWFnZS9hSFIwY0hNNkx5OTFjR3h2WVdRdWQybHJhVzFsWkdsaExtOXlaeTkzYVd0cGNHVmthV0V2WTI5dGJXOXVjeTkwYUhWdFlpOW1MMlpoTDFOMFlXNWtZWEprWDJGdVpGOXViM0p0WVd4cGVtVmtYM1p2ZDJWc1gzTndZV05sTWk1d2JtY3ZNekF3Y0hndFUzUmhibVJoY21SZllXNWtYMjV2Y20xaGJHbDZaV1JmZG05M1pXeGZjM0JoWTJVeUxuQnVadz09LnBuZw==.png)
Despite the great variety of different speakers and different conditions, listeners perceive vowels and consonants as constant categories. It has been proposed that this is achieved by means of the perceptual normalization process in which listeners filter out the noise (i.e. variation) to arrive at the underlying category. Vocal-tract-size differences result in formant-frequency variation across speakers; therefore a listener has to adjust his/her perceptual system to the acoustic characteristics of a particular speaker. This may be accomplished by considering the ratios of formants rather than their absolute values. This process has been called vocal tract normalization (see Figure 3 for an example). Similarly, listeners are believed to adjust the perception of duration to the current tempo of the speech they are listening to – this has been referred to as speech rate normalization.
Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy (see theories below). Perceptual constancy is a phenomenon not specific to speech perception only; it exists in other types of perception too.
Categorical perception
![image](https://www.english.nina.az/wikipedia/image/aHR0cHM6Ly93d3cuZW5nbGlzaC5uaW5hLmF6L3dpa2lwZWRpYS9pbWFnZS9hSFIwY0hNNkx5OTFjR3h2WVdRdWQybHJhVzFsWkdsaExtOXlaeTkzYVd0cGNHVmthV0V2WTI5dGJXOXVjeTkwYUhWdFlpOHhMekZqTDBOaGRHVm5iM0pwZW1GMGFXOXVMV0Z1WkMxa2FYTmpjbWx0YVc1aGRHbHZiaTFqZFhKMlpYTXVjRzVuTHpNd01IQjRMVU5oZEdWbmIzSnBlbUYwYVc5dUxXRnVaQzFrYVhOamNtbHRhVzVoZEdsdmJpMWpkWEoyWlhNdWNHNW4ucG5n.png)
Categorical perception is involved in processes of perceptual differentiation. People perceive speech sounds categorically, that is to say, they are more likely to notice the differences between categories (phonemes) than within categories. The perceptual space between categories is therefore warped, the centers of categories (or "prototypes") working like a sieve or like magnets for incoming speech sounds.
In an artificial continuum between a voiceless and a voiced bilabial plosive, each new step differs from the preceding one in the amount of VOT. The first sound is a pre-voiced [b], i.e. it has a negative VOT. Then, increasing the VOT, it reaches zero, i.e. the plosive is a plain unaspirated voiceless [p]. Gradually, adding the same amount of VOT at a time, the plosive is eventually a strongly aspirated voiceless bilabial [pʰ]. (Such a continuum was used in an experiment by Lisker and Abramson in 1970. The sounds they used are available online.) In this continuum of, for example, seven sounds, native English listeners will identify the first three sounds as /b/ and the last three sounds as /p/ with a clear boundary between the two categories. A two-alternative identification (or categorization) test will yield a discontinuous categorization function (see red curve in Figure 4).
In tests of the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other (20 ms for instance), listeners are likely to perform at chance level if both sounds fall within the same category and at nearly 100% level if each sound falls in a different category (see the blue discrimination curve in Figure 4).
The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed. Similar perceptual adjustment is attested for other acoustic cues as well.
Top-down influences
In a classic experiment, Richard M. Warren (1970) replaced one phoneme of a word with a cough-like sound. Perceptually, his subjects restored the missing speech sound without any difficulty and could not accurately identify which phoneme had been disturbed, a phenomenon known as the phonemic restoration effect. Therefore, the process of speech perception is not necessarily uni-directional.
Another basic experiment compared recognition of naturally spoken words within a phrase versus the same words in isolation, finding that perception accuracy usually drops in the latter condition. To probe the influence of semantic knowledge on perception, Garnes and Bond (1976) similarly used carrier sentences where target words only differed in a single phoneme (bay/day/gay, for example) whose quality changed along a continuum. When put into different sentences that each naturally led to one interpretation, listeners tended to judge ambiguous words according to the meaning of the whole sentence . That is, higher-level language processes connected with morphology, syntax, or semantics may interact with basic speech perception processes to aid in recognition of speech sounds.
It may be the case that it is not necessary and maybe even not possible for a listener to recognize phonemes before recognizing higher units, like words for example. After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal, listeners can compensate for missing or noise-masked phonemes using their knowledge of the spoken language. Compensatory mechanisms might even operate at the sentence level such as in learned songs, phrases and verses, an effect backed-up by neural coding patterns consistent with the missed continuous speech fragments, despite the lack of all relevant bottom-up sensory input.
Acquired language impairment
The first ever hypothesis of speech perception was used with patients who acquired an auditory comprehension deficit, also known as receptive aphasia. Since then there have been many disabilities that have been classified, which resulted in a true definition of "speech perception". The term 'speech perception' describes the process of interest that employs sub lexical contexts to the probe process. It consists of many different language and grammatical functions, such as: features, segments (phonemes), syllabic structure (unit of pronunciation), phonological word forms (how sounds are grouped together), grammatical features, morphemic (prefixes and suffixes), and semantic information (the meaning of the words). In the early years, they were more interested in the acoustics of speech. For instance, they were looking at the differences between /ba/ or /da/, but now research has been directed to the response in the brain from the stimuli. In recent years, there has been a model developed to create a sense of how speech perception works; this model is known as the dual stream model. This model has drastically changed from how psychologists look at perception. The first section of the dual stream model is the ventral pathway. This pathway incorporates middle temporal gyrus, inferior temporal sulcus and perhaps the inferior temporal gyrus. The ventral pathway shows phonological representations to the lexical or conceptual representations, which is the meaning of the words. The second section of the dual stream model is the dorsal pathway. This pathway includes the sylvian parietotemporal, inferior frontal gyrus, anterior insula, and premotor cortex. Its primary function is to take the sensory or phonological stimuli and transfer it into an articulatory-motor representation (formation of speech).
Aphasia
Aphasia is an impairment of language processing caused by damage to the brain. Different parts of language processing are impacted depending on the area of the brain that is damaged, and aphasia is further classified based on the location of injury or constellation of symptoms. Damage to Broca's area of the brain often results in expressive aphasia which manifests as impairment in speech production. Damage to Wernicke's area often results in receptive aphasia where speech processing is impaired.
Aphasia with impaired speech perception typically shows lesions or damage located in the left temporal or parietal lobes. Lexical and semantic difficulties are common, and comprehension may be affected.
Agnosia
Agnosia is "the loss or diminution of the ability to recognize familiar objects or stimuli usually as a result of brain damage". There are several different kinds of agnosia that affect every one of our senses, but the two most common related to speech are speech agnosia and phonagnosia.
Speech agnosia: Pure word deafness, or speech agnosia, is an impairment in which a person maintains the ability to hear, produce speech, and even read speech, yet they are unable to understand or properly perceive speech. These patients seem to have all of the skills necessary in order to properly process speech, yet they appear to have no experience associated with speech stimuli. Patients have reported, "I can hear you talking, but I can't translate it". Even though they are physically receiving and processing the stimuli of speech, without the ability to determine the meaning of the speech, they essentially are unable to perceive the speech at all. There are no known treatments that have been found, but from case studies and experiments it is known that speech agnosia is related to lesions in the left hemisphere or both, specifically right temporoparietal dysfunctions.
Phonagnosia: Phonagnosia is associated with the inability to recognize any familiar voices. In these cases, speech stimuli can be heard and even understood but the association of the speech to a certain voice is lost. This can be due to "abnormal processing of complex vocal properties (timbre, articulation, and prosody—elements that distinguish an individual voice". There is no known treatment; however, there is a case report of an epileptic woman who began to experience phonagnosia along with other impairments. Her EEG and MRI results showed "a right cortical parietal T2-hyperintense lesion without gadolinium enhancement and with discrete impairment of water molecule diffusion". So although no treatment has been discovered, phonagnosia can be correlated to postictal parietal cortical dysfunction.
Infant speech perception
Infants begin the process of language acquisition by being able to detect very small differences between speech sounds. They can discriminate all possible speech contrasts (phonemes). Gradually, as they are exposed to their native language, their perception becomes language-specific, i.e. they learn how to ignore the differences within phonemic categories of the language (differences that may well be contrastive in other languages – for example, English distinguishes two voicing categories of plosives, whereas Thai has three categories; infants must learn which differences are distinctive in their native language uses, and which are not). As infants learn how to sort incoming speech sounds into categories, ignoring irrelevant differences and reinforcing the contrastive ones, their perception becomes categorical. Infants learn to contrast different vowel phonemes of their native language by approximately 6 months of age. The native consonantal contrasts are acquired by 11 or 12 months of age. Some researchers have proposed that infants may be able to learn the sound categories of their native language through passive listening, using a process called statistical learning. Others even claim that certain sound categories are innate, that is, they are genetically specified (see discussion about innate vs. acquired categorical distinctiveness).
If day-old babies are presented with their mother's voice speaking normally, abnormally (in monotone), and a stranger's voice, they react only to their mother's voice speaking normally. When a human and a non-human sound is played, babies turn their head only to the source of human sound. It has been suggested that auditory learning begins already in the pre-natal period.
One of the techniques used to examine how infants perceive speech, besides the head-turn procedure mentioned above, is measuring their sucking rate. In such an experiment, a baby is sucking a special nipple while presented with sounds. First, the baby's normal sucking rate is established. Then a stimulus is played repeatedly. When the baby hears the stimulus for the first time the sucking rate increases but as the baby becomes habituated to the stimulation the sucking rate decreases and levels off. Then, a new stimulus is played to the baby. If the baby perceives the newly introduced stimulus as different from the background stimulus the sucking rate will show an increase. The sucking-rate and the head-turn method are some of the more traditional, behavioral methods for studying speech perception. Among the new methods (see Research methods below) that help us to study speech perception, near-infrared spectroscopy is widely used in infants.
It has also been discovered that even though infants' ability to distinguish between the different phonetic properties of various languages begins to decline around the age of nine months, it is possible to reverse this process by exposing them to a new language in a sufficient way. In a research study by Patricia K. Kuhl, Feng-Ming Tsao, and Huei-Mei Liu, it was discovered that if infants are spoken to and interacted with by a native speaker of Mandarin Chinese, they can actually be conditioned to retain their ability to distinguish different speech sounds within Mandarin that are very different from speech sounds found within the English language. Thus proving that given the right conditions, it is possible to prevent infants' loss of the ability to distinguish speech sounds in languages other than those found in the native language.
Cross-language and second-language
A large amount of research has studied how users of a language perceive foreign speech (referred to as cross-language speech perception) or second-language speech (second-language speech perception). The latter falls within the domain of second language acquisition.
Languages differ in their phonemic inventories. Naturally, this creates difficulties when a foreign language is encountered. For example, if two foreign-language sounds are assimilated to a single mother-tongue category the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquid consonants /l/ and /r/ (see Perception of English /r/ and /l/ by Japanese speakers).
Best (1995) proposed a Perceptual Assimilation Model which describes possible cross-language category assimilation patterns and predicts their consequences. Flege (1995) formulated a Speech Learning Model which combines several hypotheses about second-language (L2) speech acquisition and which predicts, in simple words, that an L2 sound that is not too similar to a native-language (L1) sound will be easier to acquire than an L2 sound that is relatively similar to an L1 sound (because it will be perceived as more obviously "different" by the learner).
In language or hearing impairment
Research in how people with language or hearing impairment perceive speech is not only intended to discover possible treatments. It can provide insight into the principles underlying non-impaired speech perception. Two areas of research can serve as an example:
Listeners with aphasia
Aphasia affects both the expression and reception of language. Both two most common types, expressive aphasia and receptive aphasia, affect speech perception to some extent. Expressive aphasia causes moderate difficulties for language understanding. The effect of receptive aphasia on understanding is much more severe. It is agreed upon, that aphasics suffer from perceptual deficits. They usually cannot fully distinguish place of articulation and voicing. As for other features, the difficulties vary. It has not yet been proven whether low-level speech-perception skills are affected in aphasia sufferers or whether their difficulties are caused by higher-level impairment alone.
Listeners with cochlear implants
Cochlear implantation restores access to the acoustic signal in individuals with sensorineural hearing loss. The acoustic information conveyed by an implant is usually sufficient for implant users to properly recognize speech of people they know even without visual clues. For cochlear implant users, it is more difficult to understand unknown speakers and sounds. The perceptual abilities of children that received an implant after the age of two are significantly better than of those who were implanted in adulthood. A number of factors have been shown to influence perceptual performance, specifically: duration of deafness prior to implantation, age of onset of deafness, age at implantation (such age effects may be related to the Critical period hypothesis) and the duration of using an implant. There are differences between children with congenital and acquired deafness. Postlingually deaf children have better results than the prelingually deaf and adapt to a cochlear implant faster. In both children with cochlear implants and normal hearing, vowels and voice onset time becomes prevalent in development before the ability to discriminate the place of articulation. Several months following implantation, children with cochlear implants can normalize speech perception.
Noise
One of the fundamental problems in the study of speech is how to deal with noise. This is shown by the difficulty in recognizing human speech that computer recognition systems have. While they can do well at recognizing speech if trained on a specific speaker's voice and under quiet conditions, these systems often do poorly in more realistic listening situations where humans would understand speech with relative ease. To emulate processing patterns that would be held in the brain under normal conditions, prior knowledge is a key neural factor, since a robust learning history may to an extent override the extreme masking effects involved in the complete absence of continuous speech signals.
Music-language connection
Research into the relationship between music and cognition is an emerging field related to the study of speech perception. Originally it was theorized that the neural signals for music were processed in a specialized "module" in the right hemisphere of the brain. Conversely, the neural signals for language were to be processed by a similar "module" in the left hemisphere. However, utilizing technologies such as fMRI machines, research has shown that two regions of the brain traditionally considered exclusively to process speech, Broca's and Wernicke's areas, also become active during musical activities such as listening to a sequence of musical chords. Other studies, such as one performed by Marques et al. in 2006 showed that 8-year-olds who were given six months of musical training showed an increase in both their pitch detection performance and their electrophysiological measures when made to listen to an unknown foreign language.
Conversely, some research has revealed that, rather than music affecting our perception of speech, our native speech can affect our perception of music. One example is the tritone paradox. The tritone paradox is where a listener is presented with two computer-generated tones (such as C and F-Sharp) that are half an octave (or a tritone) apart and are then asked to determine whether the pitch of the sequence is descending or ascending. One such study, performed by Ms. Diana Deutsch, found that the listener's interpretation of ascending or descending pitch was influenced by the listener's language or dialect, showing variation between those raised in the south of England and those in California or from those in Vietnam and those in California whose native language was English. A second study, performed in 2006 on a group of English speakers and 3 groups of East Asian students at University of Southern California, discovered that English speakers who had begun musical training at or before age 5 had an 8% chance of having perfect pitch.
Speech phenomenology
The experience of speech
Casey O'Callaghan, in his article Experiencing Speech, analyzes whether "the perceptual experience of listening to speech differs in phenomenal character" with regards to understanding the language being heard. He argues that an individual's experience when hearing a language they comprehend, as opposed to their experience when hearing a language they have no knowledge of, displays a difference in phenomenal features which he defines as "aspects of what an experience is like" for an individual.
If a subject who is a monolingual native English speaker is presented with a stimulus of speech in German, the string of phonemes will appear as mere sounds and will produce a very different experience than if exactly the same stimulus was presented to a subject who speaks German.
He also examines how speech perception changes when one learning a language. If a subject with no knowledge of the Japanese language was presented with a stimulus of Japanese speech, and then was given the exact same stimuli after being taught Japanese, this same individual would have an extremely different experience.
Research methods
The methods used in speech perception research can be roughly divided into three groups: behavioral, computational, and, more recently, neurophysiological methods.
Behavioral methods
Behavioral experiments are based on an active role of a participant, i.e. subjects are presented with stimuli and asked to make conscious decisions about them. This can take the form of an identification test, a discrimination test, similarity rating, etc. These types of experiments help to provide a basic description of how listeners perceive and categorize speech sounds.
Sinewave Speech
Speech perception has also been analyzed through sinewave speech, a form of synthetic speech where the human voice is replaced by sine waves that mimic the frequencies and amplitudes present in the original speech. When subjects are first presented with this speech, the sinewave speech is interpreted as random noises. But when the subjects are informed that the stimuli actually is speech and are told what is being said, "a distinctive, nearly immediate shift occurs" to how the sinewave speech is perceived.
Computational methods
Computational modeling has also been used to simulate how speech may be processed by the brain to produce behaviors that are observed. Computer models have been used to address several questions in speech perception, including how the sound signal itself is processed to extract the acoustic cues used in speech, and how speech information is used for higher-level processes, such as word recognition.
Neurophysiological methods
Neurophysiological methods rely on utilizing information stemming from more direct and not necessarily conscious (pre-attentative) processes. Subjects are presented with speech stimuli in different types of tasks and the responses of the brain are measured. The brain itself can be more sensitive than it appears to be through behavioral responses. For example, the subject may not show sensitivity to the difference between two speech sounds in a discrimination test, but brain responses may reveal sensitivity to these differences. Methods used to measure neural responses to speech include event-related potentials, magnetoencephalography, and near infrared spectroscopy. One important response used with event-related potentials is the mismatch negativity, which occurs when speech stimuli are acoustically different from a stimulus that the subject heard previously.
Neurophysiological methods were introduced into speech perception research for several reasons:
Behavioral responses may reflect late, conscious processes and be affected by other systems such as orthography, and thus they may mask speaker's ability to recognize sounds based on lower-level acoustic distributions.
Without the necessity of taking an active part in the test, even infants can be tested; this feature is crucial in research into acquisition processes. The possibility to observe low-level auditory processes independently from the higher-level ones makes it possible to address long-standing theoretical issues such as whether or not humans possess a specialized module for perceiving speech or whether or not some complex acoustic invariance (see lack of invariance above) underlies the recognition of a speech sound.
Theories
Motor theory
Some of the earliest work in the study of how humans perceive speech sounds was conducted by Alvin Liberman and his colleagues at Haskins Laboratories. Using a speech synthesizer, they constructed speech sounds that varied in place of articulation along a continuum from /bɑ/ to /dɑ/ to /ɡɑ/. Listeners were asked to identify which sound they heard and to discriminate between two different sounds. The results of the experiment showed that listeners grouped sounds into discrete categories, even though the sounds they were hearing were varying continuously. Based on these results, they proposed the notion of categorical perception as a mechanism by which humans can identify speech sounds.
More recent research using different tasks and methods suggests that listeners are highly sensitive to acoustic differences within a single phonetic category, contrary to a strict categorical account of speech perception.
To provide a theoretical account of the categorical perception data, Liberman and colleagues worked out the motor theory of speech perception, where "the complicated articulatory encoding was assumed to be decoded in the perception of speech by the same processes that are involved in production" (this is referred to as analysis-by-synthesis). For instance, the English consonant /d/ may vary in its acoustic details across different phonetic contexts (see above), yet all /d/'s as perceived by a listener fall within one category (voiced alveolar plosive) and that is because "linguistic representations are abstract, canonical, phonetic segments or the gestures that underlie these segments". When describing units of perception, Liberman later abandoned articulatory movements and proceeded to the neural commands to the articulators and even later to intended articulatory gestures, thus "the neural representation of the utterance that determines the speaker's production is the distal object the listener perceives". The theory is closely related to the modularity hypothesis, which proposes the existence of a special-purpose module, which is supposed to be innate and probably human-specific.
The theory has been criticized in terms of not being able to "provide an account of just how acoustic signals are translated into intended gestures" by listeners. Furthermore, it is unclear how indexical information (e.g. talker-identity) is encoded/decoded along with linguistically relevant information.
Exemplar theory
Exemplar models of speech perception differ from the four theories mentioned above which suppose that there is no connection between word- and talker-recognition and that the variation across talkers is "noise" to be filtered out.
The exemplar-based approaches claim listeners store information for both word- and talker-recognition. According to this theory, particular instances of speech sounds are stored in the memory of a listener. In the process of speech perception, the remembered instances of e.g. a syllable stored in the listener's memory are compared with the incoming stimulus so that the stimulus can be categorized. Similarly, when recognizing a talker, all the memory traces of utterances produced by that talker are activated and the talker's identity is determined. Supporting this theory are several experiments reported by Johnson that suggest that our signal identification is more accurate when we are familiar with the talker or when we have visual representation of the talker's gender. When the talker is unpredictable or the sex misidentified, the error rate in word-identification is much higher.
The exemplar models have to face several objections, two of which are (1) insufficient memory capacity to store every utterance ever heard and, concerning the ability to produce what was heard, (2) whether also the talker's own articulatory gestures are stored or computed when producing utterances that would sound as the auditory memories.
Acoustic landmarks and distinctive features
Kenneth N. Stevens proposed acoustic landmarks and distinctive features as a relation between phonological features and auditory properties. According to this view, listeners are inspecting the incoming signal for the so-called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. Since these gestures are limited by the capacities of humans' articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model. The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. Bundles of them uniquely specify phonetic segments (phonemes, syllables, words).
In this model, the incoming acoustic signal is believed to be first processed to determine the so-called landmarks which are special spectral events in the signal; for example, vowels are typically marked by higher frequency of the first formant, consonants can be specified as discontinuities in the signal and have lower amplitudes in lower and middle regions of the spectrum. These acoustic features result from articulation. In fact, secondary articulatory movements may be used when enhancement of the landmarks is needed due to external conditions such as noise. Stevens claims that coarticulation causes only limited and moreover systematic and thus predictable variation in the signal which the listener is able to deal with. Within this model therefore, what is called the lack of invariance is simply claimed not to exist.
Landmarks are analyzed to determine certain articulatory events (gestures) which are connected with them. In the next stage, acoustic cues are extracted from the signal in the vicinity of the landmarks by means of mental measuring of certain parameters such as frequencies of spectral peaks, amplitudes in low-frequency region, or timing.
The next processing stage comprises acoustic-cues consolidation and derivation of distinctive features. These are binary categories related to articulation (for example [+/- high], [+/- back], [+/- round lips] for vowels; [+/- sonorant], [+/- lateral], or [+/- nasal] for consonants.
Bundles of these features uniquely identify speech segments (phonemes, syllables, words). These segments are part of the lexicon stored in the listener's memory. Its units are activated in the process of lexical access and mapped on the original signal to find out whether they match. If not, another attempt with a different candidate pattern is made. In this iterative fashion, listeners thus reconstruct the articulatory events which were necessary to produce the perceived speech signal. This can be therefore described as analysis-by-synthesis.
This theory thus posits that the distal object of speech perception are the articulatory gestures underlying speech. Listeners make sense of the speech signal by referring to them. The model belongs to those referred to as analysis-by-synthesis.
Fuzzy-logical model
The fuzzy logical theory of speech perception developed by Dominic Massaro proposes that people remember speech sounds in a probabilistic, or graded, way. It suggests that people remember descriptions of the perceptual units of language, called prototypes. Within each prototype various features may combine. However, features are not just binary (true or false), there is a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category. Thus, when perceiving a speech signal our decision about what we actually hear is based on the relative goodness of the match between the stimulus information and values of particular prototypes. The final decision is based on multiple features or sources of information, even visual information (this explains the McGurk effect). Computer models of the fuzzy logical theory have been used to demonstrate that the theory's predictions of how speech sounds are categorized correspond to the behavior of human listeners.
Speech mode hypothesis
Speech mode hypothesis is the idea that the perception of speech requires the use of specialized mental processing. The speech mode hypothesis is a branch off of Fodor's modularity theory (see modularity of mind). It utilizes a vertical processing mechanism where limited stimuli are processed by special-purpose areas of the brain that are stimuli specific.
Two versions of speech mode hypothesis:
- Weak version – listening to speech engages previous knowledge of language.
- Strong version – listening to speech engages specialized speech mechanisms for perceiving speech.
Three important experimental paradigms have evolved in the search to find evidence for the speech mode hypothesis. These are dichotic listening, categorical perception, and duplex perception. Through the research in these categories it has been found that there may not be a specific speech mode but instead one for auditory codes that require complicated auditory processing. Also it seems that modularity is learned in perceptual systems. Despite this the evidence and counter-evidence for the speech mode hypothesis is still unclear and needs further research.
Direct realist theory
The direct realist theory of speech perception (mostly associated with Carol Fowler) is a part of the more general theory of direct realism, which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived. For speech perception, the theory asserts that the objects of perception are actual vocal tract movements, or gestures, and not abstract phonemes or (as in the Motor Theory) events that are causally antecedent to these movements, i.e. intended gestures. Listeners perceive gestures not by means of a specialized decoder (as in the Motor Theory) but because information in the acoustic signal specifies the gestures that form it. By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception, the theory bypasses the problem of lack of invariance.
See also
- Related to the case study of Genie (feral child)
- Neurocomputational speech processing
- Multisensory integration
- Origin of speech
- Speech-Language Pathology
- Motor theory of speech perception
References
- Nygaard, L.C., Pisoni, D.B. (1995). "Speech Perception: New Directions in Research and Theory". In J.L. Miller; P.D. Eimas (eds.). Handbook of Perception and Cognition: Speech, Language, and Communication. San Diego: Academic Press.
{{cite encyclopedia}}
: CS1 maint: multiple names: authors list (link) - Klatt, D.H. (1976). "Linguistic uses of segmental duration in English: Acoustic and perceptual evidence". Journal of the Acoustical Society of America. 59 (5): 1208–1221. Bibcode:1976ASAJ...59.1208K. doi:10.1121/1.380986. PMID 956516.
- Halle, M., Mohanan, K.P. (1985). "Segmental phonology of modern English". Linguistic Inquiry. 16 (1): 57–116.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Liberman, A.M. (1957). "Some results of research on speech perception" (PDF). Journal of the Acoustical Society of America. 29 (1): 117–123. Bibcode:1957ASAJ...29..117L. doi:10.1121/1.1908635. hdl:11858/00-001M-0000-002C-5789-A. Archived from the original (PDF) on 2016-03-03. Retrieved 2007-05-17.
- Fowler, C.A. (1995). "Speech production". In J.L. Miller; P.D. Eimas (eds.). Handbook of Perception and Cognition: Speech, Language, and Communication. San Diego: Academic Press.
- Hillenbrand, J.M., Clark, M.J., Nearey, T.M. (2001). "Effects of consonant environment on vowel formant patterns". Journal of the Acoustical Society of America. 109 (2): 748–763. Bibcode:2001ASAJ..109..748H. doi:10.1121/1.1337959. PMID 11248979. S2CID 10751216.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Lisker, L., Abramson, A.S. (1967). "Some effects of context on voice onset time in English plosives" (PDF). Language and Speech. 10 (1): 1–28. doi:10.1177/002383096701000101. PMID 6044530. S2CID 34616732. Archived from the original (PDF) on 2016-03-03. Retrieved 2007-05-17.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K. (1995). "Acoustic characteristics of American English vowels". Journal of the Acoustical Society of America. 97 (5 Pt 1): 3099–3111. Bibcode:1995ASAJ...97.3099H. doi:10.1121/1.411872. PMID 7759650. S2CID 10104073.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Houston, Derek M.; Juscyk, Peter W. (October 2000). "The role of talker-specific information in word segmentation by infants" (PDF). Journal of Experimental Psychology: Human Perception and Performance. 26 (5): 1570–1582. doi:10.1037/0096-1523.26.5.1570. PMID 11039485. Archived from the original (PDF) on 2014-04-30. Retrieved 1 March 2012.
- Hay, Jennifer; Drager, Katie (2010). "Stuffed toys and speech perception". Linguistics. 48 (4): 865–892. doi:10.1515/LING.2010.027. S2CID 143639653.
- Syrdal, A.K.; Gopal, H.S. (1986). "A perceptual model of vowel recognition based on the auditory representation of American English vowels". Journal of the Acoustical Society of America. 79 (4): 1086–1100. Bibcode:1986ASAJ...79.1086S. doi:10.1121/1.393381. PMID 3700864.
- Strange, W. (1999). "Perception of vowels: Dynamic constancy". In J.M. Pickett (ed.). The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Needham Heights (MA): Allyn & Bacon.
- Johnson, K. (2005). "Speaker Normalization in speech perception" (PDF). In Pisoni, D.B.; Remez, R. (eds.). The Handbook of Speech Perception. Oxford: Blackwell Publishers. Retrieved 2007-05-17.
- Trubetzkoy, Nikolay S. (1969). Principles of phonology. Berkeley and Los Angeles: University of California Press. ISBN 978-0-520-01535-7.
- Iverson, P., Kuhl, P.K. (1995). "Mapping the perceptual magnet effect for speech using signal detection theory and multidimensional scaling". Journal of the Acoustical Society of America. 97 (1): 553–562. Bibcode:1995ASAJ...97..553I. doi:10.1121/1.412280. PMID 7860832.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Lisker, L., Abramson, A.S. (1970). "The voicing dimension: Some experiments in comparative phonetics" (PDF). Proc. 6th International Congress of Phonetic Sciences. Prague: Academia. pp. 563–567. Archived from the original (PDF) on 2016-03-03. Retrieved 2007-05-17.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - Warren, R.M. (1970). "Restoration of missing speech sounds". Science. 167 (3917): 392–393. Bibcode:1970Sci...167..392W. doi:10.1126/science.167.3917.392. PMID 5409744. S2CID 30356740.
- Garnes, S., Bond, Z.S. (1976). "The relationship between acoustic information and semantic expectation". Phonologica 1976. Innsbruck. pp. 285–293.
{{cite conference}}
: CS1 maint: multiple names: authors list (link) - Jongman A, Wang Y, Kim BH (December 2003). "Contributions of semantic and facial information to perception of nonsibilant fricatives" (PDF). J. Speech Lang. Hear. Res. 46 (6): 1367–77. doi:10.1044/1092-4388(2003/106). hdl:1808/13411. PMID 14700361. Archived from the original (PDF) on 2013-06-14. Retrieved 2017-09-14.
- Cervantes Constantino, F; Simon, JZ (2018). "Restoration and Efficiency of the Neural Processing of Continuous Speech Are Promoted by Prior Knowledge". Frontiers in Systems Neuroscience. 12 (56): 56. doi:10.3389/fnsys.2018.00056. PMC 6220042. PMID 30429778.
- Poeppel, David; Monahan, Philip J. (2008). "Speech Perception: Cognitive Foundations and Cortical Implementation". Current Directions in Psychological Science. 17 (2): 80–85. doi:10.1111/j.1467-8721.2008.00553.x. ISSN 0963-7214. S2CID 18628411.
- Hickok G, Poeppel D (May 2007). "The cortical organization of speech processing". Nat. Rev. Neurosci. 8 (5): 393–402. doi:10.1038/nrn2113. PMID 17431404. S2CID 6199399.
- Hessler, Dorte; Jonkers, Bastiaanse (December 2010). "The influence of phonetic dimensions on aphasic speech perception". Clinical Linguistics and Phonetics. 12. 24 (12): 980–996. doi:10.3109/02699206.2010.507297. PMID 20887215. S2CID 26478503.
- "Definition of AGNOSIA". www.merriam-webster.com. Retrieved 2017-12-15.
- Howard, Harry (2017). "Welcome to Brain and Language". Welcome to Brain and Language.
- Lambert, J. (1999). "Auditory Agnosia with relative sparing of speech perception". Neurocase. 5 (5): 71–82. doi:10.1016/s0010-9452(89)80007-3. PMID 2707006.
- Rocha, Sofia; Amorim, José Manuel; Machado, Álvaro Alexandre; Ferreira, Carla Maria (2015-04-01). "Phonagnosia and Inability to Perceive Time Passage in Right Parietal Lobe Epilepsy". The Journal of Neuropsychiatry and Clinical Neurosciences. 27 (2): e154 – e155. doi:10.1176/appi.neuropsych.14040073. ISSN 0895-0172. PMID 25923865.
- Minagawa-Kawai, Y., Mori, K., Naoi, N., Kojima, S. (2006). "Neural Attunement Processes in Infants during the Acquisition of a Language-Specific Phonemic Contrast". The Journal of Neuroscience. 27 (2): 315–321. doi:10.1523/JNEUROSCI.1984-06.2007. PMC 6672067. PMID 17215392.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Crystal, David (2005). The Cambridge Encyclopedia of Language. Cambridge: CUP. ISBN 978-0-521-55967-6.
- Kuhl, Patricia K.; Feng-Ming Tsao; Huei-Mei Liu (July 2003). "Foreign-language experience in infancy: Effects of short-term exposure and social interaction on phonetic learning". Proceedings of the National Academy of Sciences. 100 (15): 9096–9101. Bibcode:2003PNAS..100.9096K. doi:10.1073/pnas.1532872100. PMC 166444. PMID 12861072.
- Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesh, E., Thokura, Y., Kettermann, A., Siebert, C. (2003). "A perceptual interference account of acquisition difficulties for non-native phonemes". Cognition. 89 (1): B47 – B57. doi:10.1016/S0010-0277(02)00198-1. PMID 12499111. S2CID 463529.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Best, C. T. (1995). "A direct realist view of cross-language speech perception: New Directions in Research and Theory". In Winifred Strange (ed.). Speech perception and linguistic experience: Theoretical and methodological issues. Baltimore: York Press. pp. 171–204.
- Flege, J. (1995). "Second language speech learning: Theory, findings and problems". In Winifred Strange (ed.). Speech perception and linguistic experience: Theoretical and methodological issues. Baltimore: York Press. pp. 233–277.
- Uhler; Yoshinaga-Itano; Gabbard; Rothpletz; Jenkins (March 2011). "infant speech perception in young cochlear implant users". Journal of the American Academy of Audiology. 22 (3): 129–142. doi:10.3766/jaaa.22.3.2. PMID 21545766.
- Csépe, V.; Osman-Sagi, J.; Molnar, M.; Gosy, M. (2001). "Impaired speech perception in aphasic patients: event-related potential and neuropsychological assessment". Neuropsychologia. 39 (11): 1194–1208. doi:10.1016/S0028-3932(01)00052-5. PMID 11527557. S2CID 17307242.
- Loizou, P. (1998). "Introduction to cochlear implants". IEEE Signal Processing Magazine. 39 (11): 101–130. doi:10.1109/79.708543.
- Deutsch, Diana; Henthorn, Trevor; Dolson, Mark (Spring 2004). "Speech patterns heard early in life influence later perception of the tritone paradox" (PDF). Music Perception. 21 (3): 357–72. doi:10.1525/mp.2004.21.3.357. Retrieved 29 April 2014.
- Marques, C et al. (2007). Musicians detect pitch violation in foreign language better than nonmusicians: Behavioral and electrophysiological evidence. "Journal of Cognitive Neuroscience, 19", 1453-1463.
- O'Callaghan, Casey (2010). "Experiencing Speech". Philosophical Issues. 20: 305–327. doi:10.1111/j.1533-6077.2010.00186.x.
- McClelland, J.L. & Elman, J.L. (1986). "The TRACE model of speech perception" (PDF). Cognitive Psychology. 18 (1): 1–86. doi:10.1016/0010-0285(86)90015-0. PMID 3753912. S2CID 7428866. Archived from the original (PDF) on 2007-04-21. Retrieved 2007-05-19.
- Kazanina, N., Phillips, C., Idsardi, W. (2006). "The influence of meaning on the perception of speech sounds" (PDF). PNAS. Vol. 30. pp. 11381–11386. Retrieved 2007-05-19.
{{cite conference}}
: CS1 maint: multiple names: authors list (link)[permanent dead link ] - Gocken, J.M. & Fox R.A. (2001). "Neurological Evidence in Support of a Specialized Phonetic Processing Module". Brain and Language. 78 (2): 241–253. doi:10.1006/brln.2001.2467. PMID 11500073. S2CID 28469116.
- Dehaene-Lambertz, G.; Pallier, C.; Serniclaes, W.; Sprenger-Charolles, L.; Jobert, A.; Dehaene, S. (2005). "Neural correlates of switching from auditory to speech perception" (PDF). NeuroImage. 24 (1): 21–33. doi:10.1016/j.neuroimage.2004.09.039. PMID 15588593. S2CID 11899232. Retrieved 2007-07-04.
- Näätänen, R. (2001). "The perception of speech sounds by the human brain as reflected by the mismatch negativity (MMN) and its magnetic equivalent (MMNm)". Psychophysiology. 38 (1): 1–21. doi:10.1111/1469-8986.3810001. PMID 11321610.
- Liberman, A.M., Harris, K.S., Hoffman, H.S., Griffith, B.C. (1957). "The discrimination of speech sounds within and across phoneme boundaries" (PDF). Journal of Experimental Psychology. 54 (5): 358–368. doi:10.1037/h0044417. PMID 13481283. S2CID 10117886. Retrieved 2007-05-18.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Liberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. (1967). "Perception of the speech code" (PDF). Psychological Review. 74 (6): 431–461. doi:10.1037/h0020279. PMID 4170865. Retrieved 2007-05-19.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Liberman, A.M. (1970). "The grammars of speech and language" (PDF). Cognitive Psychology. 1 (4): 301–323. doi:10.1016/0010-0285(70)90018-6. Archived from the original (PDF) on 2015-12-31. Retrieved 2007-07-19.
- Liberman, A.M. & Mattingly, I.G. (1985). "The motor theory of speech perception revised" (PDF). Cognition. 21 (1): 1–36. CiteSeerX 10.1.1.330.220. doi:10.1016/0010-0277(85)90021-6. PMID 4075760. S2CID 112932. Archived from the original (PDF) on 2021-04-15. Retrieved 2007-07-19.
- Hayward, Katrina (2000). Experimental Phonetics: An Introduction. Harlow: Longman.
- Stevens, K.N. (2002). "Toward a model of lexical access based on acoustic landmarks and distinctive features" (PDF). Journal of the Acoustical Society of America. 111 (4): 1872–1891. Bibcode:2002ASAJ..111.1872S. doi:10.1121/1.1458026. PMID 12002871. Archived from the original (PDF) on 2007-06-09. Retrieved 2007-05-17.
- Massaro, D.W. (1989). "Testing between the TRACE Model and the Fuzzy Logical Model of Speech perception". Cognitive Psychology. 21 (3): 398–421. doi:10.1016/0010-0285(89)90014-5. PMID 2758786. S2CID 7629786.
- Oden, G.C., Massaro, D.W. (1978). "Integration of featural information in speech perception". Psychological Review. 85 (3): 172–191. doi:10.1037/0033-295X.85.3.172. PMID 663005.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - Ingram, John. C.L. (2007). Neurolinguistics: An Introduction to Spoken Language Processing and its Disorders. Cambridge: Cambridge University Press. pp. 113–127.
- Parker, Ellen M.; R.L. Diehl; K.R. Kluender (1986). "Trading Relations in Speech and Non-speech". Attention, Perception, & Psychophysics. 39 (2): 129–142. doi:10.3758/bf03211495. PMID 3725537.
- Randy L. Diehl; Andrew J. Lotto; Lori L. Holt (2004). "Speech perception". Annual Review of Psychology. 55 (1): 149–179. doi:10.1146/annurev.psych.55.090902.142028. PMID 14744213. S2CID 937985.
External links
- Dedicated issue of Philosophical Transactions B on the Perception of Speech. Some articles are freely available.
This article s lead section may be too short to adequately summarize the key points Please consider expanding the lead to provide an accessible overview of all important aspects of the article January 2021 Speech perception is the process by which the sounds of language are heard interpreted and understood The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language Speech perception research has applications in building computer systems that can recognize speech in improving speech recognition for hearing and language impaired listeners and in foreign language teaching The process of perceiving speech begins at the level of the sound signal and the process of audition For a complete description of the process of audition see Hearing After processing the initial auditory signal speech sounds are further processed to extract acoustic cues and phonetic information This speech information can then be used for higher level language processes such as word recognition Acoustic cuesFigure 1 Spectrograms of syllables dee top dah middle and doo bottom showing how the onset formant transitions that define perceptually the consonant d differ depending on the identity of the following vowel Formants are highlighted by red dotted lines transitions are the bending beginnings of the formant trajectories Acoustic cues are sensory cues contained in the speech sound signal which are used in speech perception to differentiate speech sounds belonging to different phonetic categories For example one of the most studied cues in speech is voice onset time or VOT VOT is a primary cue signaling the difference between voiced and voiceless plosives such as b and p Other cues differentiate sounds that are produced at different places of articulation or manners of articulation The speech system must also combine these cues to determine the category of a specific speech sound This is often thought of in terms of abstract representations of phonemes These representations can then be combined for use in word recognition and other language processes It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound At first glance the solution to the problem of how we perceive speech seems deceptively simple If one could identify stretches of the acoustic waveform that correspond to units of perception then the path from sound to meaning would be clear However this correspondence or mapping has proven extremely difficult to find even after some forty five years of research on the problem If a specific aspect of the acoustic waveform indicated one linguistic unit a series of tests using speech synthesizers would be sufficient to determine such a cue or cues However there are two significant obstacles One acoustic aspect of the speech signal may cue different linguistically relevant dimensions For example the duration of a vowel in English can indicate whether or not the vowel is stressed or whether it is in a syllable closed by a voiced or a voiceless consonant and in some cases like American English ɛ and ae it can distinguish the identity of vowels Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English One linguistic unit can be cued by several acoustic properties For example in a classic experiment Alvin Liberman 1957 showed that the onset formant transitions of d differ depending on the following vowel see Figure 1 but they are all interpreted as the phoneme d by listeners Linearity and the segmentation problemFigure 2 A spectrogram of the phrase I owe you There are no clearly distinguishable boundaries between speech sounds Although listeners perceive speech as a stream of discrete units citation needed phonemes syllables and words this linearity is difficult to see in the physical speech signal see Figure 2 for an example Speech sounds do not strictly follow one another rather they overlap A speech sound is influenced by the ones that precede and the ones that follow This influence can even be exerted at a distance of two or more segments and across syllable and word boundaries Because the speech signal is not linear there is a problem of segmentation It is difficult to delimit a stretch of speech signal as belonging to a single perceptual unit As an example the acoustic properties of the phoneme d will depend on the production of the following vowel because of coarticulation Lack of invarianceThe research and application of speech perception must deal with several problems which result from what has been termed the lack of invariance Reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find There are several reasons for this Context induced variation Phonetic environment affects the acoustic properties of speech sounds For example u in English is fronted when surrounded by coronal consonants Or the voice onset time marking the boundary between voiced and voiceless plosives are different for labial alveolar and velar plosives and they shift under stress or depending on the position within a syllable Variation due to differing speech conditions One important factor that causes variation is differing speech rate Many phonemic contrasts are constituted by temporal characteristics short vs long vowels or consonants affricates vs fricatives plosives vs glides voiced vs voiceless plosives etc and they are certainly affected by changes in speaking tempo Another major source of variation is articulatory carefulness vs sloppiness which is typical for connected speech articulatory undershoot is obviously reflected in the acoustic properties of the sounds produced Variation due to different speaker identity The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers Men women and children generally produce voices having different pitch Because speakers have vocal tracts of different sizes due to sex and age especially the resonant frequencies formants which are important for recognition of speech sounds will vary in their absolute values across individuals see Figure 3 for an illustration of this Research shows that infants at the age of 7 5 months cannot recognize information presented by speakers of different genders however by the age of 10 5 months they can detect the similarities Dialect and foreign accent can also cause variation as can the social characteristics of the speaker and listener Perceptual constancy and normalizationFigure 3 The left panel shows the 3 peripheral American English vowels i ɑ and u in a standard F1 by F2 plot in Hz The mismatch between male female and child values is apparent In the right panel formant distances in Bark rather than absolute values are plotted using the normalization procedure proposed by Syrdal and Gopal in 1986 Formant values are taken from Hillenbrand et al 1995 Despite the great variety of different speakers and different conditions listeners perceive vowels and consonants as constant categories It has been proposed that this is achieved by means of the perceptual normalization process in which listeners filter out the noise i e variation to arrive at the underlying category Vocal tract size differences result in formant frequency variation across speakers therefore a listener has to adjust his her perceptual system to the acoustic characteristics of a particular speaker This may be accomplished by considering the ratios of formants rather than their absolute values This process has been called vocal tract normalization see Figure 3 for an example Similarly listeners are believed to adjust the perception of duration to the current tempo of the speech they are listening to this has been referred to as speech rate normalization Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy see theories below Perceptual constancy is a phenomenon not specific to speech perception only it exists in other types of perception too Categorical perceptionFigure 4 Example identification red and discrimination blue functions Categorical perception is involved in processes of perceptual differentiation People perceive speech sounds categorically that is to say they are more likely to notice the differences between categories phonemes than within categories The perceptual space between categories is therefore warped the centers of categories or prototypes working like a sieve or like magnets for incoming speech sounds In an artificial continuum between a voiceless and a voiced bilabial plosive each new step differs from the preceding one in the amount of VOT The first sound is a pre voiced b i e it has a negative VOT Then increasing the VOT it reaches zero i e the plosive is a plain unaspirated voiceless p Gradually adding the same amount of VOT at a time the plosive is eventually a strongly aspirated voiceless bilabial pʰ Such a continuum was used in an experiment by Lisker and Abramson in 1970 The sounds they used are available online In this continuum of for example seven sounds native English listeners will identify the first three sounds as b and the last three sounds as p with a clear boundary between the two categories A two alternative identification or categorization test will yield a discontinuous categorization function see red curve in Figure 4 In tests of the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other 20 ms for instance listeners are likely to perform at chance level if both sounds fall within the same category and at nearly 100 level if each sound falls in a different category see the blue discrimination curve in Figure 4 The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed Similar perceptual adjustment is attested for other acoustic cues as well Top down influencesIn a classic experiment Richard M Warren 1970 replaced one phoneme of a word with a cough like sound Perceptually his subjects restored the missing speech sound without any difficulty and could not accurately identify which phoneme had been disturbed a phenomenon known as the phonemic restoration effect Therefore the process of speech perception is not necessarily uni directional Another basic experiment compared recognition of naturally spoken words within a phrase versus the same words in isolation finding that perception accuracy usually drops in the latter condition To probe the influence of semantic knowledge on perception Garnes and Bond 1976 similarly used carrier sentences where target words only differed in a single phoneme bay day gay for example whose quality changed along a continuum When put into different sentences that each naturally led to one interpretation listeners tended to judge ambiguous words according to the meaning of the whole sentence That is higher level language processes connected with morphology syntax or semantics may interact with basic speech perception processes to aid in recognition of speech sounds It may be the case that it is not necessary and maybe even not possible for a listener to recognize phonemes before recognizing higher units like words for example After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal listeners can compensate for missing or noise masked phonemes using their knowledge of the spoken language Compensatory mechanisms might even operate at the sentence level such as in learned songs phrases and verses an effect backed up by neural coding patterns consistent with the missed continuous speech fragments despite the lack of all relevant bottom up sensory input Acquired language impairmentThe first ever hypothesis of speech perception was used with patients who acquired an auditory comprehension deficit also known as receptive aphasia Since then there have been many disabilities that have been classified which resulted in a true definition of speech perception The term speech perception describes the process of interest that employs sub lexical contexts to the probe process It consists of many different language and grammatical functions such as features segments phonemes syllabic structure unit of pronunciation phonological word forms how sounds are grouped together grammatical features morphemic prefixes and suffixes and semantic information the meaning of the words In the early years they were more interested in the acoustics of speech For instance they were looking at the differences between ba or da but now research has been directed to the response in the brain from the stimuli In recent years there has been a model developed to create a sense of how speech perception works this model is known as the dual stream model This model has drastically changed from how psychologists look at perception The first section of the dual stream model is the ventral pathway This pathway incorporates middle temporal gyrus inferior temporal sulcus and perhaps the inferior temporal gyrus The ventral pathway shows phonological representations to the lexical or conceptual representations which is the meaning of the words The second section of the dual stream model is the dorsal pathway This pathway includes the sylvian parietotemporal inferior frontal gyrus anterior insula and premotor cortex Its primary function is to take the sensory or phonological stimuli and transfer it into an articulatory motor representation formation of speech Aphasia Aphasia is an impairment of language processing caused by damage to the brain Different parts of language processing are impacted depending on the area of the brain that is damaged and aphasia is further classified based on the location of injury or constellation of symptoms Damage to Broca s area of the brain often results in expressive aphasia which manifests as impairment in speech production Damage to Wernicke s area often results in receptive aphasia where speech processing is impaired Aphasia with impaired speech perception typically shows lesions or damage located in the left temporal or parietal lobes Lexical and semantic difficulties are common and comprehension may be affected Agnosia Agnosia is the loss or diminution of the ability to recognize familiar objects or stimuli usually as a result of brain damage There are several different kinds of agnosia that affect every one of our senses but the two most common related to speech are speech agnosia and phonagnosia Speech agnosia Pure word deafness or speech agnosia is an impairment in which a person maintains the ability to hear produce speech and even read speech yet they are unable to understand or properly perceive speech These patients seem to have all of the skills necessary in order to properly process speech yet they appear to have no experience associated with speech stimuli Patients have reported I can hear you talking but I can t translate it Even though they are physically receiving and processing the stimuli of speech without the ability to determine the meaning of the speech they essentially are unable to perceive the speech at all There are no known treatments that have been found but from case studies and experiments it is known that speech agnosia is related to lesions in the left hemisphere or both specifically right temporoparietal dysfunctions Phonagnosia Phonagnosia is associated with the inability to recognize any familiar voices In these cases speech stimuli can be heard and even understood but the association of the speech to a certain voice is lost This can be due to abnormal processing of complex vocal properties timbre articulation and prosody elements that distinguish an individual voice There is no known treatment however there is a case report of an epileptic woman who began to experience phonagnosia along with other impairments Her EEG and MRI results showed a right cortical parietal T2 hyperintense lesion without gadolinium enhancement and with discrete impairment of water molecule diffusion So although no treatment has been discovered phonagnosia can be correlated to postictal parietal cortical dysfunction Infant speech perceptionInfants begin the process of language acquisition by being able to detect very small differences between speech sounds They can discriminate all possible speech contrasts phonemes Gradually as they are exposed to their native language their perception becomes language specific i e they learn how to ignore the differences within phonemic categories of the language differences that may well be contrastive in other languages for example English distinguishes two voicing categories of plosives whereas Thai has three categories infants must learn which differences are distinctive in their native language uses and which are not As infants learn how to sort incoming speech sounds into categories ignoring irrelevant differences and reinforcing the contrastive ones their perception becomes categorical Infants learn to contrast different vowel phonemes of their native language by approximately 6 months of age The native consonantal contrasts are acquired by 11 or 12 months of age Some researchers have proposed that infants may be able to learn the sound categories of their native language through passive listening using a process called statistical learning Others even claim that certain sound categories are innate that is they are genetically specified see discussion about innate vs acquired categorical distinctiveness If day old babies are presented with their mother s voice speaking normally abnormally in monotone and a stranger s voice they react only to their mother s voice speaking normally When a human and a non human sound is played babies turn their head only to the source of human sound It has been suggested that auditory learning begins already in the pre natal period One of the techniques used to examine how infants perceive speech besides the head turn procedure mentioned above is measuring their sucking rate In such an experiment a baby is sucking a special nipple while presented with sounds First the baby s normal sucking rate is established Then a stimulus is played repeatedly When the baby hears the stimulus for the first time the sucking rate increases but as the baby becomes habituated to the stimulation the sucking rate decreases and levels off Then a new stimulus is played to the baby If the baby perceives the newly introduced stimulus as different from the background stimulus the sucking rate will show an increase The sucking rate and the head turn method are some of the more traditional behavioral methods for studying speech perception Among the new methods see Research methods below that help us to study speech perception near infrared spectroscopy is widely used in infants It has also been discovered that even though infants ability to distinguish between the different phonetic properties of various languages begins to decline around the age of nine months it is possible to reverse this process by exposing them to a new language in a sufficient way In a research study by Patricia K Kuhl Feng Ming Tsao and Huei Mei Liu it was discovered that if infants are spoken to and interacted with by a native speaker of Mandarin Chinese they can actually be conditioned to retain their ability to distinguish different speech sounds within Mandarin that are very different from speech sounds found within the English language Thus proving that given the right conditions it is possible to prevent infants loss of the ability to distinguish speech sounds in languages other than those found in the native language Cross language and second languageA large amount of research has studied how users of a language perceive foreign speech referred to as cross language speech perception or second language speech second language speech perception The latter falls within the domain of second language acquisition Languages differ in their phonemic inventories Naturally this creates difficulties when a foreign language is encountered For example if two foreign language sounds are assimilated to a single mother tongue category the difference between them will be very difficult to discern A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquid consonants l and r see Perception of English r and l by Japanese speakers Best 1995 proposed a Perceptual Assimilation Model which describes possible cross language category assimilation patterns and predicts their consequences Flege 1995 formulated a Speech Learning Model which combines several hypotheses about second language L2 speech acquisition and which predicts in simple words that an L2 sound that is not too similar to a native language L1 sound will be easier to acquire than an L2 sound that is relatively similar to an L1 sound because it will be perceived as more obviously different by the learner In language or hearing impairmentResearch in how people with language or hearing impairment perceive speech is not only intended to discover possible treatments It can provide insight into the principles underlying non impaired speech perception Two areas of research can serve as an example Listeners with aphasia Aphasia affects both the expression and reception of language Both two most common types expressive aphasia and receptive aphasia affect speech perception to some extent Expressive aphasia causes moderate difficulties for language understanding The effect of receptive aphasia on understanding is much more severe It is agreed upon that aphasics suffer from perceptual deficits They usually cannot fully distinguish place of articulation and voicing As for other features the difficulties vary It has not yet been proven whether low level speech perception skills are affected in aphasia sufferers or whether their difficulties are caused by higher level impairment alone Listeners with cochlear implants Cochlear implantation restores access to the acoustic signal in individuals with sensorineural hearing loss The acoustic information conveyed by an implant is usually sufficient for implant users to properly recognize speech of people they know even without visual clues For cochlear implant users it is more difficult to understand unknown speakers and sounds The perceptual abilities of children that received an implant after the age of two are significantly better than of those who were implanted in adulthood A number of factors have been shown to influence perceptual performance specifically duration of deafness prior to implantation age of onset of deafness age at implantation such age effects may be related to the Critical period hypothesis and the duration of using an implant There are differences between children with congenital and acquired deafness Postlingually deaf children have better results than the prelingually deaf and adapt to a cochlear implant faster In both children with cochlear implants and normal hearing vowels and voice onset time becomes prevalent in development before the ability to discriminate the place of articulation Several months following implantation children with cochlear implants can normalize speech perception NoiseOne of the fundamental problems in the study of speech is how to deal with noise This is shown by the difficulty in recognizing human speech that computer recognition systems have While they can do well at recognizing speech if trained on a specific speaker s voice and under quiet conditions these systems often do poorly in more realistic listening situations where humans would understand speech with relative ease To emulate processing patterns that would be held in the brain under normal conditions prior knowledge is a key neural factor since a robust learning history may to an extent override the extreme masking effects involved in the complete absence of continuous speech signals Music language connectionResearch into the relationship between music and cognition is an emerging field related to the study of speech perception Originally it was theorized that the neural signals for music were processed in a specialized module in the right hemisphere of the brain Conversely the neural signals for language were to be processed by a similar module in the left hemisphere However utilizing technologies such as fMRI machines research has shown that two regions of the brain traditionally considered exclusively to process speech Broca s and Wernicke s areas also become active during musical activities such as listening to a sequence of musical chords Other studies such as one performed by Marques et al in 2006 showed that 8 year olds who were given six months of musical training showed an increase in both their pitch detection performance and their electrophysiological measures when made to listen to an unknown foreign language Conversely some research has revealed that rather than music affecting our perception of speech our native speech can affect our perception of music One example is the tritone paradox The tritone paradox is where a listener is presented with two computer generated tones such as C and F Sharp that are half an octave or a tritone apart and are then asked to determine whether the pitch of the sequence is descending or ascending One such study performed by Ms Diana Deutsch found that the listener s interpretation of ascending or descending pitch was influenced by the listener s language or dialect showing variation between those raised in the south of England and those in California or from those in Vietnam and those in California whose native language was English A second study performed in 2006 on a group of English speakers and 3 groups of East Asian students at University of Southern California discovered that English speakers who had begun musical training at or before age 5 had an 8 chance of having perfect pitch Speech phenomenologyThe experience of speech Casey O Callaghan in his article Experiencing Speech analyzes whether the perceptual experience of listening to speech differs in phenomenal character with regards to understanding the language being heard He argues that an individual s experience when hearing a language they comprehend as opposed to their experience when hearing a language they have no knowledge of displays a difference in phenomenal features which he defines as aspects of what an experience is like for an individual If a subject who is a monolingual native English speaker is presented with a stimulus of speech in German the string of phonemes will appear as mere sounds and will produce a very different experience than if exactly the same stimulus was presented to a subject who speaks German He also examines how speech perception changes when one learning a language If a subject with no knowledge of the Japanese language was presented with a stimulus of Japanese speech and then was given the exact same stimuli after being taught Japanese this same individual would have an extremely different experience Research methodsThe methods used in speech perception research can be roughly divided into three groups behavioral computational and more recently neurophysiological methods Behavioral methods Behavioral experiments are based on an active role of a participant i e subjects are presented with stimuli and asked to make conscious decisions about them This can take the form of an identification test a discrimination test similarity rating etc These types of experiments help to provide a basic description of how listeners perceive and categorize speech sounds Sinewave Speech Speech perception has also been analyzed through sinewave speech a form of synthetic speech where the human voice is replaced by sine waves that mimic the frequencies and amplitudes present in the original speech When subjects are first presented with this speech the sinewave speech is interpreted as random noises But when the subjects are informed that the stimuli actually is speech and are told what is being said a distinctive nearly immediate shift occurs to how the sinewave speech is perceived Computational methods Computational modeling has also been used to simulate how speech may be processed by the brain to produce behaviors that are observed Computer models have been used to address several questions in speech perception including how the sound signal itself is processed to extract the acoustic cues used in speech and how speech information is used for higher level processes such as word recognition Neurophysiological methods Neurophysiological methods rely on utilizing information stemming from more direct and not necessarily conscious pre attentative processes Subjects are presented with speech stimuli in different types of tasks and the responses of the brain are measured The brain itself can be more sensitive than it appears to be through behavioral responses For example the subject may not show sensitivity to the difference between two speech sounds in a discrimination test but brain responses may reveal sensitivity to these differences Methods used to measure neural responses to speech include event related potentials magnetoencephalography and near infrared spectroscopy One important response used with event related potentials is the mismatch negativity which occurs when speech stimuli are acoustically different from a stimulus that the subject heard previously Neurophysiological methods were introduced into speech perception research for several reasons Behavioral responses may reflect late conscious processes and be affected by other systems such as orthography and thus they may mask speaker s ability to recognize sounds based on lower level acoustic distributions Without the necessity of taking an active part in the test even infants can be tested this feature is crucial in research into acquisition processes The possibility to observe low level auditory processes independently from the higher level ones makes it possible to address long standing theoretical issues such as whether or not humans possess a specialized module for perceiving speech or whether or not some complex acoustic invariance see lack of invariance above underlies the recognition of a speech sound TheoriesMotor theory Some of the earliest work in the study of how humans perceive speech sounds was conducted by Alvin Liberman and his colleagues at Haskins Laboratories Using a speech synthesizer they constructed speech sounds that varied in place of articulation along a continuum from bɑ to dɑ to ɡɑ Listeners were asked to identify which sound they heard and to discriminate between two different sounds The results of the experiment showed that listeners grouped sounds into discrete categories even though the sounds they were hearing were varying continuously Based on these results they proposed the notion of categorical perception as a mechanism by which humans can identify speech sounds More recent research using different tasks and methods suggests that listeners are highly sensitive to acoustic differences within a single phonetic category contrary to a strict categorical account of speech perception To provide a theoretical account of the categorical perception data Liberman and colleagues worked out the motor theory of speech perception where the complicated articulatory encoding was assumed to be decoded in the perception of speech by the same processes that are involved in production this is referred to as analysis by synthesis For instance the English consonant d may vary in its acoustic details across different phonetic contexts see above yet all d s as perceived by a listener fall within one category voiced alveolar plosive and that is because linguistic representations are abstract canonical phonetic segments or the gestures that underlie these segments When describing units of perception Liberman later abandoned articulatory movements and proceeded to the neural commands to the articulators and even later to intended articulatory gestures thus the neural representation of the utterance that determines the speaker s production is the distal object the listener perceives The theory is closely related to the modularity hypothesis which proposes the existence of a special purpose module which is supposed to be innate and probably human specific The theory has been criticized in terms of not being able to provide an account of just how acoustic signals are translated into intended gestures by listeners Furthermore it is unclear how indexical information e g talker identity is encoded decoded along with linguistically relevant information Exemplar theory Exemplar models of speech perception differ from the four theories mentioned above which suppose that there is no connection between word and talker recognition and that the variation across talkers is noise to be filtered out The exemplar based approaches claim listeners store information for both word and talker recognition According to this theory particular instances of speech sounds are stored in the memory of a listener In the process of speech perception the remembered instances of e g a syllable stored in the listener s memory are compared with the incoming stimulus so that the stimulus can be categorized Similarly when recognizing a talker all the memory traces of utterances produced by that talker are activated and the talker s identity is determined Supporting this theory are several experiments reported by Johnson that suggest that our signal identification is more accurate when we are familiar with the talker or when we have visual representation of the talker s gender When the talker is unpredictable or the sex misidentified the error rate in word identification is much higher The exemplar models have to face several objections two of which are 1 insufficient memory capacity to store every utterance ever heard and concerning the ability to produce what was heard 2 whether also the talker s own articulatory gestures are stored or computed when producing utterances that would sound as the auditory memories Acoustic landmarks and distinctive features Kenneth N Stevens proposed acoustic landmarks and distinctive features as a relation between phonological features and auditory properties According to this view listeners are inspecting the incoming signal for the so called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them Since these gestures are limited by the capacities of humans articulators and listeners are sensitive to their auditory correlates the lack of invariance simply does not exist in this model The acoustic properties of the landmarks constitute the basis for establishing the distinctive features Bundles of them uniquely specify phonetic segments phonemes syllables words In this model the incoming acoustic signal is believed to be first processed to determine the so called landmarks which are special spectral events in the signal for example vowels are typically marked by higher frequency of the first formant consonants can be specified as discontinuities in the signal and have lower amplitudes in lower and middle regions of the spectrum These acoustic features result from articulation In fact secondary articulatory movements may be used when enhancement of the landmarks is needed due to external conditions such as noise Stevens claims that coarticulation causes only limited and moreover systematic and thus predictable variation in the signal which the listener is able to deal with Within this model therefore what is called the lack of invariance is simply claimed not to exist Landmarks are analyzed to determine certain articulatory events gestures which are connected with them In the next stage acoustic cues are extracted from the signal in the vicinity of the landmarks by means of mental measuring of certain parameters such as frequencies of spectral peaks amplitudes in low frequency region or timing The next processing stage comprises acoustic cues consolidation and derivation of distinctive features These are binary categories related to articulation for example high back round lips for vowels sonorant lateral or nasal for consonants Bundles of these features uniquely identify speech segments phonemes syllables words These segments are part of the lexicon stored in the listener s memory Its units are activated in the process of lexical access and mapped on the original signal to find out whether they match If not another attempt with a different candidate pattern is made In this iterative fashion listeners thus reconstruct the articulatory events which were necessary to produce the perceived speech signal This can be therefore described as analysis by synthesis This theory thus posits that the distal object of speech perception are the articulatory gestures underlying speech Listeners make sense of the speech signal by referring to them The model belongs to those referred to as analysis by synthesis Fuzzy logical model The fuzzy logical theory of speech perception developed by Dominic Massaro proposes that people remember speech sounds in a probabilistic or graded way It suggests that people remember descriptions of the perceptual units of language called prototypes Within each prototype various features may combine However features are not just binary true or false there is a fuzzy value corresponding to how likely it is that a sound belongs to a particular speech category Thus when perceiving a speech signal our decision about what we actually hear is based on the relative goodness of the match between the stimulus information and values of particular prototypes The final decision is based on multiple features or sources of information even visual information this explains the McGurk effect Computer models of the fuzzy logical theory have been used to demonstrate that the theory s predictions of how speech sounds are categorized correspond to the behavior of human listeners Speech mode hypothesis Speech mode hypothesis is the idea that the perception of speech requires the use of specialized mental processing The speech mode hypothesis is a branch off of Fodor s modularity theory see modularity of mind It utilizes a vertical processing mechanism where limited stimuli are processed by special purpose areas of the brain that are stimuli specific Two versions of speech mode hypothesis Weak version listening to speech engages previous knowledge of language Strong version listening to speech engages specialized speech mechanisms for perceiving speech Three important experimental paradigms have evolved in the search to find evidence for the speech mode hypothesis These are dichotic listening categorical perception and duplex perception Through the research in these categories it has been found that there may not be a specific speech mode but instead one for auditory codes that require complicated auditory processing Also it seems that modularity is learned in perceptual systems Despite this the evidence and counter evidence for the speech mode hypothesis is still unclear and needs further research Direct realist theory The direct realist theory of speech perception mostly associated with Carol Fowler is a part of the more general theory of direct realism which postulates that perception allows us to have direct awareness of the world because it involves direct recovery of the distal source of the event that is perceived For speech perception the theory asserts that the objects of perception are actual vocal tract movements or gestures and not abstract phonemes or as in the Motor Theory events that are causally antecedent to these movements i e intended gestures Listeners perceive gestures not by means of a specialized decoder as in the Motor Theory but because information in the acoustic signal specifies the gestures that form it By claiming that the actual articulatory gestures that produce different speech sounds are themselves the units of speech perception the theory bypasses the problem of lack of invariance See alsoRelated to the case study of Genie feral child Neurocomputational speech processing Multisensory integration Origin of speech Speech Language Pathology Motor theory of speech perceptionReferencesNygaard L C Pisoni D B 1995 Speech Perception New Directions in Research and Theory In J L Miller P D Eimas eds Handbook of Perception and Cognition Speech Language and Communication San Diego Academic Press a href wiki Template Cite encyclopedia title Template Cite encyclopedia cite encyclopedia a CS1 maint multiple names authors list link Klatt D H 1976 Linguistic uses of segmental duration in English Acoustic and perceptual evidence Journal of the Acoustical Society of America 59 5 1208 1221 Bibcode 1976ASAJ 59 1208K doi 10 1121 1 380986 PMID 956516 Halle M Mohanan K P 1985 Segmental phonology of modern English Linguistic Inquiry 16 1 57 116 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Liberman A M 1957 Some results of research on speech perception PDF Journal of the Acoustical Society of America 29 1 117 123 Bibcode 1957ASAJ 29 117L doi 10 1121 1 1908635 hdl 11858 00 001M 0000 002C 5789 A Archived from the original PDF on 2016 03 03 Retrieved 2007 05 17 Fowler C A 1995 Speech production In J L Miller P D Eimas eds Handbook of Perception and Cognition Speech Language and Communication San Diego Academic Press Hillenbrand J M Clark M J Nearey T M 2001 Effects of consonant environment on vowel formant patterns Journal of the Acoustical Society of America 109 2 748 763 Bibcode 2001ASAJ 109 748H doi 10 1121 1 1337959 PMID 11248979 S2CID 10751216 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Lisker L Abramson A S 1967 Some effects of context on voice onset time in English plosives PDF Language and Speech 10 1 1 28 doi 10 1177 002383096701000101 PMID 6044530 S2CID 34616732 Archived from the original PDF on 2016 03 03 Retrieved 2007 05 17 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Hillenbrand J Getty L A Clark M J Wheeler K 1995 Acoustic characteristics of American English vowels Journal of the Acoustical Society of America 97 5 Pt 1 3099 3111 Bibcode 1995ASAJ 97 3099H doi 10 1121 1 411872 PMID 7759650 S2CID 10104073 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Houston Derek M Juscyk Peter W October 2000 The role of talker specific information in word segmentation by infants PDF Journal of Experimental Psychology Human Perception and Performance 26 5 1570 1582 doi 10 1037 0096 1523 26 5 1570 PMID 11039485 Archived from the original PDF on 2014 04 30 Retrieved 1 March 2012 Hay Jennifer Drager Katie 2010 Stuffed toys and speech perception Linguistics 48 4 865 892 doi 10 1515 LING 2010 027 S2CID 143639653 Syrdal A K Gopal H S 1986 A perceptual model of vowel recognition based on the auditory representation of American English vowels Journal of the Acoustical Society of America 79 4 1086 1100 Bibcode 1986ASAJ 79 1086S doi 10 1121 1 393381 PMID 3700864 Strange W 1999 Perception of vowels Dynamic constancy In J M Pickett ed The Acoustics of Speech Communication Fundamentals Speech Perception Theory and Technology Needham Heights MA Allyn amp Bacon Johnson K 2005 Speaker Normalization in speech perception PDF In Pisoni D B Remez R eds The Handbook of Speech Perception Oxford Blackwell Publishers Retrieved 2007 05 17 Trubetzkoy Nikolay S 1969 Principles of phonology Berkeley and Los Angeles University of California Press ISBN 978 0 520 01535 7 Iverson P Kuhl P K 1995 Mapping the perceptual magnet effect for speech using signal detection theory and multidimensional scaling Journal of the Acoustical Society of America 97 1 553 562 Bibcode 1995ASAJ 97 553I doi 10 1121 1 412280 PMID 7860832 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Lisker L Abramson A S 1970 The voicing dimension Some experiments in comparative phonetics PDF Proc 6th International Congress of Phonetic Sciences Prague Academia pp 563 567 Archived from the original PDF on 2016 03 03 Retrieved 2007 05 17 a href wiki Template Cite conference title Template Cite conference cite conference a CS1 maint multiple names authors list link Warren R M 1970 Restoration of missing speech sounds Science 167 3917 392 393 Bibcode 1970Sci 167 392W doi 10 1126 science 167 3917 392 PMID 5409744 S2CID 30356740 Garnes S Bond Z S 1976 The relationship between acoustic information and semantic expectation Phonologica 1976 Innsbruck pp 285 293 a href wiki Template Cite conference title Template Cite conference cite conference a CS1 maint multiple names authors list link Jongman A Wang Y Kim BH December 2003 Contributions of semantic and facial information to perception of nonsibilant fricatives PDF J Speech Lang Hear Res 46 6 1367 77 doi 10 1044 1092 4388 2003 106 hdl 1808 13411 PMID 14700361 Archived from the original PDF on 2013 06 14 Retrieved 2017 09 14 Cervantes Constantino F Simon JZ 2018 Restoration and Efficiency of the Neural Processing of Continuous Speech Are Promoted by Prior Knowledge Frontiers in Systems Neuroscience 12 56 56 doi 10 3389 fnsys 2018 00056 PMC 6220042 PMID 30429778 Poeppel David Monahan Philip J 2008 Speech Perception Cognitive Foundations and Cortical Implementation Current Directions in Psychological Science 17 2 80 85 doi 10 1111 j 1467 8721 2008 00553 x ISSN 0963 7214 S2CID 18628411 Hickok G Poeppel D May 2007 The cortical organization of speech processing Nat Rev Neurosci 8 5 393 402 doi 10 1038 nrn2113 PMID 17431404 S2CID 6199399 Hessler Dorte Jonkers Bastiaanse December 2010 The influence of phonetic dimensions on aphasic speech perception Clinical Linguistics and Phonetics 12 24 12 980 996 doi 10 3109 02699206 2010 507297 PMID 20887215 S2CID 26478503 Definition of AGNOSIA www merriam webster com Retrieved 2017 12 15 Howard Harry 2017 Welcome to Brain and Language Welcome to Brain and Language Lambert J 1999 Auditory Agnosia with relative sparing of speech perception Neurocase 5 5 71 82 doi 10 1016 s0010 9452 89 80007 3 PMID 2707006 Rocha Sofia Amorim Jose Manuel Machado Alvaro Alexandre Ferreira Carla Maria 2015 04 01 Phonagnosia and Inability to Perceive Time Passage in Right Parietal Lobe Epilepsy The Journal of Neuropsychiatry and Clinical Neurosciences 27 2 e154 e155 doi 10 1176 appi neuropsych 14040073 ISSN 0895 0172 PMID 25923865 Minagawa Kawai Y Mori K Naoi N Kojima S 2006 Neural Attunement Processes in Infants during the Acquisition of a Language Specific Phonemic Contrast The Journal of Neuroscience 27 2 315 321 doi 10 1523 JNEUROSCI 1984 06 2007 PMC 6672067 PMID 17215392 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Crystal David 2005 The Cambridge Encyclopedia of Language Cambridge CUP ISBN 978 0 521 55967 6 Kuhl Patricia K Feng Ming Tsao Huei Mei Liu July 2003 Foreign language experience in infancy Effects of short term exposure and social interaction on phonetic learning Proceedings of the National Academy of Sciences 100 15 9096 9101 Bibcode 2003PNAS 100 9096K doi 10 1073 pnas 1532872100 PMC 166444 PMID 12861072 Iverson P Kuhl P K Akahane Yamada R Diesh E Thokura Y Kettermann A Siebert C 2003 A perceptual interference account of acquisition difficulties for non native phonemes Cognition 89 1 B47 B57 doi 10 1016 S0010 0277 02 00198 1 PMID 12499111 S2CID 463529 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Best C T 1995 A direct realist view of cross language speech perception New Directions in Research and Theory In Winifred Strange ed Speech perception and linguistic experience Theoretical and methodological issues Baltimore York Press pp 171 204 Flege J 1995 Second language speech learning Theory findings and problems In Winifred Strange ed Speech perception and linguistic experience Theoretical and methodological issues Baltimore York Press pp 233 277 Uhler Yoshinaga Itano Gabbard Rothpletz Jenkins March 2011 infant speech perception in young cochlear implant users Journal of the American Academy of Audiology 22 3 129 142 doi 10 3766 jaaa 22 3 2 PMID 21545766 Csepe V Osman Sagi J Molnar M Gosy M 2001 Impaired speech perception in aphasic patients event related potential and neuropsychological assessment Neuropsychologia 39 11 1194 1208 doi 10 1016 S0028 3932 01 00052 5 PMID 11527557 S2CID 17307242 Loizou P 1998 Introduction to cochlear implants IEEE Signal Processing Magazine 39 11 101 130 doi 10 1109 79 708543 Deutsch Diana Henthorn Trevor Dolson Mark Spring 2004 Speech patterns heard early in life influence later perception of the tritone paradox PDF Music Perception 21 3 357 72 doi 10 1525 mp 2004 21 3 357 Retrieved 29 April 2014 Marques C et al 2007 Musicians detect pitch violation in foreign language better than nonmusicians Behavioral and electrophysiological evidence Journal of Cognitive Neuroscience 19 1453 1463 O Callaghan Casey 2010 Experiencing Speech Philosophical Issues 20 305 327 doi 10 1111 j 1533 6077 2010 00186 x McClelland J L amp Elman J L 1986 The TRACE model of speech perception PDF Cognitive Psychology 18 1 1 86 doi 10 1016 0010 0285 86 90015 0 PMID 3753912 S2CID 7428866 Archived from the original PDF on 2007 04 21 Retrieved 2007 05 19 Kazanina N Phillips C Idsardi W 2006 The influence of meaning on the perception of speech sounds PDF PNAS Vol 30 pp 11381 11386 Retrieved 2007 05 19 a href wiki Template Cite conference title Template Cite conference cite conference a CS1 maint multiple names authors list link permanent dead link Gocken J M amp Fox R A 2001 Neurological Evidence in Support of a Specialized Phonetic Processing Module Brain and Language 78 2 241 253 doi 10 1006 brln 2001 2467 PMID 11500073 S2CID 28469116 Dehaene Lambertz G Pallier C Serniclaes W Sprenger Charolles L Jobert A Dehaene S 2005 Neural correlates of switching from auditory to speech perception PDF NeuroImage 24 1 21 33 doi 10 1016 j neuroimage 2004 09 039 PMID 15588593 S2CID 11899232 Retrieved 2007 07 04 Naatanen R 2001 The perception of speech sounds by the human brain as reflected by the mismatch negativity MMN and its magnetic equivalent MMNm Psychophysiology 38 1 1 21 doi 10 1111 1469 8986 3810001 PMID 11321610 Liberman A M Harris K S Hoffman H S Griffith B C 1957 The discrimination of speech sounds within and across phoneme boundaries PDF Journal of Experimental Psychology 54 5 358 368 doi 10 1037 h0044417 PMID 13481283 S2CID 10117886 Retrieved 2007 05 18 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Liberman A M Cooper F S Shankweiler D P amp Studdert Kennedy M 1967 Perception of the speech code PDF Psychological Review 74 6 431 461 doi 10 1037 h0020279 PMID 4170865 Retrieved 2007 05 19 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Liberman A M 1970 The grammars of speech and language PDF Cognitive Psychology 1 4 301 323 doi 10 1016 0010 0285 70 90018 6 Archived from the original PDF on 2015 12 31 Retrieved 2007 07 19 Liberman A M amp Mattingly I G 1985 The motor theory of speech perception revised PDF Cognition 21 1 1 36 CiteSeerX 10 1 1 330 220 doi 10 1016 0010 0277 85 90021 6 PMID 4075760 S2CID 112932 Archived from the original PDF on 2021 04 15 Retrieved 2007 07 19 Hayward Katrina 2000 Experimental Phonetics An Introduction Harlow Longman Stevens K N 2002 Toward a model of lexical access based on acoustic landmarks and distinctive features PDF Journal of the Acoustical Society of America 111 4 1872 1891 Bibcode 2002ASAJ 111 1872S doi 10 1121 1 1458026 PMID 12002871 Archived from the original PDF on 2007 06 09 Retrieved 2007 05 17 Massaro D W 1989 Testing between the TRACE Model and the Fuzzy Logical Model of Speech perception Cognitive Psychology 21 3 398 421 doi 10 1016 0010 0285 89 90014 5 PMID 2758786 S2CID 7629786 Oden G C Massaro D W 1978 Integration of featural information in speech perception Psychological Review 85 3 172 191 doi 10 1037 0033 295X 85 3 172 PMID 663005 a href wiki Template Cite journal title Template Cite journal cite journal a CS1 maint multiple names authors list link Ingram John C L 2007 Neurolinguistics An Introduction to Spoken Language Processing and its Disorders Cambridge Cambridge University Press pp 113 127 Parker Ellen M R L Diehl K R Kluender 1986 Trading Relations in Speech and Non speech Attention Perception amp Psychophysics 39 2 129 142 doi 10 3758 bf03211495 PMID 3725537 Randy L Diehl Andrew J Lotto Lori L Holt 2004 Speech perception Annual Review of Psychology 55 1 149 179 doi 10 1146 annurev psych 55 090902 142028 PMID 14744213 S2CID 937985 External linksDedicated issue of Philosophical Transactions B on the Perception of Speech Some articles are freely available