Making sense of allophones
Allophones aren't random
People don't make different sounds in different contexts because they're
bored and have nothing better to do with their time. The small
differences make sense.
Two examples from Rogers of allophonic differences:
- Voiceless consonants are longer at the end of a word than are voiced consonants. E.g., [nIp:] vs. [nIb], [rejt:] vs. [rejd]
- Vowels are longer before a voiced sound than before a voiceless sound.
Try to say a very long [p]. Now try to say a very long [b], keeping your
vocal cords vibrating -- you'll quickly progress from normal through
chipmunk to balloon. It makes sense that you not try to say voiced oral
stops for as long as you say voiceless ones.
Equally stressed syllables tend to take up about the same amount of time.
If a final voiced consonant is shorter, that's more time for the vowel to
take up:
Two goals in speaking
- Make life easy for your mouth.
- Make life easy for your listener.
You can't always satisfy both goals at once. When pronouncing hid,
you want your listener to be able to tell that you aren't saying
hit, so prolonging the final [d] might be helpful. But you want to
accomplish this with the least effort possible, and prolonging the final
[d] will turn you into a chipmunk.
Several of the small contextual differences between allophones can be seen
as attempts to satisfy one or both of these goals.
Assimilation
Assimilation is when a sound becomes more like its environment. E.g.,
- In many languages, a stop becomes voiced between two (voiced) vowels
- In English, a vowel is somewhat nasal before a nasal consonant, and very
nasal between two nasal consonants.
For the speaker: assimilation keeps the articulators from having to make
the sudden fast movements that would be required if the idealized slicing
view of segments were true. For the listener: spreading a feature like
voicing or nasality out over a longer period of time can often make it
easier to hear (though it can also destroy contrasts that used to be in
the segments that changed).
Enhancement
Speakers will often simultaneously do things which have similar acoustic
effects.
E.g., the R sound in English is really usually pronounced with three gestures:
- an apico-postalveolar approximant, [
]
- slight lip rounding
- a slight constriction of the throat (a radico-pharyngeal approximant).
All three gestures have a similar acoustic effect and reinforce or enhance
each other.
For the speaker: it may seem harder to do three gestures than one,
but each of the three can be smaller and less obtrusive than if one were
used alone. For the listener: exaggerating the acoustic effect makes it
harder to mistake.
For people with some physical problems, there's often no choice but to use
different gestures that have similar acoustic effects.
Multiple cues
Listeners will pay attention to all relevent information that can help
distinguish sounds. In deciding whether a final stop is voiced, a
listener won't just listen for vocal cord vibration, but for:
- vocal cord vibration
- the pitch of the preceding vowel
- the relative lengths of the vowel and consonant
- the transition between the vowel and consonant
among other things. (In some situations, the relative length of the
consonant and vowel can be a more reliable cue than whether you hear vocal
cord vibration.)
In cases of enhancing gestures or of features with multiple cues,
children often focus on the wrong one. E.g., [wæb
t] instead of [
æb
t]; a child who
substitutes a short [
] for all final voiced stops and a long [
:] for all voiceless
ones.
Next: Stops
Previous: Why bother?
Up: table of contents