Making sense of allophones

Allophones aren't random

People don't make different sounds in different contexts because they're bored and have nothing better to do with their time. The small differences make sense.

Two examples from Rogers of allophonic differences:

Voiceless consonants are longer at the end of a word than are voiced consonants. E.g., [nIp:] vs. [nIb], [rejt:] vs. [rejd]
Vowels are longer before a voiced sound than before a voiceless sound.

Try to say a very long [p]. Now try to say a very long [b], keeping your vocal cords vibrating -- you'll quickly progress from normal through chipmunk to balloon. It makes sense that you not try to say voiced oral stops for as long as you say voiceless ones.

Equally stressed syllables tend to take up about the same amount of time. If a final voiced consonant is shorter, that's more time for the vowel to take up:

Two goals in speaking

Make life easy for your mouth.
Make life easy for your listener.

You can't always satisfy both goals at once. When pronouncing hid, you want your listener to be able to tell that you aren't saying hit, so prolonging the final [d] might be helpful. But you want to accomplish this with the least effort possible, and prolonging the final [d] will turn you into a chipmunk.

Several of the small contextual differences between allophones can be seen as attempts to satisfy one or both of these goals.

Assimilation

Assimilation is when a sound becomes more like its environment. E.g.,

In many languages, a stop becomes voiced between two (voiced) vowels
In English, a vowel is somewhat nasal before a nasal consonant, and very nasal between two nasal consonants.

For the speaker: assimilation keeps the articulators from having to make the sudden fast movements that would be required if the idealized slicing view of segments were true. For the listener: spreading a feature like voicing or nasality out over a longer period of time can often make it easier to hear (though it can also destroy contrasts that used to be in the segments that changed).

Enhancement

Speakers will often simultaneously do things which have similar acoustic effects.

E.g., the R sound in English is really usually pronounced with three gestures:

an apico-postalveolar approximant, []
slight lip rounding
a slight constriction of the throat (a radico-pharyngeal approximant).

All three gestures have a similar acoustic effect and reinforce or enhance each other.

For the speaker: it may seem harder to do three gestures than one, but each of the three can be smaller and less obtrusive than if one were used alone. For the listener: exaggerating the acoustic effect makes it harder to mistake.

For people with some physical problems, there's often no choice but to use different gestures that have similar acoustic effects.

Multiple cues

Listeners will pay attention to all relevent information that can help distinguish sounds. In deciding whether a final stop is voiced, a listener won't just listen for vocal cord vibration, but for:

vocal cord vibration
the pitch of the preceding vowel
the relative lengths of the vowel and consonant
the transition between the vowel and consonant

among other things. (In some situations, the relative length of the consonant and vowel can be a more reliable cue than whether you hear vocal cord vibration.)

In cases of enhancing gestures or of features with multiple cues, children often focus on the wrong one. E.g., [wæbt] instead of [æbt]; a child who substitutes a short [] for all final voiced stops and a long [:] for all voiceless ones.

Next: Stops
Previous: Why bother?
Up: table of contents