Voice Quality: What Is Most Characteristic About "You" in Speech

by Ingo R. Titze and Brad H. Story


We send multiple messages when we speak. Some are linguistic and some paralinguistic, meaning that they are independent of the words that we utter. Such paralinguistic messages concern our health, our mood, our genetic makeup, and our upbringing. Many of them are encoded in voice quality, which in the most general sense is everything in the acoustic signal other than overall pitch, loud-ness, and phonetic contrast (vowels and consonants).

Descriptions of voice quality have traditionally consisted of qualitative terms such as warm, shrill, twangy, creaky, shrieky, breathy, yawny, gravelly, hoarse, ringing, dull, nasal, resonant, rough, and pressed. While commonly used in both clinical and non-clinical situations, the acoustic and articulatory correlates of these terms have not been well defined. In comparison, the characteristics of vocal registers have been somewhat better defined and are often given the generally accepted labels of modal, fry, and falsetto in speech, and chest, head (or mixed), falsetto and whistle in singing.

Work is now ongoing to address a few of these voice qualities on a physiologic and acoustic level. Much of this work is supported by the National Institute for Deafness and Other Communication Disorders, which believes that better treatment and care of the human voice can be achieved if clinicians can discriminate these vocal qualities and relate them to abnormality.

Our own efforts have been directed toward a better under-standing of vocal tract shapes that, together with laryngeal (voice source) adjustments, produce certain voice qualities. Magnetic resonance imaging (MRI) has been our primary tool for studying the vocal tract shape. Images are acquired while a speaker repeats a specific vowel or consonant over and over again. Subsequent image analysis is used to assess the cross-sectional area of the airway along its extent from the larynx to the lips, effectively creating a continuously-variable tubular representation of the vocal tract shape. At this point, four speakers have been imaged with this method. The first two (a male and female) produced most of the vowels and consonants of American English while the second set of speakers (another male and female) were asked to produce only four vowels, but with four different voice qualities.

A statistical analysis of each speaker's set of vocal tract shapes (for the different vowel or quality conditions) and calculations of frequency response functions, have revealed that the average vocal tract shape produces acoustic characteristics that seem to be unique to both the speaker and to the particular voice quality used during the image collection. This aver-age shape is similar to that expected for a neutral or schwa vowel /a/, and roughly produces the vowel sound heard when speaking "uh-huh." Figures 1a and 1b show average vocal tract shapes, in tubular form, for a male and female speaker, respectively. Note that this representation results from plotting the measured cross-sectional areas as equivalent circular elements. What makes these average shapes acoustically "neutral" is that their lower formant (resonance) frequencies (specifically Fl, F2, and F3) are nearly equally spaced.

Statistical analysis of the tract shape has also suggested that all other vowel shapes (and many consonant shapes) are linear combinations of the neutral shape and three eigen-shapes. These eigen-shapes, similar to eigen modes in vibration theory, can be used as orthogonal building blocks (with proper coefficients) to construct the vocal tract shapes used in any given spoken language. Thus, an interesting dichotomy arises between the shaping of the vocal tract airway for linguistic versus paralinguistic messages: the neutral shape is used primarily to code

speaker identity (voice quality), whereas the eigen-shapes are used primarily to code the speech.




                                      (a) Male



                 Larinx  Lips

                                        (b) Female

Figure 1. Tubular representations of the neutral (average) vocal tract shapes for a male and a female.


That begs the next question. What does the neutral vocal tract shape depend on, and how can it be controlled by the speaker? Genetics plays a major role, as does age, culture, dialect, and vocal training. In part, we can control our own voice quality; for example, by maintaining a forward tongue position throughout our speech, by maintaining a lower larynx position (as if yawning), by maintaining our lips somewhat rounded or pursed, or by clenching our teeth. With these gestures, we "soft-wire" our neutral vocal tract shape, virtually on an instantaneous basis. Culture, dialect, and vocal training are somewhat more firm-wired, whereas the effects of genetics and age are basically hard-wired.

A recent focus has been on perception of the voice qualities yawny, twangy, and ringing. Rather than having human speakers produce these qualities in varying degrees (which is possible), computer simulation was used to generate sounds for listeners. The sentence length utterance /ya-ya-ya-ya-ya/ was used. It is an utterance frequently used by speakers to convey the meaning "I know, I know" or "I've heard this before." It lends itself well to computer simulation because it can have natural intonation and is all voiced (no voiceless consonants, which often don't sound natural and thus draw attention to themselves).                           

Figure 2.  Modification of cross-sectional areas to transform a "normal" voice quality to "yawny."


The results were that listeners judged voices to be yawny when the pharynx and larynx were widened and the vocal folds were spread slightly apart during voicing. A lengthening of the vocal tract also added to the perception of yawn quality. Figure 2 shows how the cross-sectional area of the vocal tract was altered from the larynx (left) to the lips (right) in two vowels taken from the /ya-ya-..../ simulation; /a/-like shapes are shown on the top and and /i/-like shapes are shown on the bottom. The solid line is the "yawny" area function and the dotted line is the normal area function. Note that the pharyngeal area was enlarged for both vowels, the epilarynx tube (which is the vocal tract portion within the larynx) was widened, and the vocal tract was lengthened. For the twang quality (not shown in the figure), the opposite adjustments were made. The back of the vocal tract (pharynx and epilarynx tube) was narrowed, the vocal tract was shortened, and the vocal folds were adducted more. Basically, twang was an acoustically and aerodynamically high-impedance configuration and yawn was an acoustically and aerodynamically low-impedance configuration. Figure 3 shows how the listeners rated the two qualities (1 equals least preference for the quality and 10 equals greatest preference for the quality). On the abscissa are the high-impedance and low-impedance conditions described above. It is evident that the qualities were clear and distinct to the listeners.   

Figure 3. Listener's Rating of Yawn and Twang Qualities


Without showing the data, we also mention the result for the third quality, ring. This quality is characteristic of opera singers who wish to be heard (unamplified) over long distances. Our findings were that ring is produced with a relatively narrow epilarynx tube and a relatively wide pharynx, a conclusion reached by Johan Sundberg many years ago. Thus, ring is in effect a mixture of twang and yawn. It has the benefit of a high vocal tract input impedance (like the mouthpiece of a trumpet) with a laryngeal resonance known as the singer's formant, and it also has a low first formant, giving both brilliance and warmth to the sound quality.

In conclusion, it appears that voice qualities utilized in speech and singing appear to have clearly definable vocal tract and vocal fold adjustments. Computer simulation is helpful in quantifying these adjustments and relating the qualities to each other. But many more qualities need to be studied, some of which may involve multiple sound sources (e.g., true folds, false folds, aryepiglottic folds, and perhaps air vortices shed by the larynx) and nonlinear coupling between resonators and these sound sources.

Ingo R. Titze is a University of Iowa Foundation Distinguished Professor in Speech Pathology, Biomedical Engineering, and Music.

He directs the National Center for Voice and Speech. Brad H. Story is an Assistant Professor in the Department of Speech and Hearing

Sciences at the University of Arizona.