The Voice as a Musical Instrument



The preceding chapters of this book have concentrated on impulsively excited tones that die away-clangs, drum thumps, guitar pluckings, and the sounds made by the stringed keyboard instruments. It is now time to consider sound sources that are capable of producing a sustained tone. This chapter will be devoted to the human voice, after which we will take up the orchestral brasses, woodwinds, and stringed instruments. (In chapter 16 we devoted some attention to two sustained ­tone instruments-the pipe organ and its electronic counterpart-but our interest was restricted to the pitch relationships of their sounds, and we took no account of the ways in which these sounds are generrated. )


In the present chapter we will consider how voice sounds are generated and how these sounds are modified in the mouth and nose cavities before being radiated into the room, after which we will look into some of the implications of these operations for speech and for music. Our interest in the sound production processes of the voice is twofold. On the one hand, the singing voice has considerable musical significance; on the other hand, several of its acoustical aspects provide us with a particularly good introduction to much that is important in the nature of woodwinds, brasses, and bowed string instruments.


19. 1. The Voice: A Source of Controllable Sound


One has only to listen for a moment to a singer to realize that the voice is a sound source whose pitch is controllable. In physical terms this means that the human voice can produce acoustic signals having repetition rates that can be varied over a large range. The fact that a singer can enunciate different sustained sounds (e.g., one vowel or another) while maintaining his pitch suggests further that the other important aspect of a sustained sound-the amplitudes of its sinusoidal components-is subject to control. It may seem curious in a book on musical acoustics that we will be giving a fair amount of attention in this chapter to speech sounds, particularly vowels. They prove to be useful to a study of musical acoustics for two reasons. First, they are a musical element of singing quite aside from their information-carrying function. Second, the ways in which recognizable word sounds are shaped out of the original relatively featureless vibration recipe from our vocal cords can give us considerable insight into acoustic connections between tone color, pitch, and the strengths of the partials we hear.


The relationship between vowel sounds and tone color can be illustrated if we imagine building a pair of musical keyboard instruments; one instrument uses the sound component recipe for a particu­lar vowel sung at C4 as a basis for constructing its tones (by transposition), while the other similarly made instrument uses the recipe for a different vowel sung at the same pitch. We would be unanimous in recognizing that the two instruments have distinctly different tone colors, even though very few of us would recognize that the sounds from the two keyboards were copies of spoken vowels. Contrast this with what happens when two of your friends sing or enunciate a wide variety of words at a wide variety of pitches; their voices will retain some kind of overall tone color or flavor through all this that allows us to recognize them as the voices of specific people. Obviously musical sounds, including voices, have a tone color that is connected in a nontrivial fashion to their vibration recipes, quite aside from processing complication introduced by room acoustics.


It is fortunate indeed for our present purposes that the human voice mechanism separates itself very easily into unambiguously recognizable functional parts, each of which can be thought about in isolation. Once we have examined the various parts separately, we can put everything back together to make the central part of what Peter Denes and Elliot Pinson of the Bell Telephone Laboratories have called the speech chain.' In our investigations in this chapter we will focus our attention almost entirely on the vibration physics of vocal sound production; this means that we plan to ignore the mental and neurophysiological processes governing the selection and formation of voice sounds.

Figure 19.1 is a block diagram of the voice mechanism as it concerns us. The labels within most of the boxes give ordinary names to the various physiological objects with which we are dealing, while the words written above these boxes describe the acoustical function or nature of these objects. The box marked "sibilants, etc.," does not quire fit into the labeling scheme just described. It serves simply as a graphical device for reminding us that the production of sounds like .r, rh, k, t, and th involves an auxiliary, broadband (multicomponent) random source which can be located almost anywhere within the vocal cavity region. When speech sounds are made, the larynx may or may not itself be vibrating to produce an oscillatory flow of air; it is this choice that makes the distinction between the voiced and the unvoiced consonants.


We may quite properly think of the larynx as being what we defined in chapter 11 as a simple source. This simple source feeds into a small, very elongated (i.e., more or less one-dimensional) room of complex shape formed by the vocal cavities. Our study in chapter 11 of the acoustical response of rooms to excitation by such a source should have prepared us for the idea that the sound pressure at any given point in the vocal cavity (away from the source) will depend drastically both on the excitation frequency and on the point of observation. We should also recognize that (wherever we observe it) the acoustical response will be particularly large if the excitation frequency components of the source match one or another of the characteristic vibrational modes of the cavity.


Over and over in this book we have met examples of the way in which alterations in the structure of a vibrating object, and more particularly of its boundaries, can alter the frequencies of its characteristic modes. In the course of speaking or singing, one continually alters the shape of one's vocal cavities. The production of each particular vowel or consonant is associated with a fairly well-defined shape for the cavities, and therefore with a particular pattern of strong and weak responses to the various sinusoidal components of the airflow controlled by the vocal cords.


As we explore what happens inside the vocal cavity to the sound produced by the vocal cords, we will confine our attention to what happens at the mouth aperture. (The nose aperture, which is also used separately or with the mouth, has very similar properties; therefore we need make no further mention of it.) At the mouth opening, the oscillatory flow of air depends on the relation between the excitation frequency (from the larynx) and the various resonances of the vocal cavity. The mouth, of course, also has acoustical importance since it serves as the source for sounds as we hear them in the room. (The specific things going on acoustically inside the vocal cavity that we do not have time to explore are well understood. Research is done by using a tiny probe microphone to measure the sound pres­sure set up at various points inside the vocal cavity; also, motion pictures have been made of the movements of the vocal cords.)


In the next two sections we will first consider the way in which the flesh folds that are known as the vocal cords set themselves into oscillation at a frequency corresponding to the speaker's or singer's desired pitch, and then we will enquire into the particular ways in which the resulting oscillatory flow from the larynx has its vibration recipe modified on its way through the vocal cavities to the room and thence to our ears. The various patterns of these modifications are what make different voice sounds recognizable.


The vocal cords, which do the actual vibrating in the larynx, are flaplike folds of muscle attached to the interior of the larynx in such a way as to produce a slitlike opening through which air can pass. The cords are capable of assuming a wide variety of shapes and spacings. When we breathe normally, they pull themselves back out of the way, so as to leave an unobstructed air passage. When we whisper, they are held close enough together that air flowing between them generates a rushing or hissing sound made up of roughly equal amounts of all possible frequency components ("white" noise); the vocal tract can operate on this random collection of closely spaced sinusoidal components to produce intelligible speech, even though the sound has a radiated sound pressure spectrum in the room quite different from that of normal speech. When one phonates (produces vocal sound) normally, the cords are given a shape and spacing that permits the aerodynamic forces which arise from the air flowing between them to set them into oscillation. However, the speed of the airflow only slightly influences the frequency of this oscillation; the predominant control comes from the mass of the vocal cords and the muscle tension set up in them. The oscillation of the cords is of such a nature chat they alternately approach one another and recede, bringing about a corresponding oscillatory decrease and increase in the amount of air chat is permitted to flow between them. Not only can the speaker choose the frequency of oscillation of the cords (and so the pitch of the resulting sounds), he can also choose to have the cords swing with sufficient amplitude that they can press together during a controllable portion of each oscillatory cycle. Under these conditions, the flow consists of momentary puffs of air whose duration can be adjusted more or less independently of their repetition rate. As a result the singer is provided additionally with an adjustable recipe for his internal sound source, and therefore with one of his means for altering the tone color of his music.


As an initial step in our quest for understanding how the air passing between vocal cords can maintain their oscillations, we should remind ourselves of a few facts about the motion of fluids and some of the initial consequences of these facts. Most of us are quire familiar with these facts in an everyday way, even if we have not thought about them formally or tried to describe them in words. Because of their basic importance to our understanding of many things we will examine in the rest of this book (not just in connection with the maintenance of oscillations), I shall set down these basic ideas as the first few members of a set of numbered statements to which we can easily make reference whenever the need arises.


1. Fluids (including air) tend to flow from regions of high pressure coward regions where the pressure is low.


2. As a consequence of the influence of pressure on fluid flow, we recognize that if we see an increasing flow velocity of a fluid as it moves from one point to another in its travels, we can deduce that the pressure at a high-velocity spot must be lower than at the low-velocity point from which the fluid came. One cannot speed anything up without arranging to have an excess of force acting behind it.


3. When a fluid flows steadily and continuously in a long duct, we expect the velocity of …the duct than in the wider parts.Statement 3 is simply a recognition of the tact that, for fluid flowing in a leakfree duct. a fixed volume of fluid passes any given point per second. Where the pipe crosssection is large, many small ..chunks.. of the slow-moving fluid travel abreast of one another; in the narrower parts these must run quickly through the constriction in single file.

4. A joint implication of statements 2 and 3 is that we should expect the fluid pressure in the narrow parts of a long duct to be lower than it is in the broad parts.


  The argument leading to statement 4 runs thus: in a leak-proof pipe any given small chunk of fluid (which you might wish to identify by squirting in a tiny droplet of oil) finds itself accelerating to a higher velocity as it enters a narrow region, and then slowing back down as it continues on into a broader part of the duct. Looking at things from the point of view of the small piece of fluid, we realize that it will not change its state of motion unless a force acts on it. It speeds up as it enters a constriction; therefore, the pressure behind it must be greater than in the constricted region it is approaching. Similarly, it slows down as it leaves the constriction; therefore, an excess pressure must be acting on its front surface to retard it. The quantitative expression of statement 4 and an elucidation of some of its remarkable consequences were first worked out by the Swiss physicist Daniel Bernoulli in 1738. The formal expression of our statement 4 is known as Bernoulli's Theorem for Steady Flow.

5. The presence of viscous friction that is normally found in a fluid and between the fluid and around containing walls does not change the qualitative correctness of statements 1 through 4. However, it leads to a reduction in the total amount of fluid that pauses through the system per second under the influence of a given driving pressure of the source.


We are now provided with the information needed for a look at the vocal cords in their role as oscillators. If a mechanical engineer were asked to design a simplified machine that worked in much the same way as the vocal cords, he might very well come up with something of the sort shown in figure 19.2. Air from the lungs flows in the diagram from left to right through a large-diameter duct (A) which corresponds to the windpipe or trachea. The air then flows through a constriction (B) and out again into an enlarged portion of the duct (C), which is the beginning of the vocal tract. The upper boundary of the constriction consists chiefly of a mass M mounted on a spring having a stiffness coefficient S, the mass being free to oscillate smoothly up and down along a carefully fitted guide. This guide is made leakproof by means of some grease, which also serves to lubricate the guide. Our engineer has chosen to represent one of the two vocal cords by this spring-mass system (with viscous damping D provided by the sealing grease). The other cord would move symmetrically with the first under the influence of similar forces, and so can be left out of our initial consideration.


If no air is sent through our iron larynx, it is easy for us to see that the natural frequency of oscillation of the mass M is proportional to the quantity  ÖS/M, and that if it is pulled aside and released, the oscillations will die away with a halving time proportional to M/D (see sec. 6.1). It is this natural frequency which the singer changes as he shifts from one musical pitch to another.

If the airstream is turned on, we recognize on the basis of statement 4 above that the air pressure at (B) will be reduced relative to what it is both at (A) and at (C). If the mass moves downward, further constricting the opening, two opposing things will happen. Narrowing the aperture will increase the speed of the air motion at (B), as a result of which the pressure here is also reduced, thus tending to suck the mass even farther down. On the other hand, the added frictional resistance produced in the narrowed opening will (if the lung pressure is kept the same) reduce the total volume of air that flows past per second. As a result, the flow-dependent pressure will not change in quite the way we would otherwise expect. When everything so far is taken into account, we find that the presence of flowing air causes it to feel an aerodynamic force that has two recognizable components: a steady inward force, plus one which fluctuates as the mass vibrates in and out. We shall call this last, fluctuating part the oscillatory Bernoulli force.


Let us see how the presence of flow can be expected to modify the sinusoidal oscillation which would normally result from the interaction of the spring with the mass. The steady part of the flow-induced force pulls M in against the elasticity of the spring to a new equilibrium position in which the aperture is slightly reduced. We find further that as the mass oscillates, the other flow-induced force component acts along with the spring as an additional restoring force tending to pull the mass back toward its altered equilibrium position. It is thus perfectly permissible for us at this stage in our thinking to consider the joint action of the spring and the airflow as being equivalent to the action of a single spring having a somewhat larger stiffness coefficient. The conclusion follows then that the natural frequency of oscillation of our imitation vocal cord is slightly raised by the existence of an airflow past it. Notice, however, that we have not yet found anything that can counteract the damping effect of the lubricating grease. In other words, we have not yet discovered any means whereby the flowing current of air can initiate or maintain oscillations of the vocal cord.


Let us digress a moment now and examine the motion of a child on a swing, and notice what we must do while pushing him. This examination will suggest to us what to look for in the larynx, which is a device whose cords are of course known to oscillate. As a child swings back and forth, we recognize first the springlike restoring force that arises from the joint effect of his weight and of the oblique rope which supports him. As we learned in chapter 6, this force acts in a direction opposite to the child's displacement; it determines the frequency of oscillation according to a familiar formula. Once the child is pulled to one side and released, he swings in ever-decreasing arcs; the decrease is the result of the viscous friction of the air through which he moves (see fig. 10.5). Notice that the viscous friction is a damping force that acts in a direction opposite to the motion of the child. The contrast between the restoring force and the damping force can be made clear if we realize that the restoring force is zero at midswing, where the damping force on the rapidly moving child reaches a maximum. Conversely, the damping force falls to zero as the child comes to rest at the limits of his travel, which are the points at which the restoring force has its largest value.


If we wish to maintain the swinging motion of the child, it seems pretty obvious that it is necessary to do our pushing in the direction of the child's motion. More accurately, we realize that if we push on him over an appreciable fraction of the time of one cycle, at least the predominant share of our pushing should take place in the helpful direction. Let us distill these ideas into the sixth of our numbered statements:


6. Because the damping force on a vibrating object always acts to oppose the motion of the object, any successful attempt to maintain the oscillation requires the application of a periodic force that acts (at least predominantly) in the same direction as the motion.


Let us now go back to our artificial larynx to seek the missing force contribution that meets the requirements laid down in statement 6. Our model at this point is too simple in that it takes insufficient account of the fact that the airflow is by no means steady: it increases and decreases as the valve opens and closes. In the case of unsteady flow, Bernoulli's theorem does not quite hold true. Because of the inertia of the moving air, the velocity of air flowing through a constriction cannot instantaneously readjust itself as the aperture is changed. In other words, the sinusoidally varying aperture determined by our oscillating mass has passing through it an airflow whose variations lag behind by a small amount.

Figure 19.3 will allow us to see how the oscillation is maintained. At the top of the diagram we see a curve that represents the sinusoidal up-and-down oscillations of the mass M. The bottom part of the figure shows the corresponding varia­tion of the flow-induced oscillatory Bernoulli force that acts upon it. Notice that the force reaches its upward and downward maxima at instants of time that are slightly later than those at which the maximum excursions of the mass itself take place. To help us recognize the relationship between the Bernoulli force and the direction of motion of M, all parts of the displacement curve that correspond to downward motion are so labeled, and they are also drawn using a beaded line. In similar fashion, those parts of the force curve that represent a downward urging on the mass are labeled and drawn with a beaded line. The parts of the two curves corresponding respectively to upward motion and upward force are also labeled, and are drawn using plant lines. In the middle area of figure 19.3 we find a series of shaded boxes which call our attention to those periods of time during which the Bernoulli force acts in the same direction as the motion of the vocal-cord surrogate M. These are the times during which the force contributes to the maintenance of oscillation. Notice that these intervals*of "helpful" interaction are longer than the intervening periods during which the force tends to diminish the oscillation. The net action is therefore of the sort needed for the maintenance of oscillation, according to the requirements of statement 6.


Detailed study of our mechanical model of the larynx shows that it has all of the major properties of the real larynx, but lacks some of the subtler features.

James Flanagan and his coworkers at the Bell Telephone Laboratories have found, however, that almost everything can be well accounted for with only a slight elaboration of our simple machine.' All that is required is the provision of two adjacent movable lumps of matter, each with its own spring and damper, plus a coupling spring between them. This makes the whole larynx model into a cousin of the two-mass chain, with consequences some of which you will be able to guess with the help of what is said in sections 6.3 and 10.5.

We will close this section with a brief look at the actual flow patterns (and their sinusoidal components) that come through the larynx to act as a sound source for the rest of the vocal system.' The patterns range between the two limiting forms shown in figure 19.4. The top part of the figure shows the successive puffs of air produced when a man sings a note a little above G2 (100 Hz) with a relatively high breath pressure and fairly close initial spacing of the vocal cords. Notice first of all that the successive puffs of air are quite uniformly spaced (0.01 seconds apart), giving a well-defined repetition rate. This tells us that the partials are harmonically related. During each puff, the flow rises fairly quickly to a somewhat spiky peak, and then decreases in a slightly wiggly fashion. Notice further that the flow ceases completely for about one-third of each cycle, during the interval when the two cords have pressed themselves together.


The lower part of figure 19.4 shows the other extreme in voice production. A gentle stream of air is sent past the cords, flowing just strongly enough to keep them vibrating. The cords do not close completely, however, so that the flow is never shut off altogether. The waveform here is not as spiky as before, being shaped more like a slightly skewed sinusoid. We will postpone until later in the chapter any consideration of the implications of the slight irregularities existing between successive pulsations.


The vibration recipes for the two flow patterns illustrated in figure 19.4 differ chiefly in the relative amplitudes of the first half dozen pairs of corresponding partials. In the spiky waveform, partials from 1 to about 6 are of roughly equal amplitude, whereas above this the amplitude of the nth component is about 1/n2  as large as that of the first partial. In the more rounded signal, the 100-Hz fundamental component is considerably stronger than the other harmonic components, say 4 or 5 times the amplitude of partial 2, after which the amplitudes fall away with extreme rapidity.


For ordinary speech we may safely assume a pattern of flow intermediate between the two we have just considered. This intermediate pattern has a slightly skewed triangular shape. The flow is reduced to zero only momentarily, and the pattern shows a slightly rounder! peak at the top. This shape is almost precisely what one sees at the start-up of a guitar string that is plucked somewhat to one side of center. This means that if we want the recipe for a typical intermediate voice sound, we can take over exactly the same recipe described in section 7.2, as modified by the corner-rounding explained in section 8.4. That is, the amplitude An of the nth harmonic partial is primarily related to the fundamental amplitude A, by the formula An = A1/n2  with a few partials being weakened because their nudes (in time now instead of in space along the string) lie near the top corner of the waveform (the analog of the plucking point). A communications engineer would describe a recipe like this as having a few "zeros" in it, with the shape being outlined by an "envelope" that falls at the rate of 12 dB per octave.


19.3. Sound Transmission through the Vocal Cavities and into the Room


The vocal tract, which extends from the larynx to the mouth (and/or nose) aperture, has the duty of transforming the rather simple airflow spectrum provided by the vocal cords into the recognizable acoustical patterns needed for speech and music. We have already learned in broad outline that the larynx, acting; as a source, feeds one point in an elongated, roughly tubular, one-dimensional "room" whose set of natural frequencies can be adjusted (by movements of the tongue, lips, etc.). The mouth aperture is a sort of window at the far end of this room, acting in its turn as a simple source for the excitation of the vibrational modes of the three-dimensional room in which we can imagine we are listening.


The pressure variations produced by the larynx in the vocal tract, and thence the strength of the resulting source at the mouth, depend in a simple way on the adjustable resonance properties of the vocal tract. The pressure amplitudes produced for the various voice partials in the room surrounding the listener do not, however, have a simple proportionality to the strengths of the corresponding airflow components from the mouth. Simple sources radiating into a three-dimensional room have the fundamental property (mentioned earlier in connection with the discussion of figure 11.3) that the room­averaged sound pressure resulting from a given source strength is larger for high­ frequency sources than for those oscillating more slowly. More precisely, for every doubling of frequency, there is a doubling of sound pressure in the room, provided the source strength is kept constant. A telephone engineer would say that the sound pressure in a room due to a constant-strength source rises at the rate of 6 dB/octave. The physical explanation of this relative emphasis at high frequencies is to be found in the rapidly increasing number of off-resonance room modes whose collected responses make up so much of the sound in a room (see sec. 11.4). There is no corresponding increase in the number of modes at high frequencies in a one-dimensional (i.e., long and narrow) room, which explains why we do not find a similar "treble boost" taking place at the junction of larynx and vocal tract.


In addition to the systematic effect of the mouth's radiation behavior on the sound pressure recipe, we need to take into account the fact that our ears themselves have progressively greater sensitivity for high frequencies (up to about 3500 Hz) than they have for lower frequencies. In what follows, both effects will be taken into account, and the discussion will be confined to the loudnesses, expressed in sones (see secs. 13.4 and 13.6), of the individual voice partials that some­one would perceive if they came to his ear one by one, on the assumption that he is listening only a short distance away from the mouth of the singer or the person speaking. We will give the name loudness recipe or loudness spectrum to the description of the strengths of the various partials calculated in this way for a given vocal tone.


The top part of figure 19.5 shows the loudness recipe that is typical of the vowel [ah] steadily pronounced as in the word father by a man who pitches his voice 35 cents above G2.[4] The sinusoidal components of his voice sounding at this pitch will be exact multiples of 100 Hz. If the fundamental component of this sound reaches the listener's ear to produce a loudness of a trifle over two sones (as shown), the second partial would be heard at about 4.2 sones, etc. Notice that partial number 7 is very loud. We notice further that the loudness of the 11th partial is also greater than that of its adjacent neighbors. In similar fashion the 26th harmonic is also emphasized in the overall loudness spectrum of our 100-Hz tone.

 The lower half of figure 19.5 shows the loudness spectrum associated with a 220­ Hz (A3) tone produced by the same man if he keeps his jaw, tongue, and lip positions unchanged from those used for the 100-Hz tone. The pitch of this tone is somewhat more than an octave higher than the first, but we would still agree that the same (ah) vowel is being produced. Notice that the overall shapes of the cases we find a particularly strong component in the region from 600 to 700 Hz, another near 1100 Hz, and a third one lying near 2600 Hz that is louder than its neighbors. In between these loud components we find weaker ones, and the strengths of these in the two tones are quite similar as long as we confine our attention to some particular frequency region. For example, the 20th partial of the 100-Hz tone and the 9th one of A­220  both lie close to 2000 Hz and have loudnesses of about 2 sones.


The common element of the two differently pitched [ah) sounds that we have examined is the presence of especially strong components near 700, 1100, and 26,00 Hz, and the existence of frequency regions near 900 and 2000 Hz and below about 300 Hz in which the partials are especially weak. The explanation of these peaks and dips in the loudness spectrum is easy to find the peaks correspond to the characteristic frequencies of the particular vocal tract air column used by our subject when he is asked to pronounce the vowel [ah), and the dips arise from the tendency for cancellation between the in­ phase responses of a higher mode driven below resonance and of a lower mode driven above its natural frequency. These matters were carefully discussed in section 10.5.


What is often called the spectrum envelope of the [ah) sound is a smooth curve drawn to indicate the pattern of loudness of this vowel, regardless of what fundamental voice frequency is used for its production. This spectrum envelope is almost exactly the ordinary resonance response curve measured between the point of original excitation and the position of the detector. Figure 11.3 is an example of such a curve measured between two points in a room, while fig.10.14 shows the corresponding transmission for vibration between points on a metal tray. In this chapter we are using a slightly modified version of these transmission curves, since we want to make allowance for the properties of the ear itself.


The middle part of figure 19.6 is the loudness spectrum envelops for [ah]; the top and bottom parts of the figure show the corresponding envelopes for the vowels [oo), the middle sound of the word pool, and (ee) whose sound is found in the word feet. Each recognizable vocal sound that we produce is associated with its own particular arrangement of characteristic mode frequencies for the vocal tract, and each of these is brought about by a particular shaping of the air column.


We are now in a position to summarize and slightly extend the basic ideas of vocal sound production as we have met them so far. This summary is an abbreviated paraphrase of the opening remarks in the present-day classic study, Acoustic Theory of Speech Production, by the Swedish scientist Gunnar Fant, who is director of the Speech Transmission Laboratory at the Royal Institute of Technology in Stockholm. [5]


1. The vocal cords oscillate at a frequency determined primarily by their mass and tension, with frictional losses being restored by means of aerodynamic (Bernoulli) forces produced by the stream of air from the lungs.


2. This oscillation of the vocal cords transmits roughly triangular puffs of air into the vocal tract. The repetition rate of these puffs is equal to the vibration rate of the cords. The vibration of the cords, and therefore the shape of the resulting puffs, varies slightly from cycle to cycle, even when an attempt is made to generate a perfectly steady sound. 

3. A voice source (as heard in the room) is characterized by a spectrum envelope. Each vowel (and consonant) sound that one may wish to produce has its own characteristic spectrum envelope. The peaks and dips of any such spectrum envelope are determined by the frequencies of the characteristic vibrational modes of the corresponding vocal tract configuration.


4. The peaks that are observed in the spectrum envelope are called formants. Conventionally one assigns an identifying serial number to these formant peaks, formant 1 being the one having the lowest frequency.


5. For males the first formant peak of any vocal sound lies in the frequency region between 150 and 850 Hz, the second in the range between 500 and 2500 Hz, and the third and fourth in the 1500-to-3500-Hz and 2500-to-4800-Hz regions.


6. As a consequence of the one-dimensional, long and narrow nature of the vocal tract, the average spacing of the formant frequencies is roughly constant. Its length is such that for males the average spacing is about 1000 Hz. Because of these limitations, it is not possible for a person to achieve every arbitrarily chosen pattern of formants within the ranges given above.


7. Two people uttering the "same" sound will generally use slightly different formant frequencies, partly because of differences in their regional accent, and partly because of differences in the dimensions of their vocal tracts. Women's formants generally lie about 17 percent higher, and children's about 25 percent higher, than those typical of men.


8. The first three formants dominate the recognizability of speech, and much intelligibility is retained if only two formants are present.


The importance of the formant peaks, and in particular of the frequencies of these peaks, suggests that a sound made up of a few inharmonically related sinusoids each of which is marched to one of the formant frequencies of a particular vowel might be heard as giving that particular vowel. For example, we might guess that the [ah) sound could arise from the simultaneous sounding of components at 700, 1100, and 2600 Hz, or that too) would be produced by components at 300, 625, and 2500 Hz. This does not in general prove to be the case.


We consider next the much more serious problem of the possibility of ambiguity in the recognition of a given formant pattern, and learn of the way in which our ears exploit the information available to them to resolve the ambiguity. Suppose for example that our experimental subject is asked to produce exactly the same [ah] sound that led to the spectra shown in figure 19.5, except that he is to use a frequency of 440 Hz as the fundamental frequency rather than the 100- and 220-Hz values he used before. For a man to sound a 440-Hz tone generally requires a shift to what is called the falsetto, a type of sound production that is understandable in terms of a double-mass vocal cord model in which the motion is a combination of mode-1 and mode-2 os­cillations. The relationship between walk­ing and running is an analogous piece of physics in which we recognize differing combinations of two characteristic modes of oscillation. The loudness spectrum for the higher-pitched 440-Hz sound is readily deduced from the one appropriate to the 220-Hz tone an octave lower: one has only to obliterate the odd-numbered components from the lower diagram in figure 19.5. Elimination of the odd components appears (at least on paper) to do a rather destructive thing to the recognizability of the formant pattern, since the strong components at 660 and 1100 Hz are eliminated, along with the noticeably weak one close to 2000 Hz. The remaining partials (harmonics of 440 Hz) are indicated in the diagram by crosses drawn above each one of them, so that your eye can more easily visualize a rather broad implied formant hump extending from around 200 Hz to nearly 1500 Hz, to­gether with a spike at 2640 Hz belonging to the strong 6th harmonic of the 440-Hz tone. Comparison of this implied spectrum envelope with the envelope given for too) at the top of figure 19.6 shows that the two have a very similar appearance. This means that these two vowels would be hard to distinguish when spoken at a pitch corresponding to 440 Hz. There would of course be no difficulty in distinguishing the 440-Hz version of [ee] from the other two sounds.


The resolution of the ambiguity proves to be straightforward. The fact that the repetitive motion of the vocal cords is not precisely regular (due in part to inescapable muscle tremor and in part to certain aerodynamic instabilities of flow) means among other things that there is a continual fluctuation of the fundamental frequency-a sort of random vibrato. A typical extent for this fluctuation is 0.5 percent, corresponding to variations of 2.2 Hz, 4.4 Hz, and 6.6 Hz at the first three harmonics of 440 Hz. Since the component near 440 Hz is fluctuating a little in frequency, the strength of this partial also fluctuates as the excitation slides up and down on the resonance curve of the vocal tract. For instance, an upward fluctuation of frequency brings this component closer to the first formant resonance, and so increases the loudness of what we hear. At 440 Hz, then, our ear is supplied with the information that the spectrum envelope curve is steeply rising toward high frequencies (verify this by looking at the slope of the curve for [ah) at 440 Hz in fig. 19.6). This tells our ears that a formant peak lies a little above 440 Hz. In an exactly similar fashion, fluctuations of the 880-Hz second partial inform us that in this neighborhood the spectrum envelope is roughly horizontal (i.e., this component lies at either the top of a formant peak or at the bottom of a dip in the spectrum envelope). To continue, the downward slope to the response curve brought to light by fluctuations of the third harmonic (around 1320 Hz) implies the existence of a formant peak lying below this frequency. Let us put these various pieces of information together now to see how completely the ambiguity has resolved itself. The behavior of partial 1 tells us there is a peak on the high-frequency side of it. This missing peak must lie between partials 1 and 2 since partial 2 could not possibly be at the top of a peak and still match partial 1 in loudness. A similar argument establishes the presence of formant 2 between partials 2 and 3.


There is an even more clear-cut way in which our hearing process manages to keep track of the formant locations that might otherwise sandwich themselves between the voice harmonics. In speaking and singing, one is constantly going from one sound to another, and each formant moves smoothly from its position for one part of the utterance to that belonging to the next part. If the pitch is maintained constant throughout, we have the spectrum envelope moving past the fixed voice harmonics to plot out their shapes in time, just as we earlier found that pitch fluctuations are able to explore the shape of a fixed formant pattern. In actual speech and singing, of course, both processes are going on continually as we raise and lower the pitch of our voices and simultaneously change the formant patterns belonging to the separate parts of vhf words we are enunciating.


19.4. The Male Voice and the "Singer's Formant"


The bass-baritone voice can be thought of as a musical instrument whose lowest note has a fundamental frequency lying in the region of 80 Hz (near E2 with its top note (near F4) having a fundamental in the neighborhood of 350 Hz. In this section we will seek some of the musically relevant elements that characterize the tones of this vocal instrument (which elements are typical also of the higher male voices), and learn how the singer can make alterations in his mode of tone production. We will ignore the verbal communication aspects of singing, considering only those musical effects that might be noticed by a listener who is not acquainted with the language being sung.


The relatively stable and featureless source spectrum generated in a singer's larynx is operated on by his vocal tract to produce the elaborately shaped and rapidly varying audible spectrum that comes to our ears as the singer goes from note to note and from vowel to vowel (see secs. 19.2 and 19.3). While we are listening to a singer, our nervous system (in the midst of its many other duties) deduces a kind of running average and seeks correlations over successive brief but overlapping spans of time; this continual processing gives us a good perceptual idea of the common element in the singer's varied sounds, this common element being the source spectrum generated by his larynx. When the puffs of air are short and spiky, we say that the singer is using a light or bright voice. The darker voice colors are associated with a rounded, smoothed-out pattern of airflow (see fig. 19.4). .


Digression on the Extraction of Average Properties: The LTAS.


The following laboratory technique is based on a much simplified cousin of the way in which our nervous system works to extract the common elements of a sound. A sound it tape-recorded over a suitably chosen interval of time; this tape is then made into a loop and played over and over into an electronic analyzer that picks out successive frequency bands (.ray SO or 100 Hz wide) and measures the aggregate strength of the partials lying within them, averaging the results of each measurement over the entire duration of the passage. If we wish to apply this procedure to a singer's voice, the recording must be long enough that the singer has had tine for several repetitions of a substantial fraction of his voice's repertory of pitches and vowels. Under these conditions, the long time average spectrum (abbreviated LTAS) gives of something that is a close cousin to the larynx spectrum as modified by the "treble boost" property of the mouth­-to-room coupling. The peaks and dips of the vocal tract transmission for various enunciations tend to average themselves out when various pitches are sung in an LTAS, leaving evidence of their statis­tical aggregate in the form of a somewhat accentuated region near 450 Hz, analogous to the mouth­ aperture trend toward accentuated high frequencies that was just mentioned. (while the LTAS technique has many uses in the study of musical sounds, one cannot use it trivially to deduce such things as the flow spectrum at the reed end of a woodwind, or the force spectrum at the bowing point of a violin, despite their apparent analogy with the excitation spectrum from the larynx.)


In the above digression and the immediately preceding paragraph we have con­sidered an aspect of vocal sounds whose description remains fairly constant even when the singing pitch is altered. It proves possible to make statements of the sort, "we learn from a certain singer's LTAS that the higher partials of his voice become successively weaker at the rate of 12 dB/octave," without having to specify the repetition rate of the source. The musical relevance of this possibility comes at present from the fact that our hearing mechanism is able to extricate an auditory version of this same information. It is time now to look at the interplay between a constant element of a given vowel (its formants) and the variations in pitch that are the basis of singing.


The fact that the upper two-thirds of the bass-baritone singing instrument's range overlaps the lower third of the 150­to-850-Hz range of the first voice formant guarantees the impossibility of specifying the amplitude relation between successive partials measured in the room without also specifying the singing pitch. Thus we deduce from figure 19.6 that the 700­Hz first formant for the vowel sound [ah] lies three octaves plus about a semitone above the 80-Hz bottom note singable by a typical male voice, so that the strongest partials of the E2 note (as we hear it) will be the 8th and 9th. On the other hand, the top note of our hypothetical male singing instrument has a 350-Hz fundamental frequency, so that when it sounds [ah] while singing F4, this same first formant will cause the 2nd partial to come to our ears most strongly.


We, as listeners experienced with human speech, would have no difficulty in recognizing the vowel [ah] as produced by our singer at either of the above-mentioned pitch extremes. On the other hand, as musicians interested in tone color who imagine ourselves to be listening to abstract sounds, we might not be willing to say that the singer produces the same tone color when he sings [ah] at the bottom of his range as he does at the top of it. Let us sharpen up the contrast between the musical and verbal versions of our perceptions with the help of an example mentioned early in section 19.1. Suppose we tape record the sound of a singer producing the sustained vowel feel at the pitch G,, and then play this tape back at various speeds so as to transpose the tone to all the semitones of the musical scale. In this process the formant frequencies (peaks in the spectrum envelope) are transposed to higher and lower frequencies, along with the partials of the tone itself. An engineer would say that the spectra of the resulting tones all have the same shape, and he could deduce from the bottom curve in figure 19.6 that the fundamental component (which was originally at 261.6 Hz) is more than 3 times as loud as partial 2 and about 15 times louder than partial 3 (lying near 784 Hz); partial 6 is almost inaudible since it lies at the dip in the formant curve (near 1570 Hz), while partials 8 and 9 straddle the second formant peak and so are about as strong as partial 3. If this description omits the frequency designations (which were purely explanatory) leaving only the serial numbers of the various partials and their relative strengths, the above statements remain true for the entire scale of transposed notes, as already noted by the engineer.


As long as we do not wander more than an octave or two on our scale above or below C4 our musical ears would agree with the engineer's description given above in the sense that they would recognize that all these [eel sounds have a rather constant tone color. At a subtler level of listening we would detect a slow trend toward what many people would call brightness or lightness in this sort of sound as we go up the scale, and a corresponding darkening as we go down. This description of relative lightness or darkness, however, is not associated with quite the same sort of acoustical change that we find associated with these adjectives when a singer changes the excitation recipe from his larynx.


If we change our mode of listening to that used in recognizing human speech, we find, on the other hand, that the sound of our tape playbacks would not preserve the fee) vowel character very far as we go up or down in the scale from the C 4 starting point. This is because in playback the formant frequencies themselves are being shifted, thus destroying the identifying marks of the vowel. To be sure, no trouble at all comes from going up or down the scale by a major third because this leaves us within the 25-percent range spanned by the average formant frequencies of men, women, and children. Experiment shows, however, that a 50-percent shift of the formants by the transposition of our tones up or down by a musical fifth will change speech sounds enough to hinder intelligibility seriously.


Opera singers and others who perform with large orchestral accompaniment have developed several very interesting ways of coping with the problem of being heard recognizably. While parts of the two phenomena I shall describe here have been recognized for several decades, our understanding of their implications has been clarified greatly by the recent work of Johan Sundberg at the Speech Transmission Laboratory in Stockholm .[5]

Let us first investigate the acoustical nature of the singer-versus-orchestra audibility problem, so that its solution can be made intelligible. To begin with, we must be aware that the shape of the long time average sound pressure spectrum (LTAS, see the previous digression in this section) of orchestral music is very much the same, whether one measures a Mozart violin concerto or an operatic overture by Wagner. There are of course small differences, and loud passages in particular have an LTAS with a slight increase of their high-frequency components relative to the low-frequency ones. We can describe the sound pressure level (decibel) version of the orchestral LTAS by saying that it rises quickly from low frequencies to a peak near 450 Hz, and then falls away with an average slope of about 9 dB/octave. The actual measured spectrum can be translated into the corresponding loudness curve, which gives at any frequency the loudness that that particular segment of the spectrum would have if it were heard by itself. We find here that the peak at 450 Hz has now become very marked indeed, falling to half loudness on the two sides of the peak at about 150 and 900 Hz. The loudness is roughly constant from 1000 Hz to about 2500 Hz, above which it decreases steadily to nothing at the upper limit of hearing. A typical example of this behavior is shown by the smooth curve in figure 19.7.

The LTAS for ordinary speech and ordinary singing (but not for singing in the large-scale, operatic style) has a shape that is roughly similar to what we have just described as belonging to an orchestra. This remark provides us with at least an indication that a singer might have problems being heard; he apparently does not sound very different (in one sense) from the orchestra, and it is unlikely that he can overpower it through sheer vocal exertion. If the LTAS of an orchestra and an ordinary voice are quite similar, we would expect a certain amount of masking to take place (see chap. 13). When one listens in a room to pairs of sinusoids, fluctuations in the transmission of both the masking and the masked sound from source to our ears normally make masking unimportant. However, when there are many components from various sources having frequencies within the ear's critical bandwidth (about four semitones) centered around the frequency of the test sinusoid, masking can be a problem. Sundberg has found in a preliminary way that, for a single sinusoid to be audible in the presence of a noise source whose spectrum has been given the same shape as the orchestral LTAS, this single sinusoid must have a pressure amplitude roughly equal to the aggregate masking sound pressure of the noise that is within the critical bandwidth surrounding the sinusoid. In a musical surrounding one might expect the highly organized harmonic components of the singer's voice to survive masking somewhat better than this, because they can advertise themselves quire well as a single entity: that is, they have exactly synchronized beginnings and endings, precisely tracking vibratos, and well-defined patterns of swelling and diminishing as the formants change during articulation. All these things prove to be somewhat effective, particularly since many of these patterns of change are quite different from chose that help characterize the various orchestral instruments. Nevertheless, the sheer weight of numbers leads to trouble when one man tries to make himself heard above the sounds from many. Furthermore, the overall similarity of the orchestral LTAS and its ordinary vocal counterpart guarantees that at no place in the frequency range do the voice partials have a chance to predomi­nate over their orchestral setting, and so "carry" their weaker brothers to our attention.


The first of the acoustical alterations cultivated by the operatic singer to help him in the audibility contest is his habit of singing with a vocal cord placement and lung pressure relationship that produce short, sharp puffs of air in the out­put of his larynx. By this means he can, as we have seen earlier, strengthen the upper partials in his voice. The increased audibility of these upper partials helps us to follow the rest of his voice components through the orchestral sound.


The second large-voice acoustical phe­nomenon we will consider is the so-called singer's formanr. At least 25 years ago it was noticed that skilled male operatic singers did not sing words with quite the arrangement of formants that they would use in speaking those same words. Many of these differences are relatively small, and for present purposes unimportant. However, there is one very significant alteration that turns out to contribute enormously to the audibility of a singer who competes with an orchestra. Tucked in among the other formants of his voice is a very strongly marked extra one lying somewhere in the region between 2500 and 3000 Hz. When we measure the various speech sounds one by one in an operatic singer's voice, we find that this particular formant has a frequency that is independent of the placement of the other, more ordinary formants. The enormous contribution of the singer's formant to his audibility can readily be understood by comparing the loudness LTAS for ordinary music (solid line in fig. 19.7) with the one obtained by Sundberg for the tenor Jussi BjorIing singing with loud orchestral accompaniment, which is shown as the beaded line in figure 19.7.


The fact that the singer's formant is independent of the placement of the other formants tells us that this formant arises from resonances in some part of the vocal tract that somehow escapes the influence of the ordinary changes in its shape. We can make good use of the ideas of wave impedance (which were first met in sec­tion 17.1) to help ourselves find the origin of the singer's formant. The vocal cords form an adjustable closure at the bottom of a small tube (the larynx tube) which is a little more than 2 cm long. The larynx tube has a slight bulge at its lower end, and its upper end opens into a somewhat enlarged throat region which then connects with mouth and nose cavities. The operatic singer has learned to exaggerate the change in cross-section that exists at the junction of the larynx tube and the throat, thus increasing the discontinuity of wave impedance between the two ducts. The second digression in section 17. 1 explains that if two parts of a large system have drastically different wave impedances, it is permissible to think about the characteristic frequencies of each part more or less independently. Sundberg has shown that the first characteristic mode of vibration of air in the short larynx tube is associated with the singer's formant. The excitation in the short tube is given its acoustical identity by the trained singer's ability to provide a strong discontinuity in the cross-section at its upper end. If the discontinuity is not emphasized, the larynx tube is merely part of the "room" of irregular shape called the vocal tract. If we like, the operatic singer's larynx tube can be thought of as a miniature vocal tract in its own right, whose upper end serves as a kind of mouth which excites the long narrow room provided by the rest of the vocal tract. In this way of looking at things, the singer's formant is the first formant of the miniature vocal tract. In other words, the oscillatory flow recipe from the larynx is first given, in the short tube of the larynx, a strongly peaked boost in the 2500-to-3000-Hz region before it is passed on for a more familiar type of processing by the rest of the vocal system.  

To summarize, the trained operatic male voice is produced by a singer who has learned to cope with his orchestral accompaniment by means of several changes in his acoustical output. First of all, he can generate a flow pattern from his larynx whose higher partials become progressively weaker at a more gradual rate than those used in ordinary speech or in a smaller-scale type of singing. In addition, he has learned (sometimes at the expense of a certain amount of strain, or even discomfort) to pull the lower end of his vocal tract into a shape that permits the pro­duction of the singer's formant. Finally, he tends to use a fair amount of vibrato, which adds a great deal of recognizability to the various sinusoidal components of his voice by providing them with a synchronized pulsation in frequency and amplitude (as they sweep across their various formants). Such synchronized variations in an otherwise complex signal are of course exactly the sort of things our auditory recognition machine works well upon. The synchronized pulsations of vibrato are one more common element in the singer s sound which we can seize upon as our ears pursue his voice through the music.


The special skills of the male operatic singer have, as we have seen, a particular value to him in his chosen profession, but they are not an entirely unmixed blessing. The singer's formant, whose frequency is essentially unchangeable, can become a harsh and obtrusive element sawing away on the listener's consciousness. This harshness can be avoided to some degree if the performer is artistic enough to vary his singer's formant from nothing on up to its maximum prominence, changing its magnitude as his musical surroundings change. Similarly, his customary form of vibrato, which runs continually and at its own pace completely independent of the rhythmic pattern of the music, can give great audibility to his voice precisely because of the individuality of its pattern. However, any piece of music is likely to require a resourceful musician to employ once again the full range of variation, from no vibrato at all, through one which comes and goes during the longer notes, to the more fixed variety whose function we have already described. In short, maximum audibility is not automatically advantageous-a voice whose rich variability is skillfully made to appear and disappear in various ways provides a marvelous vehicle for the display of true artistry.


19.5. Formant Tuning and the Soprano Singing Voice


The soprano singer uses tones from the upper portion of the range for human voices. The relationships between a soprano's relatively high voice frequencies and those of the formants she uses for speech will help us understand several of her practices that are quite different from those of her male colleague. A particularly striking practice of some sopranos will be the subject of this section.


One evening in the fall of 1971 my wife and I noticed an arresting and most attractive quality in the sounds we heard in a recording by the soprano Teresa Stich-Randall as she performed the aria Porgi amor from Mozart's opera The Marriage of Figaro.[7] Whenever a note of the aria persisted a little, she seemed to be "tuning" one or another of the vowel formants to a harmonic component of the voice spectrum. It did not seem possible for her to start each note with this formant matching already complete, but the adjustment would take place rather quickly, making the tone "bloom" in a most pleasing way. Enquiry among singers shows that this mode of singing is not in general consciously cultivated. As a matter of fact, only a few singers do it with the precision that first brought it to our attention. Many listeners also seem to find it difficult at first to focus their attention on these acoustical changes, though most will say they find the resulting tone color admirable. It was easy for me to recognize this soprano's tuning process, since it was simply a new example of what I am accustomed to listen for as I alter the resonances of musical instruments by shading a woodwind tone hole with my finger or by moving an ob­ject in and out of the bell of a woodwind or a brass instrument. Such effects are important when I am asked to work on an instrument, because they act as a guide to more permanent adjustments to its physical structure.


  To help us see what is going on when a singer tunes her formants in the way we noticed on the recording, we will look at a specific example. We will suppose that our soprano, while singing a word having the vowel sound fool, comes to rest on the note 1346 in the middle of the treble staff. At this point she is producing a tone made up of harmonic partials whose frequencies are 466.2, 932.3, 1398.5, . . . Hz. Clearly, the first partial of her voice lies somewhat above the 350-Hz position we would expect her to give formant 1 (17 percent above the 300-Hz value shown in the top part of fig. 19.6 for a male voice). The singer alters her tongue, jaw, and lip positions a little bit from her normal way of producing the fool sound, in such a way as to raise this formant to match the fundamental component of her voice sound. Our meticulous singer is next called on to sing a word having the vowel sound [ah] while producing the note Ds, whose frequency components lie at 587.3, 1174.7, 1762.0, . . . Hz. While sustaining her note she can make a small downward adjustment in the frequency of her 1287-Hz second formant to make it coincide with the second voice partial.


In 1972 Johan Sundberg made a set of observations on the way a professional soprano placed her formants while singing various vowels. He found that singers tend to align their formant frequencies in approximately the way just described, although his experimental subject did nor align her formant tunings as closely as do certain singers whom I have noticed. However, the general behavior observed by Sundberg is entirely consistent with the possibility of exact tuning." 

Figure 19.8 shows the kind of things that a soprano can do if she wishes (and is able) to make close Tunings of her own voice formants to the voice frequencies required by the musical circumstances. Marks for the chromatic scale notes be­tween C4 and A5# are arranged along the bottom axis of the figure, along with an indication for the fundamental frequen­cies belonging to these notes. The vertical axis is marked off with a frequency scale to indicate the voice-partial frequencies, and those of various formants. The solid ­line curves that rise toward the right show the trend of the fundamental frequency and of its harmonics as one sings up the scale. Each curve is numbered at its left-hand end to indicate the harmonic to which it refers.


The sequence of dots along the lowest part of the graph shows the way in which the frequency of formant one varies if one sings either too] or feel up the chromatic scale between C4 and A5#. This formant frequency is about 350 Hz for all notes below D;, and therefore is not close to any of the voice harmonics. When the singer gets to E4, formant one for these two vowels has a frequency that matches that of her voice fundamental. As she sings further up the scale, she opens her mouth progressively wider, moves her jaw, etc., to keep formant one in tune with partial 1, even though their frequency rises from 329.6 Hz all the way up to 932.3 Hz. In other words, over a great part of her singing range a soprano is able to strengthen partial 1 by letting it ride on the peak of the first formant of either too] or [ee].


The next progression of dots above the one we have just discussed shows what happens semitone-by-semitone to formant one of the vowel (ah) as our fine-tuned singer progresses up the scale. Below E4, this formant cannot be brought into tune with a voice partial. From E4 to about G4 it is possible for vocal-tract adjustments to be made matching formant one with the frequency of the singer's second par­tial. Above this point in the scale, there is no reasonable way to bring the first formant belonging to [ah) into resonance until we come to Es. Beyond this the voice fundamental has risen sufficiently that it can be used to guide the matching of the first formant of [ah) as well as those belonging to the too) and [eel sounds recognized earlier.


Just above the dots showing the first formant behavior of (ah] we find a similar sequence for the variation of formant two belonging to loo]. This formant can come under the control of the second voice harmonic from about A4# all the way to the top of the range. Notice that above A4# the singer has the possibility of keeping both fundamental and second harmonic of her singing pitch in tune with formants of [oo].Whether she does this, or picks one or the other, or tunes neither to the formants of loo] presumably would depend on her skill and also on the time available. There is also the possibility that for some singing pitches it is not physiologically possible to attain both matchings simultaneously.


The second formant of [ah] jogs along in the general neighborhood of 1200 Hz over the whole singing range, although it becomes a candidate for tuning below D4# and in the immediate neighborhoods of G4 and D5. Sundberg found no evidence for an attempt at tuning the second formant of [eel, as indicated by the gently sloping row of dots at the top of the diagram. He finds this same lack of influence of upper partials on the tuning for the second formants of two or three other vowels, all of which lie very close to that shown for lee]. This observed lack of influence of the higher partials is consistent with my own experience in the adjustment of wind instruments. If one can get two or three air column resonances accurately lined up with the lower partials of the sound spectrum, the listener and the player are very pleased with the result. Evidence in support of this observation can be traced in instrument making and performance practice at least back to 1720.


Let us ask now what musical resources are made available to a singer who can tune one or two of her vowel formants to match at least approximately the harmonic components of the note she is producing. Sundberg points out that the most obvious advantage that comes from even an approximate tuning of the first formant is a very large increase in the loudness of the sound a singer can achieve for a given vocal effort. Not only will this be of use when she must compete with strong accompaniments, but also in more normal musical surroundings it has the advantage of increasing the range of dynamics that she can produce between a just-audible pianissimo and the fortissimo level that corresponds to the maximum effort of which she is capable.


There is a subtler effect of considerable musical importance which can be noticed when there is exact tuning of any formant. We learned earlier in this chapter that the inherent unsteadiness of the vocal cord os4illations gives rise to minute fluctuations in both amplitude and frequency of the sinusoidal components of the air­flow recipe. In the closing part of section 19.3 we noticed that fluctuations in the frequency of a voice partial located on the sloping side of a formant peak give rise to fluctuations in the amplitude of the component as it is given to the room. In other words, there is more amplitude unsteadiness to be detected in the radiated sound than is present in the original excitation recipe from the larynx. When, however, the voice partial finds itself perched at the rounded top of a formant peak, the frequency fluctuations no longer give rise to additional amplitude variations, and the tone takes on a particular smoothness and fullness. Once again it should be remarked that my first awareness of the perceptual importance of an altered relationship between the two kinds of source unsteadiness came from study of the analogous behavior of orchestral wind instruments. This also led to the development of a simple but highly precise method for the measurement of air column resonance frequencies.


Whether or not a singer tunes a formant precisely to a voice partial, we recognize that her use of vibrato will have a very marked effect on the overall tone. The vibrato is of course a smoothly vary­ing fluctuation in frequency which varies almost sinusoidally half a dozen times per second. This makes for a corresponding variation in the loudness of any partial that lies on the side of a formant peak. If the vibrato centers itself to vary equally on either side of a formant peak, the loudness drops briefly twice per cycle of the vibrato, as its excitation frequency slides down alternately on the two sides of the formant peak.


19.6. Intermediate Voices and Various Musical Implications


You will perhaps be wondering by now whether the male singer tunes formants to the harmonic partials of his voice after the manner of the soprano, and you may also be curious to know whether she borrows his custom of generating a singer's formant. The answers to these questions lead us toward an understanding of the ways in which tenors and altos cope with the musical demands made on their voices, which lie acoustically in the region between the high and low voices we have been studying.


Because the male voice has formant peaks whose widths are comparable to the distance between its closely spaced harmonics (see the top part of fig. 19.5), very little change in the loudness of such a voice would be expected when formant tuning takes place. The loudness contributed by a pair of partials that straddle a formant peak is not very different from that produced when one of these lies exactly on the peak while the other one is displaced some distance down along the shoulder. To be sure, we can expect to find in the low voice a slight and rather pleasant change of tone color caused, in passing, by ordinary vowel changes and by vibrato, as discussed in earlier sections.


The soprano makes almost no use of the singer's formant that is an important resource of the male singer. We have learned that her habit of formant tuning already gives her a powerful weapon in the battle for audibility (quite aside from its important aesthetic function). Thus she has no particular reason to seek additional reinforcements. Sundberg finds in addition that the muscular requirements that must be met to produce the singer's formant are sometimes incompatible with the adjustments that many of these same muscles must make in tuning the formants.


Singers whose voices lie between the bass and the soprano are apt to borrow heavily from the techniques used by their higher- and lower-pitched neighbors. Thus the alto will frequently use the singer's formant. In the same way one gets more than a hint of formant tuning when tenors and altos use the higher parts of their registers, where the technique becomes acoustically more effective.


Most singers, throughout their musical range, constantly (though usually unconsciously) manipulate the vocal tract formants to place their frequencies at musically useful spots. These modifications in formant frequencies provide the major explanation for the difficulty we often have in understanding the words of a song. The patterns we are accustomed to use for the identification of spoken words are modified in music to meet other requirements. Often the words used in a musical setting require a high degree of understandability (for instance, in musical comedy, light opera, and lieder singing). In this type of music the singer and the composer both face an extremely difficult challenge, quite aside from the question of competition with an accompaniment, since both must constantly work toward getting the right word sounds together with the right pitches.


Before we leave the singers for a study of other musical instruments, we should notice one more feature of their cone production which is of considerable musical importance. The inherent unsteadiness of the vocal cord motion produces, as we have seen, a slight fluctuation in the amplitudes and frequencies of the various voice partials, even when there is no deliberate vibrato. It is useful to recast our description of the resulting sound by recognizing that each unsteady partial is in fact a closely spaced clump of randomly arranged steady sinusoids; the strongest members of these clumps have very nearly the nominal frequency of the partial, with weaker components being spread over a narrow surrounding region of frequency. For some voices, each of these narrow­band clumps of sound is spread across a pitch range of about 15 cents; for others it is as narrow as 5 cents. My own voice lies in the middle of this classification.


We have already learned in our study of the piano the useful consequences of having multiple clumps of partials (see sec. 17.3). For singers the same consquences are manifested, but in a broader and smoother way. The beat phenomenon (which is so pronounced between pairs of sinusoids) is very little heard between the sounds of two slightly mistuned clumps of partials. For this reason, then, slight errors of tuning between two singers produce far less clashing and roughness than would arise, for example, from similar errors in the tuning of two electric organ tones whose partials are made up of single sinusoids. Curiously enough, the slight smearing of the partials of a singer's tone does not prevent the production of audi­ble heterodyne components (see chap. 14). As a matter of fact, the production of difference tones, as defined in the digression in section 14.4, is particularly easy to demonstrate with the help of two sopranos.


The following example will show how the natural small fluctuations of the voice affect the generation of heterodyne components. Suppose we feed two clumps of components, P and Q, to a nonlinear device such as the human ear, P being centered at 300 Hz and Q at 450 Hz. Let us assume for the sake of numerical simplicity that in both cases the smearing width of the clumps is one percent, so that in P the components are spread over a range of 3 Hz, while in Q they extend over 4.5 Hz. The simplest heterodyne components that are born of this pair of sounds are clumps which are centered at the following frequencies:

2P =600 Hz, 2Q =900 Hz, (P +Q) =750 Hz, (P -Q) = 150 Hz


The extent of the smearing of the resulting partials at these various locations depends jointly on the widths of the ancestral clumps and on the details of the strengths of the partials which are distributed within them. The spread of the heterodyne clumps at 600, 900, etc., Hz might be something like the following: 4.2, 6.4, 5.4, 5.4 Hz. In every case the width of a heterodyne clump is somewhat broader than the widths of its ancestors.

If you refer back to our investigation in section 14.4 of the special relationships between musical sounds, is will be apparent chat the broadening of spectrum components into clumps by voice instabilities by no means destroys these relationships. It does, however, remove the clearcut, all-or-nothing nature of the beat-free in­tervals, converting them into a sort of pastel version. This gives the composer a range between consonance and dissonance as he writes his chords, making many things musically possible that are not successful when he writes for instruments whose tones are made up of strictly sinusoidal (single-component) partials.


19.7. Examples, Experiments, and Questions


1. Close your lips around one open end of a long piece of cubing with a 20­to-25-mm diameter and sing a slowly rising glissando from your bottom note. You will find certain sharply defined pitches at which is it essentially impossible to produce any sound at all. Your vocal cords will insist on jumping to either a higher or a lower frequency of oscillation in a most unsettling and unfamiliar manner. For a pipe that is 150 cm long, a voice will act in this way ac frequencies close to 90, 185, and 265 Hz (only the highest of these is likely to be reachable by a woman); if the pipe is 100 cm long, the disruptions occur near 130 and 250 Hz; for a 50 cm pipe, the effect cakes place at a lowest frequency near 245 Hz.[9] You may wish to verify that, as the piece of tubing is progressively shortened, its disruptive effects become progressively weaker, and the frequencies ac which they occur rise above 1000 Hz, which carries the phenomenon our of the singing range for most of us.


The upsetting effects produced by a piece of pipe on the vocal cord oscillations take place at very narrowly defined frequencies, between which nothing unusual is noticed in the "feel" of the experimenter's larynx. Since the effect disappears completely as the pipe is shortened, it was indeed correct in section 19.1 to treat the vocal cords as a normally autonomous self-oscillating system which is not itself much influenced by the varying acoustical properties of the vocal tract to which it is coupled.


2. Several experiments having to do with formants can be done with a piece of hardwalled tubing about 15 cm long with a diameter large enough (50 mm or so) to fit around your ear while you press the pipe airtight against the side of your head. With the pipe in place, listen to the rushing sound produced by its response to random noise in the room as you progressively close off the open end by sliding the flat of your hand across it. The resonances of the cavity impose on the room noise a spectrum envelope having formant like behavior, so that you hear something like a progression of whispered vowel sounds. The lowest three formant like frequencies associated with this cavity will be close to the following values:


Wide open   end:                       520,   1560,   2600 Hz

Half the end area blocked:               412,    1357,   2425 Hz

Three-quarters    blocked:               374,    1310,   2390 Hz

Nine-tenths        blocked:                321,    1259,   2359 Hz

The last of these will give you a rough imitation of an too] sound, even though the formants do not coincide with those given in the top part of figure 19.6.


3. If you sing a vowel sound in the presence of a piano whose dampers are lifted, many of the strings will be set into vibration. When your tone ceases, these strings will be heard to give back a crude but often recognizable echo of your vowel. This phenomenon can be exploited in many ways. For instance, you could hold down only the key whose note name is the same as that of the tone you sing, on the assumption that the various string modes will respond to your sound. Why will this experiment work better if you simultaneously hold down three keys, corresponding to the pitch of the note you are singing plus the ones a semitone above and below? Numerous other combinations of selectively damped or undamped strings will suggest themselves for your experimentation.


4. Playing back various long-sustained vowels on a tape recorder at a speed greater or less than that used in recording them can make quite startling changes in what they sound like. For example, playing [ah) back at half speed turns it into [oh] despite the fact the first-formant frequencies for these two vowels are in the ratio 0.77, while the second formant ratio is 0.7, and the higher formants have ratios close to unity. The tape recorder running at half speed of course produces a ratio of 0.5 for all frequencies. Do you expect that a double-speed playback of [oh] will necessarily give an [ah]?


5. Deep-sea divers must work under conditions in which the atmosphere they breathe is under very high pressure. To prevent "the bends," this atmosphere generally has helium gas mixed in with the oxygen that is necessary to sustain life. In such an atmosphere, the speed of sound and hence the frequencies of the voice formants are raised considerably. In contrast to this change, why would you expect only a small change in the oscillation frequency of the diver's vocal cords, and so also in the pitch of his voice? There is a considerable disruption of the intelligibility of speech when diving, caused in part by the changes listed above and in part because the production of consonants is deranged through changes in the air viscosity and density. Taking everything into account, would you ex­pect greater disruption of speech intelligibility for men or women divers? Would you expect the diver to have trouble understanding what he hears over the telephone from his helper who is at the ' water's surface?


6. Sound spectrographs are immensely useful laboratory tools for displaying visually the changing patterns of strong and weak partials in the sounds of human speech. It is inherent in the nature of these devices that sufficient speed to follow rapidly changing sounds is attained at the expense of an ability to measure accurately the frequencies of the individual partials; a sound spectrograph shows only the general outline of the behavior of the formants.


From comic strips and television shows one sometimes gets the impression that prints generated by the sound spectrograph can be used to identify criminals in the same dependable way that is possi­ble with fingerprints. You might find it interesting to list for yourself a few of the important aspects of human speech recognition which cannot be displayed by such a device. It turns out that the most dependable identifications are made by expert human listeners who supplement the evidence of their ears with several instruments, including the spectrograph."


7. It is sometimes possible to describe the tone quality of musical instruments by telling what vowel their tone imitates (e.g., the [aw] sound attributed to the English horn). This occasionally tempts people to draw the erroneous conclusion that the spectrum of the instrument resembles that of the vowel. In the late nineteenth and early twentieth centuries, studies of human speech were generally able to uncover only the strongest formant (usually the first), which led to a particularly trivial characterization of instrumental tone color. A vivid example of the acoustical disparity between a musical sound and its vocal imitation is the whet" sound that was attributed (in part 5 of section 17.7) to the sound of brushed-across piano strings. When one enunciates this word, the first formant starts near 300 Hz, rises steadily to about 700 Hz, and then falls to 250 Hz. The second formant meanwhile starts at 650 Hz. rises above 1000 Hz, and then dips to 900 Hz before rising fairly smoothly to 2250 Hz. The third formant has a slowly rising trend from 2500 Hz to about 3200 Hz. Meanwhile, the sound spectrum of the stroked upper strings of a piano has a fundamental component that steadily rises from about 2100 Hz at C7 to about 4200 Hz at C8, while the second hamonic covers a similar variation at double frequency, ending up at 8400 Hz. It would be interesting to know how our nervous system operates on such complexities to give us impressions of speech like sounds when we listen to musical instruments.



1. An introduction to the speech process is n be found in their paperback hook of this title Peter B. Denes and Elliot N. Pinson, TLS Speech Chain (Garden City: Doubleday Anchor Books 1973). Another introductory paperback is that c Peter Ladefoged, Elements of Acoustic Phonetics (Chicago: University of Chicago Press, 1962).

2. J. L. Flanagan, K. Ishizaka, and K. L. Shilley, "Synthesis of Speech from a Dynamic Model the Vocal Cords and Vocal Tract," Bell System J. 54 (1975): )485-506. See also Arend R.,, buys, ed., "Sound Production in Man," Annals the New- York Academy of Sciences 155 (1968,1-381).

3. See, for example, James L. Flanagan, Spee Analysis, Synthesis and Perception, 2d ed. (New York: Springer-Verlag, 1972), pp. 49, 233, and 250. This book is one of the basic sources of information today about the mechanisms of human speech. See also Gunnar Fant, Acoustic Theory of Speech Production (The Hague: Mouton, 1970), p. 271. This is the other major reference book on the speech process.

4. J. L. Flanagan, "Voices of Men and Machinnes ,J. Acoust. Soc. Am. 51 (1972): 1375-87, and James L. Flanagan, "The Synthesis of Speech," Scientific American. February 1972, pp. 48-58. The curves in figures 19.5 and 19.6 are calculated on the basis of data found in Fant, Acoustic Theory of Speech Production, pp. 109, 110, and 126. See also Flanagan, Speech Analysis. pp. 276--82.

5. Published at The Hague: Mouton, 1970.

6. J. Sundherg, "A Perceptual Function of the 'Singing Formant,' " Royal Institute of Technology (KTH), Speech Transmission Laboratory, Stockholm, Quarterly Progress and Status Report, 15 Octo­ber 1972, pp. 61-63, and also Johan Sundberg, "Articulatory interpretation of the 'singing for­mant,' "J. Acoust. Soc. Am. 55 (1974): 838- 44.

7. Teresa Stich-Randall Sings Mozart Arias (pho­nograph recording), Westminster WST-17046.

8. J. Sundberg, "Formant Technique in a Pro­fessional Female Singer," Acustica 32 (1975): 89-96. See also Huston Smith, Kenneth N. Ste­vens, and Raymond S. Tomlinson, "On an Un­usual Mode of Chanting by Tibetan Lamas," J. Acoust. Soc. Ant. 41 (1967): 1262-64.

9. Bertil Kagen and Wilhelm Trendelenburg, "Zur Kenntnis der Wirkung von kunstlichen Ansatzrohen auf die Stimmschwingungen," Archiv fur die Gesaaue Phonetic 1 (1937): 129-50.

10. Richard H. Bolt, Franklin S. Cooper, Ed­ward E. David, Jr., Peter B. Denes, James M. Pickett, and Kenneth N. Stevens, "Identification of a Speaker by Speech Spectrograms," Science 166 (17 October 1969): 338-43; Michael H. L. Hecker, Speaker Recognition: An Interpretative Survey of the Literature, ASHA Monographs No. 16 (Wash­ington, D.C.: American Speech and Hearing Asso­ciation, 1971); and Harry Hollien, "Peculiar case of 'voiceprints' " (letter), J. Acoust. Soc. Am. 56 (1974): 210-13.