Pitch and Timbre:            

                                Definition, Meaning and Use


A.J.M. Houtsma





Pitch and timbre are terms frequently used in studies on sound perception. Despite the existence of formal definitions, these terms are often used ambiguously in the literature. This paper is intended as a review of the ANSI definitions and their shortcomings, of modern ways to define the concepts operationally, and of the various dependencies of pitch and timbre on physical attributes of sound. Finally, their independent functioning in speech, their mutually dependent functioning in music, and their mediating role in object recognition will be discussed.



The terms pitch and timbre refer to subjective, perceptual attributes of sound that play an important function in the perception of speech and music. Because the attributes referred to are subjective, they can be examined only by psychophysical methods and cannot be measured by direct physical means.

A first requirement for meaningful use of terms like pitch and timbre is a definition. Formal definitions do exist (American National Standards Institute 1973), but appear not to be very useful. In contemporary literature they are being replaced by definitions that are less formal and more operational. The shortcomings of the ANSI definitions and current ideas of what the terms should mean are discussed in the first section of this paper.

A second question is the dependence of each of these subjective attributes on physical attributes of sound such as frequency, intensity, spectral shape or temporal envelope. This will be reviewed in the second section of the paper.

A third issue is the interrelation between the two subjective attributes in practice. In speech, pitch contours of spoken sentences are the principal carriers of prosodic information (if we disregard the so-called tone languages where pitch contours also convey semantic information). Timbres or timbre transitions, on the other hand, enable us to identify phonemes or phoneme clusters, resulting in understanding of what is being said. Pitch and timbre patterns each have a specific function and appear to act quite independently of one another. In music, however, pitch and timbre have a mutual relationship and dependence imposed by requirements of consonance when sounds occur together. This is discussed in the third section of the paper.   The last section deals with the cognitive issue of object identification   speech, vowels and consonants are recognized on the basis of their formant structure, Le., their spectral shape. In subjective terms this implies the use of timbre-like cues. In music, the sound of particular musical instruments or instrument combinations is recognized on the basis of perceived timbre. A question is whether timbre recognition is synonymous with the recognition of a sound source, Le., a particular musical instrument, or whether timbre represents a separate perceptual space which mediates in the recognition of musical objects.




The American National Standards Institute (1973) defines loudness as "..that intensive attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from soft to loud", pitch as "..that attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from high to low", and timbre as "..that attribute of auditory sensation in terms of which a listener can judge that two sounds, similarly presented and having the same loudness and pitch, are different". Timbre therefore is defined in a purely negative manner as "everything that is not loudness, pitch, or spatial perception".


A review of modern psychoacoustical literature reveals that the ANSI definitions have not been particularly useful. The main reason is that they do not really provide a clear distinction between the three subjective attributes of sound. For instance, the difference between the definitions of loudness and pitch rests almost entirely on the semantic difference between the endpoints of the scales soft-loud vs. low-high. This has led to some interesting confusions in the literature. Tanner and Rivette (1964) reported that speakers of Punjabi, one of the many  Indian languages, had unusually large differences limens for frequency. They had asked their subjects, following common practice, to discriminate between two tones by telling which of the two was the higher one.


Burns and Sampat (1980), who repeated the experiment, found that frequency difference limens became perfectly normal if the instructions to the subjects accounted for the fact that in the Punjabi language the same word is used to indicate that a sound is "high in pitch" or "loud". It has also been established that ratings along a scale that ranges from dull to sharp account for most of the variance of timbre ratings of complex sounds,(von Bismarck 1974). The difference between a low-high and a dull-sharp scale appears to be semantically quite small and easy to confuse. Therefore, many timbre effects and phenomena may have been wrongly identified as pitch effects in the literature.


The pitch of periodically interrupted noise (Miller & Taylor 1948), the (missing fundamental) pitch of temporally sequential successive harmonics (Hall & Peters 1981), and binaural edge pitch (Klein & Hartmann 1981) were all identified and measured by means of discrimination or matching experiments, where only ordinal properties of the sensation play a role. The reported pitch effects could therefore, entirely consistent with the definitions and the experimental evidence, very well have been timbre effects.

Looking at the everyday use of the word pitch from a musical viewpoint, one observes it actually entails much more than the ANSI definition gives credit for. Although pitch space is a continuum, it is, at least in Western music, treated as a collection of steps, where pitch intervals correspond to certain well-defined frequency ratios. The fact that we not only can tell that one sound is higher than another, but also can compare and identify the magnitude of pitch steps, is hardly accounted for in the ANSI definition. Modern pitch studies therefore often use experimental paradigms that involve musical interval or melody identification. Their explicit or underlying operational definition is that pitch is the subjective correlate of each one of the acoustical events in a musically meaningful sequence of tones (Houtsma & Goldstein 1972; Roederer 1979).

Although pitch is a single perceptual attribute, it is as such not necessarily one­dimensional. Music psychologists in the past, for instance, have pointed out that in one sense two notes that are a semitone apart are closer than two notes that are separated by an octave, but that in another sense the octave notes are much closer related and confusable than the semitone. Think of the typical octave errors often made in absolute pitch judgements. This has led to rather complex pitch-space representations in which the chromatic tone scale, the circle of fifths, octave circularity and other properties are all accounted for (Shepard 1982). Three of such representations have been shown in Fig. 1, in increasing order of complexity.


The word timbre is used almost exclusively in the psychoacoustic literature with respect to music, and is hardly found in the speech literature. In speech the perceptual entity is a phoneme (vowel of consonant), and not some arbitrary point in timbre space. In music-related studies timbre has always been treated as a multidimensional continuum in which any point is potentially meaningful. It has been established by rating and multidimensional scaling techniques that the space can be adequately described in four subjective dimensions (dull-sharp, compact­scattered, colorful-colorless and full-empty) which are linked to physical dimensions such as spectral energy distribution, amount of high-frequency energy in the attack, and amount of synchronycity high-harmonic transients (von Bismarck 1974; Grey 1977). A modern development, somewhat analogous to what is observed in speech research, is to consider timbre in close connection with object identification, both for musical and natural, environmental sounds (Handel 1995).







The perceived pitch of a sound depends most of all on its frequency. For a pure tone this is rather unambiguous since there is only one frequency. For a complex sound, where several frequencies are involved, pitch salience depends on the degree to which partials are harmonic. For a harmonic sound the pitch depends on the frequency of the fundamental, no matter whether it is physically present or not (for a review, see Houtsma 1995). For inharmonic complex tones such as church or carillon bells, the pitch depends on the frequencies of certain partials and is usually less salient than for harmonic tones.


The pitch of a pure tone can also be influenced by changing its intensity (Terhardt 1974), duration (Doughty & Garner 1948), attack/decay envelope (Hartmann 1978), the amount of (partial) masking noise (Terhardt & Fastl 1971), and the ear to which the tone is presented (van tier Brink 1970). From a viewpoint of music practice this would seem to raise havoc in a music performance because tones would have to be constantly compensated for these effects. Fortunately, the mentioned pitch dependency effects are mostly absent when complex tones are used, as is the case in most music.





The timbre of a sound, being itself a multidimensional attribute, depends on several physical variables. There is, in the first place, the frequency content and the spectral profile of the sound. Because the human ear has limited frequency resolution power, the spectral composition vector can often be reduced to a vector representing the amount of instantaneous acoustic power in each critical band (Plomp 1970) without much perceptual loss of information. This is limited by the fact that phase relations between spectrally unresolved harmonics have an audible influence on timbre (Goldstein 1967). Finally, the temporal envelope of an instrumental sound, including attack, decay and modulation of the steady-state portion, influences the perceived timbre to such an extent that changes on any of them can make the sound of an instrument unrecognizable (Berger 1964).


Empirical distinction

Looking at all dependencies listed above, one concludes that the sound attributes pitch and timbre have not only been defined in a rather ambiguous manner, but also depend to some extent on the same physical variables. It is therefore not surprising that there is confusion in the literature about pitch and timbre, and that effects sometimes have been misnamed.

As an illustration of how one could possibly distinguish the two attributes empirically, an experiment will be reviewed which 1 reported more than a decade ago (Houtsma 1984). Seven different types of sound were selected, all having been reported in the literature to evoke some kind of pitch sensation. With each of the sound types, random 4-note sequences were played by sampling notes of a diatonic scale. Subjects were asked to play the perceived sequence (`melody') back on an 8-note keyboard. All, of course, were familiar with keyboard playing and were able to play tunes `by ear' without sheet music. The data were analyzed by computing a correlation coefficient between presented and played-back sequences. Such a coefficient is unity if all sequences are played back correctly. The coefficient remains high as long as the order (up/down movement) between presented and perceived sequences is preserved. This can be seen in Fig. 2a, where correladons scored are shown for each subject and for each of the seven different types of sound.


Fig. 2.


The second type of analysis, shown in Fig. 2b, was an actual count of the notes that were correctly played back and therefore must have been perceived correctly. A high score in this manner of counting requires not only ordinal (low vs. high), but also interval (step size) perception. By comparing both types of analysis one can distinguish those sensations that have only ordinal properties from those that have interval or ratio properties. It was found that some sounds yielded high scores with both types of analysis (stimuli 1, 2, and 3, indicating real pitch effects), some yielded high correlation scores and poor identification scores (stimuli 4, 5 and 6, suggesting timbre effects), and one yielded low scores on both counts (stimulus 4, suggesting neither pitch nor timbre sensations).




Although the attributes pitch and timbre have been defined as separate concepts, they are at least partially tied to the same physical attributes of sound. One may wonder, therefore, to what extent the subjective attributes themselves are dependent or independent of one another. It appears that this is different for speech and for music.


In speech, a pitch contour is a rather continuous melody-like pattern that is evoked by vowels and voiced consonants, and is directly related to the vibration rate of the vocal cords. It carries prosodic information, and control of intonation patterns is subject to rather strict, language-dependent rules ('t Hart, Collier & Cohen 1990). The timbre of a speech sound, although not commonly named this way in the speech literature, is different for each phoneme and depends physically on the shape of the glottal air flow pulse and the instantaneous shape and length of the vocal tract (throat, oral and nasal cavities). It seems that the two work quite independently of one another. A same sentence can be spoken in different intonation patterns, or even without any intonation (whispered), and still sound perfectly intelligible and natural. If natural speech is artificially manipulated, however, some bounds to this independence are found, as is illustrated by an example presented in the next section.


In music (including singing) there is the extra constraint, not found in speech, that sounds usually occur simultaneously and are intended to create well controlled sensations of consonance and dissonance. This imposes tight constraints on the choices of pitch steps (frequency ratios of scales) and timbres (frequency contents of instruments used). Traditional tone scales, from the Pythagorean tuning to the modern equally tempered tone system, were developed to minimize dissonant beat sensations and maximize musical flexibility (ability to play in different keys). The implicit, underlying assumption always seems to be that the musical sounds to be used are harmonic sounds. This is true for the human voice, wind instruments and bowed string instruments, but only approximately true for free stringed instruments (piano, guitar) and tonal percussion instruments (xylophone, carillon bells). Today it is easy to create computer sounds with any desired degree of inharmonicity. It is important from a musical viewpoint, and possible from a computational viewpoint, to create the most appropriate tone scale, given any complex tone structure (Sethares 1993).




The subject of timbre perception has in some recent publications become closely tied to the topic of auditory object recognition (Handel 1995). This is a development quite analogous to what already has occurred in the speech perception tradition. A somewhat simplistic theory would be that each musical instrument has its own unique timbre over its entire playing range. It is easy to see that such a theory leads to serious problems. No two bassoons sound exactly alike. Never­theless, we are able to recognize each bassoon sound as such. This problem is the same as the invariance problem found in the speech literature (Stevens & Blumstein 1978). It is also well documented that spectrotemporal profiles vary quite drastically from one note to the other when going through the playing range of a musical instrument (Brown 1991). Conversely, if one synthesizes a tone scale of an instrument by giving all notes the same relative spectrotemporal profile, the instrument sounds very unnatural and is unrecognizable as such (Houtsma, Rossing & Wagenaars 1987). Therefore, if a particular constant timbre would be associated with the sounds of a musical instrument, the relationship between timbre and physical sound attributes would become very loose or nonexistent. This would be very unappealing. Such a constant-timbre theory would also be unable to account for the fact that we can often hear the difference between two instruments of the same kind.


It is much more likely that object recognition occurs in steps on two different levels, as is illustrated in Fig. 3. On a perceptual level a transformation is made from the sound, represented by a point in complex physical space, to sensation represented by a point in a complex perceptual s ac . This latter space can be divided into loudness, pitch and timbre subspaces. The space is continuous, and any point in this space is potentially meaningful. Any change from one point to another in this space is detectable as long as the change is large enough to exceed internal noise constraints. Object recognition occurs at a more central, cognitive level where points or contours in perceptual space are through daily experience associated with certain explicit labels such as a vowel /a/ spoken by a child, an automobile horn, or a series of notes played on a clarinet. This process is essentially identical for speech, music and environmental sounds.


Finally, in trying to study timbre perception separately from object recognition it maybe interesting to think about the feasibility of a timbre-matching experiment. After all, loudness and pitch-matching techniques are very common in psychoacoustics.


Fig.3 - Schematic representation of auditory stimuli at the physical,sensory and        cognitive levels




   One can match the loudnesses of two sounds that differ in frequency (Fletcher & Munson 1933) or spectral content (Houtsma, Durlach & Braida 1980), and one can match the pitches of two sounds that differ in intensity (Stevens 1935) or spectral content (Schouten, Ritsma & Cardozo 1962). Would it be possible to match the timbres of two sounds across differences in intensity (loudness) and fundamental frequency (pitch)? The author is not aware of any such experiment reported in the psychoacoustical literature. It would, for practical reasons, not be an easy experiment to do for a subject because of the potentially large number of physical parameters to manipulate, with many dials to be adjusted. Nevertheless, the speech literature (e.g., Peterson & Barney 1952) shows that formant frequen­cies (Le., peaks in the spectral weight function) for vowels spoken by a female voice are typically 15 percent higher than the same vowels spoken by a male voice. Intonation (pitch) patterns of female voices are typically an octave higher than those of male voices.


Our own experience with laboratory speech synthesis confirms that if one takes a vowel sound of a male voice, doubles the frequencies of all partials, and increases the formant frequencies by about 15 percent, a very good match of the vowel sound is obtained, spoken by a natural-sounding female voice. This atests that timbre matching is in principle possible, although the speech example of a good phoneme match may not really imply a timbre match. Analogously, one could expect that a bassoon player, when asked to adjust spectral parameters of a high-frequency sound to match the timbre of a low-frequency bassoon note, will consistently come up with settings that correspond to a high-frequency bassoon note. This does not necessarily mean that timbres are matched.




The main conclusions from the material presented in this paper are:

1. Because of their subjective nature, the parameters pitch and timbre should never be presented as independent variables in perception studies. Doing so would amount to describing one unknown in terms of other unknowns.

2. The roles of the attributes pitch and timbre in the perception of speech, music and environment sounds are very similar.

3. In music, any study of pros and cons of certain temperaments or tone scales should include a consideration of the spectral composition of the sounds used to realize the music.

Linking timbre perception too exclusively with auditory object recognition would be asking for repeating the history of categorical perception in speech.




American National Standards Institute (1973). American national psychoacoustical terminology. S3.20. New York: American Standards Association.

Berger, K.W. (1966). Some factors in the recognition of timbre. Journal of the Acoustical Society of America, 36, 1888-1891.

Bismarck, G. von (1974). Timbre of steady sounds: A factorial investigation of its verbal attributes. Acustica, 30, 146-159.

Blumstein, S.E. & Stevens, K.N. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. Journal of the Acoustical Society of America, 66, 1001-1017.

Brink, G. van der (1970). Experiments on binaural diplacusis and tone perception. In R. Plomp & G.F. Smoorenburg (eds.), Frequency Analysis and Periodicity Detection in Hearing (pp. 362-374). Leiden: Sijthoff.

Brown, J.C. (1991). Calculation of a constant-Q spectral transform. Journal of the Acoustical Society of America, 89, 425-434.

Burns, E.M. & Sampat, K.S. (1980). A note on possible culture-bound effects in frequency discrimination. Journal of the Acoustical Society of America, 68, 1886-1888.

Doughty, J.M. & Garner, W.M. (1948). Pitch characteristics of short tones Il: pitch as a function of duration. Journal of Experimental Psychology, 38, 478-494.

Fletcher, H. & Munson, W.A. (1933). Loudness, its definition, measurement and calculation. Journal of the Acoustical Society of America, 5, 82-108.

Goldstein, J.L. (1967). Auditory spectral filtering and monaural phase perception. Journal of the Acoustical Society of America, 41, 458-478.

Grey, J.M. (1977). Multidimensional perceptual scaling of musical timbres. Journal of the Acoustical Society of America, 61, 1270-1277.

Hall, J.W. III & Peters, R.W. (1981). Pitch of non simultaneous successive harmonics in quiet and noise. Journal of the Acoustical Society of America, 69, 509-513.

Handel, S. (1995). Timbre perception and auditory object identification. In B.C.J. Moore (ed.), Hearing (pp. 425-461). New York: Academic Press.

Hart, J. 't, Collier, R. & Cohen, A. (1990). A Perceptual Study of Intonation. Cambridge, UK: Cambridge University Press.


Hartmann, W.M. (1978). The effect of amplitude envelope on the pitch of sinewave tone; Journal of the Acoustical Society of America, 63, 1105-1113.

Houtsma, A.J.M. & Goldstein, J.L. (1972). The central origin of the pitch of complex tone; evidence from musical interval recognition. Journal of the Acoustical Society of America, 51 520-529.

Houtsma, A.J.M., Durlach, N.I. & Braida, L.D. (1980). Intensity perception XI. Experiment results on the relation of intensity resolution to loudness matching. Journal of the Acoustics Society of America, 68, 807-813.

Houtsma, A.J.M. (1995). Pitch perception. In B.C.1. Moore (ed.), Hearing (pp. 267-295). Nei York: Academic Press.

Houtsma, A.J.M. (1984). Pitch salience of various complex sounds. Music Perception, 1 296-307.

Houtsma, A.J.M., Rossing, T.D. & Wagenaars, W.M. (1987). Auditory Demonstrations (Compac Disc Philips 1126-061). Acoustical Society of America, Woodbury NY, USA.

Klein, M.A. & Hartmann, W.M. (1981). Binaural edge pitch. Journal of the Acoustical Society of America, 70, 51-61.

Miller, G.A. & Taylor, W.G. (1948). The perception of repeated bursts of noise. Journal of the Acoustical Society of America, 20, 171-182.

Peterson, G.E. & Barney, H.L. (1952). Control methods used in a study of vowels. Journal of the Acoustical Society of America, 24, 175-184.

Plomp, R. (1970). Timbre as a multidimensional attribute of complex tones. In R. Plomp & G.1 Smoorenburg (eds.), Frequency Analysis and Periodicity Detection in Hearing (pp. 397^414 Leiden: Sijthoff.

Roederer, G. (1979). Introduction to the Physics and Psychophysics of Music. Heidelber€ Springer Verlag.

Schouten, J.F., Ritsma, R.J. & Cardozo, B.L. (1962). Pitch of the residue. Journal of the Acoustical Society of America, 34, 1418-1424.

Sethares, W.A. (1993). Local consonance and the relationships between timbre and scale. Journcal of the Acoustical Society of America, 94, 1218-1228.

Shepard, R.N. (1982). Structured representations of musical pitch. In D. Deutsch (ed.), The Psychology of Music (pp. 343-390). New York: Academic Press.

Stevens, S.S. (1935). The relation of pitch to intensity. Journal of the Acoustical Society of America, 5, 150-154.

Tanner, W.P & Rivette, G.L. (1964). Experimental study of `tone deafness'. Journal of the Acoustical Society of America, 36, 1465-1467.

Terhardt, E. & Fastl, H. (1971). Zum Einfluss von Stortônen and Sttirgerâuschen auf die Tonhüh von Sinustünen. Acustica, 25, 53-61.