Acoustics, psychoacoustics and spectral music
Daniel Pressnitzer and Stephen McAdams 1,2
(1) Institut de Recherche et de Coordination
Acoustique/Musique (IRCAM), Paris, France.
(2) Laboratoire de Psychologie Experimentale (CNRS), University
Rene Descartes, EPHE, Paris, France and IRCAM, Paris, France.
Introduction
The
aim of this article is to examine the points at which acoustics, psychoacoustics
and what has been called "spectral
music" meet. The motivations behind this bringing together of three
more or less barbaric words are to be found in the principle of the spectral
approach itself.
The tonal system is governed by a set of harmonic
rules that embody a compromise between the will to modulate among keys, the
system of symbolically notating music, the available set of instruments,
certain laws of acoustics, as well as many other concerns. This imposing
edifice was patiently constructed by an accumulation of experience and
benefited from a slow, cultural maturation. But the bases of this edifice were
shaken in the evolution of contemporary music by recent developments in our
relation to sound: previously of a fleeting and evanescent, ungraspable nature,
sound has been captured and manipulated by way of recording technology. The
theory of signals, associated with the computational power of modern
computers, has made it possible to analyze sound, to understand its fine
structure, and to fashion it at will. The potential musical universe has thus
"exploded" in a certain sense. Sound synthesis opens truly unheard
perspectives, extending the act of composition to the sound material itself.
The distinctions between note, frequency, timbre, and harmony become fuzzy, or
even irrelevant, and accumulated traditional experience finds itself impotent
to organize the emerging sound world.
After such a shock, new means of formalizing and structuring needed to be defined. Rather than establish a series of arbitrary rules, the spectral intuition consisted in founding compositional systems on the structure of sound, and thus in deriving fields of musical relations from sound itself. The wager of such an approach is to give to a listener reference points that are naturally understandable, while allowing the use of the new potential offered by micro-compositional work at the level of sound. In other words, the structures sought should be latently intelligible, since the elements necessary for their comprehension are contained in the materials. If the work of the forerunners of this approach was founded only on intuition and experimentation, the will to go further, to not let oneself be enclosed by a limited number of effects or gestures, requires a more rigorous conceptualization and formalization of the fundamental ideas. A perfect understanding of acoustic phenomena is thus necessary and we will address this domain by insisting on the importance of different representations of sound. Though it is necessary, this comprehension is not sufficient: what counts in the end is certainly (at least in the logic of the spectral approach) what is perceived and understood by the listener. It is at this level that psychoacoustics, which extends and validates the reflection on purely physical structures, enters the picture.
As such, it is not by a will to "scientism"
at any cost that the spectral composers were undoubtedly drawn to interest
themselves in these disciplines, but simply because of a necessity that
proceeds from their approach. Numerous questions thus naturally come to the
fore: What in fact is a sound? What are its possible representations? What are
the interpretations made by perception to extract from a sound what is relevant
for the listener? Can we exploit them musically? Can we speak of "sound
objects" in our psychological representations? How do we think music?
Doubts can also appear, notably with respect to a fundamental aspect of music:
if tonal harmony is considered as a sort of syntax, allowing expression by
changes in tension that occur as one deviates from certain rules, it relies
undoubtedly on a strong, even implicit, cultural learning. Is it possible, in
no longer using this solid, conventional prop, to find a basis contained in the
material of sound, for the expression of tension? In an attempt to respond to
these questions, and especially to incite new ones, we will present a set of
facts concerning sound and its perception, starting with its birth in the
acoustic world.
2 The
acoustic world
2.1
Representations
A
vibrating body creates in the surrounding air the propagation of a pressure
wave, in the same manner that the agitation of an object on a surface of water
provokes the propagation of wavelets. This is the physical reality of sound,
the variation of acoustic pressure over time, and this reality is unique. It
can, nevertheless, be represented in different ways according to the
information that one wishes to emphasize. Take the example of a chord from the
piece Streamlines by Joshua Fineberg
(1995). Its classic musical representation is the score 1 (Fig. 1). Centuries of
experience allow the use of this symbolic representation for compositional
purposes, but in fact it constitutes more a set of instructions to the
performers than a description of the sound actually produced. For example, if
the instrumentation changes, the music transcribed in
the score is transformed. In Baroque music, where exact
instrumentation was often not specified, and in some pieces of contemporary
music, where the instrumentation is specified 'Classic up to a certain point,
considering the presence here of quarter-tones!
Figure
1:
Excerpt from the score of Streamlines
by Joshua Fineberg. The excerpt was reorchestrated by the composer for the purposes of
psychoacoustic experimentation. The aggregate analyzed in subsequent figures
is the second one at the end of measure one.
vaguely as in
the percussion piece Ionisation by Edgar
Varese (1933), the sound structure itself may be
completely different from one rendering to the next.

To obtain a trace of a specific sound, it is possible
to capture it through a recording device that will convert the pressure
variations in something that can be visualized (Fig. 2). This representation of
the pressure wave reflects all of the fine-grained temporal evolution (within
the resolution limits of the visual representation). It is therefore
particularly well-adapted to manipulations of sound such as cutting and
splicing, reversal in time or repetition. In its early days, musique concrete, because of the available
technology, worked with razor blades on tapes containing the exact magnetic retranscription of this temporal wave and thus used a lot
of these kinds of transformations 2. However, this temporal image does not
translate in an obvious fashion the different pitches that it is possible to
distinguish in listening attentively to such a chord.
A representation that allows this is the one based on
Fourier's theory. This theory states that any complex signal can be decomposed
into a sum of sinusoidal waves, over an infinite time frame, by specifying
precisely their relative amplitudes and phases. It is thus possible to
decompose a complex sound into a sum of sine tones, which are called the
partials of the sound, the set of which form its spectrum (Fig. 3). This
Fourier transform represents well the same chord, but in a different form,
unveiling its frequency content.
Zone should note in passing the influence of the tools used on the end result.
fig2 [up]
and fig.3[low]
Figure 3: [low] Fourier transform (frequency spectrum) of the second aggregate in Fig. 1. The partials of the harmonic tones produced by the different instruments can be seen emerging from the noise as thin vertical peaks.
Figure 4:[not scanned] Short-term Fourier transforms displayed as spectrograms. The amplitude of the components is indicated by their darkness. The time-frequency trade-of chosen in (a) uses a long analysis window. The frequencies of the different partials can thus be seen very accurately as horizontal stripes, but the fine evolution of their amplitudes is lost. On the other hand, the short window chosen in (b) better preserves the temporal evolution, but blurs the frequency representation. Different parameters, such as the overlap factor between windows, or the display formatting, can be adjusted to get a "better" picture.
But which is the
best picture of what we hear?
This knowledge of frequency components present in the
sound, which may under certain conditions be heard as "spectral
pitches" (Terhardt, 1974), is at the origin of the
emblematic idea of the music that is, appropriately, called
"spectral". As an example among others, Jonathan Harvey's piece Mortuos plango, vivos voco
(1980) uses the spectrum of a bell sound and its transformations as a
foundation for the harmony. If such a representation is a source of fertile
inspiration, it makes the temporal information about the sound no longer
explicit in the transform as the analysis is (theoretically) over an infinite
duration.
The true nature of a sound phenomenon as we perceive
it is double: it evolves over time, which is represented by the temporal wave,
and it also has a certain frequency content, visible
in the spectrum. The short-term Fourier transform reconciles these two types of
informations. It is thus called a "timefrequency representation". The sound is sliced
up, by an analysis time window, into successive instants. The Fourier transform
is then computed for successive instants by sliding the window over the
temporal waveform bit by bit. In this way the evolution of the frequency
content of the sound is represented over time. The result of this analysis can
be presented either in the form of a spectrogram (Fig. 4), having graphical
similarities to a musical score, or in the form of a three-dimensional
perspective plot that expresses the acoustic variations in terms of frequency,
amplitude, and time 3. This type of analysis is essential for performing
additive synthesis, or its orchestral derivations. In Partiels (1975),31t should be noted that it is
possible to obtain spectrograms in an analogous fashion with a bank of filters.
From a theoretical point of view, the two descriptions of the short-term
transform (running window and filter bank) are equivalent (Allen & Rabiner, 1977).
For example, Gerard Grisey explores the sound of a trombone by assigning to
different instruments the production of a given partial of the trombone
spectrum analyzed with its dynamic temporal evolution. The representation as a
short term Fourier transform is quite general, but it should be noted that
there is an inevitable compromise between the temporal resolution (the duration
analyzed) and the frequency resolution (the analysis precision). For a high
frequency precision, a long analysis duration is
needed, and so there is a loss of temporal precision.
There are other representations that allow an
optimization of this compromise (Loughlin et al.,
1993), some, such as the wavelet transform, adapting it to the frequencies
analyzed (Combes et al., 1989). This produces for a
given sound various equivalent representations.
2.2 The
idea of continuum
In
the physical world, frequency, time, and intensity are considered as continuous
dimensions. Music, on the other hand, has been built on discrete scales of
pitch and duration made necessary, among many other reasons, by the desire to
notate events and by instrumental playing constraints. The different representations
we just presented, associated with sound synthesis, give us access to the
physical continua. This allows us, as
Another continuum is revealed by these mathematical
representations. What difference is there between the spectrum of a note
associated with a timbre and the spectrum of a chord considered as an element
of harmony? The answer is to be found on the computer screen: at first sight,
there isn't any! A simple note is a collection of spectral components, thus a
chord; and a chord is a collection of partials, thus a timbre. Sound synthesis
allows the organization of the note itself, introducing harmony into timbre,
and reciprocally sound analysis can introduce timbre as a generator of harmony.
This ambiguity is strikingly demonstrated in Jean-Claude Risset's
piece Mutations (1969), where the
same material is treated alternately as a harmonic chord or a gong-like timbre.
If harmony and timbre are so intimately linked, the clear-cut traditional
classification of chords between perfect consonauces,
imperfect consonances and dissonances may become irrelevant. Timbre
manipulation opens up the possibility to look for a continuous scale that could
reproduce, in some respects, the expressive means associated with the tonal
notions of consonance and dissonance'.
The exploration of such a dimension has been
undertaken by many composers, essentially in an intuitive fashion. Mistan Murasl, for example, has
ordered timbres and aggregates with a measure of inharmonicity
(Desint6grations, 1982). Kaija Saariaho has defined a
sound/noise axis intended to reproduce the 4 Literally
a "slicing up of the octave", i.e. the equal-tempered scale.
The Western early polyphonic period, the scale
of consonance and dissonance contained up to six diferent
degrees. The later simplification of this scale can be paralleled with the
progressive affirmation of syntactic tonal rules (Tenney,
1988)
Figure
5: Beats between pure tones. In the temporal domain, the addition
of two sine waves with slightly different
frequencies produces slow amplitude
variations in the amplitude envelope (a). In the spectral domain, the beating
tone pair is seen as two adjacent components (b).Joshua Fineberg
has adopted a hierarchy founded on the pitch of virtual fundar
mentals (Streamlines,
1995). Is it possible that a single phenomenon is hidden behind these
different intuitive criteria? Hermann von Helmholtz
proposed an axis of reflection in drawing our attention to the attribute of
sound that he called "roughness" (von Helmholtz,
1877). Two pure tones produced simultaneously and having closely related
frequencies create amplitude fluctuations in the waveform that are called
"beats" (Fig. 5). These fluctuations, according to their rate of
beating, can give rise to a grainy quality in the sound. An example of this
rough quality can be heard, emerging from silence, at the beginning of Jour, Contre-Jour (Grisey, 1980). Helmholtz thought
he saw in roughness the acoustic basis for the dissonance of musical intervals.
Western music employs principally instruments with harmonic spectra. The
partials of their complex spectra are superimposed when an interval is played,
resulting in beats if their frequencies do not coincide perfectly. Intervals
with simple frequency ratios, such as the octave or the fifth, do have a
significant degree of harmonic coincidence and thus less beating. However,
intervals such as the tritone create a situation
where harmonics of one note beat with those of the other note (Fig. 6). Coming
back to the previous examples, an inharmonic sound
can be a source of roughness when superimposed on harmonic sounds; sounds
described as noisy can often be rough; a sound with a very low fundamental
frequency has partials that are quite close to one another which creates beating. As such, it may be that the same acoustic
feature guided the composers mentioned above in the elaboration of their
"harmonic" criteria. If this was to be the case, such a feature could
be used to define a new continuum related to the vast and complex notion of
musical dissonance, as some kind of an "acoustic nucleus" for it
(Mathews & Pierce, 1980).
One might be overwhelmed before the immense field of
possibilities that is thus opened.

Frequency 45/32ft) Frequency

Figure
6: Superposition of two complex harmonic
tones separated by a small interval. With slightly different fundamental
frequencies, beats occur between adjacent partials. If the frequency ratio was
a simple one, such as
2:1 (octave) or
3:2 (fifth), the coincidence of many of the partials would minimize these
beats.
The
mass of data available to the composer that are derived from progressively more
precise and sophisticated acoustic analyses, the abundance of, at times
redundant or perceptually irrelevant, masses of numerical data coming out of
analysis programs, can mask the salient features of musical sound. For example,
when the interval between two pure tones is large enough, they are heard
separately and without beats, resulting in no roughness whatever, even though
the beats still exist in the acoustic world as can be seen on an oscilloscope.
So what happened to them?
Perception
3.1 The
ear: from without to within
The
notion of perception is implicitly contained in the word "sound". A
sound is not just any kind of variation in acoustic pressure, but a pressure
variation that can generally be heard: our ears must be able to code its
features. This coding is conditioned by the physiology of the peripheral
auditory system. Thus, the very first given of psychoacoustics is the
definition of the realm of validity of the word "sound", in other
words the audible field (Fig. 7). An audiogram traces the average hearing
threshold: that is, the intensity necessary to just detect a pure tone of a
particular frequency. This curve simply translates our capacity to detect a
sound signal and is a poor way to characterize the auditory system. The
relation between the pressure wave and what we hear of it can only be
understood by studying certain physiological mechanisms of perception.
The air vibrations of a sound wave are transmitted and
amplified by the external and middle ears: the pinna,
the ear canal, the eardrum, the middle ear ossicles, up to the cochlea. These vibrations are
communicated in the inner ear to the basilar membrane, and the waves propagate
and are damped along this membrane. The stiffness of the membrane varies along its
length. Due to this property the higher frequencies create a maximal
displacement at its base (near the ossicles), while
lower frequencies maximally stimulate the other (apical) end.
Figure
7:
The audible field. This diagram
represents the regions of frequency and acoustic pressure in which vibrations
are heard as sound.
These
deformations then result in electrochemical changes in the hair cells that are
arranged along the length of the membrane. These cells in turn stimulate the fibers of the auditory nerve, along which electrochemical
impulses are sent toward the brain.
The essential point to understand here is that the
first thing that happens to an acoustic signal in the inner ear is some kind of
a spectral analysis. In fact, starting from the temporal wave, the basilar
membrane spatially decomposes the signal into frequency bands. The second
important point is that the hair cells, in addition to coding the frequency
position of the signal components, also preserve to a certain degree their
temporal information by producing neural firings at precise moments of the
stimulating wave they are responding to. This phenomenon, called phase-locking,
decreases with increasing frequency and eventually disappears between about
2000 to 4000 Hz near the upper end of the range of musical pitch. The auditory
system thus performs a double coding of the sound, both spectral and temporal,
in such a way that all the cues present in both kinds of representation may be
available simultaneously in the sensory representation sent to the brain.
3.2 More
than just a transmission
The
coding mechanisms introduce certain phenomena that generate paradoxes and
ambiguities. Sound components called difference tones or combination tones are
a first example of the necessity to become interested in perception in addition
to physical representations.


Figure
8: The critical band. The width of the
critical band depends on the center frequency being
considered. After Moore
and Glasberg (1983).
Two
pure tones presented simultaneously to the auditory system stimulate the
basilar membrane at positions that are associated with their respective
frequencies, but also at positions corresponding to frequencies that are the
completion, toward lower frequencies, of the harmonic series. The causes and behavior of all these distortion products are not fully
understood. However, it is easy to hear the difference tone that corresponds
to the simple difference between the frequencies physically presented to the
ear. This tone is all the more audible at higher levels. It has even been used
compositionally, for example, in Gyorgy Ligeti's Zehn Stiicke fur Bldsserquinttet (1968).
The phenomenon can be heard at the end of the first piece, for instance. The
difference tone belongs to the world of physiology and perception: even if it
is not present in the stimulating waveform, it is created physically in the
inner ear. It can also create auditory beats in the same way as a "real"
sound since it is mechanically present on the basilar membrane.
The frequency decomposition realized by the basilar
membrane is mechanical: the displacements of the membrane are not limited to
specific points but are spread out over a portion of it. If two components of a
complex signal are close in frequency, these displacements will overlap. There
is thus a minimal resolution, called the critical band (
Figure
9: Simultaneous masking. A sine wave and
a narrow band of noise are presented simultaneously. The sine wave is at a
frequency just below (a) or just above (c) that of the noise band. In the first
case (b), although the frequency difference is the same between the two,
the sine tone is heard. In the other
case (d), the noise's excitation pattern swamps the sine wave's, and the latter
is not heard even though their frequency separation remained the same compared
to (a).
This selectivity limit is obvious in the masking
phenomenon. Simultaneous masking is related to the overlap of excitation patterns
on the basilar membrane and to the amount of activity present in the auditory
nerve that represents each sound component present. Simultaneous masking can be
conceived as a kind of "swamping" of the neural activity due to one
sound by that of another (usually more intense) sound. For example, high level
components (noise, partial) create a level of activity that overwhelms that
created by lower-level components, which are subsequently not perceived at all
or are perceived as being of lower level than they would if presented alone.
The different components of a complex sound can also interact, mutually masking
one another, some having a sensation level that is lower than their actual
physical level would lead one to expect. Masking relations are determined
largely by the excitation pattern on the basilar membrane. This pattern is
actually asymmetrical, extending more to the high frequency side than to the
low-frequency side, and all the more so as the level of the sound increases.
Therefore, the frequency and amplitude relations between sounds will affect
their masking relations in a non-trivial way (Fig. 9). The knowledge of these
relations is nevertheless essential to understand which part of a musical
message will actually be perceived.

Fig.9 : Simultaneous masking. A sine wave and a narrow band of noise are presented simultaneously. The sine wave is at a frequency just below [a] or just above [c] that of the noise band. In the first case [b] , although the frequency difference is the same between the two, the sine wave is heard.In the other case [d], the noise's excitation pattern swamps the sine wave's, and the later is not heard even though their frequency separation remained the same compared to [a].

Figure
10: Perceived roughness of a beating tone pair. Roughness is plotted as a
function of the frequency difference
between the two tones,
expressed as a percentage of a critical
bandwidth. When the tones are wider apart than 100% of the critical bandwidth,
they cannot interact and no roughness is heard any more. Maximum roughness is
perceived when the two tones are separated by approximately 25% of the critical
band. After Plomp and Levelt, 1965.
The critical band also influences the perception of
beats between two tones. Acoustically, the rate of the beats increases with
their frequency difference. As such, as the two pure tones are mistuned from
unison, we should hear beats that result from their interaction becoming
progressively more rapid. This is in fact what happens at the beginning of the
separation. But very soon, (after approximately 10 beats per second or a 10-Hz
frequency difference) our perception changes from a slow fluctuation in
amplitude toward an experience of more and more rapid fluctuations, that
produce roughness. Finally, if the separation becomes large with respect to the
critical band, the strength of the sensation of beating diminishes, leaving us
with the perception of two resolved pure tones. Three very different perceptual
regions can therefore arise from the same acoustical stimulus. Let us come back
for a minute to our on-going roughness example. The roughness of beating tone
pairs (measured thanks to experiments involving judgments by human listeners)
has been found to depend not on the absolute frequency difference, but rather
on the frequency difference related to the width of the critical band for a
given center frequency (Fig. 10). Roughness should
not therefore be thought of as an acoustic feature of sound, it definitely
belongs to the world of perception. This has several consequences. As the width
of the band varies (Fig. 8), a given pitch interval won't have the same
roughness in different registers. Thirds, for example, are free of roughness in
the upper register but can be quite rough in the lower one. To be able to
predict that, one needs some kind of model that could extract the relevant
features from the acoustic signal and combine them.Figure 11: Output
of a model of auditory processing derived from that of Patterson (1995). The
three adjacent lines around 900 Hz visible in the spectrogram of Fig. 4(a) are revealed as producing
audible beats (variations in level across frequency regions), causing
roughness. The time frequency trade-o,,$' is here
adapted to perception in each critical band. Ideally, what we
see is what we would hear.
3.3 Modelling
We have seen that all we can hear in a sound is not obvious in any of its physical representations. In the case of roughness, these representations can even be seriously misleading. Shouldn't it be possible to propose models that allow one to predict, on the basis of data obtained from psychoacoustics, which percepts would be induced by a physical stimulus? It is necessary to take certain precautions: a classic psychoacoustic study precisely characterizes a particular phenomenon by creating artificial stimuli and by analyzing the judgments of listeners within a controled laboratory situation. Quantitative data are thus carefully obtained for each of the phenomena mentioned above and for many others as well. The relations obtained can serve as the basis for models, but in general each model describes a particular mechanism within the constraints we just mentioned. In using these models for musical purposes, it is necessary to take into account a large number of phenomena, which of course interact in a complex way. It is extremely complicated, not to say impossible, to establish a coherent ensemble from all of these sundry parts.
However, some of these phenomena are beginning to be
understood in physiological terms, from which derives the idea of modelling
the causes rather than reproducing the effects. In modelling the behavior of the human ear, all of the interactions and
distortions are taken into account in an implicit manner (to the extent that
the model captures well the properties of the auditory system). In such
physiological models (Patterson et al., 1995; Seneff,
1988), a new representation of the sound signal is in fact proposed.

This
representation provides an image of what is effectively heard, within the limits
of our ability to learn to read it. The previous representations were entirely
oriented toward the mathematical description of the signal. Physiological
models are to the contrary adapted to our perception. The sound synthesis
process could be oriented by such representations, recentering
the work on the relevant perceptual parameters (Cosi
et al., 1994). The analysis can benefit as well: for example, roughness, hidden
in the other representations since it belongs intrinsically to the world of
sensation, is here revealed by the fluctuations within each critical band (Fig.
11). These fluctuations are a reflection of the perceived grainy quality. The
images produced by such models thus potentially characterize the evoked
sensations. The sensations are however only very basic bricks upon which the
mental representation of the acoustic world that surrounds us is organized.
4 Auditory scene organization
4.1 Auditory representations
When listening to a noisy environment, or to a piece of music, our auditory experience is usually quite different from the collection of interleaved sinusoids, with frequencies and amplitudes varying over time, that nonetheless constitute the only available information that reaches the ears. Quite to the contrary, we structure the acoustic world in terms of coherent entities that we can generally detect, separate, localize, and identify. For example, in a concert hall we hear separately the melody played by a flute soloist, the cello ensemble, a sudden percussion entry, and our neighbor sighing - in certain concerts. This capacity is quite impressive. The chaotic form of the time-frequency representations of the superposition of all these vibrating sources, that resemble the peripheral analysis we just described, is totally unintelligible to the human eye and even to the most powerful computers. Auditory organization is nonetheless of vital importance for the survival of the species, if only to be able to distinguish the flute solo from a fire alarm! This importance allows us to state that a listener will always try, whatever the situation or listening strategy, to structure the acoustic world that confronts his or her ears. The creation of a structured representation is what allows music to be more than a simple succession of percepts.
The metaphor of the auditory image intuitively incorporates the mode of
structuring that is used. An auditory image can be defined as a psychological
representation of a sound entity that reveals a certain
coherence in its acoustic behavior (McAdams, 1984).
This definition is broad enough to allow it to be employed at several levels: a
single percussion sound is one auditory image, the collection of events
composing the rapid melody played by the flute is another, all those emitted by
the cello section playing in harmony a third. From
research attempting to make sense of all the possible sound cues that the brain
uses to organize the sound world appropriately, it seems that there are two
principal modes involved in auditory image formation: perceptual fusion of
simultaneously present acoustic components and the grouping of successive
events into streams.
4.2 Vertical organization: perceptual fusion
One of the first immediate effects of vertical organization is the grouping together in a same image of the multiple partials of a complex sound spectrum, as analyzed by the ear, into coherent parts. This kind of organization allows us to hear a note played by a violin as a single note rather than a collection of harmonic partials. The main object of vertical organization is therefore at each and every instant to group what is likely to come from the same acoustic source, and to separate it from what is coming from different acoustic sources. One of the characteristics of music, as we shall see, is to constantly try to break down this simple rule. However, the cues used by the auditory system to form vertical images remain the same in musical and non-musical contexts. Understanding them is of course the key to being able to go beyond the simple equivalence between a vertical image and an acoustic source.
The position of a source in space
introduces differences between the waves received by each ear (time delays,
intensity differences) that allow a listener to localize it. A first cue of
grouping is thus made available, one image for each spatial source.
Localization can play a musical role, as in the religious antiphonal music of Gabrieli as early as 1600. Compositional writing for the
classical orchestra, with its codified disposition of the instruments,
integrates more or less consciously the cues of localization in its form. It is
thus undoubtedly to favor
the formation of "vertical" auditory images that some composers using
tape or live electronics music do not hesitate to
spatialize
their scores. Nevertheless, when one listens to sound emitted by a single loudspeaker, and thus originating from a single physical
source, it is possible to have a more or less clear representation of
different auditory images. Imagine for a moment listening to a monophonic
recording of a wind quintet played over a loudspeaker: in general you should
have no trouble perceptually segregating the five sources. Other strategies are
thus available to the ear, based on regularities in the environment that would
have been learned through an evolutionary process (Bregman,
1990). It is highly unlikely, for example, that different partials start and
stop at exactly the same time if they do not come from the same source. The ear
tends to group together partials that start together. If several partials
evolve over time in similar fashion (this is called the common fate
regularity), they will have a high probability of coming from the same source:
as a matter of fact, a modulation of the amplitude or of the fundamental
frequency of a natural sound does affect all of its partials in a similar way.
And finally, a harmonic series will probably come from the same sound source due
to the physics of sound production in forced vibration systems such as bowed
strings and blown air columns. All these cues can be used together to form
vertical images.
Horizontal
organization: streams present acoustic components and the grouping of
successive events into streams.
The
second level of the auditory image metaphor enters the realm of temporal
evolution and studies the formation of auditory streams. A stream is a sequence
of events that can be considered to come from a same source. A stream constitutes
a single auditory image that is distributed over time. For example, the voice
of someone speaking or a melody played on a musical instrument possesses a
certain perceptual unity and thus forms a coherent image.
The general law governing the formation of streams
appears to be based on spectral continuity (McAdams & Bregman,
1979). This law also reflects some kind of regularity in the environment, as a
sound source tends to change its parameters progressively over time. Continuity
is evaluated according to several cues. The most studied ones were frequency
and time proximity (van Noorden, 1975). Events that
are close according to these dimensions will tend to form a separate stream,
just as melodies having a small range and rapid tempi in different registers
from one another will be segregated into different voices. There is a trade-off
between time and frequency proximity for stream formation. Actually, most
combinations of these two parameters give rise to an ambiguous perception where
streams can be voluntarily built or modified. Timbral
similarity is another grouping cue brought into play by orchestration.
The formation of auditory streams has dramatic effects
on the perception of the acoustic events. Judging the timing between two
successive events is usually a trivial task, however if the events are part of
two different streams the judgments beeome impossible
(Bregman & Campbell, 1971). Melody recognition
is also affected by the grouping of all the correct notes in a same stream
(Dowling & Haxwood, 1986).
4.4
Emergent attributes
Vertical
and horizontal grouping mechanisms can interact in a complex manner. It is
more than likely that at some moment of an evolving auditory scene, energy in a
certain frequency region could be attributed to several different vertical
images, or to an ongoing stream. In this case, choices are made and once this
energy has been attributed to an auditory image, it is taken away from the
others. This "stealing" of energy between images can lead to the
consequence that attributes such as pitch and brightness (Bregman
& Pinker, 1978) or loudness (McAdams, Botte and
Drake, in press) or roughness (Wright & Bregman,
1987) can be altered by auditory organization. The expression "sound
object", if used regardless of its original historical context of musique concrete, can therefore be misleading.
A sound object implies a sound event possessing a basic unity, defined by some characterizable features. It is in fact the case that all
sound properties are dependent on dynamic relations with the context, within
which attributes as important as pitch, loudness, roughness, and other
dimensions of timbre can find themselves significantly changed as streams and
vertical images are formed and reorganized. There exists no indivisible,
stable sound object, but rather auditory images that possess more or less coherence
in a dynamic relation.
Each vertical auditory image, once it is formed and
only then, possesses emergent attributes. These emergent attributes are born of
the fusion of its components. An emergent attribute is different from the sum
of elementary attributes of the components contributing to the image. The
addition to spectral components higher in frequency to a pure tone can lead to
the perception of a pitch lower than the original pure tone one, as in the case
of the missing fundamental (Schouten, 1940;
1erhardt, 1974). The recognition of a certain sound source emerges from the
fusion of many components into a global timbre, each of them taken alone often
being unable to unveil the sound origin (McAdams & Bigand,
1993). Correct auditory grouping is therefore essential to build these emergent
attributes and grasp a representation of a natural environment.
4.5 Auditory images and musical structuresThe usual result of auditory organization is to build stable auditory images, each corresponding to a sound source, to be able to recognize them. In a musical context, the emergent attributes often come from "chimeric" sound sources. If each instrument of the symphonic orchestra was to be heard as a single source, the perception of the musical structures imagined by the composer would certainly be quite difficult - just as difficult as if all the instruments were fused into a single auditory image. Auditory organization has therefore been explored by composers for a long time. Deceiving processes of horizontal organization allowed the writing of virtual or implied polyphony, where a monophonic instrument expresses more than one voice at the same time. Tricking processes of vertical organization is the key to orchestration, where unheard of and augmented timbres can be created by fusing different instruments. It can also have radical structural consequences. In his piece Lontano (1967), Ligeti creates dense structures within which, because of vertical fusion cues, the instruments cannot be distinguished. He then remarks that "polyphony is written but one hears harmony". A few years later, organization cues that induce the exact opposite consequences were to be used by the same composer to write the San Francisco Polyphony (1973-74), this time clearly heard as polyphony.
The organization of the auditory scene can even be an
argument for musical 1. structure. A first example of
this is Mutations by Risset where the transformar tion of a chord into timbre that we mentioned earlier is
done with the help of auditory organization cues. The inharmonic
structure of the chord requires the convergence of other cues such as synchrony
to be heard as a fused timbre. Another even more extreme example is the piece
Desintegrations
by
Tristan Murail (1982). In this work, a set of intervals is first
heard fused in a section with a rapid tempo, comprising a complex melodic line,
though it will be heard in a "disintegrated" version in a nearly
static section. This structure is expressed by using different organizational cues
(see Figs. 12, 13).
5
Listening and cognition
5.1
Memory
Of
course, the bottom-up processes that we have described so far, that took us
from acoustic vibrations to auditory images, are not the only ones to intervene
in musical listening, nor even in the formation of auditory images. A complex
set of high-level cognitive processes also come into play. Without pretending
to describe exhaustively and precisely these processes (insofar as this is
possible), we would like in concluding to address certain cognitive aspects of
listening, stressing the ones that are associated more notably with memory.
Various notions are implicated in music cognition: we
might cite attention, cultural knowledge, and temporal organization in
perception. A common characteristic seems to tie together these kinds of
processes as they play a role in listening: memory. Memory is linked to
attending in the sense that attention seems to be predisposed to focus on
events that are expected on the basis of cultural knowledge abstracted from past
experience (Jones & Yee, 1993). Memory allows listeners to implicitly
learn the basic rules of the musical culture to which they belong (Krumhansl, 1990; Bigand, 1993).
Finally the very notion of time, so essential to music, derives from our
ability to mentally establish event sequences through the use of memory (Damasio et al., 1985).
The auditory mode of recall is remarkably powerful.
Crowder and Morton (1969) have shown, in a task requiring listeners to recall a
list presented either visually or auditorially, that
the auditory modality .has a net advantage over the visual modality for the
later elements in the list. The hypothesis advanced to explain this
superiority is the existence of a sensory storage: a sort of "echoic"
memory, specific to hearing, that conserves the
stimulus trace for a brief period of time. This hypothesis has since been
refined and there are most likely several different retention intervals (Cowan,
1984). One of these intervals would be on the order of several hundreds of
milliseconds and another one on the order of several seconds. The first storage
would be related to sensation, constituting a sort of "perceptual
present", while the second one would serve as a basis for what is called
working memory. These fairly short durations raise all kinds of questions
concerning the possibility of apprehending structures extended through time
and carried by sound. The stimulus trace vanishes within a few milliseconds of
the echoic memory, and the working memory cannot hold more than a few items.
Still, we can experience a perception of form and
meaning over time with spoken language, for instance, which is also carried by
sound. So is there a long term storage of the facsimile of the acoustic stimuli
we receive ? In fact, there are a great deal of both behavioral and neurophysiological
data that lead us to believe that there is not a memory center
where this kind of storage takes place in the brain (Rosenfeld, 1988). Memory
would more likely be distributed, as a by-product of cognitive processing, in
the form of potential representations. In
other words, in the presence of a stimulus, the brain activity results in the
extraction of the relevant features, making generalizations, and forming
categories. This theory has some physiological basis. Perception activates the primary
sensory cortices in the brain, which through their activity in time detect and
encode different features of sound. These activity patterns are then projected
onto association cortex. Each neuron in association cortex receives a large
number of connections from various areas of the brain, which allows for
generalization: stimuli sharing similar features will activate the same groups
of neurons, reinforcing the connections between the concerned sensory cortices
and these neurons. When a new stimulus is perceived, if it is similar enough to
a potential representation already memorized, it will be categorized as a
member of the same family. In recall or in imagination or dreaming, or also in
forming categories, it would be the convergence zones that then activate the
sensory cortices in the reverse direction (Damasio,
1994).
The transformation of stimuli into potential
representations, which are not reproductions but rather abstractions of
features of a stimulus, can help to interpret a study on the memory for
melodies by Crowder (1993). The recall of melodies seems first of all to be
based on the pitch contour, if recall follows shortly after learning has taken
place. However, if recall is delayed in time, the influence
of contour decreases to the benefit of pitch relations: an abstraction
has been performed from absolute pitch, an early level of perception, to relative
pitch intervals and relations within a tonal framework. It is this kind of
abstraction that seems to be stored in long-term memory.
This leads us back to what is implied in the possible
use of different continua, for only discrete scales allow a classification of
perceptual values that is apt to be coded in the brain in terms of abstract
relations. In fact, the existence of discrete degrees dividing the octave is
one of the rare constants that can be found throughout nearly all cultures
(Dowling & Harwood, 1986). Further, numerous studies on musical cognition
seem to converge toward the importance of metric and rhythmic hierarchies and
thus of the discretization of time (Lerdahl & Jackendoff, 1983;
Clarke, 1987). The question is also raised for the use of timbre, seen as a
potential carrier of structure, and its various, apparently continuous dimensions (McAdams, 1989). These considerations on memory
thus have a direct influence on musical structures, if one wants them to be
intelligible.
5.2
Arousal
After all
these considerations about acoustics, psychoacoustics and cognition, a text
about music would seem to politely avoid an uneasy point if it did not address,
even superficially, the question of arousal (or emotion) in music. This issue,
which may be considered at first glance to be beyond the field of scientific
investigation, or with one foot in the aesthetics domain, is in fact the
subject of numerous psychological studies that could find an application in the
musical domain. So, what is arousal for the cognitive psychologist
?
An interesting answer is proposed by Mandler (1984). Human beings have a certain number of
schemas, some innate and directly linked to survival, others acquired and
eventually modifiable that axe linked to past experience. Perceptions are thus
evaluated in terms of expectancies. In the case of a perception that conforms
to our expectancies, one might expect that little cognitive activity is
necessary. On the other hand, a perception that goes against expectancy triggers
both an emotional reaction and cognitive processing,
the latter in order to adapt our representation of the external world to what
has been perceived. Damasio goes even further, in showing by way of numerous examples from neuropathology
that preceding arousal is not only sufficient, but also absolutely necessary
for the correct operation of many cognitive processes such as socially relevant
decision making and personal planning for the future (Damasio,
1994). If arousal occupies such an important place in our cognitive processing
(usually associated with "pure reason"), the question of the relation
between emotion and music takes on a new dimension.
Let us try to make a parallel between what we learned
from cognitive psychology and what happens in music listening. Bregman (1990) clearly states that schemas influence the
way we organize the auditory scene, by providing us with a mechanism to extract
certain things that we are seeking within that scene. Another well-established
schema is our knowledge of tonality. These schemas, and especially the latter,
will give rise to expectancies, which, being violated or not, will evoke
arousal. In the domain of musical tension the work of Bigand
(1993) has demonstrated the influence that implicit knowledge of tonality has
on music perception in both professional musicians and non-musician listeners.
Asked to judge the degree of tension of an interrupted melody, the listeners
clearly expressed expectancies linked to tonality. Bharucha
(1989) has simulated with neural nets the expectancies of Western and Indian
listeners presented with their respective musics,
and their erroneous comprehension faced with the music of the other culture.
The arousal (the feeling of "tension") can here be seen as the
result of an enforcement of cultural rules that give rise to expectancies.
These rules, assumed to be shared by listeners, could be called external
referents. Sounds charged with an extra-musical meaning can also be considered
as playing with external referents. On the other hand, there are also reactions
to music that originate in the musical material, based on its immediate
qualities, and which generate expectancies over time. The auditory equivalent
of Gestalt good-continuation laws (Koska, 1935) have
been proposed to generate expectancies that can give rise to arousal if they
are violated (Meyer, 1956). Another immediate sound quality, roughness, has
been shown to play a part in tension perception in the tonal context along to
these other factors. The self-generated expectancies acquire even greater
importance if external referents are kept in the background (Pressnitzer et al., 1996). In this case, the embedded
structure is revealed by cognitive mechanisms and thus depends to a great
extent on the organization into auditory images, and thus on perception, which
in turn depends on the physical phenomena.
To finish with our roughness example, it is clear that
in tonal music, the influence of roughness is altered by the deep implicit or
explicit acculturation that we have of basic harmonic rules 6. However, when
the composer leaves the well-trodden path of convention, the mastery of a cue allowing
the expression of a simple and immediate tension can become primordial again.
We
might, however, ask ourselves if these rules, that actually modulate the
perceived roughness, do not in fact have an unconscious foundation based on roughness
(at least early in musical life).
For
example, the piece La tempesta
d'apres Giorgione by Hugues Dufourt (1977) is
constructed as a movement of latent tension going (nearly) all the way to its
own paroxysm. This piece employs wind instruments in a very low register. In
the low register the spacing between the partials of the instrument sounds are very small compared to the critical
band. Different partials thus fall into the same critical band: this induces a
perception of roughness. The same material transposed to a medium or high
register would lose its meaning to a large extent. As such, acoustics (the
beats) and psychoacoustics (perception of these beats) contribute to the
structure of the piece! Gerard Grisey can therefore
exclaim, "we are musicians and our model is
sound, not literature, sound, not mathematics, sound, not theater,
plastic arts, quantum theory, geology, astrology or acupuncture" (Grisey, 1984, p. 22, as quoted by Wilson, 1989). Spectral
music, in its search for expression through the material itself, without hidden
or conventional reference, makes possible the recourse to certain data from
acoustics and psychoacoustics not only to justify certain choices a posteriori, but also as a means of formalizing musical
processes.
Conclusion
Throughout
this article, we have attempted to evoke a set of facts derived from acoustics,
psychoacoustics, and cognitive psychology, that have a
certain resonance with the spectral approach. Through the example of
roughness, we showed how a part of an essential musical feature such as dissonance could be embedded in the material itself,
involving the knowledge of its acoustic ground, the study of its transformation
through perception, and how it is put into perspective through the influence of
cognitive processes involved in listening. The mutual curiosity of scientists
and musicians is not recent, but it has often had different motivations.
Concerning our example, the question of why certain intervals are consonant has
sparked interest throughout history. When Pythagoras was interested in
intervals, it took the form of a new manifestation of the omnipresence of
numbers. Later, with Kircher, divine intervention was
sought in harmonic ratios. Kepler, Galileo, Leibniz,
Euler, Helmholtz, among others, all made attempts to
propose an explanation of their own (Assayag & Cholleton, 1995).
The goal here is much more modest. 'It bring this goal to light we
would like to cite a musical custom in
"After
a while, the audience (the hosts) becomes very deeply moved. Some of them burst
into tears. Then, in reaction to the sorrow they have been made to feel, they
jump angrily and burn the dancers on the shoulders with the torches used to
light the ceremony. The dancers continue their performance without showing any
sign of pain. The dancing and singing with the concomitant weeping and burning
continue all night with brief rest periods between songs." (p. 128).
Evoking the name of a place to which a tragic memory
is perhaps attached (the death of family or friends, for example) has triggered
such reactions in the listeners. Obviously, the singer is not interested in
acoustics, nor even in psychoacoustics, and why would
he be so since he is leaning on external referents to trigger arousals. The
point here is that he knows very little about the referents he is using, being
a stranger to the territory of his hosts. If only the small amount of data concerning
music perception that have been presented here could
be used to avoid such embarrassing situations...
References
Allen, J. & Rabiner, L. (1977).
A unified approach to short-time Fourier analysis.
IEEE,
65:1558-1564.
Assayag, G. & Cholleton, J. P. (1995). La musique et les nombres. La Recherche, 26(278):804-810.
Bharucha, J. J.
& Olney, K. L. (1989). Tonal cognition, artificial intelligence and
neural nets.
Contemporary
Music Review,
4(1):341-356.
Bigand,
E. (1993). The influence of implicit harmony, rhythm and musical training on
the abstraction of tension-relaxation schemas in tonal music phrases. Contemporary Music Review, 9:123-137.
Bregman,A. S. (1990). Auditory Scene
Analysis:
The Perceptual Organisation of Sound
M.I.T Press,
Bregman, A. S. & Campbell, J. (1971).
Primary auditory stream segregation and perception of order
in rapid sequences of tones. Journal of Experimental
Psychology, 89:244--249.
Bregman,
A. S. & Pinker, S. (1978). Auditory streaming and the
building of timbre. Canadian
Journal of Psychology, 32:19-31.
Clarke, E. F. (1987). Levels of
structure in the organisation of musical time. Contemporary Music Review, 2(1):211-238.
Combes, J. M., Grossman, A., & Thamitchian,
P. (1989). Wavelets. Springer
Verlag,
Cosi, P., De Poli, G., & Lauzzana, G. (1994). Auditory modelling and selforganizing
neural network for timbre classification. Journal of New Music Research, 23:71-98.
23
Cowan, N. (1984). On short and long auditory stores. Psychological Bulletin, 2:341-370.
Crowder, R. G. (1993).
Auditory memory. In McAdams, S. & Bigand, E., eds., Thinking
in Sound, pages 113-140, Oxford. Clarendon Press.
Crowder, R. G. & Morton, J.
(1969). Precategorical acoustic storage. Perception
and Psychophysics, 5:365-373.
Damasio,
A. R. (1994). Descartes' Error: Emotion,
Reason, and the Human Brain. Grosset/Putnam, New York.
Damasio,
A. R., Eslinger, P. J., Damasio,
H., Van Hoesen, G., & Cornell, S. (1985). Multimodal amnesic syndrome following bilateral temporal and basal
forebrain damage. Archives of
Neurology, 42:109-133.
Dowling, W. J. & Harwood, D.
L. (1986). Music Cognition. Academic
Press, London.
Greenwood, D. D. (1961). Critical bandwidth and the
frequency coordinates of basilar membrane.
J. Acovst.
Soc. Am.,
33:1344-1356.
Grisey, G. (1984).
La musique : le devenir des sons.
In Darmstadter Beitrage zur Neuen Musik, volume XIX, ,Mainz.
Jones, M. R. & Yee, W.
(1993). Attending to auditory events: the
role of temporal organisation. In McAdams, S. & Bigand,
E., eds., Thinking in Sound, pages
69-112,
Koska,
K. (1935). Principles of Gestalt Psychology. Harcourt, Brace, & World,
Krumhansl,
C. L. (1990). Cognitive
Foundations of Musical Pitch.
Lerdahl, F. & Jackendoff, R.
(1983). A
Generative Theory of Tonal Music.
M.I.T
Press,
Loughlin, P. J., Atlas, L. E., & Pitton,
J. W. (1993). Advanced time-frequency
representations for speech processing. In Cooke, M.,
Beet, S., & Crawford, M., eds., Visual
Representations of Speech Signals, pages 27-53, New York. John Wiley and Sons.
Mandler, G.
(1984). Mind and Body.
Mathews, M. V. & Pierce, J.
R. (1980). Harmony and nonharmonic partials. J. Acoust. Soc. Am., 68(5):1252-1257.
McAdams,
S. (1984). Spectral
fusion, spectral parsing and the formation of auditory images.
These de doctorat, Stanford University,
Stanford, California. McAdams, S. (1989). Psychological
constraints on form bearing dimensions in music. Contemporary Music Review, 4:181-198.
von Helmholtz,
H. L. F. (1877). On the
Sensations of Tone as the Physiological Basis for the Theory of Music. 2nd. Ed. trans. A. J.
Ellis (1885), from German 4th Ed., Dover, New York (1954).
Wilson,
P. N. (1989). Vers une ecologie des sons: Partiels de Gerard
Grisey et 1'esthetique du groupe de l'Itinera,ire.
Entretemps,
8:55-81.
Wright, J. K. & Bregman,
A. (1987). Auditory stream
segregation and the control of dissonance in polyphonic music. Contemporary Music Review, 2(1):63-92.