CONTENTS
1. Auditory Grouping Phenomena . 32-2 1.1. Parsing of Sounds of
Complex Spectral
Composition, 32-3
1.1.1.
Harmonicity of Spectral Components, 32-3 1.1.2. Time-Variant Relationships,
32-4
1.1.3. Familiarity, 32-5
1.2.
Grouping of Sound Sequences in Space, 32-5 1.2.1. Auditory
Illusory Conjunctions, 32-5 1.2.2. The Scale Illusion, 32-6
1.2.3.
Grouping of Nonsimultaneous Sound Sequences, 32-7
1.2.4.
The Hypothesis of a Slow Switching Mechanism, 32-8
1.2.5. The Octave Illusion, 32-9
1.2.6. Grouping of Phase-Shifted Tones, 32-14 1.3. Grouping of Rapid Sound
Sequences, 32-16 1.3.1. Grouping by Frequency, 32-16
1.3.2. Grouping by Frequency Proximity, 32-16 1.3.3. Temporal Coherence as a
Function of Frequency Proximity and Tempo, 32-16
1.3.4.
Grouping by Frequency Proximity in Relation to Repetition, 32-16
1.3.5.
Frequency Proximity and the Perception of Temporal Relationships, 32-17
1.3.6. Grouping by Good Continuation, 32-18 1.3.7. Grouping by Sound
Quality, 32-19 1.3.8. Grouping by Amplitude, 32-19
1.3.9. Grouping by Temporal Position, 32-19 1.3.10. Grouping by Spatial
Location, 32-19 1.3.11. Closure: The Auditory Continuity Effect, 32-19
1.4. Grouping and Selective Attention, 32-20
1.4.1.
Voluntary and Involuntary Grouping, 32-20 1.4.2. Consequences of Attention
Focusing, 32-20
2. Shape Analysis for Pitch Structures 32-21
2.1. Auditory
Shape Analysis as a Multileveled
Process, 32-21
2.2.
Passive Versus Active Processing, 32-21 2.3. Feature
Abstraction, 32-21
2.3.1. Octave Equivalence, 32-21
2.3.2. Interval and Chord Equivalence, 32-21 2.3.3.
Categorical Perception of Musical Intervals, 32-22
2.3.4.
Global Cues, 32-22 2.3.5. Interval Class, 32-22
2.4. Higher-Order Abstractions, 32-25
2.5. Hierarchical Encoding of Pitch Sequences, 32-27 2.6.
The Influence of Short-Term Memory on Perception of Pitch Patterns, 32-29
2.6.1.
Interference Effects in Short-Term Memory for Pitch, 32-29
2.6.2.
Facilitation Through Repetition in Short Term Memory
for Pitch, 32-31
2.6.3. The Influence of Relational Context on Pitch Comparison
Judgments, 32-31 2.7. Contour as a Cue in Recognition of Pitch Patterns, 32-32
2.8.
Scale and Key Structure in Recognition of Pitch Patterns, 32-32
2.9.
Memory for Hierarchically Organized Pitch Patterns, 32-33
3. Analysis of Timbre 32-34 3.1. Timbre and Fourier Analysis, 32-34 32-2
3.2.
Investigation of Timbre by Analysis and Synthesis, 32-35
3.3.
Multidimensional Models of Timbre, 32-35 3.4. Role of Context in Timbre
Perception, 32-36
Perception of Temporal Relationships 32-37 4.1. Perception of
Temporal Order, 32-37
4.1.1. Modes of Order Perception, 32-37
4.1.2. Perception of the Order of Two Events, 32-37 4.1.3.
Perception of the Order of Three or More Events, 32-37
4.1.4.
Order Perception in Continuously Cycling Sound Patterns, 32-37
4.1.5.
Theories of Order Perception, 32-38 4.2. Perception of Rhythm, 32-38
4.2.1. Subjective Rhythmic Grouping, 32-38 4.2.2. Grouping
by Temporal Proximity, 32-38 4.2.3. Grouping by
Accent, 32-39
4.2.4. Grouping by Other Principles, 32-39 4.2.5.
The Run Principle and the Gap Principle, 32-39
4.2.6.
Rhythmic Hierarchies,
32-40
5. Summary
32-43 Notes 32-43 References 32-44
Research
on hearing has traditionally been concerned with simple detection,
discrimination, and scaling tasks. However, the last decade has seen a
flowering of interest in higher-level mechanisms concerned with auditory
grouping, shape percep tion, memory, and so on. This new development has been
due largely to technological advances that have enabled researchers to generate
complex auditory stimuli with precision and flex ibility. Those entering the
field have been rewarded by the discovery of an elaborately structured and
highly differentiated system that possesses some remarkable properties.
Two
major influences on research into auditory pattern recognition may be
identified. The first stems from related work in perceptual and cognitive
psychology. For example, the multi leveled
approach to auditory shape perception has been strongly motivated by
theoretical and experimental work on the per ception
of visual shape. As another example, research into memory for sound structures
has been influenced by findings on memory for verbal materials.
A
second major influence derives from music theory. Fun damental concepts such as
octave equivalence and interval equivalence have been in the mainstream of
traditional music theory since the time of Pythagoras. Several developments in
contemporary music theory have also provided input. For example, the theory of
12-tone composition, developed by Schoen berg, is based on an implied theory of
shape analysis for pitch structures. Another example is the hierarchical theory
of tonal music, developed early in this century by Schenker, which has points
of similarity with the theory of transformational grammar developed later by
Chomaky. In addition, composers of electronic and computer music have provided
the major impetus to recent experimental work on the perception of sound
quality or timbre, an area of research with broad implications for auditory per
ception in general.
This
chapter is divided into four main sections. In the first, auditory grouping
phenomena are investigated. This section deals with questions concerning the
perceptual fusion and sep aration of components of a
complex sound spectrum, the grouping of sound elements emanating from different
spatial locations, and the grouping of sounds that occur in rapid succession. The
second section is concerned with the perception and recognition of patterns
formed of pitch combinations. The third section deals with the perception of
timbre or sound quality. The fourth section is concerned with the perception of
temporal order and of rhythm. The final section summarizes the findings in
these different subfields.
1. AUDITORY GROUPING PHENOMENA
We may
distinguish two basic but interrelated questions in considering how the
auditory system groups stimuli into perceptual configurations. The first
involves the stimulus dimen sions along which grouping principles operate. When
presented with a complex signal, the auditory system may group elements
according to some rule based on frequency, on amplitude, on temporal or spatial
position, or on some multidimensional at tribute such as timbre. As will be
shown, any of these attributes may serve as a basis for grouping, and further,
there are complex and rigid rules determining which attribute will be used. Such
rules can often be well interpreted in terms of strategies that are most likely
to lead to the correct conclusions in interpreting our auditory environment. Second,
we may enquire into the principles that govern grouping along any given
dimension. The Gestalt psychologists proposed that we form groupings on the
basis of certain simple principles, such as proximity, good continuation,
similarity, and common fate (Wertheimer, 1923). As described elsewhere in this Handbook,
these have been shown to be important descriptive principles for grouping
in vision. We shall show here that this is true for hearing also. It may
plausibly be argued that grouping in conformity with such principles enables us
to interpret our environment most effectively (Bregman, 1978; D. Deutsch,
1975c; Gregory, 1970; Hochberg, 1974; Sutherland, 1973). Sounds that are
similar are likely to be coming from the same source, and sounds that are
dissimilar are likely to be coming from different sources. A sequence of sounds
is more likely to be coming from a single source if it contains frequency
transitions that are gradual rather than abrupt. Components of a sound spectrum
that modulate in synchrony are more likely to be coming from a single source
than those that modulate out of synchrony.
The view of auditory grouping as a process of unconscious inference
may be traced to Helmholtz (1859/1954) (see Note 1). He speculated how, given
the complex, time-variant spectrum produced by several musical instruments
playing simultaneously, the listener reconstructs the auditory environment so
that some components of the spectrum fuse perceptually to produce the
impression of a single sound, while others are heard as separate melodic lines
sounding in parallel. He wrote:
Now there are many circumstances which assist us
first in separating the musical tones arising from different sources, and
secondly, in
keeping together the partial tones of each separate source. Thus when one
musical tone is heard for some time before being joined by the second, and then
the second continues after the first has ceased, the separation in sound is
facilitated by the succession of time. We have already heard the first musical
tone by itself and hence know immediately what we have to deduct from the compound effect for the effect of
this first tone. Even when several parts proceed in the same rhythm in
polyphonic music, the mode in which the tones of different instruments and voices
commence, the nature of their increase in force, the certainty with which they
are held and the manner in which they die
off, are generally slightly
different for each ... but besides all this, in good part music,
especial care is taken to facilitate the
separation of the parts by the ear. In polyphonic music proper, where
each part has its own distinct melody, a principal means of clearly separating
the progression of each part has always consisted in making them proceed in
different rhythms and on different divisions of the bars.... All these helps fail in the resolution of musical tones into their constituent partials. When a
compound tone commences to sound, all its partial tones commence with the same
comparative strength; when it swells, all of them generally swell
uniformly; when it ceases, all cease simultaneously.
Hence no opportunity is generally given for hearing them separately and
independently. (pp. 59-60)
1.1. Parsing of Sounds of Complex Spectral
Composition
A
basic task for auditory theory is to determine the relationships between
elements of an ongoing sound spectrum that give rise to the perception of a
single sound and those that give rise to the perception of several simultaneous
sounds. Without these processes of fusion and separation, intelligible
listening would not be possible. Presumably mechanisms have evolved that cause
us to fuse together those elements of the sound spectrum that are likely to be
coming from the same source and to separate out those elements that are likely
to be coming from different sources. Three factors will be considered here. The
first is harmonicity of spectral components; the second is synchronicity; the
third is familiarity with certain sound complexes.
1.1.1. Harmonicity of Spectral Components.
It has been
argued from various lines of evidence that harmonic sounds are more likely to
be perceived as fused than are nonharmonic sounds (see Note 2). Stringed and
blown instruments have partials that are harmonic or nearly harmonic, and such
partials unite to produce the impression of a single tone. In contrast, bells
and gongs have partials that are nonharmonic, and these produce more diffuse
sound impressions (Mathews & Pierce, 1980). De Boer (1976) has shown that
harmonic complexes tend to produce, unitary and unequivocal pitch sensations,
whereas certain types of nonharmonic complex do not merge, but instead produce
multiple pitch sensations. Since most forced vibration systems such as the
voice have partials whose frequencies are harmonic or close to harmonic, such
findings are as expected on the hypothesis that our auditory system has evolved
to interpret sound patterns in terms of the sources from which they emanate.
We may
next enquire whether the phase relations between the partials of a tone affect
the fusion of its image. This question was investigated by Kubovy (Note 3). He
created a set of harmonically related sinusoids, all of equal amplitude, and
all be ginning with a positive zero-crossing and
therefore having a common zero-crossing at the frequency of the fundamental. One
of these sinusoids was then moved out of phase for a few hundred milliseconds. It
was then moved back into phase, while another was moved out, and so on. A
perceptual segregation was produced by these means, so that a melody was heard
that corresponded to the out-of-phase sinusoids.
Later,
Kubovy and Jordan (1979) constructed stimuli consisting of the third to
fourteenth harmonics of a 200-Hz fundamental, which were played in the sine
phase. At intervals of roughly 300 msec, the phases of all components but one
were reset to 0°ree;, and the phase of the
remaining component was set to a different phase angle. The out-of-phase
components formed a scale that either ascended or descended, and subjects
judged the direction of this scale. The results are shown in Figure 32.1. It
can be seen that for phase shifts greater than 40°ree; subjects showed
near-perfect identification of scale direction. These experiments therefore
demonstrate the perceptual effect of phase relationships on the fusion of
single tones composed of harmonically related complexes: Phase shifting a
component of the complex results in its perceptual segregation.
Tones
whose fundamental frequencies are related by simple ratios fuse more readily
than tones that are not so related. In a demonstration of this phenomenon,
Rasch (1978) presented two chords in succession. The lower tones of each chord
were identical, and the higher tones formed a sequence that either ascended or
descended. The subjects' task was to judge whether the higher tones formed a
"low-high" sequence or a "high-low" sequence. Detection
thresholds were taken as the measure of the extent to which the subjects could
separate out the component tones of each chord. The lower tones all had a
fundamental frequency of 250 Hz. The higher tones had fundamental frequencies
that either were 500 and 750 Hz or deviated slightly from these values.
These results of the experiment are shown in Figure 32.2. It can be seen that, as the relationships formed by the fundamental frequencies of the higher and lower tones deviated from simple ratios, detection performance gradually improved, indicating a decreased tendency to fuse together the higher and lower components of the chords.

Figure 32.1. Percentage of correct
identification of phase-shifted target tones as a function of phase shift in
degrees. The stimuli consisted of the third to fourteenth harmonics of a
200-Hz fundamental, which were played in the sine phase. At intervals of around
300 msec, the phases of all components except one were reset to 0°ree;, and the phase of the remaining component was set to a
different phase angle. The out-of-phase components formed a scale that either
ascended or descended, and subjects identified the direction of the scale. Near-perfect
identification was shown for phase shifts greater than 40 deg. (From M. Kubovy
& R.

Figure 32.2.
Detection thresholds for higher tones in the presence of
lower tones. Two chords were presented in succession. The lower tones of
the chords were both at 250 Hz, and the higher tones formed either a
"low-high" sequence or a "high-low"
sequence. Either higher tones were at 500 and 750 Hz, or they deviated
slightly from these values. Subjects judged whether a "low-high"
sequence or a "high low" sequence had been presented. Detection
thresholds fell gradually with increasing deviation from the 500-Hz and 750-Hz
values, in roughly symmetrical fashion.
1.1.2. Time-Variant Relationships. One factor that
may be hypothesized to contribute to the impression of a single fused sound is
coordinated modulation in the steady state. In forced vibration systems, any
perturbation of the driving force will result in perturbations of components of
the spectrum that are proportional to their frequencies. Thus a complex of sinusoids that is modulated
in correlation is likely to be emanating from a single source. McNabb and
Chowning (quoted by McAdams, 1982) have demonstrated informally that a harmonic
tone com plex with a spectral power distribution conforming to that of a vowel
produces only a weak vocal sensation, and only weak perceptual
fusion. However,
if a small amount of frequency modulation is superimposed on all the spectral
components simultaneously, they sound strongly fused. Similar observations
have been reported informally by McAdams (1982). -
By the same token, if we hear a complex of sinusoids
with uncorrelated modulation functions, the likelihood is that the components
of the complex are emanating from different sources. McAdams (1982) reports an
informal experiment employing a complex stimulus in which a transition was made
from perfectly correlated to two uncorrelated frequency modulation functions. For
harmonic tone complexes, the listener's percept shifted from a single fused
image to two distinct images. The effect was uncertain for inharmonic tone
complexes.
A related finding was obtained by Rasch (1978),
using the sequence detection task described in Section 1.1.1. He showed that,
when the higher tones of the chords were frequency modulated while the lower
tones remained unmodulated, detection of whether the chords formed a
"low-high" sequence or a "high low" sequence was enhanced,
so that uncorrelated modulation resulted in decreased fusion of the
simultaneously presented tones.
How does onset asynchrony of two simultaneous tones
affect perceptual fusion? Rasch (1978) used the same detection task to study
the effect of delaying the lower tones of the chords relative to the higher
tones. As shown in Figure 32.3, detection performance was strongly influenced
by this manipulation. Each 10 msec of delay was associated with roughly a 10-dB
downward shift of threshold. For a delay of 30 msec, threshold for perception
of the high tone was close to that for the high tone presented alone.
Rasch further noted that the phenomenological
effect of . asychrony was
very strong. Whereas in the synchronous con- ". ditions a single "sound object" was perceived, in the asynchronous conditions
the two tones stood apart very clearly. However, the onsets of the two tones
were not separately audible, so that they were perceived as two separate but
simultaneous sounds:
This is
an example of the continuity effect. (See Section 1.3.11.).
A related finding was obtained by Bregman and Pinker
(1978). These authors presented a two-tone complex in alter nation with a third
tone and introduced various conditions of onset-offset asynchrony between the
simultaneous tones in the complex. As the degree of asynchrony increased, the
likelihood also increased that one of the simultaneous tones would form a
melodic stream with the third tone. Bregman and Pinker argued that the
asynchrony of the simultaneous tones resulted in a decreased tendency for these
tones to be treated as coming from the same source and so facilitated a sequential
organization by frequency proximity between one of these simultaneous tones and
the alternating tone.

Figure 32.3.
Detection thresholds for higher tones in the presence
of lower tones. The paradigm used was as described in Figure 32.2. The lower tones
were at 250 Hz,
and the higher tones were at 500 Hz and 750 Hz. Either the higher tones ended
simultaneously with the lower tones (solid line), or they ended immediately
following onset of the lower tones (dashed line). Thresholds were virtually
unaffected by amount of overlap but were strongly affected by delay of the
lower tones. Each 10 msec of delay produced roughly a 10-dB downward shift in
threshold. (From R. A. Rasch, The perception of simultaneous notes such as in
polyphonic music, Aeustica,1978, 40. Reprinted with permission.)
Dannenbring
and Bregman (1978) investigated the effects of several variables on the
tendency of one component of a complex tone either to fuse with the other
components or al ternatively to be pulled out into a different melodic stream. The
stimuli consisted of a complex of three pure tones (at 500, 1000, and 2000 Hz)
that alternated repeatedly with a single "captor" pure tone (at 500,
1000, or 2000 Hz). The amplitudes of the components of the complex tone either
were equal or increased or decreased with frequency. The amplitude of the
"captor" tone was always equal to that of the "target"
component of the frequency with which it alternated. The relative onsets and
offsets of the components of the complex tone were also varied. Subjects judged
the repetition rate of the captor tone. If this rate was judged to be slow, the
components of the complex tone were considered to be fused into a single unit. However,
if this rate was judged to be fast, the target component of the complex tone
was considered to have been pulled into the same stream as the captor.
Various
findings emerged from this study. First, the tendency for the formation of
melodic streams was found to be greater when the repeating tone was at 500 Hz
than when the tone was at one of the other two frequencies. Second, the
tendency to fusion was greatest for tones in which the relative amplitudes of
the components decreased with frequency, a situation most like that commonly
encountered in the natural environment. Third, when the target components led
the other components of the complex tone at onset, there was an increased
tendency to produce melodic streams. This was also true when the target
component lagged the other components at offset. However, when the target
component lagged the others at onset or led them at offset, no such effect
occurred.
The
effects of fusion and separation of two gliding tones were studied by Steiger
and Bregman (1981). Here the tones glided in parallel on a log frequency scale,
and the glides were repeatedly presented in alternation with a pure tone
"captor" glide. Subjects judged whether the stimulus was
"fused" (i.e., whether the sequence appeared as an isochronous
alternation of a pure tone with a rich tone) or "decomposed" (i.e.,
whether the sequence appeared to contain three tones in each cycle). The
tendency for the stimulus to be judged as decomposed was enhanced when the
captor and target glides were in the same frequency range, and also when the
captor and target glides had the same orientation.
A
sudden change in the amplitude of a component of a tone complex can cause this
component to stand out perceptually. This was demonstrated by Kubovy (Note 4).
He presented sub jects with an eight-tone chord whose components were successively
turned off abruptly for 80 msec and then restored to full amplitude. This
manipulation occurred at a rate of three per second. The subjects perceived a
melody that corresponded to the order in which the tones were subjected to this
momentary amplitude disparity. For this pitch segregation effect to occur, it
was necessary that the frequency spacing between successive tones be greater
than the critical band.
1.1.3.
Familiarity. Sounds with familiar spectral shapes, such as human voices and
musical instrument tones, appear to fuse more readily than sounds with
unfamiliar spectral shapes. Informal observations show that the percept of a
particular vowel is lost when its spectral envelope is shifted slightly in
frequency, even though the relative amplitudes are preserved. Other factors such
as the relative growth and decay of individual partials also appear to
contribute to familiarity. Unfortunately no quantitative data on the issue are
available at present.
1.2. Grouping of Sound Sequences in Space
A
useful technique for studying grouping phenomena in hearing is to present two
different pitch sequences in parallel, one to the left of the listener, and the
other to the right. In most experiments, stimuli have been presented
dichotically via head phones; however, in some experiments stimuli have been
presented via spatially separated loudspeakers. This technique enables
different stimulus dimensions to be set in opposition to each other as bases
for grouping. Thus, for example, grouping by frequency or by amplitude may be
opposed to grouping by spatial location. At the same time, different principles
governing grouping along any given dimension may be set in opposition to each
other. For example, grouping by proximity may be op posed to grouping by good
continuation. This section describes findings obtained with this technique and
discusses their theoretical implications.
1.2.1. Auditory Illusory Conjunctions. When two
sequences of tones emanate simultaneously from different regions of space, and
the onsets and offsets of these tones are synchronous, striking perceptual
illusions are generally produced. We may characterize a tonal stimulus as a
bundle of attribute values, that is, as having a pitch, a location, a loudness, and a timbre. In the situation just outlined,
these bundles of attribute values fragment and recombine, so that illusory
conjunctions result. (See also Treisman, Chapter 35.)
This anomalous recombination suggests that all auditory stimuli are at some
stage in the processing system fragmented into their separate attributes and
that this process of fragmentation is followed by a process of perceptual
synthesis in which the different attribute value are recombined. Under most
circumstances the stimuli are re constructed correctly; however, we should not
assume that this
necessarily occurs.
Striking
individual differences are manifest in the types of illusion that are produced
in this situation. Further, these differences correlate strongly with
handedness and may be re lated to patterns of cerebral dominance. This implies
that they have an innate basis.
1.2.2. The Scale Illusion. One example of the
creation of strong illusory conjunctions is provided in the scale illusion (D.
Deutsch, 1975c, 1975e). The configuration that produced the illusion is
illustrated in Figure 32.4(a). It can be seen that this consisted of a major
scale (see Note 5), which was presented simultaneously in both ascending and
descending form. When a tone from the ascending scale was delivered to one ear,
a tone from the descending scale was simultaneously delivered to the other ear,
and successive tones in each scale alternated from ear to ear. This pattern was
repeatedly presented ten times without pause. All tones were sine waves of
equal amplitude and 250 msec in duration.
When
presented with this configuration, no subject perceived the sequence of tones
that was delivered to one ear or to the other, and none perceived a full
ascending or descending scale. Instead, the successive tones were always
grouped together on the basis of frequency range. All subjects perceived a
sequence of four tones that repeatedly descended and then ascended. Be yond
this, percepts were divisible into two categories. Most subjects also perceived
a second stream of lower tones that repeatedly ascended and then descended. The
second stream moved in contrary motion to the first [Figure 32.4(b)]. This
percept therefore included all the pitches in the configuration; however, these
were separated into two streams on the basis of frequency range.

Table 32.1. Numbers of Right-Handers and Left-Handers Perceiving Both
the Higher and the Lower Pitch Sequences in the Scale illusion ("Both"), and Those Perceiving Only
the Higher Pitches ("Single")
Streams
Handedness Both Single
The
right-handers tended significantly to hear both
streams; however, the left-handers did not show such a tendency (from
D. Deutsch, Two-channel listening to musical scales, journal of the
Acoustical Society of America, 1975, 57. Reprinted with
permission.)
A
minority of subjects perceived instead only one stream of four tones that
repeatedly descended and then ascended. This corresponded to the higher
sequence of tones; little or nothing of the lower sequence was perceived.
Table
32.1 shows the numbers of right-handed and left handed subjects who obtained
these two categories of percept. As can be seen, the two handedness groups
differed significantly on this measure. Further, in considering those subjects
who perceived both streams, significant differences between the two handedness
groups also emerged. Most right-handers obtained an illusion whereby the higher
tones all appeared localized in one ear and the lower tones in the other ear. As
shown in Table 32.2, there was a highly significant tendency to perceive the
higher tones in the right ear and the lower tones in the left ear, and also to
maintain a given localization pattern when the earphone positions were
reversed. The remaining right-handers obtained a variety of idiosyncratic
localization percepts, as did those who perceived only one stream. Most
left-handed subjects who perceived both streams also localized all the higher
tones in one ear and all the lower tones in the other ear. However, as shown in
Table 32.2, these subjects did not display the same localization tendencies as
did the right-handers. The remaining left-handed subjects reported a variety of
idiosyncratic localization percepts.
Table 32.2. Localization Patterns in the Scale Illusion, Displayed for Those
Subjects who Perceived All the Higher Tones in One Ear
and All the Lower Tones in the Other Ear

Figure 32.4.
.
(a) Stimulus configuration that produced the scale illusion. This consisted of
a major scale, presented simultaneously in both ascending and descending form. When
atone from the ascending scale was delivered to one ear, a tone from the
descending scale was simultaneously delivered to the other ear, and successive
tones in each scale alternated from ear to ear. All tones were of equal
amplitude and 250 msec in duration. There were no pauses between tones. (b)
Percept most commonly obtained. This consisted of two melodic lines, a higher
one and a lower one, that moved in contrary motion. The higher tones all
appeared to be emanating from one earphone, and the lower tones from the other
earphone. (From D. Deutsch, Two-channel listening to musical scales, Journal
of the Acoustical Society of America, 1975, 57. Reprinted
with permission.)
To
summarize these findings, in considering what attribute was used as a
basis for grouping, organization by spatial location never occurred; rather
organization was always on the basis of frequency (see also Kubovy, 1981). Second,
in considering which principle was used, organization was always on the
basis of frequency proximity. Either listeners heard two melodic lines, one
corresponding to the higher tones and the other to the lower tones, or they
heard the higher tones alone. Third, there were substantial individual
differences in the way that this configuration was perceived, both in terms of what
was perceived and in terms of where the sounds appeared to be coming
from. These individual differences correlated strongly with handedness.
Auditory
illusory conjunctions have been shown to occur under broader circumstances
also.
This
powerful illusion appears as a good example of un
conscious inference in perception. Our auditory environment is very complex,
and the assignment of sounds to their sources is rendered difficult by the
presence of echoes and reverberation. So when a sound mixture is presented such
that both ears are stimulated simultaneously, we cannot judge from first-order
localization cues (see Note 7) alone which components of the total spectrum
should be assigned to which source. We therefore need to utilize other cues in
making such judgments. One such cue is similarity of frequency spectrum.
Similar sounds are likely to be coming from the same source, and different
sounds from different sources. It is therefore reasonable for the listener to
conclude that tones in one frequency range are coming from one source, and that
tones from a different frequency range are coming from another source. The
tones are therefore perceptually reorganized in space in accordance with this
interpretation (D. Deutsch, 1975c).
1.2.3.
Grouping of Nonsimultaneous Sound Sequences. If the above line of reasoning is correct, we
should expect that perceptual grouping of parallel pitch sequences would be
strongly influenced by the salience of the first-order localization cues. If,
in contrast to the conditions just described, such cues were strong and
unambiguous, channeling by spatial location would be expected to take
precedence over channeling by frequency range. One can produce such a situation
by employing sequences in which the tones at the two ears are clearly separated
in time.
To
examine this hypothesis, perceptual grouping was examined as a function of the
temporal relationships between the signals arriving at the two ears (D.
Deutsch, 1979a). Subjects were asked to identify rapid melodic patterns whose
component tones switched from ear to ear. In one set of conditions, input was to
one ear at a time; in another set, input was to both ears simultaneously. It
was predicted that when input was to one ear at a time identification of the
melody should be difficult, reflecting perceptual grouping by spatial location.
However, when both ears receive input simultaneously, identification of the
melody should be much easier.
Subjects
were presented with sequences of pure tones. Each sequence consisted of ten
repetitions of a basic eight-tone melody. All tones were of equal amplitude and
30 msec in duration, with tones within a melody separated by 100-msec pauses. Two
such melodies were employed, and the subjects identified on each trial which of
these had been presented.
The
experiment employed four conditions, which are illustrated in Figure 32.5. In
Condition A, all tones of the melody were presented simultaneously to both
ears. In Condition B, the component tones of the melody were distributed in random fashion between the ears. Condition C was
identical to Condition B except that the melody was accompanied by a drone. Whenever
a tone from the melody was presented to the right ear, the drone was
simultaneously presented to the left ear, and vice versa. Condition D was
identical to Condition C except that the drone was always presented to the same
ear as the tone from the melody.
The
percentages of correct identifications of the melodies in the different
conditions of the experiment are shown on Figure 32.5. It can be seen that
excellent performance was obtained in Condition A, in which the melodies were
presented binaurally. In contrast, performance in Condition B, in which the
tones from the melodies were distributed between the ears, was very poor. The
procedure of switching the tones from ear to ear thus produced a considerable
decrement in identification performance. However, in Condition C, in which a
contralateral drone was presented so that input was to both ears
simultaneously, the performance level was again very high. This finding cannot
be attributed to processing the harmonic relationships between the drone and
the melody because in Condition D, in which the drone was presented to the same
ear as the melody component, performance was below chance. In this last
condition, input was no longer to the two ears simultaneously.
This
experiment demonstrates that temporal relationships between tones emanating
from different spatial locations are important factors in determining how the
tones are perceptually grouped. When signals are emanating from two locations
si multaneously, as in Condition A and C, it is easy to integrate the
information arriving at the two ears into a single perceptual stream. However,
when the signals coming from the two locations are clearly separated in time,
as in Conditions B and D, grouping by spatial location is so powerful as to
prevent the listener from combining the tones to produce an integrated percept.
We may next enquire what happens in the intermediate case, where inputs to the two ears overlap but are not strictly synchronous. This condition brings us closer to normal listening. and also to the case where streams of speech are presented in parallel to both ears. A second experiment investigated the effects of onset-offset asynchrony between the components of the melody and the contralateral drone. In the asynchronous conditions, all tones were again 30 msec in duration, and th drone either led or lagged the melody components by 15 msec

Figure 32.5. Percentage of errors in
identification of melodic patterns when the component tones of the patterns
switched between ears. On each trial, ten repetitions of a basic
eight-tone pattern were presented. All tones were 30 msec in duration, and
tones within a pattern were separated by 100-msec pauses. Two such melodies
were employed, and subjects identified on each trial which of these had been
presented. In Condition A (melody presented
binaurally) excellent performance was obtained. In Condition B (melody distributed
between ears) performance was very poor. In Condition C (contralateral drone
accompanying melody) performance levels were again high. In Condition D
(ipsilateral drone accompanying melody) performance was below chance. (From D. Deutsch, Binaural integration of melodic patterns,
Perception and Psychophysics, 1979, 25. Reprinted with permission.)
or the right ear tones led or lagged the left ear tones by 15 msec. Performance
levels in these conditions were significantly lower than when the melody components
and the drone were strictly synchronous, and they were also significantly
higher than when the melody components switched between ears without an
accompanying drone. This is as expected on the present line, of reasoning.
A
similar experiment was performed by Judd (1979). Two repeating stimulus
patterns were constructed, from four square wave tones, each 100 msec in
duration. The two patterns were as shown on Figure 32.6. It can be seen that,
taking each channel separately and treating the patterns as cyclically
repeating, the tones in the two patterns were identically ordered. However,
when the channels were combined, two different melodic patterns emerged
instead. Subjects were presented with pairs of these patterns and were required
to judge whether the members of each pair were the same or different. On half
of the trials, the silent gaps between the tones were replaced by noise. It was
found that performance was better in the noise-filler condition than in the
silent gap, condition. Judd interpreted this finding as due to the noise
degrading the localization information, which encouraged grouping of successive
tones on the basis of frequency range rather than spatial location.
Schubert
and Parker (1956) performed an experiment that may be interpreted similarly. These
authors measured the amount of interference in speech perception that was
produced by switching the signal from ear to ear. They found that adding noise
to the contralateral ear reduced this interference effect (Figure 32.7). It may
plausibly be argued that the ongoing speech-noise signal was interpreted by the
listener in terms of two sources, one emitting noise and the other emitting
speech, whereas the ongoing speech-silence signal was interpreted by the
listener in terms of two independent speech sources.
1.2.4. The Hypothesis of a Slow Switching Mechanism. The problem of degradation of processing
when information is ' switched from ear to ear has been addressed in other
contexts. For instance, Cherry and
A
related paradigm involves recall of lists of digits that are dichotically presented.
When two such dichotic lists were delivered at fast rates, recall was found to
be better by ear than by temporal order, the latter task requiring switching
between ears (Broadbent, 1954, 1958).
Figure 32.6. Stimulus configurations employed to investigate the effect of contralateral noise on the ability to discriminate melodic patterns whose component tones alternated between ears. Tones were 100 msec in duration, with fundamental frequencies of (1) 912 Hz, (2) 1024 Hz, (3) 1150 Hz, and (4) 1290 Hz. Discrimination performance was enhanced when the gaps between the tones were replaced by noise. (From T. Judd, Comments on Deutsch's musical scale illusion, Perception and Psychophysics, 1979, 26. Reprinted with permission.)

Further,
subjects showed poorer recall of successive lists of digits when these were
presented alternately to the two
ears than when they were presented binaurally (A. Treisman,1971). This finding
cannot be ascribed to perceptual interference with the basic units of speech,
since there was no disruption of the verbal items in these experiments. Some
difficulty in the ability to switch attention between the ears was therefore
hypothesized.
In
contrast to the above arguments for a switching limitation, powerful general
arguments may be made against the idea that information from the two ears
cannot be dealt with in rapid succession.
Figure 32.7. Percentages of words correctly repeated as a function of rate at which the speech signal was switched from ear to ear. The lower curve shows the results for trials with silence in the contralateral ear. The upper (dotted) curve shows the results for trials in which noise was delivered to the contralateral ear. The contralateral noise resulted in enhanced speech intelligibility, especially at switching rates of around 4 Hz, where intelligibility was otherwise substantially reduced. (From E. D. Schubert & C. D. Parker, Addition to Cherry's findings on switching speech between two ears, Journal of the Acoustical Society of America, 1956, 27. Reprinted with permission.

In
everyday listening, the information arriving at the two ears is never
identical, and the running cross correlations performed on
this information are very important for several functions. One such
function is localization, and the other is the suppression of echoes and
reverberation (Haas, 1951; Tobias & Schubert, 1959; Wallach, Newman, &
Rosen zweig,1949). The auditory elements that are
compared for such functions may be separated by only a few microseconds. Such
an ability to utilize information entering the two ears in rapid succession is
not consistent with the notion of a slow switching mechanism.
Two
conflicting sets of phenomena have therefore been re ported, one arguing for a
decrement in processing information where rapid switching between ears is
involved, and the other arguing against such a decrement. We may resolve this
conflict on the following line of reasoning. An important function of our
auditory system is to separate out the signals emanating from different
sources. If such perceptual separations were not accomplished we would not know
which elements of the acoustic spectrum to link with, so as to form high-order
abstractions. It is necessary, therefore, that there exist mechanisms that
inhibit the formation of higher-order linkages between acoustic elements that
are likely to be emanating from different sources. Since our acoustic
environment is very complex, such mechanisms must be flexible and employ
multiple criteria. Thus certain configurations involving input to the two ears
would be inter preted as coming from the same source,
so that integration of this information should be easy. Yet other
configurations would best be interpreted as emanating from different sources,
so that integration should be difficult. According to this hypothesis, when a
decrement in integrating information arriving at the ears occurs, this is due
not to capacity limitation, but rather to a mechanism that we have evolved to
prevent confusion in monitoring our auditory environment (see Bregman,1978,1981, for an analogous argument based on findings
involving various monaural tasks).
1.2.5.
The Octave Illusion. In the experiments described in Section 1.2.2, when tones
were presented to both ears si multaneously with synchronous onsets and
offsets, sequential grouping by frequency proximity was the rule. Grouping by
ear of input occurred only when there were temporal separations between the
stimuli presented to the two ears. We now turn to an examination of certain
situations in which grouping by ear of input occurs even though such input is
strictly simultaneous. It will be seen that this happens only under special
conditions of frequency relationship between the tones presented in sequence at
the two ears.
One
such situation is illustrated in Figure 32.8(a). This shows the stimulus
pattern that gives rise to the octave illusion (D. Deutsch, 1974, 1975c). It
can be seen that two tones that were spaced an octave apart (400 and 800 Hz)
were repeatedly presented in alternation. The identical sequence was delivered
to the two ears simultaneously; however, when the right ear received the high
tone the left ear received the low tone and vice versa. So in fact the listener
was presented with a single, continuous, two-tone chord, but the ear of input
for each component switched repeatedly.
This
configuration produced a number of illusory percepts, the most common of which
is illustrated in Figure 32.8(b). It can be seen that this consisted of a
single tone that alternated from ear to ear, and whose pitch simultaneously
alternated from one octave to another in synchrony with the localization shift.

When
the earphones were placed in reverse position, most listeners found that the apparent
locations of the high and low tones remained fixed. Thus it seemed to these
listeners that the earphone that had been producing the high tones was now
producing the low tones, and that the earphone that had been producing the low
tones was now producing the high tones.
If we
assume that there are two separate brain mechanisms, one for determining what
pitch we hear and the other for de termining where the sound is located, we are
in a position to advance an explanation for this illusion. The model is diagrammed
in Figure 32.9. To determine the perceived pitch, the information arriving at
one ear is followed, and the information arriving at the other ear is
suppressed. However, each tone is localized in the ear receiving the
higher-frequency signal, regardless of which frequency is in fact perceived (D.
Deutsch, 1975c). The combined output of these two mechanisms, for the case of
the listener whose pitch percept corresponds to the frequencies presented to
the right ear, should result in the percept of a high tone to the right
alternating with a low tone to the left. For the case of the listener whose
pitch percept corresponds to the frequencies presented to the left ear instead,
the resultant percept should be that of a high tone to the left alternating
with a low tone to the right.
This
model received confirmation in a further experiment (D. Deutsch & Roll,
1976). Subjects were presented with the basic pattern shown in Figure 32.10(a).
This again employed tones standing in octave relation. It can be seen that one
ear received three high tones followed by two low tones, while simultaneously
the other ear received three low tones followed by two high tones. This basic
pattern was repeatedly presented ten times without pause.
As
expected from the model, most subjects perceived a pat tern of pitches that
corresponded to the frequencies presented either to the right ear or to the
left ear. In other words, they heard a repeating sequence consisting either of
three high tones followed by two low tones, or of three low tones followed by
two high tones. However, each tone was localized in the ear that received the
higher frequency. This is illustrated in Figure 32.10(b). When Channel A was
presented to the right ear and Channel B to the left, the listener heard a
repeating sequence of three high tones to the right followed by two low tones
to the left. When, however, Channel A was presented to the left ear and Channel
B to the right, the listener now heard a repeating sequence of two high tones
to the right followed by three low tones to the left.
Most subjects in the D. Deutsch (1974) experiment perceived a single high tone in one ear alternating with a single low tone in the other ear.
Figure 32.9. Diagram showing how the outputs of the pitch and localization mechanisms combine to produce the octave illusion. Filled boxes indicate high tones (800 Hz) and unfilled boxes indicate low tones (400 Hz). The pitch mechanism follows the sequence of frequencies presented to one (dominant) ear rather than to the other. However, the localization mechanism follows the higher-frequency signal, regardless of whether the higher or the lower frequency is perceived. The outputs of these two mechanisms combine to produce the percept of a high tone in one ear alternating with a low tone in the other ear. (From D. Deutsch, The octave illusion and auditory perceptual integration, in j. V. Tobias & E. D. Schubert (Eds.), Hearing research

Figure 32.10. Stimulus patterns and percepts in experiment to test hypothesized basis for the octave illusion. Filled boxes represent tones of 800 Hz and unfilled boxes represent tones of 400 Hz. The basic patterns shown were presented ten times without pause. In accordance with the hypothesis, most subjects reported the pattern of pitches that was presented to the right ear; yet all subjects localized each tone to the ear receiving the higher-frequency signal. (From D. Deutsch & P. L. Roll, Separate 'what' and 'where' decision mechanisms in processing a dichotic tonal sequence, Journal of Experimental Psychology: Human Perception and Performance, 2. Copyright 1976 by American Psychological Association. Reprinted with permission.)

However,
some subjects instead perceived a single tone that alternated from
ear to ear, whose pitch either did not change or changed only slightly with a
shift in its apparent location. Other subjects heard more complex patterns,
such as two low tones that alternated from ear to ear with an intermittent high
tone in one ear. Such patterns were usually unstable, exhibiting frequent
changes with continued listening.
The
individual differences in perception of this illusion were found to correlate
with handedness. As shown in Table 32.3, the proportion of subjects reporting
complex percepts was substantially higher in the left-handed than in the
right-handed population (see also Craig, 1979). A second handedness correlate concerned the localization
patterns for the high and low tones. As shown in Table 32.4, most right-handers
heard the high tone on the right and the low tone on the left, regardless of
the positions of the earphones (see also Geffen & Reynolds, 1982; McClurkin
& Hall, 1981). In contrast, the left-handers did not show a significant
tendency to localize the high and low tones
Table 32.3.

Percentages
of right-handers and left-handers are displayed. "Octave"
indicates the percept of a single tone that alternates from ear
to ear, whose pitch simultaneously alternates from one octave to
the other. "Single Pitch" indicates the percept of a single
tone that alternates from ear to ear, whose pitch either does
not change or shifts slightly with a change in localization. "Complex"
comprises a number of different complex percepts. The proportion of
subjects obtaining complex percepts was considerably higher among
left-handers than among right-handers. (from D. Deutsch,
An auditory illusion, Nature, 151. Copyright 1974 by Macmillan Journals Ltd. Reprinted with permission.)
Table 32.4.

Each subject was given two
presentations of the sequence, for 20 sec each time, with earphones
placed first one way and then the other. The numbers of
right-handers and left-handers obtaining a given localization pattern
are displayed. RR: High tone localized in the right ear and low tone
in the left on both presentations. LL: High tone localized in the
left ear and low tone in the right on both presentations. Both: High tone
localized in the right ear and low tone in the left on one
presentation; and high tone localized in the left ear and low
tone in the right on the other. Right-handers tended strongly to hear
the high tone in the right and the low tone in the left; however,
left-handers did not display this tendency either way, and showed a
greater tendency to change their localization patterns.
Given the strong correlates
with handedness in perception of the octave illusion, it is interesting to
consider the neurological differences on which such correlates might be based. The
over whelming majority of right-handers are left-hemisphere dom
inant, but this is true of only about two-thirds of left-handers. Further, the
majority of right-handers have a clear dominance of the left hemisphere;
however, a substantial proportion of left-handers have some bilateral
representation (Goodglass & Quadfasel, 1954; Hdcaen & de Ajureaguerra,
1964; Hhcaen & Piercy, 1956; Milner, Branch, & Rasmussen, 1966;
Subirana, 1969; and Zangwill, 1960). It appears reasonable to assume
that these
patterns of dominance are reflected in percepts of the octave illusion in two
ways. First, the localization of the high tone on the right and the low tone on
the left reflects left hemisphere dominance, with the localization of the high
tone on the left and the low tone on the right reflecting right-hemi sphere
dominance. Second, unambiguous localization patterns reflect clear dominance,
with complex percepts reflecting more cerebral equipotentiality.
Localization
patterns have been shown to correlate not only with handedness, but also with
familial handedness back ground. In a study by D. Deutsch (1983b), subjects
with left or mixed-handed parents or siblings were found less likely to
localize the high tone on the right and the low tone on the left than were
subjects without left- or mixed-handed parents or siblings. This was found true
for right-handed, mixed-handed, and left-handed populations.
A
further question of interest is whether the interactions underlying the
localization and pitch effects in the octave illusion occur between pathways
conveying information from the two ears, or whether instead pathways conveying
information from different regions of auditory space are involved. To
investigate this question, the stimuli were presented through spatially separated
loudspeakers rather than earphones (D. Deutsch, 1974, 1975c). An analogous
illusion was obtained under these conditions: The subjects perceived a high
tone that appeared to be coming from one speaker, which alternated with a low
tone that appeared to be coming from the other speaker. This effect was
obtained even with the two speakers placed side by side, facing the listener,
which shows that highly specific regions of auditory space were involved here.
We
shall now consider only what sequence of pitches is perceived in the octave
illusion and leave aside the issue of where the tones appear to be located. In
the octave illusion, channeling of pitch sequences was always on the basis of
spatial location. However, in the scale illusion, channeling was always on the
basis of frequency proximity instead. Yet the stimuli producing these two
illusions were in several ways very similar. In both cases, repeating sequences
of sine-wave tones at equal amplitudes and durations were presented, with
synchronous onsets and offsets. Also in both cases, the frequencies presented
to one ear always differed from the frequencies simultaneously presented to the
other ear. Nevertheless, radically different channeling strategies arose in
response to these two stimulus patterns. It is particularly noteworthy that,
when two tones standing in octave relation were simultaneously presented in the
scale illusion, both these tones were generally perceived. But when two tones
standing in octave relation were simultaneously presented in the octave
illusion, only one of these tones was generally perceived. Such differences in
channeling strategy must therefore arise from differences in the patterns of
frequency relationship between successive tones.
Another
characteristic of the stimulus producing the octave illusion was that the
frequency emanating from one side of space was always the same as the frequency
that had just emanated from the opposite side. It therefore seemed plausible to
hypothesize that this sequential relationship was responsible for producing
channeling by spatial location. A further set of experiments was performed to
test this hypothesis (D. Deutsch, 1980a,1981).
In the first experiment, listeners were presented with se quences consisting of 20 dichotic chords. Two conditions were compared, using the basic patterns illustrated in Figure 32.11(a).
Figure 32.11.
(a)
Configurations used in first experiment examining effects of sequential
interactions on ear dominance. Each sequence consisted of 20 dichotic chords.
In Condition 1, the two ears received the same frequencies in succession;
however, this was not true in Condition 2. (b) Percentage of following of
nondominant'ear in these two conditions, as a function of amplitude differences
between the tones at two ears. In Condition 1, the dominant ear was followed
until a critical level of amplitude relationship was reached, and the
nondominant ear was followed beyond this level. However, there was no following
on the basis of ear of input in Condition 2. (From D. Deutsch, Ear dominance
and sequential interactions, journal of the Acoustical Society of

The pattern
in Condition 1 consisted of the repetitive presentation of a single chord. The
tones comprising this chord stood in octave relation and alternated from ear to
ear in such a way that when the high tone was in the right ear the low tone was r in the left ear and vice versa. Here
the two ears received the :H same frequencies in
succession. The sequence presented to the . . right ear began with the high tone and ended with the low
tone ,: on half of the trials, while this order was reversed on the other _7
half. The subjects were asked to judge whether the sequence began with the high
tone and ended with the low tone or whether ": it began with the low tone
and ended with the high tone. It was thus possible to infer which ear was being
followed for pitch.
In Condition 2, the basic pattern consisted of the repetitive presentation of
two dichotic chords in alternation. The tones comprising the first chord formed
an octave and the second a ``1
minor third; thus the entire four-tone combination constituted a major triad.
Note that here the two ears did not receive the same frequencies in succession.
The right ear received the higher tone of the first chord and the lower tone of
the last chord on half of the trials. The order was reversed on the other half
of the trials.