AUDITORY
PATTERN RECOGNITION
DIANA DEUTSCH
University
of California, San Diego, La Jolla, California


CONTENTS

1. Auditory Grouping Phenomena . 32-2 1.1. Parsing of Sounds of Complex Spectral

Composition, 32-3

1.1.1. Harmonicity of Spectral Components, 32-3 1.1.2. Time-Variant Relationships, 32-4

1.1.3. Familiarity, 32-5

1.2. Grouping of Sound Sequences in Space, 32-5 1.2.1. Auditory Illusory Conjunctions, 32-5 1.2.2. The Scale Illusion, 32-6

1.2.3. Grouping of Nonsimultaneous Sound Sequences, 32-7

1.2.4. The Hypothesis of a Slow Switching Mechanism, 32-8

1.2.5. The Octave Illusion, 32-9
1.2.6. Grouping of Phase-Shifted Tones, 32-14 1.3. Grouping of Rapid Sound Sequences, 32-16 1.3.1. Grouping by Frequency, 32-16
1.3.2. Grouping by Frequency Proximity, 32-16 1.3.3. Temporal Coherence as a Function of Frequency Proximity and Tempo, 32-16

1.3.4. Grouping by Frequency Proximity in Relation to Repetition, 32-16

1.3.5. Frequency Proximity and the Perception of Temporal Relationships, 32-17

1.3.6. Grouping by Good Continuation, 32-18 1.3.7. Grouping by Sound Quality, 32-19 1.3.8. Grouping by Amplitude, 32-19
1.3.9. Grouping by Temporal Position, 32-19 1.3.10. Grouping by Spatial Location, 32-19 1.3.11. Closure: The Auditory Continuity Effect, 32-19
1.4. Grouping and Selective Attention, 32-20

1.4.1. Voluntary and Involuntary Grouping, 32-20 1.4.2. Consequences of Attention Focusing, 32-20

2. Shape Analysis for Pitch Structures 32-21 2.1. Auditory Shape Analysis as a Multileveled

Process, 32-21

2.2. Passive Versus Active Processing, 32-21 2.3. Feature Abstraction, 32-21

2.3.1. Octave Equivalence, 32-21
2.3.2. Interval and Chord Equivalence, 32-21 2.3.3. Categorical Perception of Musical Intervals, 32-22

2.3.4. Global Cues, 32-22 2.3.5. Interval Class, 32-22

2.4. Higher-Order Abstractions, 32-25
2.5. Hierarchical Encoding of Pitch Sequences, 32-27 2.6. The Influence of Short-Term Memory on Perception of Pitch Patterns, 32-29

2.6.1. Interference Effects in Short-Term Memory for Pitch, 32-29

2.6.2. Facilitation Through Repetition in Short Term Memory for Pitch, 32-31

2.6.3. The Influence of Relational Context on Pitch Comparison Judgments, 32-31 2.7. Contour as a Cue in Recognition of Pitch Patterns, 32-32

2.8. Scale and Key Structure in Recognition of Pitch Patterns, 32-32

2.9. Memory for Hierarchically Organized Pitch Patterns, 32-33

3. Analysis of Timbre 32-34 3.1. Timbre and Fourier Analysis, 32-34 32-2

3.2. Investigation of Timbre by Analysis and Synthesis, 32-35

3.3. Multidimensional Models of Timbre, 32-35 3.4. Role of Context in Timbre Perception, 32-36

Perception of Temporal Relationships 32-37 4.1. Perception of Temporal Order, 32-37

4.1.1. Modes of Order Perception, 32-37
4.1.2. Perception of the Order of Two Events, 32-37 4.1.3. Perception of the Order of Three or More Events, 32-37

4.1.4. Order Perception in Continuously Cycling Sound Patterns, 32-37

4.1.5. Theories of Order Perception, 32-38 4.2. Perception of Rhythm, 32-38

4.2.1. Subjective Rhythmic Grouping, 32-38 4.2.2. Grouping by Temporal Proximity, 32-38 4.2.3. Grouping by Accent, 32-39
4.2.4.
Grouping by Other Principles, 32-39 4.2.5. The Run Principle and the Gap Principle, 32-39
4.2.6.
Rhythmic Hierarchies, 32-40

5. Summary 32-43 Notes 32-43 References 32-44

Research on hearing has traditionally been concerned with simple detection, discrimination, and scaling tasks. However, the last decade has seen a flowering of interest in higher-level mechanisms concerned with auditory grouping, shape percep tion, memory, and so on. This new development has been due largely to technological advances that have enabled researchers to generate complex auditory stimuli with precision and flex ibility. Those entering the field have been rewarded by the discovery of an elaborately structured and highly differentiated system that possesses some remarkable properties.

Two major influences on research into auditory pattern recognition may be identified. The first stems from related work in perceptual and cognitive psychology. For example, the multi leveled approach to auditory shape perception has been strongly motivated by theoretical and experimental work on the per ception of visual shape. As another example, research into memory for sound structures has been influenced by findings on memory for verbal materials.

A second major influence derives from music theory. Fun damental concepts such as octave equivalence and interval equivalence have been in the mainstream of traditional music theory since the time of Pythagoras. Several developments in contemporary music theory have also provided input. For example, the theory of 12-tone composition, developed by Schoen berg, is based on an implied theory of shape analysis for pitch structures. Another example is the hierarchical theory of tonal music, developed early in this century by Schenker, which has points of similarity with the theory of transformational grammar developed later by Chomaky. In addition, composers of electronic and computer music have provided the major impetus to recent experimental work on the perception of sound quality or timbre, an area of research with broad implications for auditory per ception in general.

This chapter is divided into four main sections. In the first, auditory grouping phenomena are investigated. This section deals with questions concerning the perceptual fusion and sep aration of components of a complex sound spectrum, the grouping of sound elements emanating from different spatial locations, and the grouping of sounds that occur in rapid succession. The second section is concerned with the perception and recognition of patterns formed of pitch combinations. The third section deals with the perception of timbre or sound quality. The fourth section is concerned with the perception of temporal order and of rhythm. The final section summarizes the findings in these different subfields.

1. AUDITORY GROUPING PHENOMENA

We may distinguish two basic but interrelated questions in considering how the auditory system groups stimuli into perceptual configurations. The first involves the stimulus dimen sions along which grouping principles operate. When presented with a complex signal, the auditory system may group elements according to some rule based on frequency, on amplitude, on temporal or spatial position, or on some multidimensional at tribute such as timbre. As will be shown, any of these attributes may serve as a basis for grouping, and further, there are complex and rigid rules determining which attribute will be used. Such rules can often be well interpreted in terms of strategies that are most likely to lead to the correct conclusions in interpreting our auditory environment. Second, we may enquire into the principles that govern grouping along any given dimension. The Gestalt psychologists proposed that we form groupings on the basis of certain simple principles, such as proximity, good continuation, similarity, and common fate (Wertheimer, 1923). As described elsewhere in this Handbook, these have been shown to be important descriptive principles for grouping in vision. We shall show here that this is true for hearing also. It may plausibly be argued that grouping in conformity with such principles enables us to interpret our environment most effectively (Bregman, 1978; D. Deutsch, 1975c; Gregory, 1970; Hochberg, 1974; Sutherland, 1973). Sounds that are similar are likely to be coming from the same source, and sounds that are dissimilar are likely to be coming from different sources. A sequence of sounds is more likely to be coming from a single source if it contains frequency transitions that are gradual rather than abrupt. Components of a sound spectrum that modulate in synchrony are more likely to be coming from a single source than those that modulate out of synchrony.

The view of auditory grouping as a process of unconscious  inference may be traced to Helmholtz (1859/1954) (see Note 1). He speculated how, given the complex, time-variant spectrum produced by several musical instruments playing simultaneously, the listener reconstructs the auditory environment so that some components of the spectrum fuse perceptually to produce the impression of a single sound, while others are heard as separate melodic lines sounding in parallel. He wrote:

Now there are many circumstances which assist us first in separating the musical tones arising from different sources, and secondly, in keeping together the partial tones of each separate source. Thus when one musical tone is heard for some time before being joined by the second, and then the second continues after the first has ceased, the separation in sound is facilitated by the succession of time. We have already heard the first musical tone by itself and hence know immediately what we have to deduct from the compound effect for the effect of this first tone. Even when several parts proceed in the same rhythm in polyphonic music, the mode in which the tones of different instruments and voices commence, the nature of their increase in force, the certainty with which they are held and the manner in which they die off, are generally slightly different for each ... but besides all this, in good part music, especial care is taken to facilitate the separation of the parts by the ear. In polyphonic music proper, where each part has its own distinct melody, a principal means of clearly separating the progression of each part has always consisted in making them proceed in different rhythms and on different divisions of the bars.... All these helps fail in the resolution of musical tones into their constituent partials. When a compound tone commences to sound, all its partial tones commence with the same comparative strength; when it swells, all of them generally swell uniformly; when it ceases, all cease simultaneously. Hence no opportunity is generally given for hearing them separately and independently. (pp. 59-60)

1.1. Parsing of Sounds of Complex Spectral Composition

A basic task for auditory theory is to determine the relationships between elements of an ongoing sound spectrum that give rise to the perception of a single sound and those that give rise to the perception of several simultaneous sounds. Without these processes of fusion and separation, intelligible listening would not be possible. Presumably mechanisms have evolved that cause us to fuse together those elements of the sound spectrum that are likely to be coming from the same source and to separate out those elements that are likely to be coming from different sources. Three factors will be considered here. The first is harmonicity of spectral components; the second is synchronicity; the third is familiarity with certain sound complexes.

1.1.1. Harmonicity of Spectral Components. It has been argued from various lines of evidence that harmonic sounds are more likely to be perceived as fused than are nonharmonic sounds (see Note 2). Stringed and blown instruments have partials that are harmonic or nearly harmonic, and such partials unite to produce the impression of a single tone. In contrast, bells and gongs have partials that are nonharmonic, and these produce more diffuse sound impressions (Mathews & Pierce, 1980). De Boer (1976) has shown that harmonic complexes tend to produce, unitary and unequivocal pitch sensations, whereas certain types of nonharmonic complex do not merge, but instead produce multiple pitch sensations. Since most forced vibration systems such as the voice have partials whose frequencies are harmonic or close to harmonic, such findings are as expected on the hypothesis that our auditory system has evolved to interpret sound patterns in terms of the sources from which they emanate.

We may next enquire whether the phase relations between the partials of a tone affect the fusion of its image. This question was investigated by Kubovy (Note 3). He created a set of harmonically related sinusoids, all of equal amplitude, and all be ginning with a positive zero-crossing and therefore having a common zero-crossing at the frequency of the fundamental. One of these sinusoids was then moved out of phase for a few hundred milliseconds. It was then moved back into phase, while another was moved out, and so on. A perceptual segregation was produced by these means, so that a melody was heard that corresponded to the out-of-phase sinusoids.

Later, Kubovy and Jordan (1979) constructed stimuli consisting of the third to fourteenth harmonics of a 200-Hz fundamental, which were played in the sine phase. At intervals of roughly 300 msec, the phases of all components but one were reset to 0&degree;, and the phase of the remaining component was set to a different phase angle. The out-of-phase components formed a scale that either ascended or descended, and subjects judged the direction of this scale. The results are shown in Figure 32.1. It can be seen that for phase shifts greater than 40&degree; subjects showed near-perfect identification of scale direction. These experiments therefore demonstrate the perceptual effect of phase relationships on the fusion of single tones composed of harmonically related complexes: Phase shifting a component of the complex results in its perceptual segregation.

Tones whose fundamental frequencies are related by simple ratios fuse more readily than tones that are not so related. In a demonstration of this phenomenon, Rasch (1978) presented two chords in succession. The lower tones of each chord were identical, and the higher tones formed a sequence that either ascended or descended. The subjects' task was to judge whether the higher tones formed a "low-high" sequence or a "high-low" sequence. Detection thresholds were taken as the measure of the extent to which the subjects could separate out the component tones of each chord. The lower tones all had a fundamental frequency of 250 Hz. The higher tones had fundamental frequencies that either were 500 and 750 Hz or deviated slightly from these values.

These results of the experiment are shown in Figure 32.2. It can be seen that, as the relationships formed by the fundamental frequencies of the higher and lower tones deviated from simple ratios, detection performance gradually improved, indicating a decreased tendency to fuse together the higher and lower components of the chords.

Figure 32.1. Percentage of correct identification of phase-shifted target tones as a function of phase shift in degrees. The stimuli consisted of the third to fourteenth harmonics of a 200-Hz fundamental, which were played in the sine phase. At intervals of around 300 msec, the phases of all components except one were reset to 0&degree;, and the phase of the remaining component was set to a different phase angle. The out-of-phase components formed a scale that either ascended or descended, and subjects identified the direction of the scale. Near-perfect identification was shown for phase shifts greater than 40 deg. (From M. Kubovy & R. Jordan, Tone-segregation by phase: On the phase sensitivity of the single ear, journal of the Acoustical Society of America 1979, 66. Reprinted by permission.)

Figure 32.2. Detection thresholds for higher tones in the presence of lower tones. Two chords were presented in succession. The lower tones of the chords were both at 250 Hz, and the higher tones formed either a "low-high" sequence or a "high-low" sequence. Either higher tones were at 500 and 750 Hz, or they deviated slightly from these values. Subjects judged whether a "low-high" sequence or a "high low" sequence had been presented. Detection thresholds fell gradually with increasing deviation from the 500-Hz and 750-Hz values, in roughly symmetrical fashion.

1.1.2. Time-Variant Relationships. One factor that may be hypothesized to contribute to the impression of a single fused sound is coordinated modulation in the steady state. In forced vibration systems, any perturbation of the driving force will result in perturbations of components of the spectrum that are proportional to their frequencies. Thus a complex of sinusoids that is modulated in correlation is likely to be emanating from a single source. McNabb and Chowning (quoted by McAdams, 1982) have demonstrated informally that a harmonic tone com plex with a spectral power distribution conforming to that of a vowel produces only a weak vocal sensation, and only weak perceptual fusion. However, if a small amount of frequency modulation is superimposed on all the spectral components simultaneously, they sound strongly fused. Similar observations have been reported informally by McAdams (1982). -

By the same token, if we hear a complex of sinusoids with uncorrelated modulation functions, the likelihood is that the components of the complex are emanating from different sources. McAdams (1982) reports an informal experiment employing a complex stimulus in which a transition was made from perfectly correlated to two uncorrelated frequency modulation functions. For harmonic tone complexes, the listener's percept shifted from a single fused image to two distinct images. The effect was uncertain for inharmonic tone complexes.

A related finding was obtained by Rasch (1978), using the sequence detection task described in Section 1.1.1. He showed that, when the higher tones of the chords were frequency modulated while the lower tones remained unmodulated, detection of whether the chords formed a "low-high" sequence or a "high low" sequence was enhanced, so that uncorrelated modulation resulted in decreased fusion of the simultaneously presented tones.

How does onset asynchrony of two simultaneous tones affect perceptual fusion? Rasch (1978) used the same detection task to study the effect of delaying the lower tones of the chords relative to the higher tones. As shown in Figure 32.3, detection performance was strongly influenced by this manipulation. Each 10 msec of delay was associated with roughly a 10-dB downward shift of threshold. For a delay of 30 msec, threshold for perception of the high tone was close to that for the high tone presented alone.

Rasch further noted that the phenomenological effect of . asychrony was very strong. Whereas in the synchronous con- ". ditions a single "sound object" was perceived, in the asynchronous  conditions the two tones stood apart very clearly. However, the onsets of the two tones were not separately audible, so that they were perceived as two separate but simultaneous sounds:

 This is an example of the continuity effect. (See Section 1.3.11.).

A related finding was obtained by Bregman and Pinker (1978). These authors presented a two-tone complex in alter nation with a third tone and introduced various conditions of onset-offset asynchrony between the simultaneous tones in the complex. As the degree of asynchrony increased, the likelihood also increased that one of the simultaneous tones would form a melodic stream with the third tone. Bregman and Pinker argued that the asynchrony of the simultaneous tones resulted in a decreased tendency for these tones to be treated as coming from the same source and so facilitated a sequential organization by frequency proximity between one of these simultaneous tones and the alternating tone.

Figure 32.3. Detection thresholds for higher tones in the presence of lower tones. The paradigm used was as described in Figure 32.2. The lower tones were at 250 Hz, and the higher tones were at 500 Hz and 750 Hz. Either the higher tones ended simultaneously with the lower tones (solid line), or they ended immediately following onset of the lower tones (dashed line). Thresholds were virtually unaffected by amount of overlap but were strongly affected by delay of the lower tones. Each 10 msec of delay produced roughly a 10-dB downward shift in threshold. (From R. A. Rasch, The perception of simultaneous notes such as in polyphonic music, Aeustica,1978, 40. Reprinted with permission.)

Dannenbring and Bregman (1978) investigated the effects of several variables on the tendency of one component of a complex tone either to fuse with the other components or al ternatively to be pulled out into a different melodic stream. The stimuli consisted of a complex of three pure tones (at 500, 1000, and 2000 Hz) that alternated repeatedly with a single "captor" pure tone (at 500, 1000, or 2000 Hz). The amplitudes of the components of the complex tone either were equal or increased or decreased with frequency. The amplitude of the "captor" tone was always equal to that of the "target" component of the frequency with which it alternated. The relative onsets and offsets of the components of the complex tone were also varied. Subjects judged the repetition rate of the captor tone. If this rate was judged to be slow, the components of the complex tone were considered to be fused into a single unit. However, if this rate was judged to be fast, the target component of the complex tone was considered to have been pulled into the same stream as the captor.

Various findings emerged from this study. First, the tendency for the formation of melodic streams was found to be greater when the repeating tone was at 500 Hz than when the tone was at one of the other two frequencies. Second, the tendency to fusion was greatest for tones in which the relative amplitudes of the components decreased with frequency, a situation most like that commonly encountered in the natural environment. Third, when the target components led the other components of the complex tone at onset, there was an increased tendency to produce melodic streams. This was also true when the target component lagged the other components at offset. However, when the target component lagged the others at onset or led them at offset, no such effect occurred.

The effects of fusion and separation of two gliding tones were studied by Steiger and Bregman (1981). Here the tones glided in parallel on a log frequency scale, and the glides were repeatedly presented in alternation with a pure tone "captor" glide. Subjects judged whether the stimulus was "fused" (i.e., whether the sequence appeared as an isochronous alternation of a pure tone with a rich tone) or "decomposed" (i.e., whether the sequence appeared to contain three tones in each cycle). The tendency for the stimulus to be judged as decomposed was enhanced when the captor and target glides were in the same frequency range, and also when the captor and target glides had the same orientation.

A sudden change in the amplitude of a component of a tone complex can cause this component to stand out perceptually. This was demonstrated by Kubovy (Note 4). He presented sub jects with an eight-tone chord whose components were successively turned off abruptly for 80 msec and then restored to full amplitude. This manipulation occurred at a rate of three per second. The subjects perceived a melody that corresponded to the order in which the tones were subjected to this momentary amplitude disparity. For this pitch segregation effect to occur, it was necessary that the frequency spacing between successive tones be greater than the critical band.

1.1.3. Familiarity. Sounds with familiar spectral shapes, such as human voices and musical instrument tones, appear to fuse more readily than sounds with unfamiliar spectral shapes. Informal observations show that the percept of a particular vowel is lost when its spectral envelope is shifted slightly in frequency, even though the relative amplitudes are preserved. Other factors such as the relative growth and decay of individual partials also appear to contribute to familiarity. Unfortunately no quantitative data on the issue are available at present.

1.2. Grouping of Sound Sequences in Space

A useful technique for studying grouping phenomena in hearing is to present two different pitch sequences in parallel, one to the left of the listener, and the other to the right. In most experiments, stimuli have been presented dichotically via head phones; however, in some experiments stimuli have been presented via spatially separated loudspeakers. This technique enables different stimulus dimensions to be set in opposition to each other as bases for grouping. Thus, for example, grouping by frequency or by amplitude may be opposed to grouping by spatial location. At the same time, different principles governing grouping along any given dimension may be set in opposition to each other. For example, grouping by proximity may be op posed to grouping by good continuation. This section describes findings obtained with this technique and discusses their theoretical implications.

1.2.1. Auditory Illusory Conjunctions. When two sequences of tones emanate simultaneously from different regions of space, and the onsets and offsets of these tones are synchronous, striking perceptual illusions are generally produced. We may characterize a tonal stimulus as a bundle of attribute values, that is, as having a pitch, a location, a loudness, and a timbre. In the situation just outlined, these bundles of attribute values fragment and recombine, so that illusory conjunctions result. (See also Treisman, Chapter 35.) This anomalous recombination suggests that all auditory stimuli are at some stage in the processing system fragmented into their separate attributes and that this process of fragmentation is followed by a process of perceptual synthesis in which the different attribute value are recombined. Under most circumstances the stimuli are re constructed correctly; however, we should not assume that this necessarily occurs.

Striking individual differences are manifest in the types of illusion that are produced in this situation. Further, these differences correlate strongly with handedness and may be re lated to patterns of cerebral dominance. This implies that they have an innate basis.

1.2.2. The Scale Illusion. One example of the creation of strong illusory conjunctions is provided in the scale illusion (D. Deutsch, 1975c, 1975e). The configuration that produced the illusion is illustrated in Figure 32.4(a). It can be seen that this consisted of a major scale (see Note 5), which was presented simultaneously in both ascending and descending form. When a tone from the ascending scale was delivered to one ear, a tone from the descending scale was simultaneously delivered to the other ear, and successive tones in each scale alternated from ear to ear. This pattern was repeatedly presented ten times without pause. All tones were sine waves of equal amplitude and 250 msec in duration.

When presented with this configuration, no subject perceived the sequence of tones that was delivered to one ear or to the other, and none perceived a full ascending or descending scale. Instead, the successive tones were always grouped together on the basis of frequency range. All subjects perceived a sequence of four tones that repeatedly descended and then ascended. Be yond this, percepts were divisible into two categories. Most subjects also perceived a second stream of lower tones that repeatedly ascended and then descended. The second stream moved in contrary motion to the first [Figure 32.4(b)]. This percept therefore included all the pitches in the configuration; however, these were separated into two streams on the basis of frequency range.


Table 32.1.
Numbers of Right-Handers and Left-Handers Perceiving Both the Higher and the Lower Pitch Sequences in the Scale
illusion ("Both"), and Those Perceiving Only the Higher Pitches ("Single")
Streams
Handedness Both Single

The right-handers tended significantly to hear both streams; however, the left-handers did not show such a tendency (from D. Deutsch, Two-channel listening to musical scales, journal of the Acoustical Society of America, 1975, 57. Reprinted with permission.)

A minority of subjects perceived instead only one stream of four tones that repeatedly descended and then ascended. This corresponded to the higher sequence of tones; little or nothing of the lower sequence was perceived.

Table 32.1 shows the numbers of right-handed and left handed subjects who obtained these two categories of percept. As can be seen, the two handedness groups differed significantly on this measure. Further, in considering those subjects who perceived both streams, significant differences between the two handedness groups also emerged. Most right-handers obtained an illusion whereby the higher tones all appeared localized in one ear and the lower tones in the other ear. As shown in Table 32.2, there was a highly significant tendency to perceive the higher tones in the right ear and the lower tones in the left ear, and also to maintain a given localization pattern when the earphone positions were reversed. The remaining right-handers obtained a variety of idiosyncratic localization percepts, as did those who perceived only one stream. Most left-handed subjects who perceived both streams also localized all the higher tones in one ear and all the lower tones in the other ear. However, as shown in Table 32.2, these subjects did not display the same localization tendencies as did the right-handers. The remaining left-handed subjects reported a variety of idiosyncratic localization percepts.

Table 32.2. Localization Patterns in the Scale Illusion, Displayed for Those Subjects who Perceived All the Higher Tones in One Ear and All the Lower Tones in the Other Ear

Figure 32.4. . (a) Stimulus configuration that produced the scale illusion. This consisted of a major scale, presented simultaneously in both ascending and descending form. When atone from the ascending scale was delivered to one ear, a tone from the descending scale was simultaneously delivered to the other ear, and successive tones in each scale alternated from ear to ear. All tones were of equal amplitude and 250 msec in duration. There were no pauses between tones. (b) Percept most commonly obtained. This consisted of two melodic lines, a higher one and a lower one, that moved in contrary motion. The higher tones all appeared to be emanating from one earphone, and the lower tones from the other earphone. (From D. Deutsch, Two-channel listening to musical scales, Journal of the Acoustical Society of America, 1975, 57. Reprinted with permission.)

To summarize these findings, in considering what attribute was used as a basis for grouping, organization by spatial location never occurred; rather organization was always on the basis of frequency (see also Kubovy, 1981). Second, in considering which principle was used, organization was always on the basis of frequency proximity. Either listeners heard two melodic lines, one corresponding to the higher tones and the other to the lower tones, or they heard the higher tones alone. Third, there were substantial individual differences in the way that this configuration was perceived, both in terms of what was perceived and in terms of where the sounds appeared to be coming from. These individual differences correlated strongly with handedness.

Auditory illusory conjunctions have been shown to occur under broader circumstances also. Butler (1979) presented the scale illusion pattern either through headphones, or through spatially separated loudspeakers in a free sound field environment. The subjects, who were musically trained, notated separately the sequence coming from the speaker on their right and the sequence coming from the speaker on their left. Almost all notations reflected grouping by frequency range. Thus a higher melodic line appeared to be coming from one speaker, and a lower melodic line appeared to be coming from the other. Even when the tones were generated on a piano, and when in addition differences in timbre and loudness were introduced between the tones coming from the two speakers, essentially the same results were obtained. Similar results were also obtained when different contrapuntal patterns (see Note 6) were employed. Further, when differences in timbre were introduced, no subject was able to identify these differences correctly. Instead, all subjects perceived a change in tone quality that appeared to be emanating from both headphones or both loudspeakers. Butler concluded that under these conditions channeling by pitch range is so strong as to persist through a wide range of timbral changes, changes in envelope characteristics of the tones, imperfectly timed attacks, inconsistencies in duration, and varying loudness.

This powerful illusion appears as a good example of un conscious inference in perception. Our auditory environment is very complex, and the assignment of sounds to their sources is rendered difficult by the presence of echoes and reverberation. So when a sound mixture is presented such that both ears are stimulated simultaneously, we cannot judge from first-order localization cues (see Note 7) alone which components of the total spectrum should be assigned to which source. We therefore need to utilize other cues in making such judgments. One such cue is similarity of frequency spectrum. Similar sounds are likely to be coming from the same source, and different sounds from different sources. It is therefore reasonable for the listener to conclude that tones in one frequency range are coming from one source, and that tones from a different frequency range are coming from another source. The tones are therefore perceptually reorganized in space in accordance with this interpretation (D. Deutsch, 1975c).

1.2.3. Grouping of Nonsimultaneous Sound Sequences. If the above line of reasoning is correct, we should expect that perceptual grouping of parallel pitch sequences would be strongly influenced by the salience of the first-order localization cues. If, in contrast to the conditions just described, such cues were strong and unambiguous, channeling by spatial location would be expected to take precedence over channeling by frequency range. One can produce such a situation by employing sequences in which the tones at the two ears are clearly separated in time.

To examine this hypothesis, perceptual grouping was examined as a function of the temporal relationships between the signals arriving at the two ears (D. Deutsch, 1979a). Subjects were asked to identify rapid melodic patterns whose component tones switched from ear to ear. In one set of conditions, input was to one ear at a time; in another set, input was to both ears simultaneously. It was predicted that when input was to one ear at a time identification of the melody should be difficult, reflecting perceptual grouping by spatial location. However, when both ears receive input simultaneously, identification of the melody should be much easier.

Subjects were presented with sequences of pure tones. Each sequence consisted of ten repetitions of a basic eight-tone melody. All tones were of equal amplitude and 30 msec in duration, with tones within a melody separated by 100-msec pauses. Two such melodies were employed, and the subjects identified on each trial which of these had been presented.

The experiment employed four conditions, which are illustrated in Figure 32.5. In Condition A, all tones of the melody were presented simultaneously to both ears. In Condition B, the component tones of the melody were distributed in random fashion between the ears. Condition C was identical to Condition B except that the melody was accompanied by a drone. Whenever a tone from the melody was presented to the right ear, the drone was simultaneously presented to the left ear, and vice versa. Condition D was identical to Condition C except that the drone was always presented to the same ear as the tone from the melody.

The percentages of correct identifications of the melodies in the different conditions of the experiment are shown on Figure 32.5. It can be seen that excellent performance was obtained in Condition A, in which the melodies were presented binaurally. In contrast, performance in Condition B, in which the tones from the melodies were distributed between the ears, was very poor. The procedure of switching the tones from ear to ear thus produced a considerable decrement in identification performance. However, in Condition C, in which a contralateral drone was presented so that input was to both ears simultaneously, the performance level was again very high. This finding cannot be attributed to processing the harmonic relationships between the drone and the melody because in Condition D, in which the drone was presented to the same ear as the melody component, performance was below chance. In this last condition, input was no longer to the two ears simultaneously.

This experiment demonstrates that temporal relationships between tones emanating from different spatial locations are important factors in determining how the tones are perceptually grouped. When signals are emanating from two locations si multaneously, as in Condition A and C, it is easy to integrate the information arriving at the two ears into a single perceptual stream. However, when the signals coming from the two locations are clearly separated in time, as in Conditions B and D, grouping by spatial location is so powerful as to prevent the listener from combining the tones to produce an integrated percept.

We may next enquire what happens in the intermediate case, where inputs to the two ears overlap but are not strictly synchronous. This condition brings us closer to normal listening. and also to the case where streams of speech are presented in parallel to both ears. A second experiment investigated the effects of onset-offset asynchrony between the components of the melody and the contralateral drone. In the asynchronous conditions, all tones were again 30 msec in duration, and th drone either led or lagged the melody components by 15 msec

Figure 32.5. Percentage of errors in identification of melodic patterns when the component tones of the patterns switched between ears. On each trial, ten repetitions of a basic eight-tone pattern were presented. All tones were 30 msec in duration, and tones within a pattern were separated by 100-msec pauses. Two such melodies were employed, and subjects identified on each trial which of these had been presented. In Condition A (melody presented binaurally) excellent performance was obtained. In Condition B (melody distributed between ears) performance was very poor. In Condition C (contralateral drone accompanying melody) performance levels were again high. In Condition D (ipsilateral drone accompanying melody) performance was below chance. (From D. Deutsch, Binaural integration of melodic patterns, Perception and Psychophysics, 1979, 25. Reprinted with permission.)

or the right ear tones led or lagged the left ear tones by 15 msec. Performance levels in these conditions were significantly lower than when the melody components and the drone were strictly synchronous, and they were also significantly higher than when the melody components switched between ears without an accompanying drone. This is as expected on the present line, of reasoning.

A similar experiment was performed by Judd (1979). Two repeating stimulus patterns were constructed, from four square wave tones, each 100 msec in duration. The two patterns were as shown on Figure 32.6. It can be seen that, taking each channel separately and treating the patterns as cyclically repeating, the tones in the two patterns were identically ordered. However, when the channels were combined, two different melodic patterns emerged instead. Subjects were presented with pairs of these patterns and were required to judge whether the members of each pair were the same or different. On half of the trials, the silent gaps between the tones were replaced by noise. It was found that performance was better in the noise-filler condition than in the silent gap, condition. Judd interpreted this finding as due to the noise degrading the localization information, which encouraged grouping of successive tones on the basis of frequency range rather than spatial location.

Schubert and Parker (1956) performed an experiment that may be interpreted similarly. These authors measured the amount of interference in speech perception that was produced by switching the signal from ear to ear. They found that adding noise to the contralateral ear reduced this interference effect (Figure 32.7). It may plausibly be argued that the ongoing speech-noise signal was interpreted by the listener in terms of two sources, one emitting noise and the other emitting speech, whereas the ongoing speech-silence signal was interpreted by the listener in terms of two independent speech sources.

1.2.4. The Hypothesis of a Slow Switching Mechanism. The problem of degradation of processing when information is ' switched from ear to ear has been addressed in other contexts. For instance, Cherry and Taylor (1954) studied the intelligibility of speech that switched back and forth from ear to ear. They': found that intelligibility dropped substantially at alternation rates of around 3 Hz and interpreted these findings in terms of a limitation in the rate at which we are able to switch attention between ears. However, Huggins (1964) found that the maximum dip in intelligibility shifted in parallel with a shift in the rate of the presented speech. He argued from this result that the performance decrement was due to interference in processing the basic units of speech, and not to a limitation in attention switching time.

A related paradigm involves recall of lists of digits that are dichotically presented. When two such dichotic lists were delivered at fast rates, recall was found to be better by ear than by temporal order, the latter task requiring switching between ears (Broadbent, 1954, 1958).

Figure 32.6. Stimulus configurations employed to investigate the effect of contralateral noise on the ability to discriminate melodic patterns whose component tones alternated between ears. Tones were 100 msec in duration, with fundamental frequencies of (1) 912 Hz, (2) 1024 Hz, (3) 1150 Hz, and (4) 1290 Hz. Discrimination performance was enhanced when the gaps between the tones were replaced by noise. (From T. Judd, Comments on Deutsch's musical scale illusion, Perception and Psychophysics, 1979, 26. Reprinted with permission.)

Further, subjects showed poorer recall of successive lists of digits when these were presented  alternately to the two ears than when they were presented binaurally (A. Treisman,1971). This finding cannot be ascribed to perceptual interference with the basic units of speech, since there was no disruption of the verbal items in these experiments. Some difficulty in the ability to switch attention between the ears was therefore hypothesized.

In contrast to the above arguments for a switching limitation, powerful general arguments may be made against the idea that information from the two ears cannot be dealt with in rapid succession.

Figure 32.7. Percentages of words correctly repeated as a function of rate at which the speech signal was switched from ear to ear. The lower curve shows the results for trials with silence in the contralateral ear. The upper (dotted) curve shows the results for trials in which noise was delivered to the contralateral ear. The contralateral noise resulted in enhanced speech intelligibility, especially at switching rates of around 4 Hz, where intelligibility was otherwise substantially reduced. (From E. D. Schubert & C. D. Parker, Addition to Cherry's findings on switching speech between two ears, Journal of the Acoustical Society of America, 1956, 27. Reprinted with permission.

In everyday listening, the information arriving at the two ears is never identical, and the running cross correlations performed on this information are very important for several functions. One such function is localization, and the other is the suppression of echoes and reverberation (Haas, 1951; Tobias & Schubert, 1959; Wallach, Newman, & Rosen zweig,1949). The auditory elements that are compared for such functions may be separated by only a few microseconds. Such an ability to utilize information entering the two ears in rapid succession is not consistent with the notion of a slow switching mechanism.

Two conflicting sets of phenomena have therefore been re ported, one arguing for a decrement in processing information where rapid switching between ears is involved, and the other arguing against such a decrement. We may resolve this conflict on the following line of reasoning. An important function of our auditory system is to separate out the signals emanating from different sources. If such perceptual separations were not accomplished we would not know which elements of the acoustic spectrum to link with, so as to form high-order abstractions. It is necessary, therefore, that there exist mechanisms that inhibit the formation of higher-order linkages between acoustic elements that are likely to be emanating from different sources. Since our acoustic environment is very complex, such mechanisms must be flexible and employ multiple criteria. Thus certain configurations involving input to the two ears would be inter preted as coming from the same source, so that integration of this information should be easy. Yet other configurations would best be interpreted as emanating from different sources, so that integration should be difficult. According to this hypothesis, when a decrement in integrating information arriving at the ears occurs, this is due not to capacity limitation, but rather to a mechanism that we have evolved to prevent confusion in monitoring our auditory environment (see Bregman,1978,1981, for an analogous argument based on findings involving various monaural tasks).

1.2.5. The Octave Illusion. In the experiments described in Section 1.2.2, when tones were presented to both ears si multaneously with synchronous onsets and offsets, sequential grouping by frequency proximity was the rule. Grouping by ear of input occurred only when there were temporal separations between the stimuli presented to the two ears. We now turn to an examination of certain situations in which grouping by ear of input occurs even though such input is strictly simultaneous. It will be seen that this happens only under special conditions of frequency relationship between the tones presented in sequence at the two ears.

One such situation is illustrated in Figure 32.8(a). This shows the stimulus pattern that gives rise to the octave illusion (D. Deutsch, 1974, 1975c). It can be seen that two tones that were spaced an octave apart (400 and 800 Hz) were repeatedly presented in alternation. The identical sequence was delivered to the two ears simultaneously; however, when the right ear received the high tone the left ear received the low tone and vice versa. So in fact the listener was presented with a single, continuous, two-tone chord, but the ear of input for each component switched repeatedly.

This configuration produced a number of illusory percepts, the most common of which is illustrated in Figure 32.8(b). It can be seen that this consisted of a single tone that alternated from ear to ear, and whose pitch simultaneously alternated from one octave to another in synchrony with the localization shift.

 Figure 32.8. (a) Stimulus pattern giving rise to the octave illusion. Musical notation is approximate. The lower tones were at 400 Hz and the higher tones were at 800 Hz. All tones were of equal amplitude and 250 msec in duration. There were no gaps between tones. The sequence was continuously presented for 20 sec. (b) Percept most commonly obtained. This consisted of a single tone that alternated from ear to ear and whose pitch simultaneously alternated from one octave to the other in synchrony with the localization shift. (From D. Deutsch, An auditory illusion, Nature, 252. Copyright 1974 by Macmillan journals Ltd. Reprinted with permission.)

When the earphones were placed in reverse position, most  listeners found that the apparent locations of the high and low tones remained fixed. Thus it seemed to these listeners that the earphone that had been producing the high tones was now producing the low tones, and that the earphone that had been producing the low tones was now producing the high tones.

If we assume that there are two separate brain mechanisms, one for determining what pitch we hear and the other for de termining where the sound is located, we are in a position to advance an explanation for this illusion. The model is diagrammed in Figure 32.9. To determine the perceived pitch, the information arriving at one ear is followed, and the information arriving at the other ear is suppressed. However, each tone is localized in the ear receiving the higher-frequency signal, regardless of which frequency is in fact perceived (D. Deutsch, 1975c). The combined output of these two mechanisms, for the case of the listener whose pitch percept corresponds to the frequencies presented to the right ear, should result in the percept of a high tone to the right alternating with a low tone to the left. For the case of the listener whose pitch percept corresponds to the frequencies presented to the left ear instead, the resultant percept should be that of a high tone to the left alternating with a low tone to the right.

This model received confirmation in a further experiment (D. Deutsch & Roll, 1976). Subjects were presented with the basic pattern shown in Figure 32.10(a). This again employed tones standing in octave relation. It can be seen that one ear received three high tones followed by two low tones, while simultaneously the other ear received three low tones followed by two high tones. This basic pattern was repeatedly presented ten times without pause.

As expected from the model, most subjects perceived a pat tern of pitches that corresponded to the frequencies presented either to the right ear or to the left ear. In other words, they heard a repeating sequence consisting either of three high tones followed by two low tones, or of three low tones followed by two high tones. However, each tone was localized in the ear that received the higher frequency. This is illustrated in Figure 32.10(b). When Channel A was presented to the right ear and Channel B to the left, the listener heard a repeating sequence of three high tones to the right followed by two low tones to the left. When, however, Channel A was presented to the left ear and Channel B to the right, the listener now heard a repeating sequence of two high tones to the right followed by three low tones to the left.

Most subjects in the D. Deutsch (1974) experiment perceived a single high tone in one ear alternating with a single low tone in the other ear.

Figure 32.9. Diagram showing how the outputs of the pitch and localization mechanisms combine to produce the octave illusion. Filled boxes indicate high tones (800 Hz) and unfilled boxes indicate low tones (400 Hz). The pitch mechanism follows the sequence of frequencies presented to one (dominant) ear rather than to the other. However, the localization mechanism follows the higher-frequency signal, regardless of whether the higher or the lower frequency is perceived. The outputs of these two mechanisms combine to produce the percept of a high tone in one ear alternating with a low tone in the other ear. (From D. Deutsch, The octave illusion and auditory perceptual integration, in j. V. Tobias & E. D. Schubert (Eds.), Hearing research

Figure 32.10. Stimulus patterns and percepts in experiment to test hypothesized basis for the octave illusion. Filled boxes represent tones of 800 Hz and unfilled boxes represent tones of 400 Hz. The basic patterns shown were presented ten times without pause. In accordance with the hypothesis, most subjects reported the pattern of pitches that was presented to the right ear; yet all subjects localized each tone to the ear receiving the higher-frequency signal. (From D. Deutsch & P. L. Roll, Separate 'what' and 'where' decision mechanisms in processing a dichotic tonal sequence, Journal of Experimental Psychology: Human Perception and Performance, 2. Copyright 1976 by American Psychological Association. Reprinted with permission.)

However, some subjects instead perceived a  single tone that alternated from ear to ear, whose pitch either did not change or changed only slightly with a shift in its apparent location. Other subjects heard more complex patterns, such as two low tones that alternated from ear to ear with an intermittent high tone in one ear. Such patterns were usually unstable, exhibiting frequent changes with continued listening.

The individual differences in perception of this illusion were found to correlate with handedness. As shown in Table 32.3, the proportion of subjects reporting complex percepts was substantially higher in the left-handed than in the right-handed population (see also Craig, 1979). A second handedness correlate concerned the localization patterns for the high and low tones. As shown in Table 32.4, most right-handers heard the high tone on the right and the low tone on the left, regardless of the positions of the earphones (see also Geffen & Reynolds, 1982; McClurkin & Hall, 1981). In contrast, the left-handers did not show a significant tendency to localize the high and low tones

Table 32.3.

Percentages of right-handers and left-handers are displayed. "Octave" indicates the percept of a single tone that alternates from ear to ear, whose pitch simultaneously alternates from one octave to the other. "Single Pitch" indicates the percept of a single tone that alternates from ear to ear, whose pitch either does not change or shifts slightly with a change in localization. "Complex" comprises a number of different complex percepts. The proportion of subjects obtaining complex percepts was considerably higher among left-handers than among right-handers. (from D. Deutsch, An auditory illusion, Nature, 151. Copyright 1974 by Macmillan Journals Ltd. Reprinted with permission.)

Table 32.4.

Each subject was given two presentations of the sequence, for 20 sec each time, with earphones placed first one way and then the other. The numbers of right-handers and left-handers obtaining a given localization pattern are displayed. RR: High tone localized in the right ear and low tone in the left on both presentations. LL: High tone localized in the left ear and low tone in the right on both presentations. Both: High tone localized in the right ear and low tone in the left on one presentation; and high tone localized in the left ear and low tone in the right on the other. Right-handers tended strongly to hear the high tone in the right and the low tone in the left; however, left-handers did not display this tendency either way, and showed a greater tendency to change their localization patterns.

Given the strong correlates with handedness in perception of the octave illusion, it is interesting to consider the neurological differences on which such correlates might be based. The over whelming majority of right-handers are left-hemisphere dom inant, but this is true of only about two-thirds of left-handers. Further, the majority of right-handers have a clear dominance of the left hemisphere; however, a substantial proportion of left-handers have some bilateral representation (Goodglass & Quadfasel, 1954; Hdcaen & de Ajureaguerra, 1964; Hhcaen & Piercy, 1956; Milner, Branch, & Rasmussen, 1966; Subirana, 1969; and Zangwill, 1960). It appears reasonable to assume that these patterns of dominance are reflected in percepts of the octave illusion in two ways. First, the localization of the high tone on the right and the low tone on the left reflects left hemisphere dominance, with the localization of the high tone on the left and the low tone on the right reflecting right-hemi sphere dominance. Second, unambiguous localization patterns reflect clear dominance, with complex percepts reflecting more cerebral equipotentiality.

Localization patterns have been shown to correlate not only with handedness, but also with familial handedness back ground. In a study by D. Deutsch (1983b), subjects with left or mixed-handed parents or siblings were found less likely to localize the high tone on the right and the low tone on the left than were subjects without left- or mixed-handed parents or siblings. This was found true for right-handed, mixed-handed, and left-handed populations.

A further question of interest is whether the interactions underlying the localization and pitch effects in the octave illusion occur between pathways conveying information from the two ears, or whether instead pathways conveying information from different regions of auditory space are involved. To investigate this question, the stimuli were presented through spatially separated loudspeakers rather than earphones (D. Deutsch, 1974, 1975c). An analogous illusion was obtained under these conditions: The subjects perceived a high tone that appeared to be coming from one speaker, which alternated with a low tone that appeared to be coming from the other speaker. This effect was obtained even with the two speakers placed side by side, facing the listener, which shows that highly specific regions of auditory space were involved here.

We shall now consider only what sequence of pitches is perceived in the octave illusion and leave aside the issue of where the tones appear to be located. In the octave illusion, channeling of pitch sequences was always on the basis of spatial location. However, in the scale illusion, channeling was always on the basis of frequency proximity instead. Yet the stimuli producing these two illusions were in several ways very similar. In both cases, repeating sequences of sine-wave tones at equal amplitudes and durations were presented, with synchronous onsets and offsets. Also in both cases, the frequencies presented to one ear always differed from the frequencies simultaneously presented to the other ear. Nevertheless, radically different channeling strategies arose in response to these two stimulus patterns. It is particularly noteworthy that, when two tones standing in octave relation were simultaneously presented in the scale illusion, both these tones were generally perceived. But when two tones standing in octave relation were simultaneously presented in the octave illusion, only one of these tones was generally perceived. Such differences in channeling strategy must therefore arise from differences in the patterns of frequency relationship between successive tones.

Another characteristic of the stimulus producing the octave illusion was that the frequency emanating from one side of space was always the same as the frequency that had just emanated from the opposite side. It therefore seemed plausible to hypothesize that this sequential relationship was responsible for producing channeling by spatial location. A further set of experiments was performed to test this hypothesis (D. Deutsch, 1980a,1981).

In the first experiment, listeners were presented with se quences consisting of 20 dichotic chords. Two conditions were compared, using the basic patterns illustrated in Figure 32.11(a).

Figure 32.11. (a) Configurations used in first experiment examining effects of sequential interactions on ear dominance. Each sequence consisted of 20 dichotic chords. In Condition 1, the two ears received the same frequencies in succession; however, this was not true in Condition 2. (b) Percentage of following of nondominant'ear in these two conditions, as a function of amplitude differences between the tones at two ears. In Condition 1, the dominant ear was followed until a critical level of amplitude relationship was reached, and the nondominant ear was followed beyond this level. However, there was no following on the basis of ear of input in Condition 2. (From D. Deutsch, Ear dominance and sequential interactions, journal of the Acoustical Society of America, 1980, 67. Reprinted with permission.)

The pattern in Condition 1 consisted of the repetitive presentation of a single chord. The tones comprising this chord stood in octave relation and alternated from ear to ear in such a way that when the high tone was in the right ear the low tone was r in the left ear and vice versa. Here the two ears received the :H same frequencies in succession. The sequence presented to the . . right ear began with the high tone and ended with the low tone ,: on half of the trials, while this order was reversed on the other _7 half. The subjects were asked to judge whether the sequence began with the high tone and ended with the low tone or whether ": it began with the low tone and ended with the high tone. It was thus possible to infer which ear was being followed for pitch.
In Condition 2, the basic pattern consisted of the repetitive presentation of two dichotic chords in alternation. The tones comprising the first chord formed an octave and the second a ``1 minor third; thus the entire four-tone combination constituted a major triad. Note that here the two ears did not receive the same frequencies in succession. The right ear received the higher tone of the first chord and the lower tone of the last chord on half of the trials. The order was reversed on the other half of the trials.