Using brief glimpses to decompose mixtures

A. S. Bregman

 

The processes of perceiving speech and music differ in many ways. However, there is a basic process common to the two. I have called it primitive auditory scene analysis (Bregman, 1990). The name scene analysis refers to the problem that any sense modality must face when it encounters more than one external object or event at the same time. It must decide which bits of sensory evidence have arisen from the same environmental event or object. There seem to be two approaches to solving this problem: primitive scene analy­sis and schema-based scene analysis. The primitive approach works without specific knowledge of particular classes of signals, whereas the schema-based approach uses knowledge of specific classes of events and objects in the world. For example, in speech perception, schema-based segregation would use knowledge of the acoustic pat­terns of a particular language and of. a particular speaker, and perhaps of the properties of the human vocal tract. In music per­ception, schema-based segregation would take advantage of the per­ceiver's knowledge of the type of music, the types of instruments, and even the particular piece being played, if the listener was familiar with it.

It is the primitive process that is common to speech and music. The auditory system seems to deal with mixtures by using a general strategy that works independently of the type of acoustic signal involved. First it breaks down the incoming signal into a large number of simultaneous and successive properties. Then it decides on how to group these local properties into separate sets that des­cribe separate environmental events. It does so by examining the properties and looking for relations between them that tend to hold between different bits of sensory evidence that have been derived from the same environmental sound, regardless of the type of signal involved. For example, several simultaneously detected frequencies might be harmonically related to one another. The value of using any particular relation to decide on grouping is derived from a corresponding regularity in the environment. In the case of harmon­icity, many sorts of vibration create frequency components related to a common fundamental. It is likely, in a normal environment, that if a large number of simultaneous frequency components are discovered to relate to a common fundamental, this relation is un­likely to have resulted from the chance co-occurrence of unrelated signals.

The failure of appropriate segregation leads to a loss of iden­tity of the component sounds. This should be described as fusion, but is sometimes mistakenly called masking. These two descriptions both imply a loss of separate perception, but they have slightly different meanings. Masking implies that the masked component is making no contribution at all to perception. Fusion, on the other hand, means that the individual component is not being heard as a separate auditory entity but is still contributing its properties to a more global experience. Most of the research literature does not distinguish between these two concepts and it is likely that many cases that are described as masking are actually examples of fusion. Masking is probably a complete wiping out of the sensory information but fusion is simply a case in which the information is not segre­gated appropriately. The added stimulus does not really mask; it camouflages.

When we study the segregation and fusion of signals in the lab­oratory we have two favorite paradigms: In one, we alternate the components so that there is no overlap in time and look at sequen­tial integration. In the other, we make them occur at exactly the same time and ask how certain variables affect our ability to "hear out" the sounds. In both cases, we usually use tones of steady intensity. However, in the world outside the laboratory, sounds are not so regular. They neither occur in strict alternation nor in exact synchronization. Instead they typically overlap, but only partially. In addition, they rarely stay at a fixed intensity as they do in the laboratory, but fluctuate in loudness. Take, for example, the sounds made by two persons talking at the same time. It is likely that there will be certain moments at which one voice dominates the spectrum received by the listener. At a later moment the intensities might be equal, or reversed.

This irregularity in the input actually opens up possibilities that can be exploited by primitive scene analysis. For example, it can employ what I have called the "old-plus-new" heuristic (Bregman, 1990). This strategy says that if the auditory system encounters a change in the spectrum, it should try to interpret the change as an old sound continuing with a new sound added to it. It should search for the old spectral components or other features of the earlier sound, or for something closely resembling them, in the new spec­trum. If it finds them, it should continue to group these as parts of a single sound, and to take the auditory components or features that are left over and group these as possibly coming from a second sound source. We can see how this heuristic is justified by the properties of pormal acoustic environments. It is very rare that a new sound starts precisely at the instant at which its predecessor stops. Therefore we usually get a glimpse of one sound before it enters into a mixture with a second.

Turning our attention to the laboratory, we can find a number of phenomena, discovered there, that can be connected to one another by relating them to the old-plus-new heuristic. I shall run through a few of these, showing, in each case, how this can be done. The first is the capturing of a partial from a complex tone. As done by van Noorden or by Bregman and Pinker (See Bregman, 1990), it in­volves the alternation of a copy of the partial taken alone with the full complex spectrum. As a result, the listener experiences the pure-sounding partial twice on each cycle, once when the pure tone is presented alone and once when the complex tone occurs.

So far the research has only tried to capture out a partial from a complex tone. However, in normal life we are trying to capture an already complex sound out from a mixture that is even more complex. The mixture is camouflaging the target. In such cases, if the tar­get sound comes on before the embedding sounds, this asynchrony may serve to make the earlier-onset sound audible throughout the dura­tion of the added ones.

Here is an example of the "old-plus-new" heuristic making a tone audible as a separate entity when it would normally be fused with one that accompanies it. The target tone is formed of harmonics 6,7, 8 and 9 of a 100 Hz fundamental. The camouflaging tone is har­monics 2, 3, 4, 11, 12, and 13 of the same fundamental. Each har­monic of the camouflaging tone has .6 the amplitude of the harmonics in the target tone. First we hear the camouflaging tone alone, pulsing at 1-second intervals. [Sound example 1] Then we hear the target and the camouflaging tone together, pulsing at 1-second in­tervals. [Sound example 21 . The only difference is one of timbre. The two tones are fused in the last example. Next we hear an ex­treme form of asynchrony, with the target tone present all the time, joined repeatedly by the camouflaging tone [Sound example 3]. This time we are aware of a continuous tone that is present during the occurrences of the camouflaging tone. The glimpse at the target tone alone has made it audible as a part of the target-plus-camou­flage mixture. We can show that the perception of the target during the camouflager is not an illusion by omitting it whenever the camouflager is on. [Sound example 4] The target tone no longer seems continuous. It simply alternates with the camouflager.

There is another well-known effect of the old-plus-new heuristic: it is called apparent continuity, auditory induction, or the continuity illusion. It occurs when a short part of a sound or sequence of sounds is removed and a louder masking sound inserted in its place. Under certain conditions, the listener hears the origin­al sound continue through the loud interruption.

This phenomenon is not an illusion. The sound that is "restored" is actually present in the masking sound. The latter is called a masking sound because it would have masked the original sound had it really been there. However, in order to mask it, the same neural activity that would normally have been instigated by the original sound would have had to be present as part of the activity instigated by the masker. Therefore the original sound, or at least its neural effect, is there, but in the same sense that a statue is present in an uncut stone. One could say that the rest of the stone was masking the statue if one chose to. In the auditory case, the presence of the undisguised signal, before and after the masker, allows the perceptual system to carve out a portion of the neural stimulation to act as a continuation of the signal.

We can see that this effect is very similar to Sound example 3, in which a four-harmonic sound was able to carve out a corresponding target from within a ten-harmonic tone. We did not think of it as an illusion because we knew that the four-harmonic target was really there. However, its perceived timbre was not there. The ten-harmonic sound did not have an audible component that sounded like the four-harmonic target. When we heard it, we were creating a timbre from the extracted components which would not have been there if the capturing tone did not precede and follow the more complex one. The ten-harmonic tone was being carved apart by the old-plus-new heuristic.

Let us give the label A to the whole continuing sound, the label A1 to the portion that enters the interrupting sound, and A2 to the portion that exits from it. Let us call the interrupting sound B. Perceived continuity occurs when auditory scene analysis interprets the sequence of sounds A1, B, A2, as a second sound laid on top of a continuing one. This interpretation depends on finding sensory evidence, within B, that A continues. In other words, B is interpreted as really a sum of A and some other sound. Because it is an interpretation, the auditory system will make it only if it is pre­ferable to other possible interpretations.

One alternative interpretation is that A2 is not actually the same sound as A1. How does the auditory system rule out this possibility? Its strategy is to compare Al and A2. If A2 does not con­tinue the properties of Al appropriately, continuity is not heard. For example, Valter Ciocca and I studied the perceived continuity of gliding tones through an interrupting loud noise burst (Ciocca & Bregman, 1987). In reality the tones were never present in the noise, but the listeners did not know. They were asked to tell us how sure they were that the glide was present during the noise. We found certain relations between the A1 and A2 glides to be important in favoring the continuity. A2 should start near the frequency at which Al finishes. They should have the same slope (on log-frequency-by-time coordinates) so that the two together make a con­stant-slope glide; alternatively, A2 should be the exact retrograde of A1 so that Al and A2 form a V pattern.

Another possible interpretation of an AI-B-A2 signal is that A went off during B and then came on again. Any evidence of an audi­tory onset at the boundary between B and A2 will favor this interpretation. In other words, if A is to be heard as continuous, some portion of the auditory activity at the end of B must be of the right sort to connect up smoothly with that activated by A2. Re­cently Valter Ciocca, Robert Guadagno and I have found evidence, in an unpublished experiment, that this match in the neural activity at the B-A2 boundary is not done at the auditory periphery. In our experiment the subjects were asked to judge the discontinuity of A (the whole sound represented by A1 and A2). We manipulated the perceived locations of Al, B, and A2. But we did so without chang­ing the relative loudness or onset time of the components in the two ears. As far as each ear, taken alone, was concerned, all the con­ditions in the experiment were the same. Only an interaural compar­ison of phase affected the perceived (intracranial) location. Al and A2 were both 600 Hz tones. B was a burst of noise. The onsets and offsets of abutting sounds were overlapped by 5 msec and had 5 msec rise/decay times to avoid sudden discontinuities. We found that the major factor influencing continuity was whether the noise, B, and the following tone, A2, were perceived as coming from the same side of the head. When we listened to the stimuli ourselves, we noticed that in conditions where B and A2 seemed to be at differ­ent locations, the onset of A2 was more audible. The subjects must have been basing their judgments mostly on this effect. It appears that there was a masking of the onset of A2 by B but that this was more effective when the sounds seemed to be coming from the same place. If this is true, then the masking must have been of a type that we are not familiar with. It must be central, because the monaural intensities and timings were held constant. And it must be specific to perceived locations.

There is another interpretation that the auditory system might give to an A1-B-A2 sequence. It might hear it as Al changing into B and then back into A2. If the change is slow and gradual, this is a plausible interpretation. However if the change to B is sudden, and A2 closely resembles Al, it is more reasonable to hear it as a case of a second sound briefly overlapping an unchanging A.

Here is a continuity effect that illustrates this principle. You can easily hear it for yourself with the simplest of equipment. You play the same pure tone over the left and right channels of stereo headphones. (I assume it could be any steady sound, but I have not tested this assumption.) Then you turn the intensity of one of these channels, say the left, up and down repeatedly between two values: zero and the intensity of the right channel. During the rise and fall of the left channel, the relative intensity at the two ears changes. If the auditory system were to treat the whole sequence as a change in a single sound, it would hear the sound as moving back and forth between the right ear and the center of the head. If it treated it as an overlap of two sounds, it would hear a steady tone on the right side and a pulsing tone at the left. One could think of this second interpretation as resulting from the old­plus-new heuristic. The auditory system would be tracking the right-ear neural activity through the variations in interaural in­tensity and finding that it can treat the right ear as an old un­changing signal with a new left-hand signal added periodically. It is easy to experiment on the factors that encourage either interpre­tation. Let us give the label V to the variable channel. The two sound interpretation is favored by sudden changes in V's intensity, and by keeping V at the higher intensity for shorter periods of time relative to the time at zero (Bregman 1990, pp.309-310).

While this example involves two ears, a similar demonstration can be created with a monaural signal. Just raise and lower a steady monaural signal between two fixed intensity levels. This can be heard either as a single tone changing in intensity or as a sof­ter tone continuing unchanged and joined periodically by second tone that has the same spectrum as the steady one. This effect has been called homophonic continuity by Warren (e.g., 1982, ch. 2). I have observed that the same factors that favor the two-tone interpretation in the dichotic case also promote it in this one: sudden rises in intensity and short relative durations at maximum intensity.

So far I have mainly been discussing the perceived properties of the continuing tone, A. However, the old-plus-new heuristic, in treating the period of the interruption as a mixture of two sounds, must also form a description of the interrupting one. There is evidence, at least in some circumstances, that this description is formed from the acoustic evidence that is left over after the con­tinuing sound has claimed its own components.

An example can be seen when a long sound, A, consisting of a noise that is restricted in bandwidth, alternates with a shorter noise burst, B, that has a wider bandwidth (Warren, 1982, ch. 2). Sound A strips itself out of B leaving behind a residual. I have created a pair of demonstrations that show this effect clearly. In both of them, the short interrupting noise is in the 0-2000 Hz band. In the first demonstration, this noise is alternated with a longer one that is in the lower half of its band. The listener hears a low continuous noise, joined periodically by a short burst of high­ pitched noise. In the second example, the longer noise, A, is ii the upper half of the band. So the listener hears a continuous high-pitched noise, but this time the accompanying short bursts art low in pitch. It should be emphasized, however, that the interrupt­ing burst is identical in the two examples. It spans both halves o1 the range. However, it sounds different because different halves of it are being stripped out and leaving the other half. Playing the two demonstrations in alternation allows the listener to hear that the residuals heard in one demonstration sound like the long tone. of the other. They are the same halves of the spectrum in both cases [Sound example 51.

Treating this pattern, and the earlier one with harmonic compo­nents, as examples of perceptual organization may seem like a joke. After all, the signals could have been made by periodically adding an extra sound to a continuous one. We would expect to hear it as such. However, it is important to realize that we always hear it this way, regardless of how it was made, This is the fact that we are trying to explain.

The perception of the residual as a separate sound may owe some­thing to the "enhancement effect". If a sound is played for some time and then the intensities of parts of the spectrum change, the auditory system seems to intensify its perception of those changes (e.g., Summerfield, Sidwell, & Nelson, 1986). A dramatic example is obtained when the negative spectrum of a vowel (which has valleys where the vowel would have peaks) is followed by a flat spectrum. The latter then sounds like the positive vowel, because the incremented spectral regions, at the moment when the valleys are suddenly filled up, are heard as peaks. A case of the enhancement effect that more clearly shows its practical value can be created in a situation in which a listener is asked to recognize two vowels that are played together (Summerfield & Assmann, 1989). When one of the vowels is played first briefly before being joined by the second, recognition of the second one is enhanced (relative to a condition in which the two vowels start at the same time). A brief glimpse of one vowel has allowed the decomposition of the mixture.

The enhancement effect may be involved in my previous two ex­amples. In the case in which low noise is interrupted by wider-band noise, the B components that are not present in A1 are the ones whose intensities change at the A1-B boundary. They may be en­hanced, and this may help them to be heard as a separate added tone. Those whose intensities do not change are grouped with the previous spectrum. While the enhancement effect does not explain the actual grouping that is perceived in these demonstrations, it can explain why the residual is perceived so clearly despite the fact that it is spectrally continuous with the other half of the spectrum. In short, the enhancement effect probably contributes to the formation of residuals. When a sound has been going on for some time, we already know its properties. It becomes more important to perceive the properties of added sounds and give them separate identities.

To recapitulate, I have tried to show that the capturing of one or more harmonics from a complex tone, the release from fusion resulting from asynchronous onsets, apparent continuity, and the enhancement effect are all aspects of a process whereby the auditory system deals with mixtures of sounds by using clear glimpses of individual sounds to help deal with the mixtures that follow them.

Now a word about the differences between speech and music. It is true that, in both, the human ear must deal with the problem of the mixture of target sounds with others that get in the way. Yet the situation is somewhat different in the two cases. When we listen to a voice, other sounds usually intrude by accident. We do not want to fuse two voices, only the various parts of the same one. In other words, we are interested in the actual acoustic sources. However, in music, particularly orchestral music, the composer may not want the listener to experience the contribution of each instrument as a separate sound. It may be more important to fuse groups of instruments into different melodic lines, or musical layers, and to keep these newly constructed streams distinct from one another. These goals can be achieved by providing the listeners with separate glimpses at the component voices before requiring them to be fol­lowed into mixtures. This is done by arranging that the instruments within a component line should be synchronous while the onsets of notes in different lines should frequently be asynchronous. Exact arrangements of this sort are made difficult for orchestras by the inherent asynchronies that one finds with players whose notes nomin­ally begin at the same time (Rasch, 1979).

David Huron has recently found evidence that composers use this approach of controlling synchronies deliberately (Huron, 1989). He was able to show in a fairly broad sample of keyboard music by J. S. Bach, that as the number of musical voices increased from two to six, the composer used more asynchronies, presumably to allow the listener follow the separate parts better by catching more glimpses of them in relative isolation.

Finally, it is true that I have pointed out some similarities in a number of phenomena in which hearing a sound in isolation allows a listener to decompose a more complex mixture in which the sound is a part. However, while the analogies are suggestive, we need more research to determine whether the same auditory processes are in­volved or whether different ones are simply serving the same func­tional goals.

References

Bregman, A.S. (1990). Auditory scene analysis: the perceptual organization of sound. Cambridge, Mass.: The MIT Press

Ciocca, V., & Bregman, A.S. (1987). Perceived continuity of gliding and steady-state tones through interrupting noise. Perception and Psychophysics, 42, 476-484.

Huron, D. (1989). Voice segregation in selected polyphonic keyboard works by J.S. Bach. Ph.D. dissertation, School of Music, University of Nottingham.

Rasch, R.A. (1979). Synchronization in performed ensemble music. Acustica, 43, 121-131.

Summerfield, Q., & Assmann, P.F. (1989). Auditory enhancement and the perception of concurrent vowels. Perception and. Psycho­physics, 45, 529-536.

Summerfield, Q., Sidwell, A., and Nelson, T. (1987). Auditory en­hancement of changes in spectral amplitude. Journal of the Acoustical Society of America, 81 (3), 700-708.

Warren, R.M. (1982). Auditory_percetion: A new synthesis. New York: Pergamon Press.