Before going further into actual sound system issues and mathematics, it is important to know the significance of human sound perception. The aim in this section is to shed some light on the physiological, neuropsychological and cognitive mechanisms which take part in our hearing of sound. Mostly, it is comprised of a brief treatment of the field of psychoacoustics, although an amount of physics, human anatomy and relevant psychology is explained as well—all these have an important role in explaining what sound is to us. This section is a long one, the reason being that not many things in sound and music are interesting to humans outside the context of how we hear and interpret what was heard. Were it not for our sense of hearing and all the cognitive processing that goes along with it, the physical phenomenon of sound would be only just that—a physical phenomenon. Not many would have any interest in such a thing.
Knowledge of how we hear and why is thus paramount to understanding the
relevance of the many algorithms, mathematical constructs and the general
discipline of audio signal processing we encounter further on. This
understanding helps explain why some synthesis methods are preferred over
others, what it is that separates pleasant, harmonious music from horrifying
noise, what it is that comprises the pitch, timbre and loudness of an
instrument, what makes some sounds especially natural
or fat
,
where the characteristic sound of some particular brand of equipment comes from
and what assumptions and simplifications can be made in storing, producing and
modifying sound signals. Basic knowledge of psychoacoustics can also help avoid
some of the common pitfalls in composition and sound processing and suggest
genuine extensions to one’s palette of musical expression.
To put it shortly, psychoacoustics is the field of science which studies how we perceive sound and extract meaningful data from acoustical signals. It concerns itself primarily with low level functions of the auditory system and thus doesn’t much overlap with the study of music or æsthetics. Basic psychoacoustical research is mainly directed toward such topics as directional hearing, pitch, timbre and loudness perception, auditory scene analysis (the separation of sound sources and acoustical parameters from sound signals) and related lower functions, such as the workings of our ears, neural coding of auditory signals, the mechanisms of interaction between multiple simultaneously heard sound sources, neural pathways from ears to the auditory cortex, their development and the role of evolution in the developement of hearing. Psychoacoustical research has resulted in an enormous amount of data which can readily be applied to sound compression, representation, production and processing, musicology, machine hearing, speech recognition and composition, to give just a few examples. The reason why such a long part of this document is devoted to psychoacoustics, is that although one can understand sound synthesis and effects fairly well just by grasping the relevant mathematics, one cannot truly get a hold onto their underlying principles, shortcomings and precise mechanisms of action before considering how the resulting sound is heard. Human auditory perception is a rather quirky and complicated beast—it often happens that sheer intuition simply doesn’t cut it.
There are three main parts in the human ear: outer, middle and inner ear. The outer ears include ear lobes (pinnæ) and the ear canals. Between outer and middle ear, the tympanic membrane (or eardrum) resides. The middle ear is a cavity in which three bones (called malleus, incus and stapes) reside. Malleus is attached to the tympanic membrane, stapes to the oval window which separates the inner and middle ears. Incus connects these two. The three bones (collectively called ossicles) form a kind of lever which transmits vibrations from the tympanic membrane to the fluid filled inner ear, providing an impedance match between the easily compressed air in the outer ear and the noncompressible fluid in the inner. To these bones the smallest muscles in our body, the middle ear muscles, are attached. They serve to dampen the vibration of the ossicles when high sound pressures are met, thereby protecting the inner ear from excessive vibration. The inner ear is composed of the cochlea, a little, bony, seashell shaped apparatus which eventually senses sound waves, and the vestibular apparatus. All these structures are incredibly small—the ear canal measures about 3 centimeters in length and about one half in diameter, the middle ear is about 2 cubic centimeters in volume and the cochlea is, when rolled straight, about 35 millimeters in length and 2 millimeters in diameter.
In the cochlea, we find an even smaller level of detail. The cochlea is divided into three longitudinal compartments, scala vestibuli, scala tympani and scala media. The first two are connected through the apex in the far end of the cochlea, the middle one is separate from the others. The vibrations from the middle ear reach the cochlea through the oval window which resides in the outer end of scala vestibuli. In the outer end of scala tympani, the round window connects the cochlea to the middle ear for the second time. Vibrations originate from the oval window, set the intermediate membranes (Reissner’s membrane between scala vestibuli and scala media, the basilar membrane between scala media and scala tympani) in movement and get damped upon reaching the round window. On the floor of scala media, under the tectorial membrane, lies the organ of Corti. This is where neural impulses associated with sound reception get generated. From the bottom of the organ of Corti, the auditory nerve emanates, headed for the auditory nuclei of the brain stem.
The organ of Corti is the focal point of attention in many basic psychoacoustical studies. It is a complex organ, so we will have to simplify its operation a bit. For a more complete description, see Kan01. That book is also a good general reference on neural structures. On top of the Corti organ, two lines of hair cells, covered with stereocilia (small hairs of actin filaments which sense movement) stand. On the outside of the cochlea is the triple line of outer hair cells, on the inside the single line of inner hair cells. The other ends of the stereocilia are embedded in the overlying tectorial membrane. This arrangement means that whenever the basilar membrane twitches, the stereocilia get bent between it and the tectorial membrane. Obviously, the pressure changes in scala vestibuli result in just such action, which means that sound results in bent stereocilia. This in case leads to neural impulses being generated. These are lead by the afferent auditory nerves towards the brain. The inner and outer hair cells are innervated rather differentely—it seems that the inner ones are mainly associated with louder and the outer with quieter sounds (see Gol01). Also, some efferent innervation reaches the outer hair cells, so it is conceivable that the ear may adapt under neural control, possibly to aid in selective attention (Kan01).
By now, the basic function of the ear should be quite clear. However, nothing has been said about how the ear codes the signals. It is well known that neurons cannot fire at rates exceeding 500‐1000Hz. Neurons also primarily operate on binary pulses (action potentials)—there either is a pulse or there is not. Direct encoding of waveforms does not come into question, then. And how about amplitude? To answer these questions, something more has to be said about the structure of the cochlea.
When we look more closely at the large scale structure of the organ of Corti, we see a few interesting things. First, the width of the basilar membrane varies over the length of the cochlea. Near the windows, the membrane is quite narrow whereas near the apex, the membrane is quite a bit wider. Similarly, the thickness and stiffness vary—near the windows they are considerable whereas near the apex they are much less. And the same kind of variation is repeated in the hair cells and their stereocilia—near the apex, longer, more flexible hairs prevail over the stiffer, shorter stereocilia of the base of the cochlea. All this has a serious impact on the vibrations caused in the Corti organ by sound—vibrations of higher frequency tend to cause response mainly near the windows where the characteristic vibrational frequency of the basilar membrane is higher, lower frequencies primarily excite the hair cells near the apex. This means that the organ of Corti performs physical frequency separation on the sound. This separation is further amplified by the varying electrical properties of the hair cells, which seem to make the cells more prone to excitation at specific frequencies. All in all, the ear has an impressive apparatus for filter bank analysis. From the inner ear, this frequency decomposed information is moved by the auditory nerve. The nerve fibers are also sorted by frequency, a pattern repeated all over the following neural structures. This is called tonotopic organization.
When sound amplitude varies, the information should be coded somehow. This is an area of study which is still going on strong. This means that a complete description cannot be given here, but most relevant points are mentioned, anyhow. One mechanism of coding involves the relative firing frequencies and amplitudes of the individual neurons—the more excitation there is, the more there is neural activity in the relevant auditory neurons. Since frequency information is carried mainly by tonotopic mapping of the neurons, this doesn’t pose a problem of data integrity. A second mechanism which seems to augment the transmission is based on the fact that as louder sounds impinge upon the ear, the width of resonance on the basilar membrane increases. This may cloud the perception of nearby frequencies but can also be used to deduce the amplitude of the dominant component. The efferent innervation to the outer hair cells and afferent axons from the inner hair cells also seem to play a part in loudness perception—there is evidence suggesting that the ear can adapt to loud sounds, keeping the ranges in check.
So we now we have a rough picture of how amplitude and frequency content are carried over the auditory nerve. How about time information, then? Considering the high time accuracy of our hearing (in the order of milliseconds, at best), mere large scale time variation in neural activity (governed by the many resonating structures on the signal path and the inherent limitation on neuron firing rate) does not seem to explain everything. When investigating this mystery, researchers ran into an interesting phenomenon, namely, phase locking, which has also served as a complementary explanation to high frequency pitch perception. It seems that hair cells, in addition to firing more often when heavily excited, tend to fire at specific points of the vibratory motion. This means that the firing of multiple neurons, although mutually asynchronous, concentrate at a specific point of the vibratory motion. This phenomenon has been experimentally demonstrated for frequencies as high as 8kHz. It is conceivable, then, that many neurons working in conjunction could directly carry significantly higher frequencies than their maximum firing rate would at first sight suggest. This result has been experimentally confirmed as well as its role in conveying accurate phase information to the brain (this is important in measuring interaural phase differences and, consequently, plays a big part in directional hearing). It also serves as a basis to modern theories of pitch perception through what is called periodicity pitch, pitch determination through the period of a sound signal. The idea of concerted action of phase locked neurons carrying frequency information is called the volley principle and augments the frequency analysis (place principle) interpretation introduced above. This time domain coding is extremely important because it seems that accurate frequency discrimination cannot be explained without it—place codings display a serious lack of frequency selectivity, even after considerable neural processing and enhancement.
Now the function of the auditory system has been described upto the auditory
nerve. What about after that? The eighth cranial nerve, most of which is an
extension of the innervation of the inner ear (the rest being mainly concerned
with the sense of balance), carries the auditory traffic to the brain stem. Here
the auditory nerve passes through the cochlear nuclei, which start
the neural processing and feature extraction process. Upon entering the cochlear
nucleus, the auditory nerve is divided in two. The upper branch goes to the
upper back quarter of the nucleus while the lower branch innervates the lower
back quarter and the front half. The cochlear nuclei display clear tonotopic
organization, with high frequencies mapped to the centre and lower frequencies
mapped to the surface. The ventral (back) side of the nucleus is made
up from kinds of cells, bushy and stellate (starlike).
Stellate cells respond to single neural input pulses by a series of evenly
spaced action potentials of a cell dependent frequency (this is called a
chopper response). The stellate cells have long, rather simple
dendrites. This suggests that the stellate cells gather pulses from many lower
level neurons, and extract precision frequency information from their
asynchronous outputs. Their presence supports one of the theories of frequency
discrimination, which speculates on the presence of a timing reference
in
the brain. The bushy cells, on the other hand, have a fairly compact array of
highly branched dendrites (whence the name) and respond to depolarization with a
single output pulse. This suggests they are probably more concerned with
time‐domain processing. It seems bushy cells extract and signal the onset time
of different frequencies in a sound stimulus. There are also cells, called
pausers, which react to stimuli by first chopping a while, then
stopping, and after a while starting again. These may have something to do with
estimating time intervals and/or offset detection.
Following the cochlear nuclei, the auditory pathway is divided in three. The dorsal (front side) acoustic stria crosses the medulla, along with the intermediate acoustic stria. The most important, however, is the trapezoid body, which is destined to the next important processing centre, the superior olivary nucleus. The olives are a prime ingredient in directional hearing. Both nuclei receive axons from both the ipsilateral (same side) and contralateral (opposite side) cochlear nuclei. The medial (closer to the centre of the body) and lateral (closer to the sides of the body) portions of the nuclei serve different functions: the medial part is concerned with measuring interaural time differences while the lateral half processes interaural intensity information. Time differences are measured by neurons which integrate the information arriving from both ears—since propagation in the preceding neurons is not instantaneous and the signals from the ears tend to travel in opposite directions along the pathways, this system works as a kind of correlator. The coincidence detector is arranged so that neurons closer to the opposite side of the sound source tend to respond to it. Similarly, the intensities are processed—contralateral signals excite and ipsilateral signals inhibit the response of the intensity detector. These functions are carried out separately for different frequency bands and are duplicated in both superior olivary nuclei, although the dynamics of the detection process mainly place the response on the opposite side of the signal source.
After leaving the olives, the axons rejoin their crossed and uncrossed friends from the cochlear nuclei. They then progress upwards—this time the bundle of axons is called the lateral lemniscus. The lemniscus ascends first through pons where an amount of crossing between the lateral pathways is observed. This happens through Probst’s commissure which mainly contains axons from the nuclei of the lateral lemniscus. From here, the lane continues upward to the midbrain (more specifically to the inferior colliculus) where all the axons finally synapse. This time there seems not to be any extensive crossing. It would appear that the inferior colliculus has something to do with orientation and sound‐sight coordination—the superior colliculus deals with eye sight and there are some important connections to be observed. Also, there is good evidence that topographic organization according to the spatial location of the sound is present in the inferior colliculi. It is noticeable that while we trace the afferent auditory pathway through to the lateral lemniscus and the inferior colliculus, the firing pattern of the neurons changes from flow‐like excitation to an onset/offset oriented kind. More on this can be found in the sections on transients and time processing. The pathway is then extended upwards to the medial geniculate nuclei just below the forebrain which then, finally, projects to the primary auditory cortex on the cerebrum.
One special thing to note about the geniculate nuclei is that they, too, are divided into parts with apparently different duties. The ventral portion displays tonotopic organization, whereas the dorsal and medial (magnocellular) parts do not. The ventral part projects to tonotopically organized areas of the cortex, the dorsal part nontonotopic ones and the medial part to both. In addition, the magnocellular medial geniculate nuclei display a certain degree of lability/plasticity. This means it may have considerable part in how learning affects our hearing. A noteworthy fact is that the nontonotopically organized parts of the geniculate nuclei and the cortex are considerably less well known than their tonotopic counterparts—complex, musically relevant mappings might be found there, in the future. Throughout the journey, connections to and from the reticular formation (which deals with sensomotoric integration, controls motivation and maintains the arousal and alertness in the rest of the central nervous system) are observed. Finally, the auditory cortex is located on the surface of the temporal lobes. And just to add to the fun, there is extensive crossing here, as well. This time it takes place through corpus callosum, the highway between the right and left cerebral hemispheres.
In the way to the auditory cortex, extensive mangling of information has already taken place. It is seen, for example, that although the tonotopic organization has survived all the way through the complex pathways, it has been multiplied, so that there are now not one but several frequency maps present on the auditory cortex. The structural organization is more complex, here, also. Like most of the cortex, the auditory cortex is both organized into six neuronal layers (which mainly contain neuronal cell bodies) and into columns (which reach through the layers). The layers show their usual pattern of external connections: layer IV receives the input, layer V projects back towards the medial geniculate body and layer VI to the inferior colliculus. The columns, on the other hand, serve more specialized functions and the different types are largely interspersed among one another. Binaural columns, for instance, show an alternating pattern of suppression and addition columns—columns which differentiate between interaural features and and those which do not, respectively. Zoning of callosally connected and nonconnected areas is also observed. Further, one must not forget that there exist areas in the brain which are mainly concerned with speech production and reception (the areas of Wernicke and Broca, respectively). They are specific to humans although some similar formations are present in the brain of other animals, especially if they are highly dependent on auditory processing (bats and dolphins are examples with their echo location and communication capabilities).
All in all, the functional apparatus of the brain concerned with auditory
analysis is of considerable size and complexity. One of the distinctive features
of this apparatus is the extensive crossing between the two processing
chains—one of the most peculiar aspects of hearing is that while the usual rule
of processing on the wrong side
is generally observed, the crossing
distributes the processing load so that even quite severe lesions and extensive
damage to the cortex need not greatly disturb auditory functions.
In the last section, it became apparent that the brain has an extensive
apparatus for extracting both time and frequency information from sounds. In
fact, there are two separate pathways for information: one for frequency domain
and the other for time domain data. This has far reaching consequences for how
we hear sound signals. First of all, it means that any perceptually significant
analysis or classification of sound must include both time and frequency. This
is often forgotten in the traditional Fourier analysis based reasoning on sound
characteristics. Second, it draws a kind of dividing line between sound signals
whose main content to us is in one of the domains. Here, this division is used
to give meaning to the often encountered terms transient and
steady‐state; we take the first to mean time oriented
, and the
second frequency oriented
. Another (rather more rigorous) definition of
steady‐state
is based on statistics. In this context, a signal is called
steady‐state if it is stationary in the short term and
transient if it is not.
The root of this terminology lies in linear system analysis. There steady‐state means that a clean, often periodic or almost constant excitation pattern has been present long enough so that Fourier based analysis gives proper results. Formally, when exposed to one‐sided inputs (non‐zero only if time is positive), linear systems exhibit output which can be decomposed into two additive parts: a sum of exponentially decaying components which depends on the system and a sustained part which depends on both the excitation and the system. The former is the transient part, the latter steady‐state. Intuitively, transients are responses which arise from changes of state—from one constant input or excitation function to another. They are problematic, since they often correspond to unexpected or rare events; it is often desired that the system spend most of its time in its easiest to predict state, a steady‐state. Because transients are heavily time‐localized, they defy the usefulness of traditional Fourier based methods.
In acoustics and music, the situation is similar in that frequency oriented methods tend to fail when transients are present. Moreover, in music, transients often correspond to excitatory motions on behalf of the performer (plucking a bow, striking a piano key, tonguing the reed while playing an oboe etc.), and so involve
All these together mean that pure frequency domain analyses do not explain complex sounds clearly enough—they do not take into account the time‐variant, stochastic or nonlinear aspects of the event. From an analytical point of view, a time‐frequency analysis is needed. Some of these are mentioned in the math section. The fourth item in the list above deserves special attention because it is characteristic of vocal sounds—consonants are primarily recognized the trajectories (starting points, relative amplitudes and speed of movement) of the partials present in the following phoneme Dow01. Usually consonants consists of a brief noisy period followed by the partials of the next phoneme sliding into place, beginning from positions characteristic to the consonant. This happens because consonants are mostly based on restricting air passage through the vocal tract (this and the following release produce the noise), because the following phoneme usually exhibits different formant frequencies (causing a slide from the configuration of the consonant) and, finally, because consonants are mostly very short compared to vowels.
What is the perceptual significance of our transient vs. steady
classification, then? To see this, we must consider speech, first. In the spoken
language, two general categories of sounds are recognized: vowels and
consonants. They are characterized by vowels being voiced
, often quite
long and often having a more or less clear pitch as opposed to consonants being
short, sometimes noiselike (such as the pronunciation of the letter s
)
and mostly unpitched. Vowels arise from nicely defined vibratory patterns in the
vocal tract which are excited by a relatively steady pulse train from the vocal
chords when consonants mostly arise from constrictions of the vocal tract and
the attendant turbulence, impulsive release (like when pronouncing a p
,
or one of the other plosives) or nonlinear vibration (like the letter r
).
Now, a clear pattern shows here. Consonants tend to be transient in nature,
while vowels are mostly steady‐state. This is very important because most of the
higher audio processing in humans has been shaped by the need to understand
speech. This connection between vowel/consonant and steady/transient
classification has also been demonstrated in a more formal setting: in listening
experiments, people generally tend to hear periodic and quasi‐periodic sounds as
being vowel‐like while noises, inharmonic waveforms and nonlinear phenomena tend
to be heard as consonants. Some composers have also created convincing illusions
such as speech music
by proper orchestration—when suitable portions of
transient and steady‐state material is present in the music in some semi‐logical
order, people tend to hear a faint speech like quality in the result. The
current generation of commercial synthesizers also demonstrates the point—today,
modulatory possibilities and time evolution of sounds often outweighs in
importance the basic synthesis method and as a buying criterion. The music of
the day relies greatly on evolving, complex sounds instead of the traditional
one‐time note event structure.
It is kind of funny how little attention time information has received in the classical study, although one of the classic experiments in psychoacoustics tells us what importance brief, transient behavior of sound signals has. In the experiment, we record instrumental sounds. We then cut out the beginning of the sound (the portion before the sound has stabilized into a quasi‐periodic waveform). In listening experiments, samples brutalized this way are quite difficult to recognize as being from the original instrument. Furthermore, if we splice together the end of one sample and the beginning of another, the compound sound is mostly recognized as being from the instrument of the beginning part. In a musical context, the brief transient in the beginning of almost all notes is called the attack, then. For a long time, it eluded any closer inspection and even nowadays, it is exceedingly difficult to synthesize if anything but a throrough physical model of the instrument is available.
This kind of high importance of transient characteristics in sound is best understood through two complementary explanations. First, from an evolutionary point of view, time information is essential to survival—if it makes a sudden loud noise, it may be coming to eat you or falling on you.
You need rapid classification as to what the source of the sound is and where it is at. From a physical point of view, there may also be considerably more information in transient sound events than in steady‐state (and especially periodic) sound—since high frequency signals are generated in nature by vibrational modes in bodies which have higher energies, they tend to occur only briefly and die out quickly. In addition to that, most natural objects tend to emit quasi‐periodic sound after a while has passed since the initial excitation. These two facts together mean that, first, upper frequencies and highly inharmonic content tend to concentrate on the transient part of a sound and, second, the following steady‐state portion often becomes rather nondistinctive.
So the steady‐state part is certainly not the best part to look at if source classification is the issue. The other part of the equation are the neural excitation patterns generated by different kinds of signals—transients tend to generate excitation in greater quantities and more unpredictably. Since unpredictability equals entropy equals information, transients tend to have a significant role in conveying useful data. This is seen in another way by observing that periodic sounds leave the timing pathway of the brain practically dead—only spectral information is carried and, as is explained in following sections, spectra are not sensed very precisely by humans. Kind of like watching photos vs. watching a movie. In addition to that, such effects as masking and the inherent normalisation with regard to the surrounding acoustic space greatly limit the precision of spectral reception.
Aside from their important role in classifying sound sources, transient features also serve a complementary role in sound localization. This is most clearly seen in auditory physiology: our brain processes interaural time differences instead of phase differences and has separate circuitry for detecting the onset of sonic events. This means that transient sounds are the easiest to locate. Experiments back this claim: the uncertainty in sound localization is greatest when steady‐state, periodic sounds are used as stimuli.
Until now, we have tacitly assumed that the ear performs like a measuring instrument—if some features are present in a sound, we hear them. In reality, this is hardly the case. As everybody knows, it is often quite difficult to hear and understand speech in a noisy environment. The main source of such blurring is masking, a phenomenon in which energy present in some range of frequencies lessens or even abolishes the sensation of energy in some other range. Masking is a complex phenomenon—it works both ipsilaterally and contralaterally, and maskings effects extend both forwards and backwards in time. It is highly relevant to both practical applications (e.g. perceptual compression) and psychoacoustic theory (for instance, in models of consonance and amplitude perception). This also means that masking has been quite thoroughly investigated over the years. The bulk of research into masking involves experiments with sinusoids or narrow‐band noise masking a single sinusoid. Significant amounts of data are available on forward and backward masking as well. It seems most forms of masking can be explained at an extremely low (almost physical) level by considering the time dynamics of the organ of Corti under sonic excitation. This is not the case for contralateral masking, though, and it seems this form of masking stands separately from the others. Currently it is thought that contralateral masking is mediated through the olivo‐cochlear descending pathway by means of direct inhibition of the cochlea in the opposite ear. (Masking like this is called central, whereas normal masking by sound conducted through bone across the scull is called transcranial.)
Masking is a rather straight forward mechanism, which can be studied with relative ease by presenting test signals of different amplitudes and frequencies to test subjects in the presence of a fixed masking signal. The standard way to give the result of such an experiment is to divide the frequency‐amplitude plane into parts according to the effect produced by a test signal with the respective attributes while the mask stays constant. The main feature is the masking threshold which determines the limit below which the masked signal is not heard at all. This curve has a characteristic shape with steep roll‐off below the mask frequency and a much slower, uneven descent above. This means that masking mostly reaches upwards with only little effect on frequencies below that of the mask. At each multiple of the mask frequency we see some dipping because of beating effects with harmonic distortion components of the mask. Above the threshold we see areas of perfect separation, roughness, difference tones and beating.
A lot is known about masking when both the mask and the masked are simple, well‐behaved signals devoid of any time information. But how about sound in general? First we must consider what happens with arbitrary static spectra. In this case one proper—and indeed lot used—way is to take masking to be additive. That is, the mask contribution of each frequency is added together to obtain the amount of masking imposed on some fixed frequency.
So additivity is nice. But does it hold in general? Not quite. Since the ear is not exactly linear, some additional frequencies always arise. These are not included in our masking computation and can sometimes make a difference. Also, in the areas where our hearing begins to roll off (very low and very high frequencies), some exceptions to the additivity must be made. Since masking mainly stretches upwards, this is mostly relevant in the low end of the audio spectrum—low pitched sounds do not mask higher ones quite as well as we would expect. Further, beating between different partials of the masking and the masked can sometimes cause additivity to be too strict an assumption. This is why practical calculations sometimes err on the safe side and take maximums instead of sums. This works because removing all content other than the frequency (band) whose masking effect was the greatest will still leave the signal masked. It is proper to expect that putting the rest of the mask back in will not reduce the total masking effect.
The above discussion concerns steady spectra. In contrast, people hear time
features in sounds as well. So there is still the question of how the masking
effect of a particular sound develops in time. When we study masking effects
with brief tone bursts, we find that masking extends some tens of milliseconds
(often quoted as 50ms) backwards and one to two hundred milliseconds forward in
time. The effect drops approximately exponentially as the temporal separation of
the mask and the masked increases. These results too can be explained by
considering what happens in the basilar membrane of the ear when sonic
excitation is applied—it seems backward and forward
masking, as these are respectively called, are the result of the basilar
membrane’s inherently resonant nature. The damped vibrations set of by sound
waves do not set in or die out abruptly, but instead some temporal integration
is always observed. This same integration is what causes the loudness of very
short sounds proportional to their total energy instead of the absolute
amplitude—since it takes some time for the vibration (and, especially, the
characteristic vibrational envelope) to set in, the ear can only measure the
total amount of vibration taking place, and ends up measuring energy across a
wide band of frequencies. Similarly, any variation in the amplitudes of sound
frequencies are smoothed out, leading to the the ear having a kind of time
constant
which limits its temporal accuracy.
Closely tied to masking (and, indeed, many other aspects of human hearing)
are the concepts of critical bandwidth and critical bands.
The critical bandwidth is defined as that width of a noise band beyond which
increasing the bandwidth does not increase the masking effect imposed by the
noise signal upon a sinusoid placed at the center frequency of the band. The
critical bandwidth varies across the spectrum, being approximately one third of
an octave in size, except below 500Hz, where the width is more or less constant at
100Hz. This concept has many uses
and interpretations because in a way, it measures the spectral accuracy of our
ear. Logically enough, a critical band is a frequency band with the width of one
critical bandwidth. Through some analysis of auditory physiology we find that a
critical band roughly corresponds to a constant number of hair cells in the
organ of Corti. In some expositions, critical bands are thought of as having a
fixed center frequency and bandwidth. Although such a view is appealing from an
application standpoint, no physiological evidence of direct banking
of
any kind is found in the inner ear or the auditory pathway, it seems that this
way of thinking is somewhat erroneous. Instead, we should think of critical
bands as giving the size and shape of a minimum discernible spectral unit
of kind—in measuring the loudness of a particular sound, the amplitude for each
frequency is always averaged over the critical band corresponding to the
frequency. (This amounts to lowpass filtering (i.e. smoothing) of the perceived
spectral envelope.) This effect is illustrated by the fact that people can
rarely discern fluctuations in the spectral envelope of a sound which are less
than one critical band in width.
Considering the complexity of the analysis taking place in the auditory
pathway, it is no wonder that few parameters of sound signals are translated
directly into perceptually significant measures. This is the case with amplitude
too—the nearest perceptual equivalent, loudness, consists of much
more than a simple translation of signal amplitude. First, the sensitivity of
the human ear is greatly frequency dependent—pronounced sensitivity is found at
the frequencies which are utilized by speech. This is mainly due to
physiological reasons (i.e. the ear canal has its prominent resonance on these
frequencies and the transduction and detection mechanisms cause uneven frequency
response and limit the total range of frequency perception). There are also
significant psychoacoustic phenomena involved. Especially, humans tend to
normalize sounds. This means that parameters of the acoustic environment we
listen to a sound in is separated from the properties of the sound source. This
means, for instance, that we tend to hear sounds with similar energies as being
of unequal loudness if our brain concludes that they are coming from differing
distances. Further, such phenomena as masking can cause significant parts of a
sound to be shielded
from us, effectively reducing the perceived
loudness. We also follow very subtle clues in sounds to deduce the source
parameters. One example is the fact that a sound with significant high frequency
content usually has a higher perceived loudness than a sound with similar
amplitude and energy but less high end. This is a learned association—we know
that usually objects emit higher frequencies if they are excited more
vigorously. Phenomena such as these are of great value to synthesists since they
allow us to use simple mathematical constructs (such as low order lowpass
filters) to create perceptually plausible synthesized instruments. On the other
hand, they tend to greatly complicate analysis.
If we take a typical, simple and single sound and look at its loudness, we can often neglect most of the complicated mechanisms of perception and look directly at the physical parameters of the sound. Especially, this is the case with sinusoids since they have no spectral content apart from their frequency. Thus most of the theory of loudness perception is formulated in terms of pure sine waves at different frequencies. It is mostly this theory that I will outline in the remainder of this section.
Figure 1 Equiphon contours for the range of human hearing in a free field experiment, according to Robinson and Dadson. At 1kHz, the phon contours correspond to the decibels. All sinusoids on the same contour (identified by sound pressure level and frequency) appear to have identical loudness to a human listener. It is seen that the dynamic range and threshold of hearing are worst in the low frequency end of the spectrum. Also, it is quite evident that at high sound pressure levels, less dependency on frequency is observed (i.e. the upper contours are flatter than the lower ones).
Decibels are nice but they have two problems: they do not take the frequency of the signal into account and also show poor correspondence with perceived loudness at low SPLs. The puzzle is solved in two steps. First, we construct a scale where the frequency dependency is taken into account. This is done by picking a reference frequency (1kHz, since this is where we defined the zero level for SPLs) and then examining how intense sounds at different frequencies need to be to achieve loudness similar to their 1kHz counterparts. After that we connect sounds with similar loudnesses across frequencies. The resulting curves are called equiphon contours and are shown in the graph from Robinson and Dadson. We get a new unit, the phon, which tells loudness in terms of SPL at 1kHz. That 1kHz is the reference point shows in that there the resulting decibel to phon mapping is an identity. Elsewhere we see the frequency dependency of hearing: following the 60 phon contour, we see that to get the same loudness which results from presenting a 1kHz, 60dB SPL sine wave, we must use a 90dB SPL sine wave at 30Hz or 55dB sine wave at 4kHz. We also see that the higher the sound pressure level, the less loudness depends on frequency (the isophon contours are straighter in the upper portion of the picture).
The phon is not an absolute unit: it presents loudness relative to the
loudness at 1kHz. Knowing the
phons, we cannot say one sound is twice as loud as another one—this would be
like saying that a five star hotel is five times better than a one star motel,
i.e. senseless. Instead, we would wish an absolute perceptual unit. All that
remains to be done is to get the phons at some frequency (preferably at
1kHz since the SPL‐to‐phon mapping is
simplest there) to match our perception. This is done by defining yet another
unit, the sone. When this is accomplished, we can first use the
equiphon contours to map any SPL to its equivalent loudness in phons at 1kHz and then the mapping to sones to get
a measure of absolute loudness. The other way around, if we want a certain
amount of sones, we first get the amount of phons at 1kHz and then move along the equiphon contours to get the
amount of decibels (SPL) at the desired frequency. Experimentally we get a
power law between sones and phons—at 1kHz, the mapping from sones to phons obeys a power
function with an exponent of 0.6, 40 phons being equal to 1 sone. (0 phons, that
is 0dB, becomes 0 sones of
course.) This way at high SPLs the sone scale is nearly the same as the
phon/decibel one, while at low levels, small changes in sones correspond to
significantly higher differences in phons. In effect, at low levels a
perceptually uniform volume slider works real fast while at higher
levels, it’s just
exponential.
All the previous development assumes that the sounds are steady and of considerable duration. If we experiment with exceedingly short stimuli, different results emerge. Namely, we observe considerable clouding of accuracy in the percepts and signs of temporal integration. This means that as we go to very short sounds and finally impulses which approach or are below the temporal resolution of the organ of Corti, the total energy in the sound becomes the dominant measure of loudness. At the same time, loudness resolution degrades so that only few separate levels of loudness can be distinguished. Similarly, when dealing with sound stimuli, the presence of transients becomes an important factor in determining the lower threshold of hearing—transient content (e.g. rapid onset of sinusoidal inputs and fluctuation in the amplitude envelopes) tends to lower the threshold while at the same time clouding the reception of steady‐state loudness.
Finally, a few words must be said about the loudness of complex sounds. As was explained in the previous section, sinusoidal sounds close to each other tend to mask one another. If the sounds are far enough from one another (more than one critical bandwidth apart) and the higher is sufficiently loud, they are heard as separate and contribute separately to loudness. In this case sones are roughly added. Since masking is most pronounced in the upward direction, a sound affects the perception of lower frequencies considerably less than higher ones—in a sufficiently rapidly decaying spectrum, the lower partials dominate loudness perception. Also, sinusoids closer than the critical bandwidth are merged by hearing so their contribution to loudness is less than the sum of their separate contributions. The same applies for narrow‐band (bandwidth less than the critical bandwidth) noise. If beating is produced, it may, depending on its frequency, increase, decrease or blur perceived loudness. Similarly harmonics (whether actually present or born in the ear) of low frequency tones and the presence of transients may aid in the perception of the fundamental, thus affecting the audibility of real life musical tones as compared to the sine waves used in the construction of the above equiphon graph.
For signals with continuous spectra (such as wideband noise), models of loudness perception are almost always heavily computational—they usually utilize a filterbank analysis followed by conversion into the bark scale, a masking simulation and averaging. Wideband signals also have the problem of not exactly following the conversion rules for decibels, phons and sones—white noise, for instance, tends to be heard as relatively too loud if its SPL is small and too silent if its SPL is high.
From looking at the structure of our auditory system, it seems like quite considerable machinery is assigned to temporal processing. Furthermore, it seems like time plays an important role in every aspect of auditory perception—even more so than in the context of the other senses. This is to be expected, of course: sound as we perceive it has few degrees of freedom in addition to time.
The importance of time processing shows in the fact that it starts at an extremely early stage in the auditory pathway, namely, in the cochlear nuclei. The bushy cells mentioned earlier seem to be responsible for detecting the onset of sounds at different frequency ranges. Excitation of the bushy cells elicits a phasic response (onset produces a response, continued excitation does not) as opposed to the tonic (continued excitation produces a continued response) pattern most often observed higher up the auditory track. This way, the higher stages of auditory processing receive a more event centric view of sound as opposed to the flow‐like, tonic patterns of the lower auditory pathway. The pauser cells may be responsible for detecting sound offsets. This way sound energy in different frequency bands is segregated into time‐limited events. This time information is what drives most of our auditory reflexes, such as startle, orientation and protective reflexes. As such, it hardly comes as a surprise that heavy connections to the reticular formation (which controls arousal and motivation, amongst other things) are observed throughout the auditory pathway.
In other animals, and especially those which rely heavily on hearing to survive (e.g. bats, whales and owls), specialized cells which extract certain temporal features from sound stimuli have been found. For instance, in bats certain speed ranges of frequency sweeps are mapped laterally on the auditory cortex Kan01. This makes it possible for the bat to use Doppler shifts to correctly echolocate approaching obstacles and possible prey. These cells are very selective—they respond best to sounds which nearly approximate the squeals sent by the bat, excepting the frequency shift. This leads to good noise immunity. Cells similarly sensitive to certain modulation effects have been found in almost all mammals and there is some evidence people are no exception. For instance, amplitude modulation in the range ?? to ??Hz displays high affinity for a group of cells in the ???????. TEMP!!! Also, the nonlinearity of the organ of Corti makes AM appear in the neural input to the cochlear nuclei as‐is. Mechanisms like these may be what makes it possible for us to follow rapid melodies, rhythmic lines and the prosody of speech without difficulty. It is also probable that they serve a role in helping separate phonemes from each other when they follow in rapid succession. Without such detection mechanisms it is quite difficult to see how consonants are so clearly perceived by the starting frequencies and the relative motion of the partials present. These mechanisms may even be lent to the interpretation of formant envelopes (and, thus, the discrimination of vowels) through the minute amplitude fluctuations in the partials of a given speech sound. (As was discussed in section 4.6, such flutter is caused by involuntary random vibrato in the period of the glottal excitation pulse train.)
The importance of time features has been heavily stressed, above. However, we have yet to discuss quantitatively the sensitivity of our ears to nonstationary signals. One reason for deferring the issue until now is that it is not entirely clear what we mean by it. We would like some objective measure of the time sensitivity of the ear, in a sense, a time constant. Some of the more important temporal measures are the time required to detect a gap in some sound signal, the time taken before two overlapped sonic events can be heard as being separate, the mean repetition rate at which a recurring sonic event fuses into a coherent, single whole and the rate at which a masking effect at certain frequency builds up when the mask is applied or evaporates after the mask is gone. The first hints at a discrimination test, the second is clearly a matter of categorical perception and multidimensional study and from third on we walk in the regime of continuous temporal integration.
The time required to hear a sonic gap varies somewhat over the audio bandwidth. For a first order approximation, we might say that to effect a discontinuous percept, we need some constant number of wavelengths of silence. But looking a bit closer, this number also depends on the amplitude, timbral composition and timbral composition of the sound. Voice band sinusoids are probably the easiest, complex sounds with lots of noise content and expected behavior the most complicated. In the context of rich spectra, temporal smearing of over 50ms can occur. With a 1kHz 80dB sinusoid, a gap of 3‐4 cycles is enough to effect a discontinuous percept. Often sounds overlaid with expectations (such as a continuously ascending sinusoid) lend themselves to a sort of perceptual extrapolation—even if the percept is broken by wideband noise, our auditory system tries to fill the gap and we may well hear the sound continue through the pause. The effect is even more pronounced when a sound is masked by another one. This will be discussed further down, in connection with the pattern recognition aspects of hearing.
The minimum length of audible gaps is one but only one measure of the ear’s time resolution. In fact, it is a very simplistic one. Another common way to describe the resolution is to model our time perception through a kind of lowpass filtering (integration or blurring) operation. In this case, we try to determine the time constant of the ear. The time constant of this conceptual filter is then used to predict whether two time adjacent phenomena are heard as separate or if they are fused into one. The first thing we notice is that the time constant varies for different frequency ranges.
‐so what’s the value?
When we look at simple stimuli, we get some nice, consistent measures of the
temporal behavior of our hearing. But as always, when higher level phenomena are
considered as well, things become complicated. It seems that the so called
psychological time is a very complicated beast. For instance, it has
been shown that dichotic hearing can precipitate significant difference in
perceived time spans as compared to listening to the same material
monophonically. In an experiment in which a series of equidistant pulse sounds
were presented at different speeds and relative amplitudes via two headphones,
it was possible to fool test subjects to estimate the tempo of the pulse train
to be anywhere between the actual tempo and its duplicate, on a continuous
scale. This means that the phenomenon isn’t so much a question of
locking
onto the sound in a particular fashion (hearing every other
pulse, for instance) but rather a genuine phenomenon of our time perception.
This experiment has a partial explanation in the theory of auditory perception,
which states that the processing of segregated streams of sound (in this case,
the trains of clicks in the two ears) are mostly independent but that the degree
of separation depends on the strength of separation of the streams. This
disjoint processing can give rise to some rather unintuitive effects. First of
all, time no longer has the easy, linear structure the Western world attributes
to it—segregated streams all more or less have their own, linear time. The
implication is that time phenomena which are strongly segregated are largely
incommeasurate. This is demonstrated by the fact that a short gap within a
sentence played to test subjects is surprisingly difficult to place within the
sentence afterwards: the subjects know that there was a gap (and even what
alternative material was possibly played in the gap) but cannot place the gap
with any certainty (as in it was after the word is
). In effect, the order
of time events has gone from total (a common linear scale one which everything
can be compared) to partial (there are incommeasurate events which cannot be
placed with respect to each other).
Further complicating the equation, we know that to some degree our perception of rhythm and time is relative. The traditional point of comparison is the individual’s heart beat but the relative state of arousal (i.e. whether we are just about to go to sleep or hyperaroused by a fight‐or‐flight reaction) probably has an even more pronounced effect. This may in part explain why certain genres of music are mostly listened to at certain times of the day. A fun experiment in relative time perception is to listen to some pitched music with a regular time while yawning, dosing off or…getting high. All these should cause profound distortions in both time and pitch quite like they do with the general state of arousal of a person.
‐Vesa Valimaki’s work on time masking etc.
‐volley theory (especially in the low register) ‐virtual vs. real pitch ‐nonlinearity/missing fundamental problem ‐spectral pitch (place theory interpretation for acute tones) ‐formants/spectral envelope
‐phase difference (grave)
‐amplitude gradient (acute)
‐indetermination in between registers
‐amplitude envelopes important (acute)
‐connection to the concept of group delay
‐relative reverb/early reflections as size/distance cues
‐the poor performance of generic computational models as proof of the
acuity of these processes
‐common features ‐occlusion ‐the old+new heuristic ‐layers: neurological and cognitive ‐attention: effects on both layers/selection ‐orientation (reflexes+attention) ‐pattern recognition ‐what’s here? ‐perceptual time ‐e.g. dichotic clicks seem slower than the same sequence when presented monaurally ‐relation to state of arousal, heartbeat and other naturaltimekeepers‐vertical vs. horizontal integration ‐competition between integration and segregation ‐this is a typical application of the Gestalt type field rules
‐formants
‐spectral envelopes
‐temporal processing
‐e.g. the genesis of granular textures
‐connections to fusion; relevance of vibrato/dynamic envelopes for fusion
‐e.g. fusion of separately introduced sinusoids upon the introduction of a
common frequency/amplitude modulator, and its converse when the commonality
no longer holds
‐the importance of attacks and transients
‐spectral spashing
‐information carrying capacity of transients (no steady‐state
vibration…)
‐indetermination in periodic timbre
‐ergo, place theory/formant perception et cetera is not very accurate,
whereas volley theory/temporal processing seems to be
‐multidimensionality
‐i.e. it is very difficult to characterize/measure timbre
‐there have been attempts
‐for instance, for steady‐state spectra with origin in orchestral
instruments, we seem to get three dimensions via
PCA/FA
‐most of these attempts do not concern temporal phenomena (the
overemphasis of on Fourier, mentioned earlier)
‐this sort of theory is based on extremely simplified sounds and
test setups
‐connection to masking (especially in composite signals)
‐phase has little effect
‐except in higher partials and granular/percussive stuff
‐i.e. steady‐state is again overemphasized in traditional expositions
‐timbre is not well defined (Bregman: wastebasket
)
‐head turning as a localisation cue ‐we continuously extract spatial information based not only on an open‐loop interpretation of what is heard, but on a closed‐loop one of what happens when we change the acoustic conditions ‐the McGurk effect ‐that is, seeing someone talk can change the interpretation of the same auditory input
‐what can be learned? ‐apparently a lot! ‐lateralization implies invariance/hardwiring? ‐or just that there is a typical dynamic balance arising from the common underlying circuitry? ‐is plasticity the norm? ‐what features in sound prompt specific invariant organizational patterns? ‐evolution and development of audition ‐population variations ‐esp. the Japanese peculiarities in lateralization!
‐hearing under the noise floor
‐the effects of ultrasonic content on directional hearing
‐esp. sursound discussions on transient localization
‐the idea that bandlimitation (and the inherent ringing, and especially
pre‐echos) it produces fool our time resolution circuitry
‐this is a nasty idea, because we cannot hear ultrasonic content, per se
‐it implies that spatial hearing is inherently non‐linear
‐it does not imply that all ultrasonic content has to be stored
‐instead it would mean that we might have to consider some nonlinear
storage format, which only helps store transients more accurately
‐thoughts on why dither might not help this situation, even if it makes
the average temporal resolution of an audio system approach
infinite
‐overcomplete analysis and superresolution
of sounds (Michael
Gerzon’s unpublished work?)
‐inherent nonlinearity in hearing (computational models of microcilia!)
‐used to explain difference tones, perception of harmonic/near‐harmonic
spectra, missing fundamentals(, what else?)
‐levels of pattern recognition (learned vs. intrinsic)
‐comodulation of masking release and profile perception as signs of cross
frequency band processing at a low level
‐the consequent refutation of strictly tonotopic place theories of pitch
etc.
‐envelopment and externalization through decorrelation
‐frequency ranges?
© 1996–2004 Sampo Syreeni; 2004–10–17; ♻ PD; ⤒ site front; ✉ email