<body xml:lang="en" class="draft">
<h1>Hearing, physiological and psychological aspects of</h1>
<div class="intro">
Before going further into actual sound system issues and mathematics, it
is important to know the significance of human sound perception. The aim
in this section is to shed some light on the physiological, neuropsychological
and cognitive mechanisms which take part in our hearing of sound. Mostly,
it is comprised of a brief treatment of the field of psychoacoustics,
although an amount of physics, human anatomy and relevant psychology is
explained as well—all these have an important role in explaining what
sound is to us. This section is a long one, the reason being that not
many things in sound and music are interesting to humans outside the
context of how we hear and interpret what was heard. Were it not for our
sense of hearing and all the cognitive processing that goes along with
it, the physical phenomenon of sound would be only just that—a
physical phenomenon. Not many would have any interest in such a thing.
Knowledge of how we hear and why is thus paramount to understanding the
relevance of the many algorithms, mathematical constructs and the
general discipline of audio signal processing we encounter further on.
This understanding helps explain why some synthesis methods are
preferred over others, what it is that separates pleasant, harmonious
music from horrifying noise, what it is that comprises the pitch, timbre
and loudness of an instrument, what makes some sounds especially
<q>natural</q> or <q>fat</q>, where the characteristic sound of some
particular brand of equipment comes from and what assumptions and
simplifications can be made in storing, producing and modifying sound
signals. Basic knowledge of psychoacoustics can also help avoid some
of the common pitfalls in composition and sound processing and suggest
genuine extensions to one’s palette of musical expression.

<h2>What is psychoacoustics all about?</h2>
To put it shortly, <dfn>psychoacoustics</dfn> is the field of science
which studies how we perceive sound and extract meaningful data from
acoustical signals. It concerns itself primarily with low level functions
of the auditory system and thus doesn’t much overlap with the study of
music or ćsthetics. Basic psychoacoustical research is mainly directed
toward such topics as directional hearing, pitch, timbre and loudness
perception, auditory scene analysis (the separation of sound sources
and acoustical parameters from sound signals) and related lower functions,
such as the workings of our ears, neural coding of auditory signals, the
mechanisms of interaction between multiple simultaneously heard sound
sources, neural pathways from ears to the auditory cortex, their
development and the role of evolution in the developement of hearing.
Psychoacoustical research has resulted in an enormous amount of data
which can readily be applied to sound compression, representation,
production and processing, musicology, machine hearing, speech
recognition and composition, to give just a few examples. The reason why
such a long part of this document is devoted to psychoacoustics, is that
although one can understand sound synthesis and effects fairly well just
by grasping the relevant mathematics, one cannot truly get a hold onto
their underlying principles, shortcomings and precise mechanisms of
action before considering how the resulting sound is heard. Human
auditory perception is a rather quirky and complicated beast—it often
happens that sheer intuition simply doesn’t cut it.

<h2>The structure and function of the ear</h2>
There are three main parts in the human ear: outer, middle and inner ear.
The outer ears include ear lobes (<dfn>pinnć</dfn>) and the ear canals.
Between outer and middle ear, the <dfn>tympanic membrane</dfn> (or
<dfn>eardrum</dfn>) resides. The middle ear is a cavity in which three
bones (called <dfn>malleus</dfn>, <dfn>incus</dfn> and <dfn>stapes</dfn>)
reside. Malleus is attached to the tympanic membrane, stapes to the
oval window which separates the inner and middle ears. Incus connects
these two. The three bones (collectively called <dfn>ossicles</dfn>)
form a kind of lever which transmits vibrations from the tympanic
membrane to the fluid filled inner ear, providing an impedance match
between the easily compressed air in the outer ear and the noncompressible
fluid in the inner. To these bones the smallest muscles in our body,
<dfn>the middle ear muscles</dfn>, are attached. They serve to dampen
the vibration of the ossicles when high sound pressures are met, thereby
protecting the inner ear from excessive vibration. The inner ear is
composed of the <dfn>cochlea</dfn>, a little, bony, seashell shaped
apparatus which eventually senses sound waves, and the vestibular
apparatus. All these structures are incredibly small—the ear canal
measures about 3 centimeters in length and about one half in diameter,
the middle ear is about 2 cubic centimeters in volume and the cochlea
is, when rolled straight, about 35 millimeters in length and 2
millimeters in diameter.
In the cochlea, we find an even smaller level of detail. The cochlea is
divided into three longitudinal compartments, <dfn>scala vestibuli</dfn>,
<dfn>scala tympani</dfn> and <dfn>scala media</dfn>. The first two are
connected through the <dfn>apex</dfn> in the far end of the cochlea, the
middle one is separate from the others. The vibrations from the middle
ear reach the cochlea through the oval window which resides in the outer
end of scala vestibuli. In the outer end of scala tympani, the <dfn>round
window</dfn> connects the cochlea to the middle ear for the second time.
Vibrations originate from the oval window, set the intermediate membranes
(<dfn>Reissner’s membrane</dfn> between scala vestibuli and scala media,
the <dfn>basilar membrane</dfn> between scala media and scala tympani)
in movement and get damped upon reaching the round window. On the floor
of scala media, under the <dfn>tectorial membrane</dfn>, lies the
<dfn>organ of Corti</dfn>. This is where neural impulses associated with
sound reception get generated. From the bottom of the organ of Corti,
the auditory nerve emanates, headed for the auditory nuclei of the brain
The organ of Corti is the focal point of attention in many basic
psychoacoustical studies. It is a complex organ, so we will have to
simplify its operation a bit. For a more complete description, see
<a href="dsndr01#refkan01"><cite>Kan01</cite></a>. That book is also a good general reference
on neural structures. On top of the Corti organ, two lines of hair cells,
covered with <dfn>stereocilia</dfn> (small hairs of actin filaments which
sense movement) stand. On the outside of the cochlea is the triple line
of <dfn>outer hair cells</dfn>, on the inside the single line of
<dfn>inner hair cells</dfn>. The other ends of the stereocilia are
embedded in the overlying tectorial membrane. This arrangement means
that whenever the basilar membrane twitches, the stereocilia get bent
between it and the tectorial membrane. Obviously, the pressure changes
in scala vestibuli result in just such action, which means that sound
results in bent stereocilia. This in case leads to neural impulses being
generated. These are lead by the afferent auditory nerves towards the
brain. The inner and outer hair cells are innervated rather
differentely—it seems that the inner ones are mainly
associated with louder and the outer with quieter sounds (see <a
href="dsndr01#refgol01"><cite>Gol01</cite></a>). Also, some efferent innervation reaches
the outer hair cells, so it is conceivable that the ear may adapt
under neural control, possibly to aid in selective attention
(<a href="dsndr01#refkan01"><cite>Kan01</cite></a>).

<h2>From air to brain—spectral analysis, tonotopic organization and time domain coding</h2>
By now, the basic function of the ear should be quite clear. However,
nothing has been said about how the ear codes the signals. It is well
known that neurons cannot fire at rates exceeding 500‐1000<acronym title="Hertz" xml:lang="en">Hz</acronym>. Neurons
also primarily operate on binary pulses (action potentials)—there
either is a pulse or there is not. Direct encoding of waveforms does not
come into question, then. And how about amplitude? To answer these
questions, something more has to be said about the structure of the
When we look more closely at the large scale structure of the organ of
Corti, we see a few interesting things. First, the width of the basilar
membrane varies over the length of the cochlea. Near the windows, the
membrane is quite narrow whereas near the apex, the membrane is quite
a bit wider. Similarly, the thickness and stiffness vary—near the
windows they are considerable whereas near the apex they are much less.
And the same kind of variation is repeated in the hair cells and their
stereocilia—near the apex, longer, more flexible hairs prevail over
the stiffer, shorter stereocilia of the base of the cochlea. All this
has a serious impact on the vibrations caused in the Corti organ by
sound—vibrations of higher frequency tend to cause response mainly
near the windows where the characteristic vibrational frequency of the
basilar membrane is higher, lower frequencies primarily excite the hair
cells near the apex. This means that the organ of Corti performs physical
frequency separation on the sound. This separation is further amplified
by the varying electrical properties of the hair cells, which seem to
make the cells more prone to excitation at specific frequencies. All in
all, the ear has an impressive apparatus for filter bank analysis. From
the inner ear, this frequency decomposed information is moved by the
auditory nerve. The nerve fibers are also sorted by frequency, a pattern
repeated all over the following neural structures. This is called
<dfn>tonotopic organization</dfn>.
When sound amplitude varies, the information should be coded somehow.
This is an area of study which is still going on strong. This means that
a complete description cannot be given here, but most relevant points
are mentioned, anyhow. One mechanism of coding involves the relative
firing frequencies and amplitudes of the individual neurons—the more
excitation there is, the more there is neural activity in the relevant
auditory neurons. Since frequency information is carried mainly by
tonotopic mapping of the neurons, this doesn’t pose a problem of data
integrity. A second mechanism which seems to augment the transmission
is based on the fact that as louder sounds impinge upon the ear, the
width of resonance on the basilar membrane increases. This may cloud the
perception of nearby frequencies but can also be used to deduce the
amplitude of the dominant component. The efferent innervation to the
outer hair cells and afferent axons from the inner hair cells also seem
to play a part in loudness perception—there is evidence suggesting
that the ear can adapt to loud sounds, keeping the ranges in check.
<div class="sidebar">
Secondary codings are interesting since the dynamic range of neurons is
in the order of 40<acronym title="Decibel" xml:lang="en">dB</acronym>, significantly less than the range of the human
hearing which is often in the excess of 120<acronym title="Decibel" xml:lang="en">dB</acronym>.
So we now we have a rough picture of how amplitude and frequency content
are carried over the auditory nerve. How about time information, then?
Considering the high time accuracy of our hearing (in the order of
milliseconds, at best), mere large scale time variation in neural
activity (governed by the many resonating structures on the signal path
and the inherent limitation on neuron firing rate) does not seem to
explain everything. When investigating this mystery, researchers ran
into an interesting phenomenon, namely, <dfn>phase locking</dfn>, which
has also served as a complementary explanation to high frequency pitch
perception. It seems that hair cells, in addition to firing more often
when heavily excited, tend to fire at specific points of the vibratory
motion. This means that the firing of multiple neurons, although mutually
asynchronous, concentrate at a specific point of the vibratory motion.
This phenomenon has been experimentally demonstrated for frequencies as
high as 8<acronym title="kiloHertz" xml:lang="en">kHz</acronym>. It is conceivable, then, that many neurons working in
conjunction could directly carry significantly higher frequencies than
their maximum firing rate would at first sight suggest. This result has
been experimentally confirmed as well as its role in conveying accurate
phase information to the brain (this is important in measuring interaural
phase differences and, consequently, plays a big part in directional
hearing). It also serves as a basis to modern theories of pitch perception
through what is called <dfn>periodicity pitch</dfn>, pitch determination
through the period of a sound signal. The idea of concerted action of
phase locked neurons carrying frequency information is called the
<dfn>volley principle</dfn> and augments the frequency analysis
(<dfn>place principle</dfn>) interpretation introduced above. This time
domain coding is extremely important because it seems that accurate
frequency discrimination cannot be explained without it—place codings
display a serious lack of frequency selectivity, even after considerable
neural processing and enhancement.
<div class="sidebar">
Furthermore, it is now known that the hair cells of the inner ear only
react to vibrations in one direction, i.e. during one halfwave of a
periodic vibration. This is an intrinsically nonlinear mechanism. When
more than one frequency is present, this mode of response offers a highly
elegant explanation to the phenomenon of <dfn>difference tones</dfn>,
hybridization products which arise during the detection process of complex
sounds. Earlier these were simply explained as resulting from the action
of nonlinearities in the middle ear at loud volumes. Now it seems that
that explanation does not really cover everything. Further, problems with
the so called <dfn>missing fundamental</dfn> type tones (periodic sounds
which take on the pitch of the first harmonic, even if the first (few)
harmonic(s) is (are) not actually present) seem to benefit greatly from
analyses which take into account this type of nonlinearity in the ear.
See <a href="dsndr01#refwal01"><cite>Wal01</cite></a> and section 4.10 on pitch perception.

<h2>The auditory pathway: nerves, nuclei and their roles in auditory analysis</h2>
Now the function of the auditory system has been described upto the
auditory nerve. What about after that? The eighth cranial nerve, most of
which is an extension of the innervation of the inner ear (the rest
being mainly concerned with the sense of balance), carries the auditory
traffic to the brain stem. Here the auditory nerve passes through the
<dfn>cochlear nuclei</dfn>, which start the neural processing and
feature extraction process. Upon entering the cochlear nucleus, the
auditory nerve is divided in two. The upper branch goes to the upper
back quarter of the nucleus while the lower branch innervates the lower
back quarter and the front half. The cochlear nuclei display clear
tonotopic organization, with high frequencies mapped to the centre and
lower frequencies mapped to the surface. The <dfn>ventral</dfn> (back)
side of the nucleus is made up from kinds of cells, <dfn>bushy</dfn>
and <dfn>stellate</dfn> (starlike). Stellate cells respond to single
neural input pulses by a series of evenly spaced action potentials of a
cell dependent frequency (this is called a <dfn>chopper response</dfn>).
The stellate cells have long, rather simple dendrites. This suggests
that the stellate cells gather pulses from many lower level neurons, and
extract precision frequency information from their asynchronous outputs.
Their presence supports one of the theories of frequency discrimination,
which speculates on the presence of a <q>timing reference</q> in the
brain. The bushy cells, on the other hand, have a fairly compact array
of highly branched dendrites (whence the name) and respond to
depolarization with a single output pulse. This suggests they are
probably more concerned with time‐domain processing. It seems bushy
cells extract and signal the onset time of different frequencies in a
sound stimulus. There are also cells, called <dfn>pausers</dfn>, which
react to stimuli by first chopping a while, then stopping, and after a
while starting again. These may have something to do with estimating
time intervals and/or offset detection.
Following the cochlear nuclei, the auditory pathway is divided in three.
The <dfn>dorsal (front side) acoustic stria</dfn> crosses the medulla,
along with the <dfn>intermediate acoustic stria</dfn>. The most important,
however, is the <dfn>trapezoid body</dfn>, which is destined to the next
important processing centre, the <dfn>superior olivary nucleus</dfn>.
The olives are a prime ingredient in directional hearing. Both nuclei
receive axons from both the <dfn>ipsilateral</dfn> (same side) and
<dfn>contralateral</dfn> (opposite side) cochlear nuclei. The <dfn>medial</dfn>
(closer to the centre of the body) and <dfn>lateral</dfn> (closer to the
sides of the body) portions of the nuclei serve different functions:
the medial part is concerned with measuring interaural time differences
while the lateral half processes interaural intensity information. Time
differences are measured by neurons which integrate the information
arriving from both ears—since propagation in the preceding neurons is
not instantaneous and the signals from the ears tend to travel in
opposite directions along the pathways, this system works as a kind of
correlator. The coincidence detector is arranged so that neurons closer
to the opposite side of the sound source tend to respond to it. Similarly,
the intensities are processed—contralateral signals excite and ipsilateral
signals inhibit the response of the intensity detector. These functions
are carried out separately for different frequency bands and are
duplicated in both superior olivary nuclei, although the dynamics of the
detection process mainly place the response on the opposite side of the
signal source.
After leaving the olives, the axons rejoin their crossed and uncrossed
friends from the cochlear nuclei. They then progress upwards—this time
the bundle of axons is called the <dfn>lateral lemniscus</dfn>. The
lemniscus ascends first through <dfn>pons</dfn> where an amount of
crossing between the lateral pathways is observed. This happens through
<dfn>Probst’s commissure</dfn> which mainly contains axons from the
nuclei of the lateral lemniscus. From here, the lane continues upward to
the midbrain (more specifically to the <dfn>inferior colliculus</dfn>)
where all the axons finally synapse. This time there seems not to be any
extensive crossing. It would appear that the inferior colliculus has
something to do with orientation and sound‐sight coordination—the
superior colliculus deals with eye sight and there are some important
connections to be observed. Also, there is good evidence that topographic
organization according to the spatial location of the sound is present
in the inferior colliculi. It is noticeable that while we trace the
afferent auditory pathway through to the lateral lemniscus and the
inferior colliculus, the firing pattern of the neurons changes from
flow‐like excitation to an onset/offset oriented kind. More on this
can be found in the sections on transients and time processing. The
pathway is then extended upwards to the <dfn>medial geniculate
nuclei</dfn> just below the forebrain which then, finally, projects to
the primary auditory cortex on the cerebrum.
One special thing to note about the geniculate nuclei is that they,
too, are divided into parts with apparently different duties. The
ventral portion displays tonotopic organization, whereas the dorsal
and <dfn>medial (magnocellular)</dfn> parts do not. The ventral part
projects to tonotopically organized areas of the cortex, the dorsal
part nontonotopic ones and the medial part to both. In addition, the
magnocellular medial geniculate nuclei display a certain degree of
lability/plasticity. This means it may have considerable part in how
learning affects our hearing. A noteworthy fact is that the
nontonotopically organized parts of the geniculate nuclei and the
cortex are considerably less well known than their tonotopic
counterparts—complex, musically relevant mappings might be found
there, in the future. Throughout the journey, connections to and from
the <dfn>reticular formation</dfn> (which deals with sensomotoric
integration, controls motivation and maintains the arousal and
alertness in the rest of the central nervous system) are observed.
Finally, the auditory cortex is located on the surface of the temporal
lobes. And just to add to the fun, there is extensive crossing here,
as well. This time it takes place through <dfn>corpus callosum</dfn>,
the highway between the right and left cerebral hemispheres.
In the way to the auditory cortex, extensive mangling of information has
already taken place. It is seen, for example, that although the tonotopic
organization has survived all the way through the complex pathways, it
has been multiplied, so that there are now not one but several frequency
maps present on the auditory cortex. The structural organization is more
complex, here, also. Like most of the cortex, the auditory cortex is
both organized into six neuronal layers (which mainly contain neuronal
cell bodies) and into columns (which reach through the layers). The
layers show their usual pattern of external connections: layer IV receives
the input, layer V projects back towards the medial geniculate body and
layer VI to the inferior colliculus. The columns, on the other hand,
serve more specialized functions and the different types are largely
interspersed among one another. Binaural columns, for instance, show an
alternating pattern of suppression and addition columns—columns which
differentiate between interaural features and and those which do not,
respectively. Zoning of callosally connected and nonconnected areas is
also observed. Further, one must not forget that there exist areas in
the brain which are mainly concerned with speech production and
reception (the areas of Wernicke and Broca, respectively). They are
specific to humans although some similar formations are present in the
brain of other animals, especially if they are highly dependent on
auditory processing (bats and dolphins are examples with their echo
location and communication capabilities).
All in all, the functional apparatus of the brain concerned with auditory
analysis is of considerable size and complexity. One of the distinctive
features of this apparatus is the extensive crossing between the two
processing chains—one of the most peculiar aspects of hearing is that
while the usual rule of <q>processing on the wrong side</q> is generally
observed, the crossing distributes the processing load so that even
quite severe lesions and extensive damage to the cortex need not greatly
disturb auditory functions.

<h2>Steady‐state vs. transient sounds. The attack transient. Vowels and consonants.</h2>
In the last section, it became apparent that the brain has an extensive
apparatus for extracting both time and frequency information from sounds.
In fact, there are two separate pathways for information: one for
frequency domain and the other for time domain data. This has far reaching
consequences for how we hear sound signals. First of all, it means that
any perceptually significant analysis or classification of sound must
include both time and frequency. This is often forgotten in the traditional
Fourier analysis based reasoning on sound characteristics. Second, it
draws a kind of dividing line between sound signals whose main content
to us is in one of the domains. Here, this division is used to give
meaning to the often encountered terms <dfn>transient</dfn> and
<dfn>steady‐state</dfn>; we take the first to mean <q>time oriented</q>,
and the second <q>frequency oriented</q>. Another (rather more rigorous)
definition of <q>steady‐state</q> is based on statistics. In this
context, a signal is called <dfn>steady‐state</dfn> if it is stationary
in the short term and <dfn>transient</dfn> if it is not.
The root of this terminology lies in linear system analysis. There
steady‐state means that a clean, often periodic or almost constant
excitation pattern has been present long enough so that Fourier based
analysis gives proper results. Formally, when exposed to one‐sided inputs
(non‐zero only if time is positive), linear systems exhibit output which
can be decomposed into two additive parts: a sum of exponentially
decaying components which depends on the system and a sustained part which
depends on both the excitation and the system. The former is the transient
part, the latter steady‐state. Intuitively, transients are responses
which arise from changes of state—from one constant input or excitation
function to another. They are problematic, since they often correspond
to unexpected or rare events; it is often desired that the system spend
most of its time in its easiest to predict state, a steady‐state. Because
transients are heavily time‐localized, they defy the usefulness of
traditional Fourier based methods.
In acoustics and music, the situation is similar in that frequency
oriented methods tend to fail when transients are present. Moreover, in
music, transients often correspond to excitatory motions on behalf of
the performer (plucking a bow, striking a piano key, tonguing the reed
while playing an oboe etc.), and so involve
<li>Significant nonlinear interactions (instruments behave exceedingly
<li>Stochastic or chaotic phenomena (often from turbulence, as when
sibilant sounds are produced in the singing voice)</li>
<li>Unsteady vibratory patterns (the onset of almost any note)</li>
<li>Partials with rapidly changing amplitudes and frequencies (as a
result of the above)</li>
All these together mean that pure frequency domain analyses do not explain
complex sounds clearly enough—they do not take into account the
time‐variant, stochastic or nonlinear aspects of the event. From an
analytical point of view, a time‐frequency analysis is needed. Some of
these are mentioned in the math section. The fourth item in the list
above deserves special attention because it is characteristic of vocal
sounds—consonants are primarily recognized the trajectories (starting
points, relative amplitudes and speed of movement) of the partials present
in the following phoneme
<a href="dsndr01#refdow01"><cite>Dow01</cite></a>. Usually consonants
consists of a brief noisy period followed by the partials of the next
phoneme sliding into place, beginning from positions characteristic to the
consonant. This happens because consonants are mostly based on restricting
air passage through the vocal tract (this and the following release produce
the noise), because the following phoneme usually exhibits different
formant frequencies (causing a slide from the configuration of the
consonant) and, finally, because consonants are mostly very short compared
to vowels.
What is the perceptual significance of our transient vs. steady classification,
then? To see this, we must consider speech, first. In the spoken language,
two general categories of sounds are recognized: vowels and consonants.
They are characterized by vowels being <q>voiced</q>, often quite long
and often having a more or less clear pitch as opposed to consonants
being short, sometimes noiselike (such as the pronunciation of the letter
<q>s</q>) and mostly unpitched. Vowels arise from nicely defined vibratory
patterns in the vocal tract which are excited by a relatively steady pulse
train from the vocal chords when consonants mostly arise from constrictions
of the vocal tract and the attendant turbulence, impulsive release (like
when pronouncing a <q>p</q>, or one of the other plosives) or nonlinear
vibration (like the letter <q>r</q>). Now, a clear pattern shows here.
Consonants tend to be transient in nature, while vowels are mostly
steady‐state. This is very important because most of the higher audio
processing in humans has been shaped by the need to understand speech.
This connection between vowel/consonant and steady/transient classification
has also been demonstrated in a more formal setting: in listening experiments,
people generally tend to hear periodic and quasi‐periodic sounds as being
vowel‐like while noises, inharmonic waveforms and nonlinear phenomena
tend to be heard as consonants. Some composers have also created convincing
illusions such as <q>speech music</q> by proper orchestration—when
suitable portions of transient and steady‐state material is present in
the music in some semi‐logical order, people tend to hear a faint speech
like quality in the result. The current generation of commercial
synthesizers also demonstrates the point—today, modulatory possibilities
and time evolution of sounds often outweighs in importance the basic synthesis
method and as a buying criterion. The music of the day relies greatly on
evolving, complex sounds instead of the traditional one‐time note event
It is kind of funny how little attention time information has received
in the classical study, although one of the classic experiments in
psychoacoustics tells us what importance brief, transient behavior of
sound signals has. In the experiment, we record instrumental sounds. We
then cut out the beginning of the sound (the portion before the sound
has stabilized into a quasi‐periodic waveform). In listening experiments,
samples brutalized this way are quite difficult to recognize as being
from the original instrument. Furthermore, if we splice together the end
of one sample and the beginning of another, the compound sound is mostly
recognized as being from the instrument of the beginning part. In a
musical context, the brief transient in the beginning of almost all notes
is called the <dfn>attack</dfn>, then. For a long time, it eluded any
closer inspection and even nowadays, it is exceedingly difficult to
synthesize if anything but a throrough physical model of the instrument
is available.
This kind of high importance of transient characteristics in sound is
best understood through two complementary explanations. First, from an
evolutionary point of view, time information is essential to
survival—if it makes a sudden loud noise, it may be
coming to eat you or falling on you.
<div class="sidebar">
This is where the <dfn>startle</dfn> and <dfn>orientation</dfn>
reflexes come in: sudden noises or movement tend to cause a rapid fight
or flight reaction and even weaker, unexpected stimuli cause one to
locate the sound source by turning the head towards it. Since unexpected,
sudden features in the heard sound tend to cause such effects and generally
arouse the central nervous system, it can be conjectured that <em>notes</em>
may well have some very deep seated physiological meaning to
people—they do tend to start with transients and cause
fixing of attention.
You need rapid classification as to what the source of the sound is and
where it is at. From a physical point of view, there may also be
considerably more information in transient sound events than in
steady‐state (and especially periodic) sound—since high
frequency signals are generated in nature by vibrational modes in
bodies which have higher energies, they tend to occur only briefly and
die out quickly. In addition to that, most natural objects tend to
emit quasi‐periodic sound after a while has passed since the initial
excitation. These two facts together mean that, first, upper
frequencies and highly inharmonic content tend to concentrate on the
transient part of a sound and, second, the following steady‐state
portion often becomes rather nondistinctive.
<div class="sidebar">
Often it is close to being periodic, which means that only frequencies
near multiples of some basis frequency are present—factoring out the
basis frequency and observing that higher frequencies (by virtue of
their higher energy) tend to be present in lesser degrees, we see that
only a small number of distinquishable steady‐state sounds exist in
So the steady‐state part is certainly not the best part to look at if
source classification is the issue. The other part of the equation are
the neural excitation patterns generated by different kinds of
signals—transients tend to generate excitation in
greater quantities and more unpredictably. Since unpredictability
equals entropy equals information, transients tend to have a
significant role in conveying useful data. This is seen in another way
by observing that periodic sounds leave the timing pathway of the
brain practically dead—only spectral information is
carried and, as is explained in following sections, spectra are not
sensed very precisely by humans. Kind of like watching photos vs.
watching a movie. In addition to that, such effects as masking and
the inherent normalisation with regard to the surrounding acoustic
space greatly limit the precision of spectral reception.
Aside from their important role in classifying sound sources, transient
features also serve a complementary role in sound localization. This is
most clearly seen in auditory physiology: our brain processes interaural
time differences instead of phase differences and has separate circuitry
for detecting the onset of sonic events. This means that transient sounds
are the easiest to locate. Experiments back this claim: the uncertainty
in sound localization is greatest when steady‐state, periodic sounds
are used as stimuli.

<h2>Critical bands and masking</h2>
Until now, we have tacitly assumed that the ear performs like a measuring
instrument—if some features are present in a sound, we hear them. In
reality, this is hardly the case. As everybody knows, it is often quite
difficult to hear and understand speech in a noisy environment. The main
source of such blurring is <dfn>masking</dfn>, a phenomenon in which
energy present in some range of frequencies lessens or even abolishes
the sensation of energy in some other range. Masking is a complex
phenomenon—it works both ipsilaterally and contralaterally, and
maskings effects extend both forwards and backwards in time. It is highly
relevant to both practical applications (e.g. perceptual compression)
and psychoacoustic theory (for instance, in models of consonance
and amplitude perception). This also means that masking has been quite
thoroughly investigated over the years. The bulk of research into
masking involves experiments with sinusoids or narrow‐band noise masking
a single sinusoid. Significant amounts of data are available on forward
and backward masking as well. It seems most forms of masking can be
explained at an extremely low (almost physical) level by considering the
time dynamics of the organ of Corti under sonic excitation. This is not
the case for contralateral masking, though, and it seems this form of
masking stands separately from the others. Currently it is thought that
contralateral masking is mediated through the olivo‐cochlear descending
pathway by means of direct inhibition of the cochlea in the opposite
ear. (Masking like this is called <dfn>central</dfn>, whereas normal
masking by sound conducted through bone across the scull is called
Masking is a rather straight forward mechanism, which can be studied
with relative ease by presenting test signals of different amplitudes
and frequencies to test subjects in the presence of a fixed masking
signal. The standard way to give the result of such an experiment is
to divide the frequency‐amplitude plane into parts according to the
effect produced by a test signal with the respective attributes while
the mask stays constant. The main feature is the <dfn>masking threshold</dfn>
which determines the limit below which the masked signal is not heard at
all. This curve has a characteristic shape with steep roll‐off below the
mask frequency and a much slower, uneven descent above. This means that
masking mostly reaches upwards with only little effect on frequencies
below that of the mask. At each multiple of the mask frequency we see
some dipping because of beating effects with harmonic distortion
components of the mask. Above the threshold we see areas of perfect
separation, roughness, difference tones and beating.
A lot is known about masking when both the mask and the masked are
simple, well‐behaved signals devoid of any time information. But how
about sound in general? First we must consider what happens with
arbitrary static spectra. In this case one proper—and
indeed lot used—way is to take masking to be additive.
That is, the mask contribution of each frequency is added together to
obtain the amount of masking imposed on some fixed frequency.
<div class="sidebar">
This leads to the masking threshold over the whole audio bandwidth being
the convolution of the (frequency variable) masking curve with our mask
spectrum. Further, if we warp the frequency domain representation of the
mask sound to produce a scale (the <dfn>bark</dfn> scale) with equal
critical bandwidth over the frequency axis, we see that masking curves
corresponding to different mask frequencies are made nearly identical in
shape. Performing the convolution in this new domain is the normal
shift‐invariant convolution operation and can be very nearly approximated
by low cost linear shift‐invariant filtering. Extremely useful in
masking calculations like when performing perceptual compression.
So additivity is nice. But does it hold in general? Not quite. Since the
ear is not exactly linear, some additional frequencies always arise.
These are not included in our masking computation and can sometimes make
a difference. Also, in the areas where our hearing begins to roll off
(very low and very high frequencies), some exceptions to the additivity
must be made. Since masking mainly stretches upwards, this is mostly
relevant in the low end of the audio spectrum—low pitched sounds do
not mask higher ones quite as well as we would expect. Further, beating
between different partials of the masking and the masked can sometimes
cause additivity to be too strict an assumption. This is why practical
calculations sometimes err on the safe side and take maximums instead of
sums. This works because removing all content other than the frequency
(band) whose masking effect was the greatest will still leave the signal
masked. It is proper to expect that putting the rest of the mask back in
will not reduce the total masking effect.
The above discussion concerns steady spectra. In contrast, people hear
time features in sounds as well. So there is still the question of how
the masking effect of a particular sound develops in time. When we
study masking effects with brief tone bursts, we find that masking
extends some tens of milliseconds (often quoted as 50ms) backwards and
one to two hundred milliseconds forward in time. The effect drops
approximately exponentially as the temporal separation of the mask and
the masked increases. These results too can be explained by considering
what happens in the basilar membrane of the ear when sonic excitation is
applied—it seems <dfn>backward</dfn> and <dfn>forward</dfn> masking,
as these are respectively called, are the result of the basilar
membrane’s inherently resonant nature. The damped vibrations set of by
sound waves do not set in or die out abruptly, but instead some temporal
integration is always observed. This same integration is what causes the
loudness of very short sounds proportional to their total energy instead
of the absolute amplitude—since it takes some time for the vibration
(and, especially, the characteristic vibrational envelope) to set in,
the ear can only measure the total amount of vibration taking place, and
ends up measuring energy across a wide band of frequencies. Similarly,
any variation in the amplitudes of sound frequencies are smoothed out,
leading to the the ear having a kind of <q>time constant</q> which
limits its temporal accuracy.
Closely tied to masking (and, indeed, many other aspects of human
hearing) are the concepts of <dfn>critical bandwidth</dfn> and
<dfn>critical bands</dfn>. The critical bandwidth is defined as that
width of a noise band beyond which increasing the bandwidth does not
increase the masking effect imposed by the noise signal upon a sinusoid
placed at the center frequency of the band. The critical bandwidth
varies across the spectrum, being approximately one third of an octave
in size, except below 500<acronym title="Hertz" xml:lang="en">Hz</acronym>, where the width is more or less constant at
100<acronym title="Hertz" xml:lang="en">Hz</acronym>. This concept has many uses and interpretations because in a way,
it measures the spectral accuracy of our ear. Logically enough, a
critical band is a frequency band with the width of one critical
bandwidth. Through some analysis of auditory physiology we find that a
critical band roughly corresponds to a constant number of hair cells in
the organ of Corti. In some expositions, critical bands are thought of
as having a fixed center frequency and bandwidth. Although such a view
is appealing from an application standpoint, no physiological evidence
of direct <q>banking</q> of any kind is found in the inner ear or the
auditory pathway, it seems that this way of thinking is somewhat
erroneous. Instead, we should think of critical bands as giving the size
and shape of a minimum discernible <q>spectral unit</q> of kind—in
measuring the loudness of a particular sound, the amplitude for each
frequency is always averaged over the critical band corresponding to the
frequency. (This amounts to lowpass filtering (i.e. smoothing) of the
perceived spectral envelope.) This effect is illustrated by the fact
that people can rarely discern fluctuations in the spectral envelope of
a sound which are less than one critical band in width.
<div class="sidebar">
One counter‐example is found in the perception of speech: the inherent
unstability in the pitch of a vowel makes many static features of the
spectral envelope more easily heard, even if these features are
extremely narrow. More generally, if we pass a periodic sound with a
strong harmonic spectrum (like a pulse train) through a filter, applying
some slow fluctuation to the base frequency makes the harmonics roam
through the peaks and dents in the filter’s response and so turns the
static response of the filter into relatively simple amplitude modulation
of the partials. Our ear is tuned to detecting these and can so go below
the critical bandwidth in resolving the overall spectral envelope.
Naturally the trick won’t work with extremely low base frequencies (the
harmonics are too close to each other), fast fluctuations (the ear
integrates in time), spectra which are not nearly harmonic, or
flat/continuous spectra (like noise).

<h2>Amplitude to loudness</h2>
Considering the complexity of the analysis taking place in the auditory
pathway, it is no wonder that few parameters of sound signals are translated
directly into perceptually significant measures. This is the case with
amplitude too—the nearest perceptual equivalent, <dfn>loudness</dfn>,
consists of much more than a simple translation of signal amplitude.
First, the sensitivity of the human ear is greatly frequency
dependent—pronounced sensitivity is found at the
frequencies which are utilized by speech. This is mainly due to
physiological reasons (i.e. the ear canal has its prominent resonance
on these frequencies and the transduction and detection mechanisms
cause uneven frequency response and limit the total range of frequency
perception). There are also significant psychoacoustic phenomena
involved. Especially, humans tend to normalize sounds. This means that
parameters of the acoustic environment we listen to a sound in is
separated from the properties of the sound source. This means, for
instance, that we tend to hear sounds with similar energies as being
of unequal loudness if our brain concludes that they are coming from
differing distances. Further, such phenomena as masking can cause
significant parts of a sound to be <q>shielded</q> from us,
effectively reducing the perceived loudness. We also follow very
subtle clues in sounds to deduce the source parameters. One example is
the fact that a sound with significant high frequency content usually
has a higher perceived loudness than a sound with similar amplitude
and energy but less high end. This is a learned
association—we know that usually objects emit higher
frequencies if they are excited more vigorously. Phenomena such as
these are of great value to synthesists since they allow us to use
simple mathematical constructs (such as low order lowpass filters) to
create perceptually plausible synthesized instruments. On the other
hand, they tend to greatly complicate analysis.
If we take a typical, simple and single sound and look at its loudness,
we can often neglect most of the complicated mechanisms of perception and
look directly at the physical parameters of the sound. Especially, this
is the case with sinusoids since they have no spectral content apart
from their frequency. Thus most of the theory of loudness perception is
formulated in terms of pure sine waves at different frequencies. It is
mostly this theory that I will outline in the remainder of this section.
<div class="figure right half-width">
<img src="pictsnd/robdad1" alt="[Figure 1: Equal loudness curves in a free field experiment]" longdesc="pictsnd/robdad1-desc"/>
<span class="tag">Figure 1</span> Equiphon contours for the range of
human hearing in a free field experiment, according to Robinson and
Dadson. At 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>, the phon contours correspond to the decibels. All
sinusoids on the same contour (identified by sound pressure level
and frequency) appear to have identical loudness to a human listener.
It is seen that the dynamic range and threshold of hearing are worst
in the low frequency end of the spectrum. Also, it is quite evident
that at high sound pressure levels, less dependency on frequency is
observed (i.e. the upper contours are flatter than the lower ones).
Decibels are nice but they have two problems: they do not take the
frequency of the signal into account and also show poor correspondence
with perceived loudness at low <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym>s. The puzzle is solved in two steps.
First, we construct a scale where the frequency dependency is taken into
account. This is done by picking a reference frequency (1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>, since this
is where we defined the zero level for <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym>s) and then examining how
intense sounds at different frequencies need to be to achieve loudness
similar to their 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym> counterparts. After that we connect sounds with
similar loudnesses across frequencies. The resulting curves are called
<dfn>equiphon contours</dfn> and are shown in the graph from Robinson
and Dadson. We get a new unit, the <dfn>phon</dfn>, which tells loudness
in terms of <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym> at 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>. That 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym> is the reference point shows in that
there the resulting decibel to phon mapping is an identity. Elsewhere
we see the frequency dependency of hearing: following the 60 phon
contour, we see that to get the same loudness which results from
presenting a 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>, 60<acronym title="Decibel" xml:lang="en">dB</acronym> <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym> sine wave, we must use a 90<acronym title="Decibel" xml:lang="en">dB</acronym> <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym> sine wave
at 30<acronym title="Hertz" xml:lang="en">Hz</acronym> or 55<acronym title="Decibel" xml:lang="en">dB</acronym> sine wave at 4<acronym title="kiloHertz" xml:lang="en">kHz</acronym>. We also see that the higher the sound
pressure level, the less loudness depends on frequency (the isophon
contours are straighter in the upper portion of the picture).
The phon is not an absolute unit: it presents loudness relative to the
loudness at 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>. Knowing the phons, we cannot say one sound is twice as
loud as another one—this would be like saying that a five star hotel
is five times better than a one star motel, i.e. senseless. Instead, we
would wish an absolute perceptual unit. All that remains to be done is
to get the phons at some frequency (preferably at 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym> since the
<acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym>‐to‐phon mapping is simplest there) to match our perception. This is
done by defining yet another unit, the <dfn>sone</dfn>. When this is
accomplished, we can first use the equiphon contours to map any <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym> to
its equivalent loudness in phons at 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym> and then the mapping to sones
to get a measure of absolute loudness. The other way around, if we want
a certain amount of sones, we first get the amount of phons at 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym> and
then move along the equiphon contours to get the amount of decibels
(<acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym>) at the desired frequency. Experimentally we get a power law between
sones and phons—at 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>, the mapping from sones to
phons obeys a power function with an exponent of 0.6, 40 phons being
equal to 1 sone. (0 phons, that is 0<acronym title="Decibel" xml:lang="en">dB</acronym>, becomes 0 sones of course.)
This way at high <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym>s the sone scale is nearly the same as the
phon/decibel one, while at low levels, small changes in sones
correspond to significantly higher differences in phons. In effect, at
low levels a perceptually uniform volume slider works <em>real</em>
fast while at higher levels, it’s <q>just</q> exponential.
All the previous development assumes that the sounds are steady and of
considerable duration. If we experiment with exceedingly short stimuli,
different results emerge. Namely, we observe considerable clouding of
accuracy in the percepts and signs of <dfn>temporal integration</dfn>.
This means that as we go to very short sounds and finally impulses which
approach or are below the temporal resolution of the organ of Corti, the
total energy in the sound becomes the dominant measure of loudness. At
the same time, loudness resolution degrades so that only few separate
levels of loudness can be distinguished. Similarly, when dealing with
sound stimuli, the presence of transients becomes an important factor in
determining the lower threshold of hearing—transient content (e.g.
rapid onset of sinusoidal inputs and fluctuation in the amplitude
envelopes) tends to lower the threshold while at the same time clouding
the reception of steady‐state loudness.
Finally, a few words must be said about the loudness of complex sounds.
As was explained in the previous section, sinusoidal sounds close to
each other tend to mask one another. If the sounds are far enough from
one another (more than one critical bandwidth apart) and the higher is
sufficiently loud, they are heard as separate and contribute separately
to loudness. In this case sones are roughly added. Since masking is most
pronounced in the upward direction, a sound affects the perception of
lower frequencies considerably less than higher ones—in a sufficiently
rapidly decaying spectrum, the lower partials dominate loudness
perception. Also, sinusoids closer than the critical bandwidth are
merged by hearing so their contribution to loudness is less than the
sum of their separate contributions. The same applies for narrow‐band
(bandwidth less than the critical bandwidth) noise. If beating is
produced, it may, depending on its frequency, increase, decrease or
blur perceived loudness. Similarly harmonics (whether actually present
or born in the ear) of low frequency tones and the presence of transients
may aid in the perception of the fundamental, thus affecting the audibility
of real life musical tones as compared to the sine waves used in the
construction of the above equiphon graph.
For signals with continuous spectra (such as wideband noise), models of
loudness perception are almost always heavily computational—they
usually utilize a filterbank analysis followed by conversion into the
bark scale, a masking simulation and averaging. Wideband signals also
have the problem of not exactly following the conversion rules for
decibels, phons and sones—white noise, for instance, tends to be heard
as relatively too loud if its <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym> is small and too silent if its <acronym title="Sound Pressure Level" xml:lang="en">SPL</acronym> is
<div class="sidebar">
There are also more subtle complications in trying to generalize the
notion of loudness to cover complex sounds—it is quite possible the
same concepts are not completely applicable to both simple and composite
sounds. Fusion and streaming may separate composite sounds so that
asking for a measure of loudness to be assigned to the composite becomes
meaningless. On top of that, <em>loudness</em> may have very different
meanings depending on the context. There is a fascinating discussion on
what makes a movie <q>too loud</q> on the
<a href="http://www.dolby.com/">Dolby Laboratories web site</a>. It
clearly demonstrates the difference between estimating the loudness of
instantaneous/long term, music/noise and expected/unexpected sounds and
the contribution of dynamic range to the perceived strength of a sound.
In other words it struggles with the cognitive aspects of loudness
perception. All in all, we have bumped for the first time into one of
the prime problems in psychometrics—people display modes of behavior
(such as categorical perception) which tend to defy easy measurement.
Not all things in our perceptual world have dimensions or permit proper

<h2>Temporal processing. Amplitude and frequency modulation.</h2>
From looking at the structure of our auditory system, it seems like quite
considerable machinery is assigned to temporal processing. Furthermore,
it seems like time plays an important role in every aspect of auditory
perception—even more so than in the context of the other senses. This
is to be expected, of course: sound as we perceive it has few degrees of
freedom in addition to time.
The importance of time processing shows in the fact that it starts at an
extremely early stage in the auditory pathway, namely, in the cochlear
nuclei. The bushy cells mentioned earlier seem to be responsible for
detecting the onset of sounds at different frequency ranges. Excitation
of the bushy cells elicits a <dfn>phasic</dfn> response (onset produces
a response, continued excitation does not) as opposed to the <dfn>tonic</dfn>
(continued excitation produces a continued response) pattern most often
observed higher up the auditory track. This way, the higher stages of
auditory processing receive a more event centric view of sound as
opposed to the flow‐like, tonic patterns of the lower auditory
pathway. The pauser cells may be responsible for detecting sound
offsets. This way sound energy in different frequency bands is
segregated into time‐limited events. This time information is what
drives most of our auditory reflexes, such as startle, orientation and
protective reflexes. As such, it hardly comes as a surprise that heavy
connections to the reticular formation (which controls arousal and
motivation, amongst other things) are observed throughout the auditory
In other animals, and especially those which rely heavily on hearing to
survive (e.g. bats, whales and owls), specialized cells which extract
certain temporal features from sound stimuli have been found. For instance,
in bats certain speed ranges of frequency sweeps are mapped laterally on the
auditory cortex <a href="dsndr01#refkan01"><cite>Kan01</cite></a>. This
makes it possible for the bat to use Doppler shifts to correctly
echolocate approaching obstacles and possible prey. These cells are very
selective—they respond best to sounds which nearly
approximate the squeals sent by the bat, excepting the frequency shift.
This leads to good noise immunity. Cells similarly sensitive to certain
modulation effects have been found in almost all mammals and there is
some evidence people are no exception. <span class="disposition"> For
instance, amplitude modulation in the range ?? to ??<acronym title="Hertz" xml:lang="en">Hz</acronym> displays high
affinity for a group of cells in the ???????. TEMP!!!</span> Also, the
nonlinearity of the organ of Corti makes AM appear in the neural input
to the cochlear nuclei as‐is. Mechanisms like these may be what
makes it possible for us to follow rapid melodies, rhythmic lines and
the prosody of speech without difficulty. It is also probable that
they serve a role in helping separate phonemes from each other when
they follow in rapid succession. Without such detection mechanisms it is
quite difficult to see how consonants are so clearly perceived by the
starting frequencies and the relative motion of the partials present. These
mechanisms may even be lent to the interpretation of formant envelopes (and,
thus, the discrimination of vowels) through the minute amplitude fluctuations
in the partials of a given speech sound. (As was discussed in section 4.6, such
flutter is caused by involuntary random vibrato in the period of the glottal
excitation pulse train.)
The importance of time features has been heavily stressed, above.
However, we have yet to discuss quantitatively the sensitivity of our
ears to nonstationary signals. One reason for deferring the issue until
now is that it is not entirely clear what we mean by it. We would like
some objective measure of the time sensitivity of the ear, in a sense,
a time constant. Some of the more important temporal measures are the
time required to detect a gap in some sound signal, the time taken before
two overlapped sonic events can be heard as being separate, the mean
repetition rate at which a recurring sonic event fuses into a coherent,
single whole and the rate at which a masking effect at certain frequency
builds up when the mask is applied or evaporates after the mask is gone.
The first hints at a discrimination test, the second is clearly a matter
of categorical perception and multidimensional study and from third on
we walk in the regime of continuous temporal integration.
The time required to hear a sonic gap varies somewhat over the audio
bandwidth. For a first order approximation, we might say that to effect
a discontinuous percept, we need some constant number of wavelengths of
silence. But looking a bit closer, this number also depends on the
amplitude, timbral composition and timbral composition of the sound.
Voice band sinusoids are probably the easiest, complex sounds with lots of
noise content and expected behavior the most complicated. In the context
of rich spectra, temporal smearing of over 50ms can occur. With a 1<acronym title="kiloHertz" xml:lang="en">kHz</acronym>
80<acronym title="Decibel" xml:lang="en">dB</acronym> sinusoid, a gap of 3‐4 cycles is enough to effect a discontinuous
percept. Often sounds overlaid with expectations (such as a continuously
ascending sinusoid) lend themselves to a sort of perceptual
extrapolation—even if the percept is broken by wideband
noise, our auditory system tries to fill the gap and we may well hear
the sound continue through the pause. The effect is even more
pronounced when a sound is masked by another one. This will be
discussed further down, in connection with the pattern recognition
aspects of hearing.
The minimum length of audible gaps is one but only one measure of the
ear’s time resolution. In fact, it is a very simplistic one. Another
common way to describe the resolution is to model our time perception
through a kind of lowpass filtering (integration or blurring)
operation. In this case, we try to determine the <dfn>time
constant</dfn> of the ear. The time constant of this conceptual filter
is then used to predict whether two time adjacent phenomena are heard
as separate or if they are fused into one. The first thing we notice
is that the time constant varies for different frequency ranges.
<pre class="disposition">
‐so what’s the value?
When we look at simple stimuli, we get some nice, consistent measures
of the temporal behavior of our hearing. But as always, when higher
level phenomena are considered as well, things become complicated. It
seems that the so called <dfn>psychological time</dfn> is a very
complicated beast. For instance, it has been shown that dichotic
hearing can precipitate significant difference in perceived time
spans as compared to listening to the same material monophonically. In
an experiment in which a series of equidistant pulse sounds were
presented at different speeds and relative amplitudes via two
headphones, it was possible to fool test subjects to estimate the
tempo of the pulse train to be anywhere between the actual tempo and
its duplicate, <em>on a continuous scale</em>. This means that the
phenomenon isn’t so much a question of <q>locking</q> onto the sound
in a particular fashion (hearing every other pulse, for instance) but
rather a genuine phenomenon of our time perception. This experiment
has a partial explanation in the theory of auditory perception, which
states that the processing of segregated streams of sound (in this
case, the trains of clicks in the two ears) are mostly independent but
that the degree of separation depends on the strength of separation of
the streams. This disjoint processing can give rise to some rather
unintuitive effects. First of all, time no longer has the easy, linear
structure the Western world attributes to it—segregated streams all
more or less have their own, linear time. The implication is that time
phenomena which are strongly segregated are largely incommeasurate.
This is demonstrated by the fact that a short gap within a sentence
played to test subjects is surprisingly difficult to place within the
sentence afterwards: the subjects know that there was a gap (and even
what alternative material was possibly played in the gap) but cannot
place the gap with any certainty (as in <q>it was after the word
is</q>). In effect, the order of time events has gone from total (a
common linear scale one which everything can be compared) to partial
(there are incommeasurate events which cannot be placed with respect
to each other).
Further complicating the equation, we know that to some degree our
perception of rhythm and time is relative. The traditional point of
comparison is the individual’s heart beat but the relative state of
arousal (i.e. whether we are just about to go to sleep or hyperaroused
by a fight‐or‐flight reaction) probably has an even more pronounced
effect. This may in part explain why certain genres of music are
mostly listened to at certain times of the day. A fun experiment in
relative time perception is to listen to some pitched music with a
regular time while yawning, dosing off or…getting high. All these
should cause profound distortions in both time and pitch quite like
they do with the general state of arousal of a person.
<div class="sidebar">
There is an interesting concept which has been floating around the
community of music psychology for quite some time. I figure it’s a
noteworthy one when talking about hearing and time. That’s the concept
of the <dfn>perceptual now</dfn>, the psychological equivalent of the
present time. What is noteworthy about it is that it extends over a
variable time span and in a multiresolution manner. Depending on what
sort of sonic events we are looking at, the psychological now varies
from milliseconds <em>to entire seconds</em>, even while the different
sounds overlap. Of course, any two events heard in the <q>now</q> mentally
<pre class="disposition">
‐Vesa Valimaki’s work on time masking etc.

<h2>Pitch perception</h2>
<pre class="disposition">
‐volley theory (especially in the low register)
‐virtual vs. real pitch
‐nonlinearity/missing fundamental problem
‐spectral pitch (place theory interpretation for acute tones)
‐formants/spectral envelope

<h2>Directional hearing, externalization and localisation</h2>
<pre class="disposition">
‐phase difference (grave)
‐amplitude gradient (acute)
‐indetermination in between registers
‐amplitude envelopes important (acute)
‐connection to the concept of group delay
‐relative reverb/early reflections as size/distance cues
‐the poor performance of generic computational models as proof of the
acuity of these processes

<h2>Auditory perception as a pattern recognition task: stream segregation and fusion</h2>
<pre class="disposition">
‐common features
‐the old+new heuristic
‐layers: neurological and cognitive
‐attention: effects on both layers/selection
‐orientation (reflexes+attention)
‐pattern recognition
‐what’s here?
‐perceptual time
‐e.g. dichotic clicks seem slower than the same sequence when presented
‐relation to state of arousal, heartbeat and other natural
‐vertical vs. horizontal integration
‐competition between integration and segregation
‐this is a typical application of the Gestalt type field rules

<pre class="disposition">
‐spectral envelopes
‐temporal processing
‐e.g. the genesis of granular textures
‐connections to fusion; relevance of vibrato/dynamic envelopes for fusion
‐e.g. fusion of separately introduced sinusoids upon the introduction of a
common frequency/amplitude modulator, and its converse when the commonality
no longer holds
‐the importance of attacks and transients
‐spectral spashing
‐information carrying capacity of transients (no steady‐state
‐indetermination in periodic timbre
‐ergo, place theory/formant perception et cetera is not very accurate,
whereas volley theory/temporal processing seems to be
‐i.e. it is very difficult to characterize/measure timbre
‐there have been attempts
‐for instance, for steady‐state spectra with origin in orchestral
instruments, we seem to get three dimensions via
<acronym title="Principal Component Analysis" xml:lang="en">PCA</acronym>/<acronym title="Factor Analysis" xml:lang="en">FA</acronym>
‐most of these attempts do not concern temporal phenomena (the
overemphasis of on Fourier, mentioned earlier)
‐this sort of theory is based on <em>extremely</em> simplified sounds and
test setups
‐connection to masking (especially in composite signals)
‐phase has little effect
‐except in higher partials and granular/percussive stuff
‐i.e. steady‐state is again overemphasized in traditional expositions
‐timbre is not well defined (Bregman: <q>wastebasket</q>)

<h2>Sensory integration</h2>
<pre class="disposition">
‐head turning as a localisation cue
‐we continuously extract spatial information based not only on an
open‐loop interpretation of what is heard, but on a closed‐loop one of what
happens when we change the acoustic conditions
‐the McGurk effect
‐that is, seeing someone talk can change the interpretation of the same
auditory input

<h2>Cognitive aspects of hearing. Evolutionary perspectives.</h2>
<pre class="disposition">
‐what can be learned?
‐apparently a lot!
‐lateralization implies invariance/<q>hardwiring</q>?
‐or just that there is a typical dynamic balance arising from the common
underlying circuitry?
‐is plasticity the norm?
‐what features in sound prompt specific invariant organizational patterns?
‐evolution and development of audition
‐population variations
‐esp. the Japanese peculiarities in lateralization!

<pre class="disposition">
‐hearing under the noise floor
‐the effects of ultrasonic content on directional hearing
‐esp. sursound discussions on transient localization
‐the idea that bandlimitation (and the inherent ringing, and especially
pre‐echos) it produces fool our time resolution circuitry
‐this is a nasty idea, because we cannot hear ultrasonic content, per se
‐it implies that spatial hearing is inherently non‐linear
‐it does <em>not</em> imply that all ultrasonic content has to be stored
‐instead it would mean that we might have to consider some nonlinear
storage format, which only helps store transients more accurately
‐thoughts on why dither might not help this situation, even if it makes
the <em>average</em> temporal resolution of an audio system approach
‐overcomplete analysis and <q>superresolution</q> of sounds (Michael
Gerzon’s unpublished work?)
‐inherent nonlinearity in hearing (computational models of microcilia!)
‐used to explain difference tones, perception of harmonic/near‐harmonic
spectra, missing fundamentals(, what else?)
‐levels of pattern recognition (learned vs. intrinsic)
‐comodulation of masking release and profile perception as signs of cross
frequency band processing at a low level
‐the consequent refutation of strictly tonotopic place theories of pitch
‐envelopment and externalization through decorrelation
‐frequency ranges?

<p class="stamp" xml:lang="en">
Copyright © 1996–2002 Sampo Syreeni; Date: 2002–09–17;
<a href="http://www.helsinki.fi/~ssyreeni/front">⤒</a>; <a href="mailto:decoy@iki.fi">✉</a>;
<a href="http://www.opencontent.org/">♻</a>; <a href="http://www.helsinki.fi/~ssyreeni/decoy/support">☑</a>