author:
Dr. Van Belle Werner
e-mail: werner.van.belle@vub.ac.be
keywords: BPM Counting, tempo measurement, audio content
extraction
download: PDF
version
Automatically measuring the tempo of a digital soundtrack is crucial for DJ-programs such as BpmDJ[Bel03], BeatForce [Beu01], and other similar software. The ability to mix two (or more) pieces of music, depends almost entirely on the availability of correct tempo information. All too often this information must be supplied by the DJ. This results in inaccurate tempo descriptions, which are useless from a programming point of view [Cli00].
In this paper, we first elaborate on the required tempo accuracy. Afterwards we explain the use and limitations of a fast Fourier transformation to detect the most prominent tempo and finally we explain our own technique of beat graphs and ray-shooting. The presentation of our technique covers the algorithm, performance estimates and experiments. For developers who want to use the algorithm 'out of the box' we give a list of parameters that makes the algorithm robust and fast on a standard personal computer.
In the upcoming discussion we assume that the fragments that need to be mixed have a fixed but initially unknown tempo. Certainly not all songs fall in this category, nevertheless this assumption can safely be made because we are only interested in the fragments that need to be mixed. After all, very few DJ's will mix two songs when one of the tempos changes too much.
Throughout the text we make use of
different units to express frequency
and period. Frequency is expressed in Hz = (
or in
BPM (beats per minute). Beats per minute are the number of
whole notes played during 1 minute. To convert BPM to Hz we simply
multiply or divide by 60:
and
. A period describes
the time between two ticks (or beats). Converting a frequency
to a period
is done by using the standard formula
.
If we have a frequency specified in BPM then we can obtain the time
between two beats in msecs with
. In this paper,
one measure is 4 beats. Given these units we can now start
investigating how accurately we need to measure the tempo.
Accurate tempo information is important because, without it, two songs will drift out of synchronization very quickly. This often results in a mix that degrades to chaos, which is not what we want. Therefore we must ask ourselves how accurately we should measure the tempo of a song. To express the required accuracy, we will calculate the maximum error on the tempo measurement to keep the synchronization drift after a certain timespan below a certain value.
We assume that the song is measured with
an error . If
the exact tempo of the song is
BPM then
the measurement will
report
BPM. Such a wrong
measurement
leads to a drift
after a certain timespan
(denoted
). The timespan we
consider, is the time needed by the DJ to mix the songs. Typically
this can be 30'', 60'' or even up to 2 minutes if he wants to alternate
between different songs. We express the drift in beats (at
the given tempo). We chose not to use absolute time values because
slow songs don't suffer as much from the same absolute drift as fast
songs. E.g, a drift of 25 ms is barely noticeable in slow songs (70
BPM), while such a drift is very disturbing for fast songs (145 BPM).
Therefore, we consider a maximum drift of
beat. Given
the exact tempo
and the measurement error
, we can
now calculate the drift
after timespan
.
To do so, we must first know how many
beats are contained within .
Based on the standard relation between period and frequency, this
is
.
Because the tempo is measured with an accuracy
of
, we get a drift (expressed in
beats) of
.
To keep this drift below
beat, we must
keep
.
Because
, we
obtain a
required accuracy of
![]() |
(1) |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 2:
This table presents an overview of the
required accuracy ![]() ![]() ![]() |
In this section we investigate the use of a Fourier transformation to measure the lower end of the signal spectrum. This allows us to detect the most prominent low frequency within a soundtrack. This frequency can be easily converted to the tempo of the song. We will discuss the performance and accuracy of such an analysis.
To avoid confusion we must warn the reader that we are not
referring to peak detectors that spike at certain frequencies. These
programs make use of a Fourier analysis or specially tuned digital
filters to detect bass drums (frequencies between 2 and 100 Hz),
hi-hats
(above 8000 Hz) or other repetitive sounds. By measuring the distance
between different spikes, they obtain the mean tempo. The problem
with this technique is that it is an inherent highly inaccurate
technique,
which not only requires detailed fine tuning but afterwards still
requires a comprehension phase of the involved rhythm [YKT95].
Its inaccuracy comes from the problem that low frequencies (encoding
a bass drum for instance) have a large period, which makes it difficult
to position the sound accurately. In other words, the Fourier analysis
will position the spike somewhere around the actual position of the
beat. This makes the algorithm inherent inaccurate. The second
problem involves the difficulties of designing a set of peak-responses
that will work for all songs. E.g, some songs are very repetitive
with respect to the bass-drums but not with respect to the bass itself.
Hence, tuning the algorithm to detect only the bass drums can be very
difficult. Finally, the last problem of peak detectors is that they
have no notion of the involved rhythm. E.g, a song that has
a drum line with beats at position 1, 2, 3, 4 and
are notoriously difficult to understand because the last spike at
position
will seriously lower the
mean
time between
the different beats, hence the algorithm will report a very inaccurate
tempo. So, to state it clearly, we are not referring to this kind
of application of Fourier analysis. The use we are aiming at is
essentially
for what a Fourier analysis is supposed to do: to immediately measure
the presence of frequencies between 80 BPM (= 1.3333... Hz) and 160
BPM (= 2.666... Hz). As we will point out, this application of Fourier
analysis has also major limitations.
A digital signal is typically represented as a series of samples
(denoted
). Each sample describes the
amplitude of the signal at
that
given position in time. The number of samples present is denoted
.
The length of the audio fragment is
(expressed
in
seconds).
A Fourier transformation takes these
samples as input and transforms
them into a superposition of sine waves. Every sine wave has its own
strength and its own frequency. Essentially, it converts a signal
represented in the amplitude domain to a signal represented in the
frequency domain [HJB84].
A Fourier transformation
is characterized by two parameters, the sampling rate and the
window size. The sampling rate (denoted )
describes
the number of samples encoded in one second. In this paper we assume
a standard sampling rate of 44100 Hz. The window size (denoted
) is a parameter of the Fourier
transform that specifies
the number
of samples that are given as input to the algorithm. Due to technical
characteristics of the FFT, the window size is typically a power of
two. The outputs of a Fourier transform are the strengths of all
frequencies
between
and
Hz.
However, from these only
the ones between
and
Hz are of interest, because a
sampling rate
of
Hz can only describe signals up to
Hz [Nyq28].
The FFT is in general a fast algorithm. Its performance estimate is
. Different hardware
specific optimizations
might
lead to notable improvements, especially in the range of smaller window
sizes (between
and
). For
standard
desktop computers,
there exist efficient implementations such as the FFTW [FJ97]
and Intel's high performance math libraries [Int03].
However, the larger the window size becomes the closer different
implementation
techniques come to a factor of the theoretical
. Nowadays,
window sizes of
(which is a song of 6' 20'')
are achievable
with standard personal computers. Nevertheless, the reason why we
don't use a fast Fourier transform lies in its lack of accuracy.
|
The accuracy of the spectrum measurement depends on both the sampling
rate and the window size. The Fourier transformation puts out
frequencies. Its highest output (output number
)
reports the strength
of frequency
while the lowest output (output
number 1)
reports
the strength of
Hz. In general, output
reports
the strength of frequency
. This leads
to the
observation
that the smallest distance between two measurable frequencies is
.
E.g, if the window size is 1024 (= 23.2 msecs), with a sampling rate
of 44100, then we have an accuracy of 43 Hz (That is, we can measure
the energy of 43 Hz, 86 Hz, and so on...).
Now, if we want to increase the accuracy
of a Fourier analysis, then
we have the choice of decreasing the sampling rate, or increasing
the window size. We cannot modify the sampling rate because then we
would throw away mostly useful information (hi-hats, claps and so
on). Therefore, our only remaining option lies in selecting the largest
possible window size. However, here we are limited by the size of
the audio fragment. This leads us to specify the accuracy of the
Fourier
analysis in function of the length of the song. For the sake of
simplicity,
we assume that this is a power of 2. Then, the accuracy of a FFT of
a song of length (in seconds) is, after
conversion to BPM given
by
Table 3 contains an overview of the BPM measure accuracy, given a certain song length. Clearly, an accuracy of 0.157 is not even close to the required accuracy of 0.0313 BPM (set aside the accuracy of 0.0156 BPM).
Because the fast Fourier transform is highly inaccurate and its accuracy is very dependent on the length of the song, we created our own technique, called beat graphs and ray shooting.
|
Our visualization technique of beat graph can be used to show tempo information of music. It visualizes the function
Horizontally, the measures are enumerated, while vertically the content
of one such a measure is visualized. The color of a pixel at position
is given by the absolute
strength of the signal at
time step
. The value
is
the period between two beats.
We multiply this period by a multiplication factor
that
specifies
how many beats are supposed to be in one measure. We further assume
that this value equals 4 because most popular songs are played in
a 4/4 measure.
Picture 1 shows two beat graphs. Reading a beat graph can best be done from top to bottom and left to right. For instance in the song Alien Pump [Tan97] we see 4 distinct horizontal lines (numbers 1, 2, 3 and 4). These are the the 'beats' of the song. The top line covers all the beats on the first note in the measure, the second horizontal line covers all the beats at the second note within a measure and so on... The vertical white strokes that break the horizontal lines (C, D and E) are breaks within the music: passages without bass drum or another notable signal. Because the lines are distinctively horizontal we know that the period is correct. If the lines in the beat graph are slanted such as in Lsd [Hal95] (the blue line in the picture) then the period (and thus the tempo) is slightly incorrect. In this case, the line is slanted upwards which means that the period is too short (the beat on the next vertical line comes before the beat on the current line), and this means that the tempo is a bit too high. By looking at the direction of the lines in a beat graph we can relate the actual tempo to the visualized tempo.
|
The beat graph not necessarily displays only straight lines. Two examples of this are given in figure 2. The first example is the song X-Files [CM01]. In the beginning the tempo line bends upward indicating that the song is brought down from a higher tempo. After a break the tempo remains constant. Another example is the song Anniversary Walz [Quo90] in which we clearly see a drummer who drifts around a target tempo.
Our visualization technique of beat
graphs offers some important advantages.
First, the visualization function is very easy to implement and can
be calculated quickly. This is in stark contrast with voice prints,
which require extensive mathematical analysis and offer no tempo
information.
Secondly, the beat graph contains rhythm information. From the
viewpoint
of the DJ, all the necessary information is present. The 'character'
of music can be recognized by looking at the picture. E.g, we can
easily see how the songs Alien Pump [Tan97]
contains beats at notes 1, 2, 3, 4 and 4
(position A
in the picture). After a while the extra beat at the end of a measure
is shifted forward in time by half a measure (position B). Similarly,
breaks and tempo changes can be observed (C, D and E). Third, as a
DJ-tool it offers a visualization that can be used to align the tempo
immediately. E.g, by drawing a line on the beat-graph such as done
in Lsd [Hal95],
it is possible to recalculate
the graph with an adjusted period. Not only will this introduce
accurate
tempo information, it also introduces the information necessary to
find the start of a measure. This makes it easy to
accurately
and quickly place cue-points.
We will now further investigate how we can fully automatically line up the beat graph. If we are able to do so then we are able to calculate the tempo of a song automatically.
Beat graphs give us the possibility to visualize the tempo information within a song. We will now use this to search for the best possible visualization of the beat graph. An optimal visualization will give rise to as much 'horizontality' as possible. 'Horizontality' is measured by comparing every vertical slice (every measure) of the beat graph with the previous slice (the previous measure). Through accumulation of the differences between the consecutive slices we obtain a number that is large when little horizontality is present and small when a lot of horizontality is present. Formally, we measure
By expanding we obtain the simpler
formula,
The use of absolute values here is necessary to cancel out too strong oscillations due to phase differences. Normally, when two waves are subtracted when they are in phase, the result will be 0°. On the other hand, when the phase difference is 180°, then instead the result will be 2 times the amplitude. By using an absolute value here we have halved the maximum error due to phase differences. The maximum error will now occur at a phase shift of 90° and 270° and will be one time the amplitude.
The problem we face now, is to find the
period for which
is the smallest.
must be located within a
certain range, for
instance between the corresponding periods of 80 BPM and 160 BPM.
Once this period is found, we can convert it back to its tempo.
Converting
a period to BPM is done as follows.
A possible optimization to this search process is the introduction of clipping values. Once the calculation of the ray accumulates an error larger than the clipping value it simply stops and returns the clipping value:
![]() |
(3) |
if
The time needed to calculate the value of for
one period is
. Because the period is
often neglectable small
in comparison
to the entire length of a song, we omit it. Hence,
takes
time to finish. To find the best match in a certain tempo range by
doing a brute force search we need to shoot
rays. We
will denote the size of this window
. The
resulting time
estimate
is accordingly
.
In direct comparison to the fast Fourier
transform (recall )
this performance estimate is much better because it can be done
linearly.
With respect to the multiplication factors involved, we will not
compare
them because this is inherently dangerous to do. After all, both
algorithms
depend highly on the speed of the inner loop and this is very hardware
dependent. Luckily, our algorithm is, in comparison, with Fourier
analysis much more simple. The inner loop only contains 1 instruction,
a subtraction with removal of the sign bit. But aside from this
appealing
performance, the real strength of our approach comes from its extremely
high accuracy and its modular nature.
Our algorithm can verify any period above 0 and below .
This means that it can distinguish which one of two periods
or
period
is the best. Therefore, we will
calculate the
accuracy
by measuring the smallest
distinguishable
frequency. Given
a certain frequency, we convert it to its period, then assume that
we have measured the period wrongly with the least possible error:
1 sample. The two resulting periods are converted back to BPM's and
the differences between these two equals the theoretical
accuracy.
Suppose the tempo of the song is
, then the
period of
this frequency
is
. The closest distinguishable
next period is
.
After converting these two periods back to their tempos we get
and thus an accuracy of
. Expansion of this
expression
results in
|
Table 4 presents an overview of the accuracies we might obtain with this technique. As can be seen, even the least accurate measurable tempo (170 BPM) can be distinguished with an accuracy of 0.00273, which is still 11 times more accurate than the required accuracy of 0.0313. For lower frequencies this even increases to 67.4 times better than the required accuracy ! It should be noted here that this is a theoretical accuracy. Later on we will experimentally validate the practical accuracy.
The algorithm as presented allows to verify the presence of one
frequency in . This opens the possibilities
to parallelize
the algorithm easily but also the possibility to actively search
for matching frequencies while neglecting large parts of the search
space. This of course requires a suitable search algorithm. In the
following section we will shed light on one search approach that has
proven to be very fast and still highly accurate.
As explained in the previous section, verifying every possible frequency by doing a brute force search is not such a good idea because it would simply take too much time. Nevertheless, the algorithm as presented here, allows the verification of single frequencies very accurately and quickly. This has lead to an algorithm that first does a quick scan of the entire frequency spectrum and then selects the most important peaks to investigate in further detail.
To skip large chunks of the search space,
we use an incremental approach.
First, the algorithm does a coarse grained scan that takes into account
'energy' blocks of information (let's say with a block size of ).
From the obtained results, it will decide which areas are the most
interesting to investigate in further detail. This is done by
calculating
the mean error of all measured positions. Everything below the mean
error will be investigated in the next step. We iterate this process
with a smaller block size of
until the block
size
becomes
1.
The problem with this approach is that it
might very well overlook
steep dalls and consequently find and report a completely wrong
period/tempo.
To avoid this we will flatten the
function by
resampling
the input data to match the required block size.
To avoid a strongly oscillating
function, which
would make
a selective descent very difficult, we will, in every iteration step,
resample the audio stream to the correct frequency. This resampling
however is very delicate. Simply by taking the mean of the data within
a block, we will throw away valuable information. In fact, we will
throw everything away that cannot be represented at the given sampling
rate. E.g, if we have a block size of 1024 then the final sampling
rate will be 43 Hz. Such low sampling rates can only describe the
absolute low frequencies up to 21.5 Hz [Nyq28].
Clearly, such a sampling rate conversion is not what we are looking
for. Therefore we have invented the following resampling algorithm
With every step, the algorithm will first
rescale the audio stream
by shrinking every block of sample to 1 sample.
This is done
by taking the sum of the absolute values within the block:
The use of the absolute value in this context allows us to encode the presence of high frequencies that are fully contained within the resampling block. Normally, such a sine wave will cancel itself out if we simply take the sum of all values. By taking the absolute value, it will still be present in the resulting block. The second reason why the use of absolute values is necessary, is to automatically select the most important frequencies: the only sines that remain, after resampling, are the ones that go through the zero line. Any full sine completely encoded on the positive or negative side of the zero line will cancel itself out. In other words, we select only the waves that are at the given time most responsible for the wave not being zero.In fact we loosely encode the spectrum by means of the absolute value.
With the knowledge how to resample the audio fragment and how to search
for the most prominent frequency, we can now write down the algorithm
in full detail. In the algorithm below is the
number of
beats
that are supposed to be in one measure.
and
(see equation 2).
The window size
is given by
. The resample function
is described in equation 4 and
the
function
is described in equation 3.
is the number
of requested iterations. The maximum block size will then be
.
resample(
,
)
for(;
;
)
ray(
,
,
)
for(;
;
)
resample(
,
)
for(;
;
)
if
ray(
,
,
)
for( ,
;
;
)
if
ray(
,
,
)
if
Estimating the performance of this algorithm is easy but requires
some explanation. Let us assume that we start with a block size of
. This means that we will iterate
steps.
The initial step, with a block size of will result in
rays to be shot. The cost to shoot one ray depends on the audio size.
After resampling this is
.
The following steps, except for the last,
are similar. However, here
it becomes difficult to estimate how many rays that will be shot,
because we take the mean error of the previous step and select only
the region that lies below the mean error. How large this area is,
is unknown. Therefore we will call the number of new rays divided
by the number of old rays the rest-factor .
After one step,
we get
rays; after two steps this is
and by induction after
steps the number of rays
becomes
.
Every ray, after resampling to a block size of
,
takes
time, hence the time estimation for every
step is
.
We continue this process from
until
.
The last step () is
similar, except that we can do the ray shooting
it in half the expected time because we are looking for the minimum
and no longer for the mean error. This results in
The resampling cost at every step is . After adding all steps
together we obtain:
After some reductions, we obtain:
Depending on the factor of and
we
obtain different performance
estimates. We have empirically measured the value of
under a
window of [80 BPM:160 BPM]. We started our algorithm at a block
size of
and have observed that the mean
value of
is
0.85. This results in an estimate of
.
In
practice, from
the possible 66150 rays to shoot we only needed to shoot 142,35 to
obtain the best match. So, we only searched 0.20313 % of the search
space before obtaining the correct answer. This is a speedup of about
246.18 times in comparison with the brute force search algorithm.
Setup We have implemented this algorithm as a BPM
counter
for the DJ program BpmDj [Bel03].
The
program can measure
the tempo of .MP3 and .OGG files. It is written in C++ and runs under
Linux. The program can be downloaded from http://bpmdj.sourceforge.net/.
The parameters the algorithm currently uses are: ,
,
.
Of course these can be modified when necessary.
The algorithm starts iteration with
and
stops at
because
a higher accuracy is simply not needed.
|
Figure 3 shows the output it produces for two songs. Every color presents the results of one scanning iteration. The first scan is colored green, the last scan is colored red. The horizontal colored lines represent the position of the mean values for that scan. As can be seen, by using the mean value, quickly large parts of the search space are omitted.
The song Tainted Love [Cel96]
is interesting
because it shows 4 distinct spikes. Every spike is a multiple of each
other. In fact, the spike is the actual
tempo with
a measured
period of 4 beats, which is a correlation between beat 1 and beat
5. The
spike occurs when the
algorithm correlates
between
the 1st and the
beat, hence calculates a tempo
with measures
of 5 beats, while it expects to find only 4 beats in a measure. This
results in a tempo of 114.5. The
spike is similar but
now
a correlation between the
and the
beat occurs.
The first spike is a correlation between the
beat and the
beat. We have these distinct
spikes because
the song is
relatively empty and monotone. In fact, the algorithm first chooses
the spike at position 95,49 BPM. Luckily, afterwards other clues were
found that has lead the algorithm to choose a tempo of 143.236 BPM.
Experiment The experiment consisted of a selection of 150 techno songs. Each song its tempo has been manually measured by using the beat graphs. All songs had a tempo between 80 BPM and 160 BPM. Of them 7 had a drifting tempo. In these songs we have manually selected the largest straight part (such as done for the song X-Files [CM01]). Afterwards, we have run the BPM counter on all the songs again. For every song the counter reported: a) the normal playing time of the song, b) the time necessary to analyze the song, c) the manual measured tempo and d) the automatic measured tempo. The machine on which we did the tests was a Dell Optiplex GX1, 600 MHz running Debian GNU/Linux. The counter has been compiled with the Gnu C Compiler version 3.3.2 with maximum optimization, but without machine specific instructions.
Results As expected
the results contained a number of
wrongly reported errors due to harmonics. These have been detected
and eliminated by simply multiplying the reported period by ,
or
. 10 songs had the
-harmonic
problem. 10 songs had the
-harmonic
problem and
5 songs
had the
-harmonic problem. The
whole
test set had two
outliers. One outlier was an error of about 11 BPM (for which we still
don't understand the reason). The other outlier was a mismatch of
1.2 BPM. All other measurement errors were below 0.48 BPM. The mean
error (without the two outliers) was 0.0340 BPM, from which we can
conclude that our algorithm has a practical accuracy of 0.0340
BPM, which is close to the required accuracy of 0.0313. Concerning
the analyzing speed, we observed a mean speed of 8.33 real
time. E.g, if a song is 120 seconds then the algorithm will report
the tempo in 14.4 seconds.
In this paper we have introduced our visualization technique, called beat graphs. A beat graph shows the structure of the music under a certain tempo. It shows rhythmic information, covering where breaks and beats are positioned and tempo information. Beat graphs can be quickly drawn and are a valuable tool for the digital DJ, because they can also be used as a fingerprint of songs.
Based on these beat graphs, we have developed an offline algorithm to determine the tempo of a song fully automatically. The algorithm works by shooting rays through the digital audio and checking the similarities between consecutive slices within the audio. To speed up the process of finding the best matching ray, we presented an optimized search algorithm that must only search 0.2% of the entire search space.
The BPM counter works fully correct
in 82% of the cases. The
other 17% are correctly measured but report a tempo that is
an harmonic of the actual tempo. The remaining one percent is measured
incorrectly . The accuracy of the measurement is 0.0340 BPM, which
is enough to keep the drift of two songs below beat
after 30'' (or to keep the drift below
beat after
60''). Aside from being very accurate, our algorithm is also
insensitive
to breaks and gaps in the audio stream. The algorithm is 8.33 faster
than real time on a 600 MHz Dell Optiplex, without making use of
processor
specific features.
Harmonics The problem of harmonics is a problem that comes from the fact that we are searching over a very large area (from 80 BPM till 160 BPM). When a measurement reports an inaccurate tempo it can be quickly solved by multiplying it with one of the harmonics. However, this requires extra user input. How we can eliminate the reporting of harmonics without extra user input remains an important question.
Speed As we have
observed in the BPM analyzing graphs,
the descent towards the absolute minimum is often straightforward.
After one iteration, the final peak is often already known. This can
be probably turned to our advantage by tuning the
parameter to
narrow down the search faster and thus increase the speed of the
algorithm.
Another area where a speed improvement is possible, is by making use
of the processor cache. Instead of analyzing ray per ray, we can
analyze
one block of the song (which then remains in the cache) for all rays
at once. This would probably increase the speed of the algorithm even
further.
Order of the ray shooter The ray evaluation itself can probably be improved by taking into account multiple consecutive slices instead of only the previous slice. Whether this would increase the accuracy of the algorithm should be investigated further.
Tempo Line Detecting tempo modifications over time is an important problem. For humans, the beat graphs are easy understandable, however, deducing the tempo line automatically remains a very difficult problem because silent passages and breaks will disrupt the tempo line completely. Probably some autocorrelation technique together with a high level beat-graph analysis (where are the breaks positioned) might help.
Beat graph improvements The beat graph has still some room for improvement. We are not yet making use of color information. By filtering the audio stream through two complementary filters (a high-pass filter and a low-pass filter) we might be able to create two beat graphs, one presenting the low frequencies and another presenting the high frequencies. By superimposing these two graphs, one colored red, the other yellow, we might be able to offer even more information.
FFT & PCA By first converting the audio stream to a voice print and then detecting the principal frequency axis we might be able to encode the spectrum at a given position in time more accurately than we did now through absolute values.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)