BPM Measurement of Digital Audio
by means of Beat Graphs & Ray Shooting

author: Dr. Van Belle Werner
e-mail: werner.van.belle@vub.ac.be


Abstract:

To automatically create a beatmix of two digital audio streams, one needs to know exactly the tempo of the songs involved. In this paper we present a fully automatic algorithm to measure the mean tempo of a song with a very high accuracy. The algorithm itself is an easy implementable offline search algorithm that looks for the tempo that best describes the song. For every investigated tempo, it steps through the song and compares the similarity of consecutive pieces of information (bass drum, a hi-hat, ...). The practical performance, with respect to time, space and accuracy is substantially better than Fourier analysis.

keywords: BPM Counting, tempo measurement, audio content extraction
download: PDF version

Introduction

Automatically measuring the tempo of a digital soundtrack is crucial for DJ-programs such as BpmDJ[Bel03], BeatForce [Beu01], and other similar software. The ability to mix two (or more) pieces of music, depends almost entirely on the availability of correct tempo information. All too often this information must be supplied by the DJ. This results in inaccurate tempo descriptions, which are useless from a programming point of view [Cli00].

In this paper, we first elaborate on the required tempo accuracy. Afterwards we explain the use and limitations of a fast Fourier transformation to detect the most prominent tempo and finally we explain our own technique of beat graphs and ray-shooting. The presentation of our technique covers the algorithm, performance estimates and experiments. For developers who want to use the algorithm 'out of the box' we give a list of parameters that makes the algorithm robust and fast on a standard personal computer.

In the upcoming discussion we assume that the fragments that need to be mixed have a fixed but initially unknown tempo. Certainly not all songs fall in this category, nevertheless this assumption can safely be made because we are only interested in the fragments that need to be mixed. After all, very few DJ's will mix two songs when one of the tempos changes too much.

Throughout the text we make use of different units to express frequency and period. Frequency is expressed in Hz = ( $\frac{1}{sec})$ or in BPM (beats per minute). Beats per minute are the number of whole notes played during 1 minute. To convert BPM to Hz we simply multiply or divide by 60: $1\,\textrm{BPM}=\frac{1}{60}\,\textrm{Hz}$ and $1\,\textrm{Hz}=60\,\textrm{BPM}$. A period describes the time between two ticks (or beats). Converting a frequency $f$ to a period $t$ is done by using the standard formula $f=\frac{1}{t}$. If we have a frequency specified in BPM then we can obtain the time between two beats in msecs with $t=\frac{60000}{f}$. In this paper, one measure is 4 beats. Given these units we can now start investigating how accurately we need to measure the tempo.

Required Accuracy

Accurate tempo information is important because, without it, two songs will drift out of synchronization very quickly. This often results in a mix that degrades to chaos, which is not what we want. Therefore we must ask ourselves how accurately we should measure the tempo of a song. To express the required accuracy, we will calculate the maximum error on the tempo measurement to keep the synchronization drift after a certain timespan below a certain value.

We assume that the song is measured with an error $\epsilon $. If the exact tempo of the song is $f$ BPM then the measurement will report $f\pm\epsilon$ BPM. Such a wrong measurement leads to a drift $\gamma $ after a certain timespan (denoted $t$). The timespan we consider, is the time needed by the DJ to mix the songs. Typically this can be 30'', 60'' or even up to 2 minutes if he wants to alternate between different songs. We express the drift in beats (at the given tempo). We chose not to use absolute time values because slow songs don't suffer as much from the same absolute drift as fast songs. E.g, a drift of 25 ms is barely noticeable in slow songs (70 BPM), while such a drift is very disturbing for fast songs (145 BPM). Therefore, we consider a maximum drift of $\frac{1}{b}$ beat. Given the exact tempo $f$ and the measurement error $\epsilon $, we can now calculate the drift $\gamma $ after timespan $t$.

To do so, we must first know how many beats are contained within $t$. Based on the standard relation between period and frequency, this is $\frac{f.t}{60}$. Because the tempo is measured with an accuracy of $\epsilon $, we get a drift (expressed in beats) of $\gamma=\frac{(f\pm\epsilon).t}{60}-\frac{f.t}{60}=\pm\frac{\epsilon.t}{60}$. To keep this drift below $\frac{1}{b}$ beat, we must keep $\left\vert\gamma\right\vert<\frac{1}{b}$. Because $\left\vert\gamma\right\vert=\frac{\epsilon.t}{60}$, we obtain a required accuracy of

\begin{displaymath}
\epsilon<\frac{60}{b.t}
\end{displaymath} (1)
Equation 1 describes how accurately one song should be measured to have a maximum possible drift of $\frac{1}{b}$ beats after a time of $t$ seconds. In practice, at least two songs are needed to create a mix. If all involved songs are measured with the same accuracy $\epsilon $, then the maximum drift after $t$ seconds will be the sum of the maximum drifts of all involved songs. From this, it becomes clear that we need to divide the required accuracy by the number of songs. Table 2 gives an overview of the required accuracies. As can be seen, an accuracy of 0.0313 BPM is essential, while an accuracy of 0.0156 BPM is comfortable.



# songs $\gamma(\textrm{beat})$ 30'' 60'' 90'' 120''
  $1/32$ 0.0625 0.0313 0.0208 0.0156
1 $1/16$ 0.125 0.0625 0.0416 0.0313
  $1/8$ 0.25 0.125 0.0833 0.0625
  $1/32$ 0.0313 0.0156 0.0104 0.0078
2 $1/16$ 0.0625 0.0313 0.0208 0.0156
  $1/8$ 0.125 0.0625 0.0417 0.0313
  $1/32$ 0.0208 0.0104 0.00693 0.0052
3 $1/16$ 0.0417 0.0208 0.0139 0.0104
  $1/8$ 0.0833 0.0417 0.0278 0.0208
  $1/32$ 0.0157 0.00783 0.0052 0.0039
4 $1/16$ 0.0313 0.0156 0.0104 0.00783
  $1/8$ 0.0625 0.0313 0.0208 0.0156

Table 2: This table presents an overview of the required accuracy $\epsilon $ (in BPM) given a number of songs, a maximum allowable drift $\gamma $ (in beats) and a timespan $t$ (in seconds).


Fourier Analysis

In this section we investigate the use of a Fourier transformation to measure the lower end of the signal spectrum. This allows us to detect the most prominent low frequency within a soundtrack. This frequency can be easily converted to the tempo of the song. We will discuss the performance and accuracy of such an analysis.

Peak Detectors

To avoid confusion we must warn the reader that we are not referring to peak detectors that spike at certain frequencies. These programs make use of a Fourier analysis or specially tuned digital filters to detect bass drums (frequencies between 2 and 100 Hz), hi-hats (above 8000 Hz) or other repetitive sounds. By measuring the distance between different spikes, they obtain the mean tempo. The problem with this technique is that it is an inherent highly inaccurate technique, which not only requires detailed fine tuning but afterwards still requires a comprehension phase of the involved rhythm [YKT95]. Its inaccuracy comes from the problem that low frequencies (encoding a bass drum for instance) have a large period, which makes it difficult to position the sound accurately. In other words, the Fourier analysis will position the spike somewhere around the actual position of the beat. This makes the algorithm inherent inaccurate. The second problem involves the difficulties of designing a set of peak-responses that will work for all songs. E.g, some songs are very repetitive with respect to the bass-drums but not with respect to the bass itself. Hence, tuning the algorithm to detect only the bass drums can be very difficult. Finally, the last problem of peak detectors is that they have no notion of the involved rhythm. E.g, a song that has a drum line with beats at position 1, 2, 3, 4 and $4\frac{1}{2}$ are notoriously difficult to understand because the last spike at position $4\frac{1}{2}$ will seriously lower the mean time between the different beats, hence the algorithm will report a very inaccurate tempo. So, to state it clearly, we are not referring to this kind of application of Fourier analysis. The use we are aiming at is essentially for what a Fourier analysis is supposed to do: to immediately measure the presence of frequencies between 80 BPM (= 1.3333... Hz) and 160 BPM (= 2.666... Hz). As we will point out, this application of Fourier analysis has also major limitations.

The Fourier Transform

A digital signal is typically represented as a series of samples (denoted $a_{i}$). Each sample describes the amplitude of the signal at that given position in time. The number of samples present is denoted $n$. The length of the audio fragment is $t$ (expressed in seconds).

A Fourier transformation takes these samples as input and transforms them into a superposition of sine waves. Every sine wave has its own strength and its own frequency. Essentially, it converts a signal represented in the amplitude domain to a signal represented in the frequency domain [HJB84]. A Fourier transformation is characterized by two parameters, the sampling rate and the window size. The sampling rate (denoted $r$) describes the number of samples encoded in one second. In this paper we assume a standard sampling rate of 44100 Hz. The window size (denoted $w$) is a parameter of the Fourier transform that specifies the number of samples that are given as input to the algorithm. Due to technical characteristics of the FFT, the window size is typically a power of two. The outputs of a Fourier transform are the strengths of all frequencies between $0$ and $r$ Hz. However, from these only the ones between $0$ and $\frac{r}{2}$ Hz are of interest, because a sampling rate of $r$ Hz can only describe signals up to $\frac{r}{2}$ Hz [Nyq28].

Performance

The FFT is in general a fast algorithm. Its performance estimate is $O(w.log_{2}w)$. Different hardware specific optimizations might lead to notable improvements, especially in the range of smaller window sizes (between $2$ and $2^{16}$). For standard desktop computers, there exist efficient implementations such as the FFTW [FJ97] and Intel's high performance math libraries [Int03]. However, the larger the window size becomes the closer different implementation techniques come to a factor of the theoretical $O(w.log_{2}w)$. Nowadays, window sizes of $2^{24}$ (which is a song of 6' 20'') are achievable with standard personal computers. Nevertheless, the reason why we don't use a fast Fourier transform lies in its lack of accuracy.

Accuracy

Table 3: Accuracy of the FFT to measure BPM in function of the song length (sampling rate = 44100 Hz).
song length window size (samples) accuracy (BPM)
1.5"
2^16
40.37
2.9"
2^17
20.18
5.9"
2^18
10.09
11.9"
2^19
5.04
23.7"
2^20
2.5
47.5"
2^21
1.26
1'35"
2^22
0.63
3'10"
2^23
0.315
6'20"
2^24
0.157
12'40"
2^25
0.078


The accuracy of the spectrum measurement depends on both the sampling rate and the window size. The Fourier transformation puts out $w$ frequencies. Its highest output (output number $w$) reports the strength of frequency $r$ while the lowest output (output number 1) reports the strength of $\frac{r}{w}$ Hz. In general, output $i$ reports the strength of frequency $\frac{i.r}{w}$. This leads to the observation that the smallest distance between two measurable frequencies is $\frac{r}{w}$. E.g, if the window size is 1024 (= 23.2 msecs), with a sampling rate of 44100, then we have an accuracy of 43 Hz (That is, we can measure the energy of 43 Hz, 86 Hz, and so on...).

Now, if we want to increase the accuracy of a Fourier analysis, then we have the choice of decreasing the sampling rate, or increasing the window size. We cannot modify the sampling rate because then we would throw away mostly useful information (hi-hats, claps and so on). Therefore, our only remaining option lies in selecting the largest possible window size. However, here we are limited by the size of the audio fragment. This leads us to specify the accuracy of the Fourier analysis in function of the length of the song. For the sake of simplicity, we assume that this is a power of 2. Then, the accuracy of a FFT of a song of length $t$ (in seconds) is, after conversion to BPM given by

\begin{displaymath}
\epsilon=\frac{60.r}{r.t}=\frac{60}{t}\end{displaymath}

Table 3 contains an overview of the BPM measure accuracy, given a certain song length. Clearly, an accuracy of 0.157 is not even close to the required accuracy of 0.0313 BPM (set aside the accuracy of 0.0156 BPM).

Because the fast Fourier transform is highly inaccurate and its accuracy is very dependent on the length of the song, we created our own technique, called beat graphs and ray shooting.

Beat Graphs

Figure 1: Beat graphs of two songs. Left: Alien Pump by Tandu [Tan97] at 148.011 BPM. Right: Lsd by Hallucinogen [Hal95] at 136.98 BPM.
\includegraphics[%%
width=0.45\textwidth,
keepaspectratio]{/E/homes/werner_gentoo/BpmDj/Papers/AlienPumpBeatGraphAnnotated.eps} \includegraphics[%%
width=0.45\textwidth]{/E/homes/werner_gentoo/BpmDj/Papers/LsdBeatGraphAnnotated.eps}

Our visualization technique of beat graph can be used to show tempo information of music. It visualizes the function

\begin{displaymath}
g(x,y,p)=\vert a_{x.M.p+y}\vert\end{displaymath}

Horizontally, the measures are enumerated, while vertically the content of one such a measure is visualized. The color of a pixel at position $(x,y)$ is given by the absolute strength of the signal at time step $x.M.p+y$. The value $p$ is the period between two beats. We multiply this period by a multiplication factor $M$ that specifies how many beats are supposed to be in one measure. We further assume that this value equals 4 because most popular songs are played in a 4/4 measure.

Picture 1 shows two beat graphs. Reading a beat graph can best be done from top to bottom and left to right. For instance in the song Alien Pump [Tan97] we see 4 distinct horizontal lines (numbers 1, 2, 3 and 4). These are the the 'beats' of the song. The top line covers all the beats on the first note in the measure, the second horizontal line covers all the beats at the second note within a measure and so on... The vertical white strokes that break the horizontal lines (C, D and E) are breaks within the music: passages without bass drum or another notable signal. Because the lines are distinctively horizontal we know that the period is correct. If the lines in the beat graph are slanted such as in Lsd [Hal95] (the blue line in the picture) then the period (and thus the tempo) is slightly incorrect. In this case, the line is slanted upwards which means that the period is too short (the beat on the next vertical line comes before the beat on the current line), and this means that the tempo is a bit too high. By looking at the direction of the lines in a beat graph we can relate the actual tempo to the visualized tempo.

Figure 2: Beat graphs of two songs. Left: X-Files by Chakra & Edi Mis [CM01] at 139.933. Right: Anniversary Walz by Status Quo [Quo90] at 172.614 BPM.
\includegraphics[%%
width=0.45\textwidth]{/E/homes/werner_gentoo/BpmDj/Papers/XFileBeatGraph.eps} \includegraphics[%%
width=0.45\textwidth]{/E/homes/werner_gentoo/BpmDj/Papers/AnniversaryWalzBeatGraphContrastEnhanced.eps}

The beat graph not necessarily displays only straight lines. Two examples of this are given in figure 2. The first example is the song X-Files [CM01]. In the beginning the tempo line bends upward indicating that the song is brought down from a higher tempo. After a break the tempo remains constant. Another example is the song Anniversary Walz [Quo90] in which we clearly see a drummer who drifts around a target tempo.

Our visualization technique of beat graphs offers some important advantages. First, the visualization function is very easy to implement and can be calculated quickly. This is in stark contrast with voice prints, which require extensive mathematical analysis and offer no tempo information. Secondly, the beat graph contains rhythm information. From the viewpoint of the DJ, all the necessary information is present. The 'character' of music can be recognized by looking at the picture. E.g, we can easily see how the songs Alien Pump [Tan97] contains beats at notes 1, 2, 3, 4 and 4$\frac{1}{2}$ (position A in the picture). After a while the extra beat at the end of a measure is shifted forward in time by half a measure (position B). Similarly, breaks and tempo changes can be observed (C, D and E). Third, as a DJ-tool it offers a visualization that can be used to align the tempo immediately. E.g, by drawing a line on the beat-graph such as done in Lsd [Hal95], it is possible to recalculate the graph with an adjusted period. Not only will this introduce accurate tempo information, it also introduces the information necessary to find the start of a measure. This makes it easy to accurately and quickly place cue-points.

We will now further investigate how we can fully automatically line up the beat graph. If we are able to do so then we are able to calculate the tempo of a song automatically.

Ray Shooting

Beat graphs give us the possibility to visualize the tempo information within a song. We will now use this to search for the best possible visualization of the beat graph. An optimal visualization will give rise to as much 'horizontality' as possible. 'Horizontality' is measured by comparing every vertical slice (every measure) of the beat graph with the previous slice (the previous measure). Through accumulation of the differences between the consecutive slices we obtain a number that is large when little horizontality is present and small when a lot of horizontality is present. Formally, we measure

\begin{displaymath}
ray(p)=\sum_{x=1}^{\frac{n}{M.p}}\sum_{y=0}^{M.p}\vert g(x,y,p)-g(x-1,y,p)\vert\end{displaymath}

By expanding $g(x,y,p)$ we obtain the simpler formula,

\begin{displaymath}
ray(a,p)=\sum_{i=M.p}^{n}\left\vert\left\vert a_{i}\right\vert-\left\vert a_{i-M.p}\right\vert\right\vert\end{displaymath}

The use of absolute values here is necessary to cancel out too strong oscillations due to phase differences. Normally, when two waves are subtracted when they are in phase, the result will be 0°. On the other hand, when the phase difference is 180°, then instead the result will be 2 times the amplitude. By using an absolute value here we have halved the maximum error due to phase differences. The maximum error will now occur at a phase shift of 90° and 270° and will be one time the amplitude.

The problem we face now, is to find the period $p$ for which $ray(p)$ is the smallest. $p$ must be located within a certain range, for instance between the corresponding periods of 80 BPM and 160 BPM. Once this period is found, we can convert it back to its tempo. Converting a period to BPM is done as follows.

\begin{displaymath}
bpm(p):=\frac{60.M.r}{p}\qquad period(f):=\frac{60.M.r}{f}
\end{displaymath} (2)
80 BPM corresponds with a period of 0.75'', which is (at a sampling rate of 44100 Hz) 33075 samples. Similarly, 160 BPM corresponds with 16537 samples. The range in which we are looking for the best match is denoted $[p_{0}:p_{1}]$. $p_{0}$ is the smallest period (thus the highest frequency), while $p_{1}$ is the highest period (and thus the lowest frequency). Given these two boundaries, we can calculate the period of a song as follows

\begin{displaymath}
period\,\textrm{is song period}\iff ray(period)=min\{ ray(p)\,\vert\, p\in[p_{0}:p_{1}]\}\end{displaymath}

A possible optimization to this search process is the introduction of clipping values. Once the calculation of the ray accumulates an error larger than the clipping value it simply stops and returns the clipping value:

\begin{displaymath}
ray(a,p,c)=min(c,\sum_{i=M.p}^{n}\left\vert\left\vert a_{i}\right\vert-\left\vert a_{i-M.p}\right\vert\right\vert)
\end{displaymath} (3)
Because this ray function is limited to the computational complexity of the last search, its calculation time can, as the number of scans increase, only decrease. The search algorithm can then be written down as


for( $min:=+\infty,\, i:=0$$i<w$$i:=i+1$)

     $match:=ray(audio,p_{0}+i,min)$

    if $match<min$

        $min:=match$

         $tempo:=period(p_{0}+i)$

Performance

The time needed to calculate the value of $ray$ for one period is $n-M.period$. Because the period is often neglectable small in comparison to the entire length of a song, we omit it. Hence, $ray$ takes $O(n)$ time to finish. To find the best match in a certain tempo range by doing a brute force search we need to shoot $p_{1}-p_{0}$ rays. We will denote the size of this window $w$. The resulting time estimate is accordingly $O(n.w)$.

In direct comparison to the fast Fourier transform (recall $O(n.log_{2}n)$) this performance estimate is much better because it can be done linearly. With respect to the multiplication factors involved, we will not compare them because this is inherently dangerous to do. After all, both algorithms depend highly on the speed of the inner loop and this is very hardware dependent. Luckily, our algorithm is, in comparison, with Fourier analysis much more simple. The inner loop only contains 1 instruction, a subtraction with removal of the sign bit. But aside from this appealing performance, the real strength of our approach comes from its extremely high accuracy and its modular nature.

Accuracy

Our algorithm can verify any period above 0 and below $\frac{n}{2}$. This means that it can distinguish which one of two periods $p$ or period $p+1$ is the best. Therefore, we will calculate the accuracy $\epsilon $ by measuring the smallest distinguishable frequency. Given a certain frequency, we convert it to its period, then assume that we have measured the period wrongly with the least possible error: 1 sample. The two resulting periods are converted back to BPM's and the differences between these two equals the theoretical accuracy. Suppose the tempo of the song is $f$, then the period of this frequency is $period(f)$. The closest distinguishable next period is $period(f)+1$. After converting these two periods back to their tempos we get $bpm(period(f)+1)$ and thus an accuracy of $f-bpm(period(f)+1)$. Expansion of this expression results in

\begin{displaymath}
\epsilon=f-\frac{60.M.r.f}{60.M.r+f}\end{displaymath}

Table 4: Accuracy of the ray shooting technique. Horizontally the measured tempo is set out, vertically the accuracy expressed in BPM is shown. The full line is the accuracy of the algorithm using measures of 4 beats, the dotted line is the accuracy of the algorithm using measures of 8 beats.
\includegraphics[%%
width=9cm,
height=6.5cm]{/E/homes/werner_gentoo/BpmDj/Papers/RayShootingAccuracy.eps}
Tempo (BPM) Accuracy (BPM)
70
70
0.000463
80
0.000605
90
0.000765
100
0.000945
110
0.00114
120
0.00136
130
0.00160
140
0.00185
150
0.00213
160
0.00242
170
0.00273


Table 4 presents an overview of the accuracies we might obtain with this technique. As can be seen, even the least accurate measurable tempo (170 BPM) can be distinguished with an accuracy of 0.00273, which is still 11 times more accurate than the required accuracy of 0.0313. For lower frequencies this even increases to 67.4 times better than the required accuracy ! It should be noted here that this is a theoretical accuracy. Later on we will experimentally validate the practical accuracy.

Selectivity

The algorithm as presented allows to verify the presence of one frequency in $O(n)$. This opens the possibilities to parallelize the algorithm easily but also the possibility to actively search for matching frequencies while neglecting large parts of the search space. This of course requires a suitable search algorithm. In the following section we will shed light on one search approach that has proven to be very fast and still highly accurate.

Searching: Selective Descent

As explained in the previous section, verifying every possible frequency by doing a brute force search is not such a good idea because it would simply take too much time. Nevertheless, the algorithm as presented here, allows the verification of single frequencies very accurately and quickly. This has lead to an algorithm that first does a quick scan of the entire frequency spectrum and then selects the most important peaks to investigate in further detail.

To skip large chunks of the search space, we use an incremental approach. First, the algorithm does a coarse grained scan that takes into account 'energy' blocks of information (let's say with a block size of $2^{x}$). From the obtained results, it will decide which areas are the most interesting to investigate in further detail. This is done by calculating the mean error of all measured positions. Everything below the mean error will be investigated in the next step. We iterate this process with a smaller block size of $2^{x-1}$ until the block size becomes 1.

The problem with this approach is that it might very well overlook steep dalls and consequently find and report a completely wrong period/tempo. To avoid this we will flatten the $ray(...)$ function by resampling the input data to match the required block size.

Resampling

To avoid a strongly oscillating $ray$ function, which would make a selective descent very difficult, we will, in every iteration step, resample the audio stream to the correct frequency. This resampling however is very delicate. Simply by taking the mean of the data within a block, we will throw away valuable information. In fact, we will throw everything away that cannot be represented at the given sampling rate. E.g, if we have a block size of 1024 then the final sampling rate will be 43 Hz. Such low sampling rates can only describe the absolute low frequencies up to 21.5 Hz [Nyq28]. Clearly, such a sampling rate conversion is not what we are looking for. Therefore we have invented the following resampling algorithm

With every step, the algorithm will first rescale the audio stream by shrinking every block of $2^{x}$ sample to 1 sample. This is done by taking the sum of the absolute values within the block:

\begin{displaymath}
b=resample(a,s)\quad\iff\quad b_{i}=\sum_{j=0}^{s-1}\vert a_{i.s+j}\vert
\end{displaymath} (4)

The use of the absolute value in this context allows us to encode the presence of high frequencies that are fully contained within the resampling block. Normally, such a sine wave will cancel itself out if we simply take the sum of all values. By taking the absolute value, it will still be present in the resulting block. The second reason why the use of absolute values is necessary, is to automatically select the most important frequencies: the only sines that remain, after resampling, are the ones that go through the zero line. Any full sine completely encoded on the positive or negative side of the zero line will cancel itself out. In other words, we select only the waves that are at the given time most responsible for the wave not being zero.In fact we loosely encode the spectrum by means of the absolute value.

Algorithm

With the knowledge how to resample the audio fragment and how to search for the most prominent frequency, we can now write down the algorithm in full detail. In the algorithm below $M$ is the number of beats that are supposed to be in one measure. $p_{0}=period(stopbpm)$ and $p_{1}=period(startbpm)$ (see equation 2). The window size $w$ is given by $p_{1}-p_{0}$. The resample function is described in equation 4 and the $ray$ function is described in equation 3. $m$ is the number of requested iterations. The maximum block size will then be $2^{m}$.


$matches:=[+\infty,\ldots\textrm{w times}\ldots,+\infty]$

$a:=$resample( $\textrm{input audio}$$2^{m}$)

for($j:=0$;   $j<\frac{w}{2^{m}}$;  $j:=j+1$)

     $matches[j.2^{m}]:=$ray($a$$p_{0}+j$$+\infty$)

for($i:=x-1$$i>0$$i:=i-1$)

   $a:=$resample( $\textrm{input audio}$$2^{i}$)

    for($j:=0$ $j<\frac{w}{2^{i}}$$j:=j+1$)

        if   $j\textrm{ in between matches below mean}$

             $match[j.2^{i}]:=$ray($a$,$p_{0}+j$$+\infty$)

for( $minimum:=+\infty$$j:=0$$j<w$$j:=j+1$)

    if  $j\textrm{ in between matches below mean}$

        $match:=$ray( $\textrm{input audio}$$p_{0}+j$$minimum$)

        if $match<minimum$

            $tempo:=bpm(j)$

             $minimum:=match$

Performance

Estimating the performance of this algorithm is easy but requires some explanation. Let us assume that we start with a block size of $2^{m}$. This means that we will iterate $m+1$ steps.

The initial step, with a block size of $2^{m}$ will result in $\frac{w}{2^{m}}$ rays to be shot. The cost to shoot one ray depends on the audio size. After resampling this is $\frac{n}{2^{m}}$.

The following steps, except for the last, are similar. However, here it becomes difficult to estimate how many rays that will be shot, because we take the mean error of the previous step and select only the region that lies below the mean error. How large this area is, is unknown. Therefore we will call the number of new rays divided by the number of old rays the rest-factor $q$. After one step, we get $q.\frac{w}{2^{m}}$ rays; after two steps this is $q^{2}\frac{w}{2^{m}}$ and by induction after $i$ steps the number of rays becomes $q^{i}\frac{w}{2^{m}}$. Every ray, after resampling to a block size of $2^{m-i}$, takes $\frac{n}{2^{m-i}}$ time, hence the time estimation for every $i^{th}$ step is $q^{i}\frac{w.n}{2^{m}2^{m-i}}$. We continue this process from $i=0$ until $i=m-1$.

The last step ($i=m$) is similar, except that we can do the ray shooting it in half the expected time because we are looking for the minimum and no longer for the mean error. This results in $q^{m}\frac{w}{2^{m}}.\frac{n}{2}$

The resampling cost at every step is $n$. After adding all steps together we obtain:

\begin{displaymath}
m.n+\sum_{i=0}^{m-1}\frac{w.n.q^{i}}{2^{2m-i}}+\frac{w.n.q^{m}}{2^{m+1}}\end{displaymath}

After some reductions, we obtain:

\begin{displaymath}
m.n+\frac{w.n}{2^{m}}\left(\sum_{i=0}^{m-1}\frac{q^{i}}{2^{m-i}}+\frac{q^{m}}{2}\right)\end{displaymath}

Depending on the factor of $m$ and $q$ we obtain different performance estimates. We have empirically measured the value of $q$ under a window of [80 BPM:160 BPM]. We started our algorithm at a block size of $2^{8}$ and have observed that the mean value of $q$ is 0.85. This results in an estimate of $142,35.n$. In practice, from the possible 66150 rays to shoot we only needed to shoot 142,35 to obtain the best match. So, we only searched 0.20313 % of the search space before obtaining the correct answer. This is a speedup of about 246.18 times in comparison with the brute force search algorithm.

Experiments

Setup  We have implemented this algorithm as a BPM counter for the DJ program BpmDj [Bel03]. The program can measure the tempo of .MP3 and .OGG files. It is written in C++ and runs under Linux. The program can be downloaded from http://bpmdj.sourceforge.net/. The parameters the algorithm currently uses are: $M=4$, $p_{0}=period(160)$, $p_{1}=period(80)$. Of course these can be modified when necessary. The algorithm starts iteration with $m=8$ and stops at $i=2$ because a higher accuracy is simply not needed.

Figure 3: Tempo scanning in two songs. Top: Rolling On Chrome by Aphrodelics remix done by Kruder & Dorfmeister [Aph90], Bottom: Tainted Love by Soft Cell [Cel96].
\includegraphics[%%
width=1.0\textwidth]{/E/homes/werner_gentoo/BpmDj/Papers/RollingOnChromeScan.eps}
\includegraphics[%%
width=1.0\textwidth]{/E/homes/werner_gentoo/BpmDj/Papers/TaintedLoveScan.eps}

Figure 3 shows the output it produces for two songs. Every color presents the results of one scanning iteration. The first scan is colored green, the last scan is colored red. The horizontal colored lines represent the position of the mean values for that scan. As can be seen, by using the mean value, quickly large parts of the search space are omitted.

The song Tainted Love [Cel96] is interesting because it shows 4 distinct spikes. Every spike is a multiple of each other. In fact, the $4^{th}$ spike is the actual tempo with a measured period of 4 beats, which is a correlation between beat 1 and beat 5. The $3^{th}$ spike occurs when the algorithm correlates between the 1st and the $6^{th}$ beat, hence calculates a tempo with measures of 5 beats, while it expects to find only 4 beats in a measure. This results in a tempo of 114.5. The $2^{nd}$ spike is similar but now a correlation between the $1^{st}$ and the $7^{th}$ beat occurs. The first spike is a correlation between the $1^{st}$ beat and the $8^{th}$ beat. We have these distinct spikes because the song is relatively empty and monotone. In fact, the algorithm first chooses the spike at position 95,49 BPM. Luckily, afterwards other clues were found that has lead the algorithm to choose a tempo of 143.236 BPM.

Experiment   The experiment consisted of a selection of 150 techno songs. Each song its tempo has been manually measured by using the beat graphs. All songs had a tempo between 80 BPM and 160 BPM. Of them 7 had a drifting tempo. In these songs we have manually selected the largest straight part (such as done for the song X-Files [CM01]). Afterwards, we have run the BPM counter on all the songs again. For every song the counter reported: a) the normal playing time of the song, b) the time necessary to analyze the song, c) the manual measured tempo and d) the automatic measured tempo. The machine on which we did the tests was a Dell Optiplex GX1, 600 MHz running Debian GNU/Linux. The counter has been compiled with the Gnu C Compiler version 3.3.2 with maximum optimization, but without machine specific instructions.

Results   As expected the results contained a number of wrongly reported errors due to harmonics. These have been detected and eliminated by simply multiplying the reported period by $\frac{5}{4}$, $\frac{6}{4}$ or $\frac{7}{4}$. 10 songs had the $\frac{5}{4}$-harmonic problem. 10 songs had the $\frac{6}{4}$-harmonic problem and 5 songs had the $\frac{7}{4}$-harmonic problem. The whole test set had two outliers. One outlier was an error of about 11 BPM (for which we still don't understand the reason). The other outlier was a mismatch of 1.2 BPM. All other measurement errors were below 0.48 BPM. The mean error (without the two outliers) was 0.0340 BPM, from which we can conclude that our algorithm has a practical accuracy of 0.0340 BPM, which is close to the required accuracy of 0.0313. Concerning the analyzing speed, we observed a mean speed of 8.33 real time. E.g, if a song is 120 seconds then the algorithm will report the tempo in 14.4 seconds.

Conclusion

In this paper we have introduced our visualization technique, called beat graphs. A beat graph shows the structure of the music under a certain tempo. It shows rhythmic information, covering where breaks and beats are positioned and tempo information. Beat graphs can be quickly drawn and are a valuable tool for the digital DJ, because they can also be used as a fingerprint of songs.

Based on these beat graphs, we have developed an offline algorithm to determine the tempo of a song fully automatically. The algorithm works by shooting rays through the digital audio and checking the similarities between consecutive slices within the audio. To speed up the process of finding the best matching ray, we presented an optimized search algorithm that must only search 0.2% of the entire search space.

The BPM counter works fully correct in 82% of the cases. The other 17% are correctly measured but report a tempo that is an harmonic of the actual tempo. The remaining one percent is measured incorrectly . The accuracy of the measurement is 0.0340 BPM, which is enough to keep the drift of two songs below $\frac{1}{32}$ beat after 30'' (or to keep the drift below $\frac{1}{16}$ beat after 60''). Aside from being very accurate, our algorithm is also insensitive to breaks and gaps in the audio stream. The algorithm is 8.33 faster than real time on a 600 MHz Dell Optiplex, without making use of processor specific features.

Future Work

Harmonics  The problem of harmonics is a problem that comes from the fact that we are searching over a very large area (from 80 BPM till 160 BPM). When a measurement reports an inaccurate tempo it can be quickly solved by multiplying it with one of the harmonics. However, this requires extra user input. How we can eliminate the reporting of harmonics without extra user input remains an important question.

Speed  As we have observed in the BPM analyzing graphs, the descent towards the absolute minimum is often straightforward. After one iteration, the final peak is often already known. This can be probably turned to our advantage by tuning the $q$ parameter to narrow down the search faster and thus increase the speed of the algorithm. Another area where a speed improvement is possible, is by making use of the processor cache. Instead of analyzing ray per ray, we can analyze one block of the song (which then remains in the cache) for all rays at once. This would probably increase the speed of the algorithm even further.

Order of the ray shooter  The ray evaluation itself can probably be improved by taking into account multiple consecutive slices instead of only the previous slice. Whether this would increase the accuracy of the algorithm should be investigated further.

Tempo Line  Detecting tempo modifications over time is an important problem. For humans, the beat graphs are easy understandable, however, deducing the tempo line automatically remains a very difficult problem because silent passages and breaks will disrupt the tempo line completely. Probably some autocorrelation technique together with a high level beat-graph analysis (where are the breaks positioned) might help.

Beat graph improvements  The beat graph has still some room for improvement. We are not yet making use of color information. By filtering the audio stream through two complementary filters (a high-pass filter and a low-pass filter) we might be able to create two beat graphs, one presenting the low frequencies and another presenting the high frequencies. By superimposing these two graphs, one colored red, the other yellow, we might be able to offer even more information.

FFT & PCA  By first converting the audio stream to a voice print and then detecting the principal frequency axis we might be able to encode the spectrum at a given position in time more accurately than we did now through absolute values.

Bibliography


Aph90
Aphrodelics.
Rolling On Chrome.
Remix by Kruder and Dorfmeister, Label: Studio K7, 1990.

Bel03
Werner Van Belle.
BpmDj: Free DJ Tools For Linux.
http://bpmdj.sourceforge.net/, December 2001 - 2003.

Beu01
John Beuving.
BeatForce comupter DJ-ing system.
http://beatforce.berlios.de/, 25 September 2001.

Cel96
Soft Cell.
Tainted Love.
Lavel: Warner, May 1996.

Cli00
Dave Cliff.
Hang the DJ: Automatic sequencing and seamless mixing of dance-music tracks.
Technical report, Digital Media Systyems Departement, Hewlett-Pacakrd Laboratories, Bristol BS34 8QZ England, June 2000.

CM01
Chackra and Edi Mis.
X-Files.
Retro 10 Years of Israeli Trance, Label: Phonokol, 2001.

FJ97
M. Frigo and S. G. Johnson.
The Fastest Fourier transform in the West.
Technical report, MIT-LCS-TR-728, Laboratory for Computer Science, MIT, Cambridga, MA, September 1997.

Hal95
Hallucinogen.
LSD.
Label: Dragonfly Records, 1995.

HJB84
M. T. Heideman, D. H. Johnson, and C. S. Burrus.
Gauss and the history of the FFT.
IEEE Acoustics, Speech and Signal Processing Magazine, 1(3):14-21, October 1984.

Int03
Intel Math Kernel Library Reference Manual.
Technical report, 630813-6004, Intel Corporation, 1994-2003.

Nyq28
Harry Nyquist.
Certain Topics in Telegraph Transmission Theory.
AIEE Trans., 47:617-644, 1928.

Quo90
Status Quo.
Anniversary Walz.
Label: Castle Music, 1990.

Tan97
Tandu.
Alien Pump.
Distance To Goa 6, Label: Distance (Substance) Records, 1997.

YKT95
Yamada, Kimura, Tomohiko, Funada, Takeaki, Inoshita, and Gen.
Apparatus for detecting the number of beats.
US Patent Nr 5,614,687, December 1995.


Copyright (c) Werner Van Belle November 2003
e-mail: werner.van.belle@vub.ac.be
http://bpmdj.sourceforge.net/

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)