Guitar Chords Classification Using Uncertainty Measurements of Frequency Bins

This paper presents a method to perform chord classification from recorded audio.The signal harmonics are obtained by using the Fast Fourier Transform, and timbral information is suppressed by spectral whitening. Amultiple fundamental frequency estimation of whitened data is achieved by adding attenuated harmonics by a weighting function.This paper proposes a method that performs feature selection by using a thresholding of the uncertainty of all frequency bins. Those measurements under the threshold are removed from the signal in the frequency domain. This allows a reduction of 95.53% of the signal characteristics, and the other 4.47% of frequency bins are used as enhanced information for the classifier. An Artificial Neural Network was utilized to classify four types of chords: major, minor, major 7th, and minor 7th. Those, played in the twelve musical notes, give a total of 48 different chords. Two reference methods (based on Hidden Markov Models) were compared with the method proposed in this paper by having the same database for the evaluation test. In most of the performed tests, the proposed method achieved a reasonably high performance, with an accuracy of 93%.


Introduction
A chord, by definition, is a harmonic set of two or more musical notes that are heard as if they was simultaneously sounding [1].A musical note refers to the pitch class set of , ♯/♭, , ♯/♭, , , ♯/♭, , ♯/♭, , ♯/♭, , and the intervals between notes are known as half-note interval or semitone interval.Thus, chords can be seen as musical features and they are the principal harmonic content that describes a musical piece [2,3].
A chord has a basic construction known as triad that includes notes identified as a fundamental (the root), a third, and a fifth [4].The root can be any note chosen from the pitch class set, and it is used as the first note to construct the chord; besides, this note gives the name to the chord.The third has the function of making the chord be minor or major.For a minor chord the third is located at 3 half-notes intervals from the root.On the other hand, a major chord has the third placed at 4 half-note intervals from the root.The perfect fifth, which completes the triad, is located at 7 halfnote intervals from the root.If a note is added to the triad at 11 half-note intervals from the root, then the chord will become a 7th chord.For instance, a  major chord (Maj) will be composed of a root  note, a major third  note, and a perfect fifth  note; the  major with a 7th (Maj7) is composed of the same triad of  major plus the 7th note .
Chord arrangements, melody and lyrics, can be grouped in written summaries known as lead sheets [5].All kind of musicians, from professionals to amateur, make use of these sheets because they provide additional information about when and how to play the chords or some other arrangement on a melody.

Mathematical Problems in Engineering
Writing lead sheets of chords by hand is a task known as chord transcription.It can only be performed by an expert; however this is a time-consuming and expensive process.In engineering, the automatization of chord transcription has been considered a high-level task and has some applications such as key detection [6,7], cover song identification [8], and audio-to-lyrics alignment [9].
Chord transcription requires recognizing or estimating the chord from an audio file by applying some signal preprocessing.The most common method for chord recognition is based on templates [10,11]; in this case a template is a vector of numbers.Then, this method suggests that only chord definition is necessary to achieve recognition.The simplest chord template has a binary structure, for this kind of template, the notes that belong to the chord will have unit amplitude, and the remaining ones will have null amplitude.This template is described by a 12-dimensional vector; each number in the vector represents a semitone in the chromatic scale or pitch class set.As an illustration, the  major chord template will be [1 0 0 0 1 0 0 1 0 0 0].The 12-dimensional vectors obtained from an audio frame signal are known as chroma vectors, and they were proposed by Fujishima [12] for chord recognition using templates.In his work, chroma vectors are obtained from the Discrete Fourier Transform (DFT) of the input signal.Fujishima's method (Pitch Class Profile, PCP) is based on an intensity map on the Simple Auditory Model (SAM) of Leman [13].This allows chroma vector to be formed by the energy of the twelve semitones of the chromatic scale.In order to perform chord recognition, two matching methods were tested: the Nearest Neighbors [14] (Euclidean distances between the template vectors and the chroma vectors) and the Weighted Sum Method (dot product between chroma vectors and templates).
Lee [11] applied the Harmonic Product Spectrum (HPS) [15] to propose the Enhanced Pitch Class Profile (EPCP).In his study, chord recognition is performed by maximizing the correlation between chord templates and chroma vectors.
Template matching models have poor recognition performance on real life songs, because chords change with time, and consequently chroma vectors will have semitones of two different chords.Therefore, statistical models became popular methods for chord recognition [16][17][18].Thus, Hidden Markov Models [19,20] (HMM) are probabilistic models for a sequence of observed variables assumed to be independent of each other, and it is supposed that there is a sequence of hidden variables that are related with the observed variables.
Barbancho et al. [21] proposed a method using HMM to perform a transcription of guitar chords.The chord types used in their study are major, minor, major 7th, and minor 7th of each root of the pitch class set.That is a total of 48 chord types.All of them can be played in many different forms; thus, to play the same chord several finger positions can be used.In their work, 330 different forms for 48 chord types are proposed (for details see the reference); in this case every single form is a hidden state.Feature extraction is achieved by the algorithm presented by Klapuri [22], and a model that constrains the transitions between consecutive forms is proposed.Additionally, a cost function that measures the physical difficulty of moving from one chord form to another one is developed.Their method was evaluated using recordings from three musical instruments: an acoustic guitar, an electric guitar, and a Spanish guitar.
Ryynänen and Klapuri [23] proposed a method using HMM to perform melody transcription and classification of bass line and chords in polyphonic music.In this case, fundamental frequencies ( 0 's) are found using the estimator in [21]; after that, these are passed through a PCP algorithm in order to enhance them.A HMM of 24 states (12 states for major chords and 12 states for minor ones) is defined.The transition probabilities between states are found using the Viterbi algorithm [24].The method does not detect silent segments; however, it provides chord labeling for each analyzed frame.
The aforementioned methods achieve low accuracies, and the most recent cited one, the method from Barbancho et al., achieves high accuracy by combining probabilistic models.However, the uses of a HMM and the probabilistic models in their work make such method somewhat complex.
In this paper, we propose a method based on Artificial Neural Networks (ANNs) to classify chords from recorded audio.This method classifies chords from any octave for a six-string standard guitar.The chord types are major, minor, major 7th, and minor 7th, that is, the same variants for the chords used by Barbancho et al. [21].First, time signals are converted to the frequency domain, and timbral information is suppressed by spectral whitening.For feature selection, we propose an algorithm that measures the uncertainty for the frequency bins.This allows reducing the dimensionality of the input signal and enhances the relevant components to improve the accuracy of the classifier.Finally, the extracted information is sent to an ANN to be classified.Our method avoids the calculation of transition probabilities and probabilistic models working in combination; nevertheless the accuracy achieved in this study has superior performance over the most mentioned methods.
The rest of this paper is organized as follows.In Section 2, fundamental concepts related to this study are presented.Section 3 details the theoretical aspects of the proposed method.Section 4 presents experimental results that validate the proposed method, and Section 5 includes our conclusions and directions for future work.

General Concepts
For clarity purposes, this section presents two important concepts widely used in Digital Signal Processing (DSP).These concepts are the Fourier Transform and spectral whitening.

Transformation to Frequency Domain.
The human hearing system is capable of performing a transformation from the time domain to the frequency domain.There is evidence that humans are more sensitive to magnitude than phase information [25]; as a consequence humans can perceive harmonic information.This is the main idea to perform the classification of guitar audio signals in this work.Therefore, a frequency domain representation of the original signal has to be calculated.The time to frequency domain transformation is obtained by applying the Fast Fourier Transform (FFT) to the input signal [] and is represented by Equation ( 1) describes the transformation of [] at all times.However, this is not convenient because songs or signals, in general, are not stationary.For this reason, a window function, [], is applied to the time signal as where [], for this study, is the Hamming window function according to where  = 0.54,  = [0,  − 1], and  is the number of samples in the frame analysis.A study about the use of different window types can be found in Harris [26].Equations ( 2) and ( 3) divide the signal in different frames that allowing the analysis of the signal in the frequency domain by For this work, windowing functions will have 50% of overlapping to analyze the entire signal and thus obtain a set of frames   [] (for simplicity in the notation   will be used).Those frames can be concatenated to construct a matrix Z = [ 1  2 ⋅ ⋅ ⋅   ], and, then, compute the FFT for every column.The result is a representation in the frequency domain as in Figure 1; this representation is known as spectrogram [27].This is the format that the signals will be presented to the classifier for training.

Spectral
Whitening.This process allows obtaining a uniform spectrum of the input signal, and it is achieved by boosting the frequency bins of the FFT.There exist different methods to perform spectral whitening [28][29][30][31].
Thus, inverse filtering [22] is the whitening method used in our experiments, and it is described next.
First, the original windowed signal is zero-padded to twice its length as and its FFT, represented by Γ  , is calculated.The resulting frequency spectrum will have an improved amplitude estimation because of the zero-padding.Next, a filter bank is applied to Γ  ; the central frequencies of this bank are given by where  = 0, . . ., 30.In this case, each filter in the bank has a triangular response   ; in fact, this bank tries to simulate the inner ear basilar membrane.The band-pass frequencies for each filter are from  −1 to  +1 .Because there is no more relevant information at higher frequencies than 7000 Hz, the maximum value for the parameter  was 30.
Subsequently, the standard deviations   are calculated as where uppercase  is the length of the FFT series.Later on, the compression coefficients for the central frequencies  1 ,  2 , . . .,   are calculated as   =  ]−1  , where ] = [0, 1] is the amount of spectral whitening applied to the signal.The coefficients   are those that belong to the frequency bin of the "peak" of each triangle response; observe Figure 2. The rest of the coefficients () for the remaining frequency bins are obtained performing a linear interpolation between the central frequency coefficients   .
Finally, the white spectrum is obtained with a pointwise multiplication of all compression coefficients with Γ  as

Proposed Method
Our proposed method is described in the block diagram shown in Figure 3.The method begins by defining the columns of matrix Z as where a single column vector represents the th Hamming windowed audio sample.These columns are zero-padding to twice their length   as where 0 is a zero matrix of the same size of Z.Then, (10) indicates an augmented matrix.
After that, the signal spectrum for every column of Y is calculated by applying the FFT, and then these columns are passed through a spectral whitening step and the output matrix is represented as I. Furthermore, by taking advantage of the symmetrical shape of the FFT, only the first half of the frequency spectrum (represented by Λ) is taken in order to perform the analysis.
A multiple fundamental frequency estimation algorithm and a weighting function are applied to the whitened audio signals.These algorithms enhance the fundamental frequencies by adding their harmonics attenuated by the weighting function.The output matrix of this step is denoted as Φ.
The training set includes all data in a matrix of   frequency bins and  audio samples, where each row or frequency bin will be an input to the classifier.The number of inputs can be reduced from   to   (T matrix) by applying a method based on the uncertainty of the frequency bins, thus enhancing the pertinent information to perform a classification.Finally, enhanced data are used to train the classifier and then to validate its performance.

Multiple Fundamental Frequency Estimation.
The fundamental frequencies of the semitones in the guitar are defined by where  ∈ Z and  min is the minimum frequency to be known; for example, in a standard six-string guitar, the lowest note is  having a frequency of 82 Hz.Signal theory establishes that the harmonic partials (or just harmonics) of a fundamental frequency are defined by where ℎ  = 2, 3, 4, . . .,  + 1.In this study  represents the number of harmonics to be considered.As an illustration, for a fundamental frequency   = 131 Hz of a  note, the first three harmonics will be the set {262, 393, 524}.
In this work, if a frequency is located at ±3% of the semitones frequencies, then this frequency is considered to be correct.This approach was proposed in [22].
In an th frame under analysis, fundamental frequencies can be raised if harmonics are added to its fundamentals [22], by applying and, then, all harmonics Λ(ℎ    , ) and their fundamental frequencies Λ(  , ), described in (13), are removed from the frequency spectrum.When the resulting signal is again analyzed, with the described method, a different fundamental frequency will be raised.
A common issue with ( 13) is when two or more fundamentals share a same harmonic.For instance, the fundamental frequency of 65.5 Hz of  note has a harmonic located at 196.5 Hz.When the Euclidean distances [32] between the analysis frequency and the frequencies of the semitones are computed, the minimum distance or nearest frequency will correspond to the  note.This implies that if those two notes are present in the same analysis frame, then the harmonic of  will be summed and eliminated with the harmonics of the  note.This is because the 196 Hz harmonic is located in the range of ±3% of the frequency of a  note.
There are some methods that deal with this problem.In [33], a technique that makes use of a moving average filter is proposed.In that work, the fundamental frequency takes its original amplitude and a moving average filter modifies the amplitude of its harmonics.Then, only part of their amplitude is removed from the original frequency spectrum.
In [22], a weighting function that modifies the amplitude of the harmonics is proposed.Also, an algorithm to find multiple fundamental frequencies is suggested.The weighting function is given by where   / max represents the low limit frequency (e.g., 82 Hz),   / is the fundamental frequency   under analysis, and   is the sampling frequency.The parameters  and  are used to optimize the function and minimize the amplitude estimation error (see [22] for details).In the work [22], the analyzed   in a whitened signal Λ(, ) is used to find its harmonics with where  is a range of frequency bins in the vicinity of   analyzed.The parameter  indicates that the signal spectrum is divided into analysis blocks, to find the fundamental frequencies.Thus, ŝ() becomes a linear function of the magnitude spectrum Λ(, ).Then, a residual spectrum Λ  (, ) is initialized to Λ(, ), and a fundamental period  is estimated using Λ  (, ).The harmonics of  are found in ℎ    /, and then they are added to a vector Λ  (, ) in their corresponding position of the spectrum.The new residual spectrum is calculated as where  = [0, 1] is the amount of subtraction.This process iteratively computes a different fundamental frequency using the methodology described above.The algorithm finishes until there are no more harmonics in Λ  (, ) to be analyzed.Equation (15) was adapted to keep the notation of our work; refer to [22] for further analysis.
In this study, we propose a modification of Klapuri's algorithm, in an attempt to achieve a better estimate of the multiple fundamental frequencies.Using (14) and the th whitened signal Λ(, ), the multiple fundamental frequencies can be found by using where ℎ  = { |  ∈ Z,  > 1,  < /2} for  = 0, 1, . . ., /2.Equation ( 17) analyzes all frequency bins and its harmonics in the signal spectrum.This equation adds to the th frequency bin, all its harmonics in ℎ   of the entire spectrum.Besides, the weighting function performs an estimation of the harmonic amplitude that must be added to the th frequency bin.Observe that the weighting function does not modify the original amplitude of the harmonics.Finally when all frequency bins have been analyzed, the resulting signal has all its fundamental frequencies with high amplitude.This will help the classifier to have an accurate performance.

Feature Selection.
The objective of this paper is to classify frequencies.Then, the inputs of the classifier are all frequency bins that come from the FFT.However, not all frequency bins will have relevant information.Therefore, a method to remove unnecessary data and enhance the relevant data has to be performed.This will result in a reduction of the number of inputs to the classifier.
We propose a method based on the uncertainty of the frequency bins.This method will discriminate all those that are not relevant for the classifier in order to improve its performance.
In Wei [34], it is stated that, similarly to the entropy, the variance can be considered as a measure of uncertainty of a random variable, if and only if the distribution has one central tendency.The histograms for all frequency bins of the 48 chord types were calculated.This can be used to verify whether the distribution could be approximated to any distribution with only one central tendency.For simplicity, Figure 4 represents one frequency bin distribution of a  major and a  minor chord, respectively; it can be seen that the distribution fits into a Gaussian distribution.This same behavior was observed in the other samples of the 48 different chords.This demonstrates that the variance can be used in this study as an uncertainty measure in the frequency bins.
In order to perform the feature selection using the uncertainty of the frequency bins, first consider a matrix Φ defined by where  →   is a vector formed by the magnitudes of the thcomponent frequency bin of all audio samples.The variances of each  →   can be computed with where If  2  ≈ 0, then it means that for that particular frequency bin the input is quasi-constant; consequently this frequency bin can be eliminated from all audio samples.This can be achieved if we consider and a vector ⃗ ] ind formed with the indexes  of ⃗  2  that are defined by Once feature selection has been performed, the remaining frequency bins will form the input to the classifier.

Classifier.
Classification is an important part for chord transcription.In order to perform a good classification, important data will be generated from the original information.Then, a classification algorithm will be able to label the chords.Artificial Neural Networks [35] (ANNs) can be considered as "massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections"; according to Jain et al. [36] ANNs have been used in chord recognition as a preprocessing method or as a classification method.Gagnon et al. [37] proposed a method with ANN to preclassify the number of strings plucked in a chord.Humphrey and Bello [38] used labeled data to train a convolutional neural network.In this study, an Artificial Neural Network was used to perform classification.Figure 5 represents the configuration for the ANN used in this work.The ANN was trained using the Back Propagation algorithm [39].

Experimental Results
Computer simulations were performed to quantitatively evaluate the proposed method.The performance of two stateof-the-art references [21,23] was compared with the present work.
Databases for training and testing containing four chord types (major, minor, major 7th, and minor 7th) with different versions of the same chord are considered.Electric and acoustic guitar recordings were used to construct the training data set.A total of 25 minutes were recorded from an electric guitar, and a total of 30 minutes were recorded from an acoustic guitar.Recordings include sets of chords played consecutively (e.g., -♯--♯ . ..), as well as some parts of songs.The database used for evaluation was provided by Barbancho et al. [21].This database has 14 recordings: 11 recordings from two different Spanish guitars played by two different guitar players, 2 recordings from an electric guitar, and 1 recording from an acoustic guitar, making a total duration of 21 minutes and 50 seconds.The sampling frequency   is of 44100 Hz for all audio recordings.
The training data set was divided into frames of 93 ms, leading to a FFT of 4096 frequency bins.In the spectral whitening, the signal was zero-padded to twice its length before applying the frequency domain transform, so a FFT of 8192 data was obtained.For the spectral whitening, the  parameter takes the original length of the FFT but the length of the whitened signals remains at 4096 frequency bins.For the multiple fundamental frequency estimation, the  and  parameters are constant and set to 52 and 320, respectively, as in Klapuri [22], while the parameter  was adjusted to improve performance.An optimum value of 0.99 was found.This parameter differs from the value in [22] because, in our method, the signal is modified in every cycle that ℎ  in (15) increases; on the other hand, Klapuri [22] modifies the signal after ℎ  increases to its higher value.These processes were applied to all audio samples to build a training data set.In this case, the data set is a matrix of 4096 rows (frequency bins) by 5000 columns (audio samples).In (21), the maximum variance for all frequency bins in the audio samples is computed.Equation (22) proposes a threshold to remove all those frequency bins that remain quasi-constant.For instance, suppose that a threshold of 0.05 is set, and some frequency bins variances (shown in Table 1) are evaluated.Only those above the threshold will be taken as inputs to the classifier, as is shown in Table 2.
Performance tests were made to find the optimal value for .This parameter was varied; then the ANN was trained and evaluated.The process was repeated until the best result was obtained.The  parameter was found to be optimal at 0.01326.This allows a 95.6% reduction of the total of the frequency bins, while keeping the relevant information.Therefore, we concluded that, for a  value lower than 0.01326, some information required for a correct classification is lost.Figure 6 shows part of the training data set, in fact only frequency bins in the range [0, 500], and 3000 audio samples are depicted.Figure 7 shows the same data set of Figure 6 after the feature extraction algorithm was applied.It can be observed that the algorithm maintains sufficient information to train the classifier.
An ANN was used as a classification method with 183 inputs and 48 outputs.The applied performance metric was the ratio of the number of correctly classified chords to the total number of frames analyzed.
The validation test had the same structure as the one presented in Figure 3. First, audio data was loaded.Second,  a frequency domain transformation and a spectral whitening are applied to the signal.Finally, the multiple fundamental frequency estimation algorithm is used.At this point, the signal has 4096 frequency bins.To reduce the number of frequency bins, only those that meet (22) are taken from the signal and then passed through the classifier.
The results of the proposed method VTH (Variance Threshold) in this work were compared with two state-of-theart methods.The best are shown in Table 3; specifically 48 chord types with different variants of the same chord were evaluated.For reference method proposed by Barbancho et al. [21], experiments with different algorithms were performed.This method is denoted by PM and includes all models described next.The PHY model describes the probability of the physical transition between chords.These probabilities are computed by measuring the cost of moving the fingers from one position to another.The MUS model is based on musical transition probabilities, that is, the probabilities of switching between chords.These were estimated from the first eight albums of The Beatles.And, the PC model is equal to the proposed method but without the transition probabilities; instead, uniform transition probabilities are used.All models were separately tested; an accuracy of 86% was achieved at most.The best result was obtained from using the combination of all methods; a 95% accuracy was achieved in this case.
For the reference method proposed by Ryynänen and Klapuri [23], the evaluation results were taken from [21]; in this case, three tests were performed.First, MM tests (only major and minor chords) were carried on; for all three tests, this was the one with the highest accuracy (91%).Second, MMC tests were executed, all chords were taken into account; however 7th major/minor chords labeled as major/minor were correctly classified; that is, a Maj7 labeled as Maj was correct.Finally, CC tests were set with the 48 possibilities; that is, 7th major/minor chords labeled as major/minor were incorrect; this results in an accuracy of 70%.
The proposed method on this paper achieves an accuracy of 93% in the evaluation test.This classification performance was achieved with a 95% confidence interval of [91. 4, 94.6].The results are competitive with the two reference methods.Even though Barbancho et al. [21] have a 95% of accuracy, it is only achieved when all algorithms PHY, MUS, and PC are combined.Besides, HMM needs the calculations of probability transitions between the states of the model (48 chord types).This makes their method more complex than the one presented in this work.This paper focuses only on chord recognition, so the comparison with [21] does not take into consideration the finger configuration.

Conclusions
A method to classify chords of an audio signal is proposed in this work.This is based on a frequency domain transformation, where harmonics are the key to find the fundamental frequencies that compose the input signal.It was found that all remaining frequency bins after feature extraction were in the range from 40 Hz to 800 Hz.This means that the relevant information for the classifier is located on the low frequency end.
The chords considered were major, minor, major 7th, and minor 7th.Two state-of-the-art methods, which used the same chords, were taken to compare our study.All computer simulations were performed using the same database.The reference method from Ryynänen and Klapuri [23] had the best performance when only 24 chord types were considered.Our method outperforms the method of Ryynänen and Klapuri by 2%, even when, in our work, 48 chord types were classified.The reference method of Barbancho et al. [21] had an accuracy of 95%; however, they performed a signal analysis to propose two statistical models and a third one that does not consider probability transitions between states.Their best performance is achieved with all models working together; if they are separately tested, the performance is at most 86%.Also, their classification method is based on a Hidden Markov Model that needs interconnected states.
The method presented in this work avoids designing statistical models and interconnected states for the HMM.The Artificial Neural Network as a classification method works with a high precision when the data presented have been processed with an appropriate algorithm.The proposed method for feature selection achieves high accuracy, because the data presented to the classifier have the pertinent information to be trained.
The sampling frequency of 44100 Hz and the windowing of 4096 data result in a frequency resolution of 10 Hz.With this frequency resolution it is not possible to distinguish the low frequencies of the guitar, for example, an  with 82 Hz and an  with 87 Hz.However, the original signal has six sources (strings), where three of them are octaves from the other three (except for 7th chords).Then, because the proposed method for multiple fundamental frequency estimation adds the harmonics for every single th bin, the high octaves can be raised.For example, for an  of 82 Hz, the octave at 164 Hz will also be raised.Then, this octave with the other fundamentals gives a correct classification of the chord.In the case of an , the fundamental at 87 Hz can not be distinguished from the frequency of 82 Hz.Nevertheless, the octave at 174 Hz will be perfectly raised; so with the other fundamentals frequencies of , the ANN performs a correct classification.
The present work due to its simplicity can be applied to chord recognition in some devices, for example, a Field-Programmable Gate Array (FPGA) or some microcontrollers.This study leaves for a future work the source separation of each string in the guitar.Once a played chord is known, we can make some assumptions about where the hand playing the chord is.Thus, we can apply some methods of blind source separation to obtain the audio of each guitar string.Besides, with the information of separated strings, the classifier can be extended for a wide set of chord families.Because the classification can be performed by a single string instead of the mixture of six strings, this can lead to the complete transcription of guitar chords and identification of strings being played.

Figure 3 :
Figure 3: Overview of the proposed system for training with   frequency bins and  samples of audio.

Figure 4 :Figure 5 :
Figure 4: Central tendency of the fundamental frequency of a  chord.

Figure 6 :
Figure 6: Training set before feature extraction.

Figure 7 :
Figure 7: Training set after feature extraction.

Table 3 :
Comparison between methods.