Long Short-Term Memory Projection Recurrent Neural Network Architectures for Piano’s Continuous Note Recognition

1School of Information Science and Technology, Beijing Forestry University, No. 35 Qinghuadong Road, Haidian District, Beijing 100083, China 2National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancundong Road, Haidian District, Beijing 100190, China 3College of Information Science and Technology, Jinan University, No. 601, West Huangpu Avenue, Guangzhou, Guangdong 510632, China


Introduction
Piano's continuous note recognition is important for a robot, whether it is a bionic robot, a dance robot, or a music robot. There have been companies researching on music robots. For example, Vadi produced by Vatti is able to identify a voiceprint.
Most of the existing piano's note recognition techniques use Hidden Markov Model (HMM) and Radial Basis Function (RBF) to recognize musical notes with one musical note at a time and therefore are not suitable for continuous note recognition. Fortunately, in the field of pattern recognition, Deep Neural Networks (DNNs) have shown great advantages. DNNs are used to recognize features extracted from a large number of hidden nodes [1] and seek reverse partial guidance through the chain rule and at the same time make the neural network weight matrix convergence through training data iteratively and then achieve recognition [2]. RNN adds a time series based on DNN [3], which makes features have time continuity [4,5]. However, in experiments, we find that RNN's time characteristics will disappear completely after four iterations [6], and a music note is generally longer than a frame [7], so RNN is not suitable for piano's continuous note recognition [8]. Fortunately, a variant of RNN, named LSTM, is proposed [9][10][11][12], in which an input gate, an output gate, and a forgotten gate are added to memorize a longterm cell state to maintain long-term memory [8,9,[13][14][15][16]. Furthermore, LSTMP adds a projection layer to LSTM to increase its efficiency and effectiveness. This paper studies LSTM and LSTMP for piano's continuous note recognition, and in order to solve the temporal classification problem, we combine LSTM and LSTMP with The rest of this paper is organised as follows. In Section 2, we first introduce the LSTM network architecture, and then Deep LSTM. LSTMP is illustrated in Section 3. In Section 4, we discuss CTC. The experimental results are presented in Section 5, and finally, in Section 6, we draw conclusions and give our future work.

LSTM
2.1. The LSTM Network Architecture. LSTM is a kind of RNN which succeeds to keep memory for a period of time by adding a "memory cell." The memory cell is mainly controlled by "the input gate," "the forgetting gate," and "the output gate." The input gate activates the input of information to the memory cell, and the forgetting gate selectively obliterates some information in the memory cell and activates the storage to the next input [18]. Finally, the output gate decides what information will be outputted by the memory cell [19].
The LSTM network structure is illustrated in Figure 1. Each box represents different data, and the lines with arrows mean data flow among these data. From Figure 1, we can understand how LSTM stores memory for a long period of time.
The recognition procedure of LSTM begins with a set of input sequences = ( 1 , 2 , . . . , ) ( is a vector) and finally outputs a set of = ( 1 , 2 , . . . , ) ( is also a vector), which is calculated according to the following equations: In these equations, means the input gate, and and are the output gate and the forget gate, respectively. is the information input to the memory cell, and includes cell activation vectors, and is the information the memory cell outputs.
represents weight matrices (e.g., represents the weight matrix from input to the input gate ). is the bias ( is the input gate bias vector), and and ℎ are the activation function of cell inputting and cell outputting, respectively, regarded as ℎ and in most of the models and also in this paper. ⊙ is the point multiplication in a matrix. is the activation function of the neural network output, and we use in this paper. After conducting some experiments, we find that, compared with the standard equation, (3) is more simple and easier to converge. Not only does the training time become less but also the number of iterations becomes smaller. Therefore, in the neural networks in this paper, we use (3) to calculate instead of the standard equation.

Deep LSTM.
In piano's continuous note recognition, we also build a multilayer neural network to further increase the recognition rate. Deep LSTM adds an LSTM after another and so on [10]. The added LSTMs have the same structure as the original one. Each layer regards the output from the last layer as the input of the next layer. We hope that the neural networks in different LSTM layers will learn different characteristics, so as to learn the various features of musical notes from different aspects and therefore improve the recognition rate.

LSTMP-LSTM with a Projection Layer
In LSTM, there are a large number of calculations in the various gates, calculating the number of parameters in the neural network. The weight matrix dimension input by the input gate, the output gate, and the cell state at this time is * , and the weight matrix dimension at the last time is * , and the output matrix dimension connected to the output of the neural network is * , where and are the dimensions of the input and the output, respectively, and is the number of memory cells. We can easily get the following formula: that is, As we increase , LSTM grows in a square pattern. Therefore, increasing the number of memory cells to increase the amount of memory costs a lot, but a smaller cell number will bring a lower recognition rate, so we propose an architecture named LSTMP, which can not only improve the accuracy, but also effectively reduce the computations.
In the output layer of the neural network, LSTM outputs a matrix of 2 * . Then, is sent into the output matrix to be outputted and also serves as the input to the neural network at the next time. We add a layer to the LSTM architecture, and after passing this layer, becomes an * matrix called , which replaces as the input of the next neural network. When the memory cell number of the neural network increases, the number of parameters in the neural network is LSTMP = 3 * * + 3 * * + * + * ; that is, LSTMP = 3 * * + * (4 * + ) .
Therefore, in LSTMP, the factor that affects the total number of parameters changes from * to * . We can change the value of / to reduce the computational complexity. When 3 * > 4 * , LSTMP can speed up the training model. Moreover, with the projection layer, LSTMP can converge faster to ensure the convergence of the model. The mathematical formulae of LSTMP are as follows: In these formulae, represents the layer, and the other equations are the same as LSTM. Figure 2 is the structure of LSTMP, and the part marked with red dashed lines is the projection. By comparing Figure 1 with Figure 2, we can see that LSTMP is LSTM with a projection layer.
Algorithm 1 is the pseudocode of LSTMP. is the input weight matrix, and is the weight matrix of the last result. is bias and is the projection matrix. We put the extracted musical notes features into the neural network and the algorithm executes until we get an acceptable recognition rate.

CTC
The output layer of our LSTM and LSTMP is called CTC [20]. We use CTC because it does not need presegmented training data or external postprocessing to extract label sequences from the network outputs.
To be the same as many latest neural networks, CTC has forward and backward algorithms. When it comes to the forward algorithm, the key point is to estimate the distribution through probabilities. Given the length , the input sequence , and the training set at time , the activation of the output unit at time is interpreted as the probability of observing label ( = | | + 1): We refer to the elements ∈ as paths, where is the set of the length sequences over the alphabet = ∪ . Then we define a many-to-one map to remove first the repeated labels and then the blanks from the paths. With glance at the paths, will find they are mutually exclusive. According to the characteristic, the conditional probability of some labelling ∈ ⩽ can be calculated by summing the probabilities of all the paths mapped onto it by :

Require:
Ensure: After all these procedures, CTC will complete its classification task.

Experiments
We conduct all our experiments on a server with 4 Intel Xeon E5-2620 CPUs and 512 GB memories. A NVIDIA Tesla M2070-Q graphics card is used to train all the models. The programming language we use is python 3.5.
We choose the piano as our instrument. We record 445 note sequences as our dataset and the length of each sequence is around 8 seconds.
In the extraction of features, we carry out Hamming window processing and then take Fast Fourier Transform (FFT) for the real part and the imaginary part of each window. Then we let the FFT result to be orthogonal by adding the square of the real part and that of the imaginary part together.
Apart from that, we gain the log of the quadratic sum. Finally, the normalization of the input data is performed.
In the experiments, the number of kinds of notes is 8, and the number of input nodes is 9. We try different numbers of cell units in our models, from 20 to 320. The initial value of the neural network is set as a random value within [−0.2, 0.2], and the learning rate is 0.001. In terms of the structures, all the neural networks are connected to a single layer CTC. As for the dataset, we choose 80% of the samples as the development set and 20% as the test set. Table 1 shows the recognition rates and how many times LSTM, DLSTM, and LSTMP with different parameters need to iterate until their recognition rates are stable, and the best results are in bold.

Experimental Results.
In Table 1, "LSTMP-80 to 20" means the LSTMP model projecting 80 cell states to 20 cell states. From Table 1, we see  that DLSTM and LSTMP perform much better than LSTM, and their best recognition rates are almost the same, which are 100% and 99.8%, respectively. As for the numbers of iterations, LSTMP needs much less iterations than LSTM and DLSTM, which makes LSTMP more suitable for piano's continuous note recognition for robotics considering the efficiency. Figure 3 illustrates LSTMP with different parameters and DLSTM with different layers. The axis means the number of iterations and the axis means the recognition rate. We see that for LSTMP the model projecting 80 cell states to 20 cell states has the best result, but all LSTMP results are very close.

LSTMP and DLSTM with Different Parameters.
As for DLSTM, we see clearly that Deep LSTM is much better than LSTM with only one layer.

Comparisons of LSTM, LSTMP, and DLSTM.
We compare LSTM, LSTMP, and DLSTM in Figure 4. Given the same parameters, LSTMP performs much better than LSTM. As for LSTMP and DLSTM, we find that when the number of iterations is small, LSTMP has great advantages, but as the number of iterations increases, DLSTM becomes better.

Conclusions and Future Work
In this paper, we have used neural network structures called LSTM with CTC to recognize continuous musical notes. On the basis of LSTM, we have also tried LSTMP and DLSTM. Among them, LSTMP worked best when projecting 80 cell states to 20 cell states, which needed much less iterations than LSTM and DLSTM, making it most suitable for piano's continuous note recognition.
In the future, we will use LSTM, LSTMP, and DLSTM to recognize more complex continuous chord music, such as piano music, violin pieces, or even symphony, which will greatly improve the development of music robots.

Conflicts of Interest
The authors declare that they have no conflicts of interest.