With the widespread growth in the use of digital computers, there has been an increasing need to be able to communicate with machines in a simple manner. One of the main tasks that simplify the communication with machines is the speech recognition. Speech recognition is the translation of spoken words into text. However, speech recognition is a very complex problem. This paper is related to the recognition of spoken Arabic digits. Two recognition techniques have been implemented and tested: Pitch Detection Algorithm (PDA) and Cepstrum Correlation Algorithm (CCA). In order to analyze the recognition accuracy of the selected techniques, a database of spoken Arabic digits has been created. The performance of the two techniques has been analyzed based on the created database.
Speech can be classified into two general categories: voiced and unvoiced speech. A voiced speech is one in which the vocal cords of the speaker vibrate as the sound is made, and unvoiced speech is one where the vocal cords do not vibrate [
There are many techniques used in recognizing voiced speech, and in this paper, we consider two techniques: Pitch Detection Algorithm (PDA) [
The PDA is one of the most robust and reliable techniques; it is known to have a very high accuracy for voiced pitch estimation. It is a commonly used method to estimate pitch level and is based on detecting the highest value of the autocorrelation function in the region of interest.
The CCA method is mostly used to obtain Mel Frequency Coefficients (MFCC) as vectors are used as a pattern recognition technique. But here we used a very simplified way with almost the same results of recognition instead of the MFCC recognition technique.
The main objective of this paper is to study and implement the two sound recognition techniques. Another objective is to create a database of spoken Arabic digits with different users and then use it to evaluate the performance of the selected techniques. Another objective is to design a Graphical User Interface (GUI) which will be used in the analysis and performance comparison of recognition techniques.
The frequency of human voice ranges from 20 Hz to 14,000 Hz (typically from 300 Hz to 4,000 Hz). The frequency of a sound wave determines the human tone and pitch. In general, the frequencies, which have the most significant part of speech, lie between about 100 Hz and 4,000 Hz.
There are many preprocessing techniques needed to perform speech recognition. The main pre-processing techniques are explained as follows.
Normalization is the process by which the signal is brought into a range consistent with expected values. Normalization technique has been considered for audio signals to prevent clipping. It is therefore good practice to scale audio being processed to a specific range.
Zero padding is a useful process that can be used in many applications for various reasons. In this paper, zero padding was used because the correlation process used later cannot be done unless both the database signal and the test signal are equal in length.
Classically, spectral techniques making use of a short-time Fourier transform have been the dominant solution for classification and recognition problems.
The term “cepstrum” was coined by Bogert et al. [
The Cepstrum is a common transform used to gain information from a person’s speech signal. It can be used to separate the excitation signal (which contains the words and the pitch) and the transfer function (which contains the voice quality).
The cepstrum is defined as the inverse DFT of the log magnitude of the DFT of a signal. Consider
Computation of the Cepstrum of a signal
Now if we were told that the log spectrum was a waveform, in this case, we would describe it as quasiperiodic with some form of amplitude modulation, to separate both components; we could then employ the DFT, and we would expect the DFT to show a large spike around the “period” of the signal and some “low-frequency” components due to the amplitude modulation. Figure
Flow chart of the Cepstrum Algorithm.
PDA is an algorithm designed to estimate the pitch or fundamental frequency of a quasiperiodic or virtually periodic signal, usually a digital recording of speech.
When a segment of a signal is correlated with itself, the distance between the positions of the maximum and the second maximum correlation is defined as the fundamental period (pitch) of the signal.
The modified autocorrelation pitch detector based on the center clipping method and infinite clipping is used in our implementation.
Two techniques are considered.
Figure
The result of the clipping value is determined by 50% clipping of the signal.
The autocorrelation pitch detector is one of the most robust and reliable pitch detectors based on detecting the highest value of the autocorrelation function. The autocorrelation is calculated as follows:
The autocorrelation function is searched for its maximum value. If the maximum exceeds 0.61 (energy as the threshold) of the autocorrelation value at 0 delay, the section is classified as voiced and the location of the maximum is the pitch period. Otherwise, the section is classified as unvoiced.
A formant is a concentration of acoustic energy around a particular frequency in the speech wave. There are several formants, each at a different frequency, roughly one in each 1000 Hz band. Each formant corresponds to a resonance in the vocal tract.
Figure
Flow chart of Pitch Detection Algorithm.
In order to evaluate the performance of PDA and CCA algorithms, two databases of the spoken Arabic digits (0 to 9) were created: three males (User 1, User 2, and User 3) and three females (User 4, User 5, and User 6).
Each time the speech was recorded in a single file, which was approximately 12 s long. This process was repeated 13 times, so that 13 speech files were collected for each user and each file contained all the Arabic digits.
Every speech file contained both speech signals and nonspeech signals. Each file was analyzed by a detection program in order to locate and segment each spoken digit accurately. In this process, two measures were used in the segmentation of the sound signals: the zero crossing and the signal energy.
The set of recorded files for each user has been divided into two groups. One group, consisting of ten files, was chosen to form the dataset, while the remaining three files were used as a test set. The GT (Ground truth), which defines the original signal number for each spoken digit inside the database, has been used to compare between the correct spoken digit and the test signal. Based on the recognition result of each technique the percent of correct recognition can be calculated.
The recognition starts by taking two signals: one from the created database and the other from the test samples, and the objective is to recognize the spoken Arabic digit based on CCA and PDA techniques.
This procedure will be repeated among the whole 10 signals, taking the best five results.
The results were taken in two groups, first one for males and the second for females. Each group has the recognition results with and without normalization.
The males’ recognition results without normalization are shown in Figure
The percentage of all males’ voices without normalization.
The percentage of all males’ voices with normalization.
From Figures Recognition of User 2 and User 3 results was relatively better than other users because their data records were taken at a quite environment. The results obtained with normalization were relatively better than the results without normalization. Spoken Arabic digit 6 was the hardest number to be recognized because of the syllables nature in Arabic language. Spoken Arabic digits 3, 7, and 10 had the best recognition results because they consist of strong Arabic syllables, which make them easy to be recognized. The records of User 1 were taken in a different microphone than records of User 2 and User 3, and the microphone of the last 2 users had larger sensitivity to spoken words, which made them more recognizable by the recognition techniques.
The females’ recognition results without normalization are shown in Figure
The percentage of all females’ voices without normalization.
The percentage of all females’ voices with normalization.
The recognition results were considered using the center clipping and infinite peak clipping.
Recognition results of the three test signals of User 1.
Here we can see the “zero” with the lowest percentage whereas “eight” with the best result, it scored 100% twice. The digit “Six” is correct with a correct recognition percent of 100% with the first user, but it has 0% of correct recognition with users 2 and 3. Others are slightly the same.
Percentage of all males’ voices for all spoken digits.
As can be noted from the figure, the percentages are not the same but vary according to the users.
The recognition results for all male and female users can be summarized in Tables
Recognition accuracy of the average male users.
Whole signal | 1/3 signal | Average | |
---|---|---|---|
Center clipping | 34.22% | 37.33% | 35.775% |
Infinite peak clipping | 25.554% | 28.444% | 25.497% |
Recognition accuracy of the average female users.
Whole signal | 1/3 signal | Average | |
---|---|---|---|
Center clipping | 31.996% | 37.33% | 31.828% |
Infinite peak clipping | 33.33% | 28.444% | 31.66% |
PDA recognition using center clipping was found better for males’ (35.8% accuracy) compared to females’ results (31.8% accuracy). On the other hand, infinite peak clipping gives about 25.57% accuracy. For females in both cases, the recognition was slightly the same. For this reason center clipping is commonly used.
In general, the results of using CCA were relatively better than PDA approach.
A Graphical User Interface (GUI) [
The operation “Performance evaluation” was used to evaluate the recognition result of the selected technique. This was done by loading a prerecorded signal and then performing the recognition based on the selected technique, as shown in Figure
Performance evaluation using CCA.
As can be seen in Figure
Another example is shown in Figure
Performance evaluation using PDA.
The paper describes the recognition of spoken Arabic digits using two techniques: the PDA and CCA.
Spoken Arabic digits six and nine were especially difficult to be recognized. The reason for that is the complexity of the voiced signals of these two spoken digits.
As a future work on speech recognition, we will consider the following tasks: using DSP kits in recognition tasks; using other recognition techniques such as hidden Markov model (HMM).