Optimized Audio Classification and Segmentation Algorithm by Using Ensemble Methods

Audio segmentation is a basis for multimedia content analysis which is the most important and widely used application nowadays. An optimized audio classification and segmentation algorithm is presented in this paper that segments a superimposed audio stream on the basis of its content into four main audio types: pure-speech, music, environment sound, and silence. An algorithm is proposed that preserves important audio content and reduces themisclassification rate without using large amount of training data, which handles noise and is suitable for use for real-time applications. Noise in an audio stream is segmented out as environment sound. A hybrid classification approach is used, bagged support vector machines (SVMs) with artificial neural networks (ANNs). Audio stream is classified, firstly, into speech and nonspeech segment by using bagged support vectormachines; nonspeech segment is further classified into music and environment sound by using artificial neural networks and lastly, speech segment is classified into silence and pure-speech segments on the basis of rule-based classifier. Minimum data is used for training classifier; ensemble methods are used for minimizing misclassification rate and approximately 98% accurate segments are obtained. A fast and efficient algorithm is designed that can be used with real-time multimedia applications.


Introduction
The excessive rise in multimedia data over internet has created a major shift towards online services. In most multimedia applications, audio information is an important part. The most common and popular example of online information is music [1]. Audio analysis, video analysis, and content understanding can be achieved by segmenting and classifying an audio stream on the basis of its content [2]. For this purpose, an efficient and accurate method is required that segments out an audio stream. A technique, in which an audio stream is divided into homogenous (similar) regions, is called audio segmentation [1]. The advent of multimedia and network technology results in an emerging increase in digital data and this causes a growing interest in multimedia content-based information retrieval. For analyzing and understanding an audio signal, the fundamental step is to discriminate an audio signal on the basis of its content. Audio classification and segmentation are a pattern recognition problem. It comprises two main stages: feature extraction and then classification on the basis of these features (statistical information) extracted [3].
Applications of audio content analysis can be categorized in two parts. One part is to discriminate an audio stream into homogenous regions and the other part is to discriminate a speech stream into segments, of different speakers. Lu et al. [2,4] discriminate an audio stream into different audio types. Classifier support vector machines [5][6][7][8][9] andnearest neighbor integrated with linear spectral pairs-vector quantization are used respectively. The training is done on 2hour data.
Coz et al. [10] presented an audio indexing system that characterizes various content levels of a sound track by frequency tracking. The system does not require any prior 2 Mathematical Problems in Engineering knowledge. A fuzzy approach is used by Kiranyaz et al. [11] in which hierarchic audio classification and segmentation algorithm based on automated audio analysis is proposed. An audio signal is divided into homogeneous regions by finding time boundaries also called change points detection. In audio segmentation, with the help of change detection a sound signal is segmented in homogenous and continuous temporal regions. The problem arises in defining the criteria of homogeneity. By computing exact generalized likelihood ratio statistics, the audio stream segmentation can be done without any prior knowledge of the classes. Mel-frequency cepstral coefficients are used as feature [12]. For calculating statistics large amount of training data is required.
Tasks like meeting transcription and automatic camera panning require the segmentation of a group meeting into different individual person's speech. Bayesian information criterion (BIC) is used for segmenting the feature vectors [13][14][15]. BIC requires a large amount of training data. Structured discriminative models use structures support vector machine (SSVM) in the mediums of large vocabulary speech recognition tasks. Hidden Markov models (HMMs) [16][17][18][19][20][21] are used to determine the features and Viterbi-like scheme is used [14].
Traditionally used audio retrieval systems are text based, whereas the human auditory systems principally rely on perception. As the text only elaborates the high level content, this is not sufficient to get any perceptual likeness between two acoustic audio clips. This problem can be solved easily by using Query by example technique. In this technique, only those audio samples are predicted from databases that sound similar to the example. Query by example is quite a different approach from audio classification. For modeling the continuous probability distribution of audio features, Gaussian mixture model (GMM) is used [22]. Janku and Hyniová [23] proposed that MMI-supervised tree-based vector quantizer and feedforward neural network [16,17,24,25] can be used on a sound stream in order to detect environmental sounds and speech. Regularized kernel based method based on kernel Fisher discriminant can be used for unsupervised change detection [26,27].
Speech is not only a mode of transmitting word messages; it also emphasizes emotions, personality, and so forth. Words contain vowel regions, which are of vital importance in many speech applications mainly in speech segmentation and verification of speaker. Vowel regions initiate when the vowel onset point occurs and ends when vowel offset point occurs. Audio segmentation is also possible, by dividing an audio stream into segments, on the basis of vowel regions [28].
Audio segmentation algorithms can be divided into three general categories. In the first category, classifiers are designed [29]. The features are extracted in time domain and frequency domain; then classifier is used to discriminate audio signals on the basis of its content. The second category of audio segmentation extracts features on statistics that is used by classifier for discrimination. These types of features are called posterior probability based features. Large amount of training data is required by the classifier to give accurate results. The third category of audio segmentation algorithm emphasizes setting up effective classifiers. The classifiers used in this category are Bayesian information criterion, Gaussian likelihood ratio, and a hidden Markov model (HMM) classifier. These classifiers also give good results when large training data is provided [29].
Audio segmentation and classification have many applications. Content-based audio classification and retrieval are mostly used in entertainment industry, audio archive management, commercial music usage, surveillance, and so forth. Nowadays, on the World Wide Web, millions of databases are present; for audio searching and indexing audio segmentation and classification are used. In monitoring broadcast news programs, audio classification is used, helping in efficient and accurate navigation through broadcast news archives [30].
The analysis of superimposed speech is a complex problem and improved performance systems are required. In many audio processing applications, audio segmentation plays a vital role in preprocessing step. It also has a significant impact on speech recognition performance. That is why a fast and optimized audio classification and segmentation algorithm is proposed which can be used for real-time applications of multimedia. The audio input is classified and segmented into four basic audio types: pure-speech, music, environment sound, and silence. An algorithm is proposed that requires less training data and from which high accuracy can be achieved; that is, misclassification rate is minimum.
The organization of paper is as follows: Audio classification and segmentation algorithm (proposed), preclassification step, feature extraction step, hybrid classifier approach (bagged SVMs (support vector machines) with ANNs (artificial neural networks)), and steps used for discrimination are discussed. In Results and Discussion the experimental results are discussed.

Audio Classification and Segmentation
Step. Hybrid classification scheme is proposed in order to classify an audio clip into basic data types. Before classification a preclassification step is done which analyzes each windowed frame of the audio clip separately. Then the feature extraction step is performed from which a normalized feature vector is obtained. After feature extraction the hybrid classifier approach is used. The first step classifies audio clips/frames into speech and nonspeech segments by using bagged SVM. As the silence frames are mostly present in speech signal so the speech segment is classified into silence and purespeech segments on the basis of rule-based classifier. Finally, ANN classifier is used to further discriminate nonspeech segments into music and environment sound segments. This hybrid scheme is used to achieve high classification accuracy and can be used for different real-time applications of multimedia. Figure 1 illustrates the block diagram of the proposed algorithm. Audio stream is taken as an input, it is then downsampled to 8000 KHz, preclassification step is applied on this audio stream, features {zero-crossing rate, short-time energy, spectrum flux, Mel-frequency cepstral coefficients, and periodicity analysis} are extracted, and  hybrid classifier is used. Bagged SVM uses features {zerocrossing rate, short-time energy, spectrum flux, and Melfrequency cepstral coefficients} and classifies audio clip into speech and nonspeech segments; features {spectrum flux, periodicity analysis, and Mel-frequency cepstral coefficients} are used and nonspeech segments are classified into music and environment sound using ANN. Rule-based classifier is used to discriminate silence and pure-speech segments. In preprocessing step for audio segmentation all input signals are downsampled into 8 KHz sampling rate. Audio clips are subsequently segmented into 1-s frames. This 1s frame is taken as the basic classifying unit. For feature extraction nonoverlapping frames are used. The features signify the characteristic information present within each 1-s audio clip.

Preclassification
Step. Speech signal is superimposed (i.e., in mixed form) which means that a conversation is held at any place or party where there is music and lots of noise. This is also known as cocktail party effect. Separating the source or the desired segments within the independent component analysis framework is known as blind source separation [31][32][33]. Blind source is generally a method used to separate the mixed signal into independent sources (when the mixing process is not known) [34]. Most blind source separation techniques use higher order statistics. For higher order statistics these algorithms require iterative calculations [35]. Molgedey and Schuster method is used for separating the signals on the basis of second order statistics (correlation). This does not need higher order statistics and iterative calculations. The temporal structure of signals is analyzed and the separation is done on this basis.
The mixed signal is firstly converted to the timefrequency domain, also called spectrogram of signal, by applying Fourier transform at short-time intervals. Hamming window is used. In order to avoid mixing of spectrograms each spectrogram is dealt with separately. Correlation is performed on all these short intervals. The sphering and rotation step is then performed. Orthogonalizing source signals into an observing coordinate is called sphering. An observation is actually a projection of source signals in certain direction. Original observations are not orthogonal; by applying sphering these observations are arranged in such a way that they become orthogonal to each other. An ambiguity of rotation still remains, even after sphering. So the correct rotation can be examined by removing all the off-diagonal observations present in correlation matrix. Simultaneous diagonalization [36,37] is applied at several time delays. Reconstruction step is performed on each separated signal's spectrogram. All the decomposed frequency components are then combined. At the end permutation step is performed for finding the relation between the separated signals shown in Figure 2. The decision is made by using classifier.

Feature Extraction
Step. The process of converting an audio signal into a sequence of feature vectors is called feature extraction process. The feature vectors carry temporal as well as spectral characteristic information about the audio signal. Feature vectors are calculated on window basis. The feature selection has a great impact on the performance of audio segmentation systems. Three types of features are calculated in this proposed work: Mel-frequency cepstral coefficients (MFCCs), time-domain and frequency-domain features. To form a feature vector these normalized features are combined.
Initially the audio stream is converted to 16 bit chunks at a sampling rate of 8 kHz. Feature extraction step is performed on the separated signals obtained after preclassification step. These separated signals are divided into nonoverlapping frames. These frames are used as classification unit. On the basis of the classification results segmentation is performed.
As suggested by [38], 12 order Mel-frequency cepstral coefficients are used. Time-domain features are zero-crossing rate, short-time energy, and periodicity analysis. Frequencydomain feature is spectrum flux.

Zero-Crossing Rate (ZCR).
Zero-crossing is a measure of signal changes that occurs from positive to negative or vice versa as shown in Figure 3. General definition is the amount of zero-crossing within a frame. Zero-crossing rate discriminates speech and music effectively, as the speech contains more silent regions as compared to music so the zero-crossing rate for speech is greater than music [4,30].
The expression for the zero-crossing rate is given by where ( ) represents the discrete signal that is in the range of = 1, . . . , . sgn[⋅] is known as sign function.

Short-Time Energy (STE).
Short-time Energy is a measure of the total energy power of a frame. STE is a useful feature for distinguishing speech and music segments. STE measure for speech signals has large variations as compared to music signals [4,30], because the frequency characteristics of human voice are extremely different from music apparatus. Figure 4 shows the short-time energy calculated from the multiple frames of speech and music.  : Spectrum flux plot (calculated for multiple frames of speech, environment sound, and music waveforms; from the plot it can be easily observed that spectrum flux of speech which is from 0 to 280 s is greater than music spectrum flux which is from 561 to 850 s. And the spectrum flux for environment sound which is 281 to 560 s is highest).
Mathematically short-time energy feature is expressed as where ( ) is the input discrete signal, is the number of frames, and ( ) is the window used for analysis.

Spectrum Flux (SF).
Spectrum flux is a measure of the changeable power spectrum of an audio signal. It is calculated by computing distance between current frame and the previous frame. Precisely, spectrum flux can be obtained by calculating the Euclidean distance between two normalized spectra. Spectrum flux helps in discriminating speech, music, and environment sound. Speech signals have higher spectral variations as compared to music. However, for environment sound the spectral variation is the highest as compared to music and speech [2,4,30]. Figure 5 shows the spectrum flux plot. Consider Equation (3) illustrates the formula that is used to calculate spectrum flux. ( ) is the input discrete audio signal. Window function ( ) of length is used; the order of DFT is . The total number of frames is and a small value Δ is introduced in order to avoid calculation overflow, whereas ( , ) is the Fourier transform of ( ).

Mel-Frequency Cepstral Coefficients (MFCCs).
Melfrequency cepstral coefficients are the logarithmic measure of the Mel magnitude spectrum, which is calculated by triangular band-pass filter. These values are decorrelated using discrete cosine transform. MFCC is a real-valued implementation of complex cepstrum; that is why it is calculated by taking FFT. The steps for calculating MFCC are as follows: (1) Divide the audio signal into short frames.
(2) Take Fourier transform of each frame and calculate periodogram-based power spectral estimate for each frame.
(3) Take log of all filter bank energies.
(4) Take discrete cosine transform of each Mel log power.
(5) The amplitudes of resulting spectrum are MFCCs.
MFCCs have good discriminating capability. That is why most of the speech recognition systems use them as a strong feature [4,27,30,39] (see Figure 6). Consider where = 1, 2, . . . , , represents the total band-pass filters and the order of cepstrum is . 12 order MFCCs are used. After passing th triangular band-pass filter, the resulting Mel-weighted spectrum is . is the transformed Melweighted spectrum to MFCCs.

Periodicity
Analysis. Periodicity analysis can be calculated by estimating the periodicity of each frame and periodicity is obtained by correlation. Periodicity analysis for music is higher than environment sound because music signals are more periodic in nature as compared to environment sound signals. Periodicity is useful feature for discriminating music and environment sound [2] (see Figure 7). Consider Equation (6) illustrates the periodicity calculation for each frame. is the frame index, is the total frame number, and is the normalized correlation function calculated from current sample ( ) and previous sample ( − ).  Figure 6: Mel-frequency cepstral plot (calculated for multiple frames of speech, music, and environment sound; from the plot the cepstral behavior for three different audio types can be observed; from 0 to 700 s is speech, from 701 to 1400 is environment sound, and from 1401 to 2100 s is music).  These features are concatenated to form a feature vector. All the features have different characteristics; that is, combining these features to create a feature vector is not appropriate. That is why each feature component is normalized which makes their scale comparable. Normalization is computed as where represents the th feature value. This normalized feature vector is further used for classification.

Hybrid Approach
Bagged SVM is combined with ANN classifier to provide a hybrid approach. Only using a single classifier most of the environment data is misclassified as speech and music, which is not a good approach. So in order to avoid this misclassification hybrid approach is used in which firstly speech and nonspeech are discriminated and then further nonspeech is discriminated into music and environment sound.  [40][41][42] uses a given set containing positive and negative examples in order to learn an optimized separating hyperplane. For a fixed data having unknown probability distribution, SVM minimizes the probability of misclassifying unseen patterns. SVM minimizes the structural risk; that is why optimized performance on training data is obtained. This property of SVM makes it more optimal when compared with traditional pattern recognition techniques. The support vector machines are of two types; linear and nonlinear (kernel based). In an audio data the feature distribution is so complicated. Different classes of an audio data may have overlapping areas and cannot be separated linearly; such situation can be handled by kernel support vector machine. In this section, some concept of kernel based SVM with bagging (bagged SVM) is introduced.

Kernel Support Vector Machines.
Considering the case in which the vectors are linearly nonseparable but are nonlinearly separable, SVM uses a kernel function ( , ). SVM uses the kernel to create an optimal separating hyperplane [43][44][45]. The curse of dimensionality can be addressed in such a way that the input vectors can be mapped implicitly by the kernel function to a high dimensionality feature space; in this feature space the mapped data is linearly separable. Most commonly used kernel functions are polynomial, Gaussian radial basis function, and multilayer perception. It was empirically observed that the Gaussian radial basis kernel performs better than the other two; that is why in the proposed method Gaussian radial basis kernel is used. Consider where is the Gaussian function's width.

Bagging
Approach. Multiple classifier system, also known as ensemble learning, includes training of different classifiers and combining their predictions in order to obtain improved classification accuracy. Using multiple classifiers in a system outperforms the single classifier results. Ensemble method tries to combine a set of learners in contrast to the ordinary learning approaches that make their predictions on the basis of a single learner [46,47]. Bagging gives best results for unstable classifiers [48]. Bagging approach is used to improve the accuracy of certain classifiers while dealing with artificial and real-world datasets. The general concept of bagging is to generate multiple training subsets through bootstrapping. In bootstrapping random samples with replacement are picked [47,49]. The training is performed on each subset and their output is aggregated via majority voting [50,51]. An audio segmentation technique is presented that reduced the misclassification rate of the classifier in order to achieve high accuracy and to fully preserve the information inside the audio stream. That is why support vector machine (SVM) classifier is bagged. SVM gives good results when used separately as compared to artificial neural networks (ANNs) and -nearest neighbors (KNN) [38]. Different subsets for bagging approach are randomly selected; on each subset training and testing are performed. The predictions of each SVM are aggregated through majority voting.

Artificial Neural Network (ANN).
A computational or mathematical model that is inspired from the structural and functional characteristics of human nervous system is called artificial neural network (ANN) or neural networks (NNs) [52]. A neural network is composed of multiple interconnected groups of artificial neurons. These interconnected neurons use a connectionist approach for computing any information. ANNs have adaptive nature, which means that they change their structure on the basis of the information (either internal or external) which passes through network. Artificial neural networks are composed of simple elements working in parallel, known as nodes. Neural networks are trained by adjusting the connection values between nodes. Training is performed until a specific output appears for a corresponding input. Network is adjusted on the basis of the target and output difference and it stops when the difference between targets and output is zero or minimum; that is, output matches input [53].
ANN training process is called supervised learning. ANNs are trained in such a way that an input is given to the nodes, the nodes calculate the output, and the predetermined targets are compared with the output. If the targets and output do not match then it is given back to the node and the weights readjust. This process continues until the output and targets have maximum matches. Due to the knowledge storing ability and decision making ANNs are used extensively in pattern recognition tasks. ANNs are of two types: single layer and multiple layer perceptron (MLPs) [54,55]. Single layer perceptron uses single layer of weights; that is, input is directly connected to the output. Single layer perceptron only handles linearly separable problems. Multiple layer perceptron (MLP) uses multiple layers of weights. It consists of input layer, hidden nodes, and output layers. The proposed algorithm uses multiple layer perceptron ANN. Back propagation algorithm is used for training ANN classifier.

Discrimination
Steps. The steps used for discriminating an audio data into different audio types are discussed in detail.

Speech and Nonspeech
Discrimination. Speech and nonspeech frames are discriminated on the basis of bagged SVM classifier. On the processed audio clip bagged SVM classifier is applied based on spectrum flux, Mel-frequency cepstral coefficients, zero-crossing rate, and short-time energy. Speech and nonspeech codebooks are generated by training databases.

Silence and Pure-Speech Discrimination.
Silence is detected on the basis of features {short-time energy, zerocrossing rate} by using 1-s window. The classification is done by rule-based classifier; a threshold value is set. If {shorttime energy, zero-crossing rate} are less than the predefined threshold then it is a silence frame; otherwise it is classified as pure-speech frame. This is a simple approach used for distinguishing silence and pure-speech frames (see Figure 8). The mixed audio stream is never fully silent; it always contains some kind of sound. That is why silence is only present in the pauses in the speech and it can be detected.

Discriminating Music and Environment Sound.
Nonspeech segment is used for discriminating music and environment sound segments. Spectrum flux, Mel-frequency cepstral coefficients, and periodicity analysis are used to discriminate music from environment sound. Music signals are more periodic as compared to environment sound signals. But discriminating the signals alone on the basis periodicity analysis is not a good choice. For more accurate and precise results spectrum flux and Mel-frequency cepstral coefficients are also incorporated with periodicity analysis feature. Spectrum flux of environment sounds in most cases is greater than music. Using ANN classifier the music and environment sound segments are discriminated.

Postprocessing
Step. Audio stream is continuous, so it does not change frequently or abruptly. A few smoothing steps are performed at the end on speech segment. Between three 1-s frames if first frame and last frame are speech segments then most probably the mid frame is also a speech frame. Similarly, in three 1-s frames if first, second, and last frames are not same then the mid frame may be a silence frame and first and last frames are speech frames. This postprocessing step refines the results and increases the accuracy measure for pure-speech and silence discrimination.

Results and Discussions
The audio dataset used for evaluation of proposed algorithm is speech/music collection of GTZAN. It consists of personal CDs, microphone, and radio recordings in order to provide different varieties of recording conditions. The dataset contains 120 audio files that are 30-s long and are downsampled to 8 KHz before processing. 1000-s environment sounds are also included in the dataset. The audio stream used is in mixed form; that is, speech is superimposed with music and noise.  Half-hour data is used for training and approximately twohour data for testing. Noise is segmented out as environment sound. 1/3 of the dataset is used for training classifier and 2/3 of the dataset is used for testing the classifier. After preprocessing, the audio stream is divided into short segments by applying hamming window, which is a moderate window. Each segment is processed independently. Hamming window is mostly used in narrowband applications (e.g., spectrum of telephone signal). Fourier transform is computed for each segment. Correlation of these short segments is taken, sphering and rotation step is performed in order to estimate the direction. Decorrelation is done to reconstruct these short segments. The cross correlating values are eliminated. Permutation is applied to find the relation between these independent segments. On the relation basis these segments are combined.
The combined audio stream is divided into nonoverlapping 1-s frames with the help of a moving window. On these 1-s frames classification is performed. For classification the features are extracted from each frame. These features are combined to form a feature vector.
The speech and nonspeech frames are discriminated by using bagged SVM. Randomly five different training and testing sets with replacement are selected. The results of these classifiers are majority voted and the final aggregate labels are obtained. The accuracies for these five randomly selected datasets are shown in Table 1. The output labels for each set are compared with each other; if the maximum number of output labels for a single frame is 1 then it is labeled as 1; otherwise it is labeled as 0. This method is known as majority voting.
Bagged SVM classifier is based on features {ZCR, STE, SF, and MFCC}. This baseline model gives good results for speech and nonspeech discrimination. Bagged SVM results are compared with the simple SVM classifier. The results obtained for different classifying types are shown in Table 2. Bagged SVM gives 98.2% accurate results and 1.8% reduced misclassification rate whereas simple SVM gives 92% accuracy and 8% misclassification as evident in Figure 9. The results of different SVMs applied on five different randomly selected sets with replacement are boosted by using ensemble methods technique called bagging.
The use of ensemble approaches reduces training complexity drastically, especially when high predictive accuracy is maintained. In bagged SVM, different SVM models are aggregated; in this case five SVM models are aggregated. Each SVM randomly selects training set with replacement; that is, training is performed on small samples of training set. Because of this subdivision, total training time decreases. The computation complexity of kernel based SVM is Ω( 2 ), but when classifiers are used on subsamples of size / , then the computational complexity is approximately Ω( 2 / ) [56]. Because of this reduced complexity larger dataset and nonlinear kernels can be handled easily as shown in Figure 10.
Nonspeech segments are further discriminated into music and environment sound by using ANN. The features {periodicity analysis, Mel-frequency cepstral coefficients, and spectrum flux} are used. The speech segment can further be segmented into silence and pure-speech segment by using rule-based classifier. The overall classification results are shown in Table 3. 98% classification results are obtained with 1.9% misclassification rate.
The performance of proposed algorithm is also tested with -nearest neighbor (KNN) and artificial neural networks (ANNs) as shown in Figure 11. Both these classifiers are also used for audio classifications. The Bayesian information criterion (BIC), Gaussian likelihood ratio, and a hidden Markov model (HMM) use large training data for good results [29]; that is why proposed algorithm is only compared with the conventional classifiers SVM and ANN, as the proposed algorithm uses less training data. ANN performs better than KNN but still there is some misclassification. With the hybrid approach proposed minimum data is misclassified and maximum information within the audio stream is preserved.
In addition to the above performance analysis, a comparison is done between the algorithm used and the already existing audio segmentation and classification techniques. Audio classification and segmentation was presented in [57], in which audio stream is segmented into speech, music and silence. The respective algorithm uses general mixture model (GMM) and -nearest neighbor (KNN). The algorithm achieves 95% accuracy for discrete audio signals.
Audio stream is segmented into music, speech, environment sound, and silence [2]. The respective algorithm  Content analysis for audio classification and segmentation using KNN and LSP-VQ [2] Sports audio segmentation and classification using BIC and GMM [15] Classification of audio signals using AANN [58] Classification of audio signals using GMM [ uses -nearest neighbor (KNN) and linear spectral pairsvector quantization (LSP-VQ). This algorithm achieves 96% accuracy. Sports audio stream is segmented and classified into speech and nonspeech [15]. For segmentation Bayesian information criterion (BIC) is used. Clusters are formed and Gaussian mixture model (GMM) is used for classification. 87.3% accurate segmentation and classification results are achieved.
Audio stream is classified into music, sports, advertisement, cartoon, and movie [58]. In order to capture feature vectors auto associative neural network model (AANN) is used. For training GMM is used. Table 4 shows the comparison of proposed audio segmentation technique with the existing audio segmentation techniques.

Conclusion and Future Work
An efficient and fast audio classification and segmentation approach has been discussed that does not require large amount of training data yet gives good discrimination results. In this work, an audio stream is discriminated into homogenous regions and classified into basic audio types such as pure-speech, music, environment sound, and silence. Main goal is to design an audio segmentation algorithm which can be incorporated with multimedia content analysis applications and audio recognition systems.
Hybrid approach has been used for audio classification and segmentation. Firstly, audio clips are discriminated into speech and nonspeech segments by using bagged SVM classifier. Nonspeech segments are further classified into environment sound and music by using ANN classifier. Speech segment is discriminated into silence and pure-speech segments by using rule-based classifier. Experiments have showed that the algorithm is very efficient for real-time multimedia applications.
In future work, this algorithm can be used as a preprocessing step in automatic speech recognition, video conferencing, human-computer interaction systems (for identifying human activities involving speech), and speaker tracking. This algorithm can be used in video content analysis, audio retrieval, and indexing, for attaining useful information.