Content-Based Audio Classification and Retrieving Using Modified Bacterial Foraging Optimization Algorithm

Audio classification and retrieval has been recognized as a fascinating field of endeavor for as long as it has existed due to the topic of identifying and choosing the most useful audio attributes. The categorization of audio files is significant not only in the area of multimedia applications but also in the disciplines of medicine, sound analysis, intelligent homes and cities, urban informatics, entertainment, and surveillance. This study introduces a new algorithm called the modified bacterial foraging optimization algorithm (MBFOA), which is based on a method that retrieves and classifies audio data. The goal of this algorithm is to reduce the computational complexity of existing techniques. Along with the combination of the peak estimated signal, the enhanced mel-frequency cepstral coefficient (EMFCC) and the enhanced power normalized cepstral coefficients (EPNCC) are used. These are then optimized using the fitness function utilizing MBFOA. The probabilistic neural network is used to differentiate between a music signal and a spoken signal from an audio source (PNN). It is next necessary to extract and list the characteristics that correspond to the class that was arrived at as a consequence of the categorization. When compared to other approaches that are somewhat similar, MBFOA demonstrates superior levels of sensitivity, specificity, and accuracy.


Introduction
Retrieving audio information automatically has gained a lot of reputation for organizing a wide variety of audio fles in the database. However, the current process of audio retrieval does not characterize the perceptual similarity of the audio signals, as they depend primarily on the measure of similarity. Tis article proposes an approach to audio classifcation and retrieval based on MBFOA to overcome the above issues. Using MBFOA along with the ftness function calculation, selecting the optimum feature is an easy job. To classify the audio signal as a music or speech signal, PNN is performed. Te audio signal category is then extracted from the result of the classifcation. Ten the calculation of the Euclidean distance is performed, and the relevance matching is done to retrieve the fles related to the query. Finally, it is ensured that the proposed approach is consistent with the existing approach.
Audio signals, such as voice, music, and the sounds of the environment, are essential examples of many sorts of media. Te difculty of separating audio signals into the many forms of audio being produced is thus becoming an issue of growing signifcance. Simply listening to a brief portion of an audio signal is sufcient for a human listener to be able to correctly identify a wide variety of audio formats. However, fnding a solution to this issue via the use of computers has proved to be quite challenging. Despite this, a large number of systems with just a moderate degree of precision might still be constructed. Te processes of audio segmentation and classifcation have a broad range of potential applications. For example, content-based audio categorization and retrieval is utilized extensively in the entertainment sector as well as in the administration of audio archives and the use of commercial music, surveillance, and other felds. Nowadays, numerous digital audio databases can be found on the World Wide Web; in these cases, audio segmentation and classifcation are required in order to do audio searching and indexing. Tere has been a signifcant uptick in interest in monitoring broadcast news programmes as of late. In this context, the categorization of speech data in terms of speaker might be of assistance in facilitating the efective navigation of broadcast news archives.
A signal processing part and a classifcation section are the two primary components of any audio classifcation work, just as they are in the majority of other pattern classifcation endeavors. Te extraction of characteristics from the audio stream is the focus of the signal processing portion of the algorithm. In many situations, the diferent techniques of time-frequency analysis that were frst established for speech processing are applied. Tese approaches were designed for the purpose of processing audio data. Te classifcation stage involves assigning categories to collected data based on the statistical information that was derived from the signals.
Te amount of digital music fles stored on personal computers has expanded as a direct result of the expansion of the market for portable digital audio players. When you want to listen to music that fts a certain mood, such as happy music, sad music, or furious music, it may be challenging to pick and organize the songs that you want to listen to in order to achieve that mood. Consumers are not the only ones who are required to label their music; online distributors also have to label the thousands of tracks that are stored in their databases for their customers to peruse.
In the felds of music psychology and music education, emotion-based musical components have been identifed as the aspect of music that is most closely related to the concept of music expressivity. Music information behaviour research has also highlighted music mood emotion as an essential factor that individuals employ while searching for music and storing it. Tis criterion is utilized by music indexing and storage. However, because of the great degree of subjectivity involved, determining the atmosphere of a piece of music may be challenging. It is possible to infer aspects of a person's personality based on the music to which they listen. While we listen to music, there are a number of diferent elements that subtly but signifcantly alter it. Some examples of such things are rhythm, pace, musical instruments, and musical scales. Lyrics are a very signifcant entity that has a direct impact on our thoughts, and they may be found in a variety of forms.
Te task of extracting audible words from lyrics and then categorizing the music in accordance with those words is a challenging one since it involves intricate problems with digital signal processing. A rather straightforward method of comprehending the music based on its acoustic patterns was investigated using these processing techniques. Tis difculty is similar to the traditional one of pattern recognition, and people have been working hard to isolate these patterns from the audio elements.
Trough the use of an artifcial neural networking system, a computer can, in all seriousness, read the thoughts of humans. In spite of the fact that a computer is responsible for the dot's gyrations, the machine was only obeying the directions given to it by the subject of the experiment. Te method of artifcial neural networks is much more than just a trick performed in the lab. Despite the fact that computers are capable of solving issues that are very complicated quickly, standardized assessments of one's conduct may be found in the feld of psychology. Te vast majority of psychological tests may be broken down into two primary groups: mental ability measures and personality scales. Intelligence tests, which are intended to assess an individual's overall mental ability, and aptitude tests, which examine an individual's specialized mental talents, are both included in the battery of mental ability tests. Since there is no one correct or incorrect response to a personality scale, this kind of assessment is more often referred to as a scale. Various reasons, interests, values, and attitudes may all be measured via the use of personality tests. Te fndings of the experiment demonstrate that the ANN-MFCC method system is successful.
Te authentication process that the user goes through is often either unusually lax or very stringent, depending on the situation. Authentication has been an intriguing method all through the years that it has been used. Because of the proliferation of technological tools, it is becoming simpler for "others" to create false identities, steal identities from others, or crack the passwords of others. As a consequence of this, many algorithms have been developed, each of which takes an intriguing approach to the computation of a secret key. Te algorithms are designed in such a way that they select a random number somewhere between 10 and 6.
Users in the modern day are given main password stereotypes such as textual passwords, biometric scanning, tokens or cards (such as those used at an ATM), and other similar options. Te majority of textual passwords are encrypted using the same method that was discussed before. Your "natural" signature is obtained by biometric scanning, and cards or tokens are used to verify your identity. But some individuals despise having to carry their cards with them and others refuse to put their retinas through intense IR radiation (biometric scanning). Te majority of textual passwords used these days are kept very basic, such as a word from the dictionary, the name of a pet, or the name of a lover. Several years ago, Klein carried out these experiments, and he was able to break 10-15 passwords in a single day. Tis is now quite easy to do because of improvements in technology such as faster processors and an abundance of resources available on the Internet.
As a result, with our current notion, this system protects the 3D passwords, which are a particularly intriguing method of authentication because of their increased level of 2 Computational Intelligence and Neuroscience personalization. Te number of possible passwords has been increased to 21, considering the limits of human memory. In most cases, basic passwords are used so that the information may be easily recalled. In our system, the human memory is required to go through the processes of recognition and recall, as well as any biometric or token-based authentication. After it has been installed and you have logged in to a secure website, the graphical user interface for the 3D password will appear. Tis is an extra password that the user may just type in. It is a textual password. After he passes the frst verifcation, a 3D virtual area will become accessible to him on the screen. In our scenario, let us assume a virtual garage. Now, in a typical garage, one can discover a wide variety of tools, pieces of equipment, and other items, all of which have their own specifc qualities. Tereafter, the user will interact with these attributes appropriately. Each item in the three-dimensional space may be moved freely in any of the three planes x, y, and z. Tat is the property of each thing that allows it to move. Tis quality is shared by all of the items that are located in the space. Let us say a person successfully logs on and navigates their way into the garage. He notices and picks up a screwdriver, which is the beginning position in xyz coordinates (5,5,5), and then he moves the screwdriver fve spaces to his right (on the xy plane, this would be 1) (10, 5, 5). Tat is a kind of authentication that can be recognized. Te actual user is the only one who can comprehend and identify the thing that he must choose from among numerous others. Tis is when the "recall and recognition" section of a person's memory comes into action. It i's interesting to note that a password may be established by walking up to a radio and changing the frequency to a number that the user alone is aware of. Te fact that cards and biometric scanners may be used as input can help to increase the system's level of security. A user may be subjected to a variety of diferent degrees of authentication.

Related Works
In this part of the article, the various strategies and procedures for the regular research efort linked with the categorization and retrieval of audio signals are discussed. Te categorization and segmentation of audio was accomplished via the use of ensemble techniques [1]. Te signals may be broken down into speech, music, sounds of the surroundings, speech that has been isolated from other sounds, and silence. Te zero-crossing rate (ZCR), the short-time energy (STE), the spectrum fux, the cepstral coefcients of melfrequency (MFCC), and the analysis of periodicity were all used in order to extract features. When attempting to categorize the music and surrounding noises, it is possible to produce discernible inaccuracies. Pitch density-based parameters are used in [2], such as the average pitch density and relative tonal density. With the use of the pitch density-based features, a multi-class coarseto-fne classifcation that was both highly accurate and more efcient than before was accomplished. Te development of a comprehensive and one-of-a-kind approach for extracting audio characteristics was undertaken to achieve a high classifcation accuracy for ambient audio data. Te classifcation accuracy of the suggested technique has been increased when compared to the MFCC features [3]. Te Gammatone cepstral coefcients, also known as GTCCs, were used in order to categorize the audio signals that did not originate from speech. It was discovered that the classifcation accuracy was much greater when compared to other conventional audio characteristics. Particularly at low frequencies [4], it was discovered that the GTCC was much more efective than the MFCC.
As part of the content-based retrieval (CBR) work that the Muscle fsh Company was doing, they repeated the process of measuring the statistical values in the frequency domain multiple times in order to quantify perceptual characteristics including loudness, brightness, bandwidth, and pitch. Tis method is only useful for analyzing timbre sounds [5] due to the ease with which the approach of employing simply statistical data can be carried out. In content-based audio retrieval, the query-by-humming approach is one of the particular techniques employed. Te strategy [6] employed string-matching techniques for detecting which songs are similar to one another and described the sequence of relative pitch fuctuations that would be used to depict the contour of the melody. A transition in pitch between 10 and 12 has been noted for it.
A one-by-one comparison was made between the displayed signal and each signal in the database. Te sample from the database is retrieved [7] if the same requirements are satisfed as before. Te following is an explanation of the retrieval system's fundamental workings. First, an estimate is made [8] of the feature for each user signal as well as the signal from the database. When it comes to audio signals, the temporal domain is the same as their original domain. One thing that all temporal choices have in common is that they are taken from the received raw audio data in its original form without undergoing any transformations beforehand.
Te organizations can better flter out anonymous calls with the use of voice and audio recognition done in real time.
For content-based music information retrieval, a k-nearest neighbor learning technique based on a support vector machine has been presented [9]. When there are a signifcant number of audio fles of varying sizes, the use of approaches that utilize machine learning is recommended [10]. In order to manage such large amounts of audio data, a decision tree classifer based on AdaBoost has been constructed [11]. Recognizing and classifying sounds is an extremely important step in the process of developing assistive technology for persons with hearing impairments. When classifying audio events, convolutional neural networks (CNN) combined with pretrained VGG networks are employed to extract feature embeddings [12].
Te identifcation of bird songs may also be accomplished with the help of multi-modal deep CNN by using audio samples and metadata as inputs. Utilizing quadratic time-frequency distributions [13], an audio surveillance system has been developed to prevent automobile collisions and tyre sliding in hazardous situations. Te efective music indexing framework (EMIF) has been created [14] with the goals of increasing the efciency of queries and facilitating Computational Intelligence and Neuroscience the retrieval of scalable and accurate music. It does this by applying semantic-sensitive categorization to the music in order to determine its categories. Te content-based music retrieval tool known as Pandora radio takes use of the commonalities between song characteristics including genre, mood, instrumentation, and voice quality [15]. In contrast to text fles, audio recordings are stored in an unstructured manner, which necessitates the use of more complicated algorithms during preprocessing and categorization [16]. In this instance, psycholinguistic and stylistic characteristics were used in order to categorize auditory emotion. Te following is the contribution of this article: (i) A whole new audio fle format is suggested by this work (ii) A new method for the categorization of audio fles and the retrieval of their contents is proposed in this study. (iii) Te proposed method attains excellent levels of sensitivity, specifcity, and accuracy in comparison to the current method. (iv) Experiments are carried out using a dataset that is considered standard. (v) Te computational complexity of methods for classifying data that make use of probabilistic neural networks is simplifed thanks to this strategy. (vi) Tis scheme may be used to identify the audio signals in a wide variety of applications including multimedia. (vii) Individuals who struggle with their hearing may fnd a signifcant improvement in their quality of life after implementing this plan.

Materials and Methods
Tis section provides a detailed explanation of the classifcation and retrieval strategy that will be used for the anticipated audio stream. A mean flter is used in order to accomplish the smoothing of audio data. In order to extract the retrieval stage feature, EMFCC-EPNCC [17] is used, in conjunction with a mixture of peak estimated signal characteristics. Using the MBFOA method, the feature vector that was produced is analyzed so that the best possible feature index may be chosen [18]. Te suggested method of categorizing audio and retrieving it is shown in the fow diagram in Figure 1, which may be found here.

Segmentation and Feature
Extraction. During this stage of peak feature extraction, the notation R Loc refers to the features that are located within the high peak of the positive edge of the audio signal, Q Loc refers to the features that are located at the minimum signal diference within the negative edge of the audio signal, and S Loc represents the features that are located at the maximum signal diference within the negative edge of the signal [19]. In order to update the diference in peak extraction, the threshold value that was retrieved from the input signal is applied. Te study that was suggested [20] includes references to the processes of pitch extraction and peak extraction. Te objective function is frst used during the pitch extraction process to determine the weight of the signal based on the diference in the cosine angle of the amplitude of the signal [21]. Tis is done by applying the objective function at the beginning of the procedure. Te pitch of the audio stream may be approximated by using a detection approach that is based on the time domain. Many diferent approaches are used to estimate pitch. Some of these methods are the zero-crossing method, the autocorrelation method, the maximum likelihood method, the adaptive fltering method employing fast Fourier transform (FFT), and the super-resolution pitch detection method [22]. Te audio signal characteristics are extracted with the help of EMFCC-EPNCC. Tese features make up a set that is mostly used to represent the audio signal. When determining the feature values to return, the extraction of features considers EMFCC and EPNCC. Tese feature values are derived from the audio signal and sampled at fs (Hz). Te windowing approach is used to the audio stream in order to break it up into individual frames and do spectrum analysis on each frame. Te frame was then applied to the discrete Fourier transform at that moment in time (DFT). Tereafter, the output of the DFT is distorted due to warping [23].
Te window distribution is characterized as Te estimate of the power spectrum is denoted by the variable P i (k), which may take the values k � 0, 1, . . . , [(N/2) − 1]. N is the total number of samples that were taken. Te change in pitch angle is determined for each preallocated time sample by using the size of the input signal X i as the starting point [24].
Te defnition of the function of the frequency warping intensity of the mel may be found in In this context, the sampling frequency is denoted by fs, and the function of frequency received is denoted by t. In order to fulfl the particular requirements, it is necessary to standardize the function of mel-warping. p(π) � π. where

Feature Selection and Classifcation.
After employing the MBFOA selection of features to pull a collection of testing features out of the audio fle, the optimal testing features are then chosen for selection. Te colony behaviours of Escherichia coli (E. coli) demonstrate that these bacteria are highly organized and capable of forming complex structures [25]. Passino [15] developed a brand-new technique for optimization that he called the bacterial foraging (BF) algorithm [26]. Tis algorithm was inspired by the foraging behaviour of the E. coli bacterium. Four primary phases are used in the process of determining the foraging behaviour of bacteria, which include chemotaxis, swarming, reproduction, and elimination and dispersion. An MBFOA [27] that has an improved ftness function computation is employed in the process of selecting the best possible value for each feature [28]. MBFOA required fewer user parameters to be specifed while producing results that were comparable to those of other methods, and it reduced the total number of function evaluations performed. Te elimination of user-defned parameters and the presence of an internal BFOA loop are the primary benefts of MBFOA [29].
Te bacterial set is initialized by using: Te pareto optimal set is found as follows: where g 1 (x), g 2 (x), . . . , g n (x) are n objective functions. Te objective function problem is solved using BFOA algorithm.
where L represents the current ftness value of, "M" represents the average ftness value of the l iteration, and (k) marks the distance value at each angle duration. Tis is accomplished by using Optimal features are selected by using the best ftness value obtained using the MBFOA algorithm [30]. (Algorithm 1) Te probabilistic neural network has a multilayer design that feeds forward data as shown in Figure 2. Implementing the applied arithmetic rule known as kernel discriminate analysis is what is meant when we talk about PNN's defnition [31]. Te processes in this are structured into a multilayered feedforward network, which consists of four layers: the input layer, the pattern layer, the summation layer, and the output layer [32]. PNN categories have several benefts, the most important of which are rapid training methods, a structure that is inherently parallel, and a guarantee that they will converge to the best possible classifer as the size of the representative training set continues to grow. Additionally, training samples can typically be Input: Feature the matrix, Tr Output: Best solution of the feature index, Fs Step 1: Initialize the bacterial set using (5) Step 2: Find the pareto optimal set using (6) Step 3: Solve the objective function problem by BFOA algorithm based on (7) and (8) Step 4: Find the optimized solution Step 5: Extract the index of selected features using equation (8) ALGORITHM 1: MBFOA algorithm. Computational Intelligence and Neuroscience added or removed without requiring extensive retraining [33]. A PNN is consequently superior to numerous models of neural networks in terms of its learning speed, and it is efective in a wide range of applications [34].

Retrieval.
A comparable segment of the audio fle must be retrieved in order to complete the audio retrieval process. Te process of extracting features is the most important step in the audio recovery operation. Te raw audio fle may be difcult to work with, which is why this function displays the numerical representation of an audio fle instead [35]. Te features are pulled out of each audio fle in the database and saved in the feature database once they have been extracted. Te query audio fle has its features extracted, and then those characteristics have to be compared with the features of every audio fle that is stored in the feature database. If a query audio feature is found to be compatible with the feature database, the audio fle that corresponds to the query may be obtained. Te distance between the query example and the individual samples is immediately used by the easy retrieval strategy. Additionally, the retrieval list that supports the measured distance is provided here for your convenience [36]. It is checked against the whole database, many relevant audios are obtained, and the retrieval pace is sluggish. Tis is especially true when the growing database is quite vast. It is recommended to use a PNN classifcation in conjunction with a Euclidean distance measure as a retrieval strategy in order to increase the speed of the search and return a variety of fles that are linked to the query [37].
Using the probabilities derived from PNN, the sound in question is assigned to one of the two primary categories-namely speech or music-using the hierarchical retrieval process described above. As the pattern layer is being produced, a calculation is made to determine the probability density function for a single [38]. Tis is accomplished using: Following that, the distances between the query and the samples were measured using the category rather than the full database, and an ascending distance list was created as the outcome of the retrieval process. With the help of this method, it is possible to stop the processing of a number of irrelevant fles sooner after the search has been started [39]. To get the fles that are relevant to the query fle, the Euclidean distance computation and relevance matching are both carried out. Te steps involved in the recovery of the audio are shown in Figure 3. Equation (9) is the formula that may be used to determine the Euclidean distance, and its defnition is as follows: Te distance between the two places calculated using the Euclidean method.
In order to defne the class label, the category of the audio signal must frst be obtained. A comparison is made between the provided dataset and the training set of features in order to get data from the given dataset. Because the structured database is based on the fndings of the audio categorization, it is possible to obtain both the audio signal that was recovered and the category to which the audio belongs. Te audio signal that is included inside the class will be displayed to the user once it has been ranked according to how closely it matches the query signal [40]. In the frst step of the testing process, it is determined if the provided testing feature is music or voice. If it is determined that the signal is in fact related to music, the label will be categorized as one of the following instruments: cello, clarinet, fute, guitar, organ, piano, saxophone, trumpet, violin, and band. If the testing characteristic is found to be a speech, the voice will be classifed as either male or female once again [41].

Result and Discussion
Te suggested method is evaluated with the use of the audio dataset, which contains a collection of music, voice, and instrumental signals compiled by marsyasweb [42] and fuhrmann [43]. Te dataset contains a total of 128 sounds, 64 of which are musical and 64 of which are vocal. Audio fles and 1100 instrumental signal audio fles are organized into 11 categories, including cello (100), clarinet (100), fute (100), acoustic guitar (100), electric guitar (100), organ (100), piano (100), saxophone (100), trumpet (100), violin (100), and singing voice audio fles (100). A training set consisting of eighty percent of the audio recordings in the dataset is chosen, and the remaining twenty percent of the audio fles are assigned to the test set. Te task that was suggested was carried out in the context of MATLAB [44]. Te signal characteristics and audio signal that were used in the work being suggested are shown in Tables 1-3. Figure 4 shows the comparative sensitivity, specifcity, and accuracy of the prediction rate.
It is clear from looking at this fgure that the audio classifcation and retrieval approach based on MBFOA is capable of achieving high levels of sensitivity, specifcity, and accuracy. After doing a side-by-side comparison of these benchmark performance results with our own, our team was able to determine that this was audio [45].
Te proportion of impostors whose biometric data are accepted by the system is referred to as the false acceptance rate (FAR). In this identifcation system based on biometrics, users are not required to make any assertions about their identities. As a result, this percentage has to be as low as it can possibly be in order for the system to be unable to accept the individual who is not already registered in the system. Terefore, there should be as few false acceptances as possible in comparison to false rejections [46].
False rejection rate (FRR) is the proportion of legitimate users who are unable to log in as a result of the biometric technology that is being used. Because the user will be making assertions about their identifcation inside the biometric verifcation system, the system must ensure that it does not reject an enrolled user, and the frequency of false rejections must be reduced to the greatest extent feasible [17]. Terefore, the number of false rejections should be kept to a minimum in comparison to the number of false acceptances.
Genuine acceptance rate (GAR) is the proportion of genuine users that are accepted by the system. GAR � 100 − FRR is supplied.
It has been determined that there has been a reduction in the GAR, FAR, FRR, and error rate. Based on these fndings, it is clear that the characteristics that do not work well for classifcation have been successfully eliminated from the feature selection model. Table 4 displays the    Figure 4: Sensitivity, specifcity, and accuracy analysis.
Computational Intelligence and Neuroscience 7 results of the categorization that was performed. Table 5 contains an examination of the precision as well as the recall that was performed. Figure 5 depicts a comparison of the planned work's average precision with the acoustic characteristics that already exist. [47] Te signal classifcation approach that may be used for multimedia applications performs much better than the benchmark fndings. As a consequence, when categorizing audio signals, we may infer that this result is superior to the standard that is currently being used in the industry. Figure 6 depicts the percentage of accuracy achieved by using the currently available methods [27].
Precision � Numberofrelevantaudio Numberofretrievedaudio and Recall Te confusion matrix for the musical data is shown in Table 5. In the confusion matrix, each row represents the examples that fall into the category of an actual class, while each column represents the occurrences that fall into the category of a predicted class [30]. Te feature vector for this experiment work was derived from the initial feature set, which consisted of a total of 41 features.
Te average recall rate has been calculated for all the given query audio fles and the results are shown in Figure 7, indicating that the recall rate will be higher after a certain number of relevant audio fles.

Conclusions
An efcient audio data classifcation and retrieval approach that may partition the audio stream into similar pieces is explained in the proposed work. Tese areas are separated into music and speech, and if the signal is a voice, it is further categorized as either a male or female voice. In the event if the signal is music, it is broken down into the following categories: cello, clarinet, fute, guitar, organ, piano, saxophone, trumpet, violin, and the band. When it came to categorizing and retrieving audio recordings, a strategy that was based on MBFOA was used. To begin, the noise that was   Figure 5: Average precision.    Computational Intelligence and Neuroscience present in the signal that was collected is removed by applying the mean flter at the point where it flters the audio signal. It is much simpler to exclude the silence or other unimportant frequency characteristics from the audio signal input by using a procedure that utilizes peak estimation and pitch extraction. Tereafter, the MBFOA is used to pick the ideal testing feature, and the PNN classifer input chosen feature is used to classify the data based on that feature. During the retrieval step, the characteristics that correspond to the category that was acquired as a consequence of the classifcation process will be retrieved and listed. In the not too distant future, the audio signal will be diferentiated from the continuous audio stream. Our method for categorizing sounds will be improved in the near future so that it can categorize sounds into a greater number of sound categories [48].

Data Availability
Te data that support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest to report regarding the present study.