The Classification of Music and Art Genres under the Visual Threshold of Deep Learning

Wireless networks are commonly employed for ambient assisted living applications, and artificial intelligence-enabled event detection and classification processes have become familiar. However, music is a kind of time-series data, and it is challenging to design an effective music genre classification (MGC) system due to a large quantity of music data. Robust MGC techniques necessitate a massive amount of data, which is time-consuming, laborious, and requires expert knowledge. Few studies have focused on the design of music representations extracted directly from input waveforms. In recent times, deep learning (DL) models have been widely used due to their characteristics of automatic extracting advanced features and contextual representation from actual music or processed data. This paper aims to develop a novel deep learning-enabled music genre classification (DLE-MGC) technique. The proposed DLE-MGC technique effectively classifies the music genres into multiple classes by using three subprocesses, namely preprocessing, classification, and hyperparameter optimization. At the initial stage, the Pitch to Vector (Pitch2vec) approach is applied as a preprocessing step where the pitches in the input musical instrument digital interface (MIDI) files are transformed into the vector sequences. Besides, the DLE-MGC technique involves the design of a cat swarm optimization (CSO) with bidirectional long-term memory (BiLSTM) model for the classification process. The DBTMPE technique has gained a moderately increased accuracy of 94.27%, and the DLE-MGC technique has accomplished a better accuracy of 95.87%. The performance validation of the DLE-MGC technique was carried out using the Lakh MIDI music dataset, and the comparative results verified the promising performance of the DLE-MGC technique over current methods.


Introduction
Music in vast collections is becoming increasingly difficult to find, browse, and suggest. It is mandatory to maintain all of the tags linked with a piece of music to make it easier for others to find the requirement. Annotations can be added either manually or automatically [1]. On the other hand, the amount of necessary human labor is prohibitively expensive for manual annotation operations. It is possible to categorize songs, albums, and artists based on their shared musical traits by applying genre designations to their work. Music genres have long been used to sort music into various subgenres [2]. A hot topic of discussion has arisen because of this: automated music genre categorization. Numerous academics categorize music into broad genres (such as Pop or Rock) utilizing handmade audio elements and assigning a single label to each track [3]. Several issues arise with this proposal. First, it should be noted that not all musical subgenres are mutually incompatible. For example, a song that incorporates parts of Deep House and Reggae may be labeled as Pop (eSongs that are Pop but also have aspects of Deep House and Reggae rhythm might be considered hybrids) [4]. Second, the variability of the data may not be well represented by the characteristics that have been developed. Representational learning strategies have consistently outperformed all others when it comes to learning. It is also possible to identify genres using a broad range of data from audio, images, and text [5]. ere is a range of ways to deal with various types of modalities.
is music information retrieval (MIR) problem, which involves both single-label and multilabel categorization, does not yet have a deep learning multimodal technique, as far as we are aware. We offer a system that can accurately predict music genre labels [6]. We employ a two-pronged approach: A neural network is used to teach the categorization job for each modality. In a multimodal technique, intermediate representations are retrieved and concatenated from each network. Experiments on single-label and multilabel genre categorization are used to test the effects of the newly discovered data representations and their combination [7]. It is possible to train convolutional neural networks on spectrogram data, encoding audio signals as time-frequency representations (CNNs). Album cover photographs are used to train an advanced convolutional neural network (CNN) pretrained using parameters learned in a generic image classification task and then fine-tuned to classify music genre labels [8]. Representations of texts from music-related publications are learned by feeding forward networks across a Vector Space Model (VSM) that has been augmented with semantic information using Entity Linking (EL). is information is the first time audio and visuals have been utilized to test single-label classification [9]. A multimodal feature space may be learned by matching a given dataset's visual and aural data representations. Using an auditoryvisual method to categorize items is more successful than alone using either an audio or visual strategy [10]. Despite the absence of visual data, multimodal feature space enhances auditory representations. Using human annotator results as a benchmark, we analyze how well our trained algorithms categorize images. ese results show how well the combined audio and visual representations work [11].
is study looks at different parts of the input photographs to see how the deep visual model judges each genre. ese results were confirmed in an experiment that combined audio, text, and graphics to create a multilabel classification system [12]. Better scores are achieved when many modes of data acquisition are combined. e results demonstrate that deep neural networks outperform a traditional audio-based technique that relies on handcrafted characteristics. Deep learning architectures for audio categorization are also compared in a comprehensive manner [13]. An increased dimensionality reduction approach is used for labels in this comparison. e experiment's multilabel categorization is then thoroughly analyzed in terms of its quality [14]. An audio feature such as MFCCs is generally used to input a machine learning classifier in classical techniques, such as neural networks and deep neural networks [15]. Spectrograms, which are visual representations of an audio stream, have been used in recent deep learning algorithms [16]. A similar approach to picture categorization is used to feed these visual representations of audio into Convolutional Neural Networks (CNNs).
For this goal, researchers also looked at text-based approaches. In an early attempt to categorize music reviews, for example, work on multi-class genre categorization and star rating prediction is mentioned. Using an innovative method for predicting how people would use music, known as agglomerative clustering, they discovered that bigram characteristics were more informative than unigram features in their research [17].
us, POS tags and pattern mining techniques are employed to extract descriptive patterns for discriminating between positive and negative reviews. e meaning of a song can be deduced from its lyrics and other textual data, making it possible to predict its theme (e.g., love, war, or drugs). Album reviews use an SVM classifier to enrich and categorize them into 13 different genres [18]. Music genre classification with visuals is a relatively new field of study. Audio and song lyrics have frequently been used in studies of multimodal approaches [19]. Aside from text, music and video have also been considered. McKay and Fujinaga use cultural, symbolic, and auditory characteristics to categorize music. Other fields have done a lot of work on multilabel classification. In the framework of MIR, machine learning and deep learning have been used to study multilabel tag classification from audio (autotagging). It is challenging to classify music genres using multilabels since none of the algorithms employ representation learning or multimodal data [20].
More than 100,000 songs have been made available online since music streaming services were introduced. Genre-based playlists are the most crucial feature of these services. It is impossible to define exactly what goes into making a genre of music, although there are some commonalities among the music that falls under that category. Musical compositions can be categorized based on these characteristics. Even if music publishers create these labels, they do not serve any purpose in classifying music [21]. As the Internet and multimedia technologies have grown more widely available and popular, musical pieces have increased considerably in recent years. Professionals have been increasingly ineffectual in their role as the primary source of information for assessing and categorizing. Using software to classify music genres can significantly reduce the workload of human experts while also increasing the accuracy of the results [22].
K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are two of the most often used machine learning algorithms, Mixture Models, etc., that can solve the challenge of categorizing music genres. e performance of the conventional machine learning algorithm does typically not improve if the amount of data input exceeds a certain threshold. Deep learning algorithms, in particular, have gotten a lot of attention in the last several years. e complexity of feature engineering in traditional machine learning is more significant than that in deep learning. It is possible to reduce the amount of data needed to train machine learning algorithms by using feature engineering, which involves constructing feature extractors based on domain knowledge [23]. For example, if you label your data, you do not create a new feature extractor for each task. e accuracy of feature extraction is more important than the input volume for most machine learning algorithms in computer vision and natural language processing. Many different industries are now able to benefit from deep learning. As a result of the deep learning method, significant progress has been made in computer vision and natural language processing. As a result, MIR still has a long way to go compared to the two categories before it. As a result, researchers in MIR are increasingly turning to deep learning approaches to tackle their difficulties. A significant reduction in professional burden and an increase in industry-related application efficiency are possible outcomes of this implementation method. A solid basis and new ideas are presented to help solve more complex MIR issues [24].
Approaches such as the ones listed above have regularly surpassed the competition for classifying music into various genres. ese strategies have yet to address how middle-level learning features influence the classification results of a complete model. Here, middle-level learning, which refers to the characteristics of other layers between an input and a classifier used for learning, is especially significant. e following has been simplified for clarity's sake. It is possible to use final learning features derived from a single input model or a multifeature model to classify data. It is assumed that other forms of learning are also involved in this interaction since the bottom learning feature is more beneficial than the top learning feature when taken into account. Middle-level learning feature interaction (MLFI) has been proposed to investigate this issue. e study focused on the classification of music and art genres under the visual threshold of deep learning. is paper is organized into four sections. Section 1 presents the introduction and objectives of the study. Section 2 highlights the methods used in the study. Section 3 presents the results and analysis, and Section 4 focuses on the conclusion and future work. e contributions of the study are as follows.
(i) is paper focuses on developing a novel deep learning-enabled music genre classification (DLE-MGC) technique. (ii) e proposed DLE-MGC technique effectively classifies the music genres into multiple classes by using three subprocesses, namely preprocessing, classification, and hyperparameter optimization. (iii) At the initial stage, the Pitch to Vector (Pitch2vec) approach is applied as a preprocessing step where the pitches in the input musical instrument digital interface (MIDI) files are transformed into the vector sequences.

Materials and Methods
Music genres are classifications of music based on the style of the music played by the players depending on the circumstances or storyline. In this work, experimental analysis is carried out on the Lakh MIDI music dataset, a highly reliable dataset commonly used for music genre classification.
where W x i and W h i represent weight matrices, h t−1 denotes the preceding hidden layer of the unit, and b i shows the bias vector.
e function σ(x) ∈ (0, 1) is a sigmoid function utilized for gating. Likewise, the output of forget gate f t is calculated by Finally, the output of the output gate 0 t and cell state c t are in which, ⊙ represent the Hadamard product. e BiLSTM contains two similar LSTM layers: forward direction and backward direction, as shown in Figure 1. Since the input is processed two times, BiLSTM extracts further data from the input, therefore, enhancing contextual data for making effective predictions when compared to LSTM. us, BiLSTM presents fast accuracy and convergence when compared to LSTM. e BiLSTM framework consists of keeping past and future context at any time of the sequence. e output of LSTM is integrated as follows: where h t → and h t ← represent the output of forward and backward LSTM.

CSO-Based Hyperparameter Optimization.
At the final stage, the CSO algorithm has been employed to tune the hyperparameter values of the BiLSTM model and thereby enhance the overall classification outcomes. e main advantage of the hyperparameter optimization in the machine learning algorithm is to provide the best performance by finding the hyperparameters which are measured based on the validation set. e CSO is a population-based, comparatively stochastic, metaheuristic evolutionary approach. CSO imitates two natural behaviors of cats: tracing the targets and looking around their environments for the next move. ey often stay alert even while at rest. ey have excellent hunting skills and stronger inquisitiveness toward moving objects. One significant feature of cats is that they save their energy for future chasing and spend most of their time in inertia. In normal times, their movement is also slower. Once they sense prey, it chases very fast, spending massive energy. e CSO relates to; pursuing with high energy and speed as tracing mode and resting with slower movement as seeking mode. e increasing speed in the seeking process is arithmetically mapped as a significant change in the cat's position.

Seeking Mode.
ere are five operators in the seeking mode: Seeking Range of Selected Dimension (SRD), Counts of Dimension to Change (CDC), Seeking Memory Pool (SMP), Mixture Ratio (MR), and Self-Position Consideration (SPC).
SRD is utilized to state the mutation part for the selected dimension. When the candidate a dimension is carefully chosen for mutation, the variance among the new and old values may not be out of range. CDC corresponds to the number of dimensions to be changed in the seeking method. SMP defines the number of copies of a cat produced or pool size of seeking memory when the values of SMP are set as 10. en it is capable of storing ten solutions set as candidates. MR is a smaller value than the fraction of the population to guarantee that the cat spends its time seeking. SPC is a Boolean value, and when it is true, one position within the memory would store the existing solution set and remain unchanged. e steps implemented in the seeking process are given below.
Step 1: number of cats that is N is formed by initializing the velocities, flag, and position of cat.
Step 2: select some cats and based on the MR implement them to seeking and tracing modes.
Step 3: estimate the fitness value according to its position.
Step 4: when the end criterion is satisfied, the last solution would be the optimal position of the cat in the solution space. Or else, return to Step 2.

Tracing Mode.
It is exactly the same as the local searching model of the swarm in the PSO approach. In this process, the cat traces the target with higher energy by changing the position with its own velocity. e velocity and position of the ith cat in D dimension solution space are as follows: e global optimal position of CSO is given by Upgrade the position and velocity of the existing cat by utilizing equations (7) and (8): where w represents the inertia weight, c 1 denotes the acceleration constant, and r 1 indicates an arbitrary value uniformly distributed within (0, 1). A small inertia weight facilitates local searching while a large inertia weight helps global searching. In the work, w is fixed as 0.4. A cat swarm characterizes a set of indices. By utilizing this index, a reduced feature subset is acquired from the novel data set. It might take place during the selection more than one number of indices that may not fall (rare cases) in the range of column presented on the data set. For getting optimum candidate features with good classification performance, adapted CSO is employed for each dataset.

Results and Discussion
is section investigates the classification result analysis of the DLE-MGC technique on the benchmark dataset. e classifier results of the DLE-MGC technique are examined under two different epochs, namely, 1000 and 2000. Table 1 and Figure 2

Conclusions
Artificial intelligence-enabled event detection and classification techniques are becoming commonplace in ambient assisted living applications. A compelling music genre classification (MGC) system cannot be designed because of the vast amount of data in the music industry. Using robust MGC approaches requires more data that will take a long time to collect and analyze. e design of music representations taken directly from input waveforms has received little attention due to their ability to automatically extract advanced features and contextual representations from actual music or processed data. Deep learning-enabled music genre classification (DLE-MGC) is developed due to this motivation. ree subprocesses, including preprocessing, classification, and hyperparameter optimization, are used in the proposed DLE-MGC technique to classify music genres into numerous classes effectively. Pitch to Vector (Pitch2vec) is used as a preprocessing step to convert the input musical instrument digital interface (MIDI) files' pitches into vector sequences. e DLE-MGC method utilizes a cat swarm optimization (CSO) model equipped with   Computational Intelligence and Neuroscience bidirectional long-term memory (BiLSTM) for the classification process. According to the experimental results, the proposed model has provided an accuracy of 95.87%. For future direction, it is highly recommended to implement the hybrid model for analyzing the genres in music.

Data Availability
e data used to support the findings of this study are available from the corresponding author on request.

Conflicts of Interest
ere are no competing interests.