Anime Audio Retrieval Based on Audio Separation and Feature Recognition

the


Introduction
With the rapid development of the anime industry, there is an increasing demand for processing and retrieving anime audio content.Anime viewers often develop an interest in the audio content, such as searching for a particular iconic line, a background music track, or a specifc sound efect.Despite the existence of many mature methods, there are still limitations when it comes to anime audio.Firstly, anime audio typically consists of multiple sound sources, such as background music, dialogue, and sound efects, which pose challenges in audio processing and retrieval.Current methods often struggle to accurately separate diferent sound sources.Additionally, anime audio exhibits complex audio characteristics, including highly varied pitch, speech rate, and timbre, which leads to decreased accuracy in feature extraction.Moreover, traditional audio retrieval methods often rely on manually designed features and similarity measurement approaches, lacking the deep learning modeling and representation capabilities for anime audio features.
To address the aforementioned issues, we propose an anime audio retrieval method based on audio separation and audio feature recognition.By separating the sound sources, we can obtain more accurate audio feature representations, which helps improve the accuracy and robustness of audio retrieval, enabling users to search and locate specifc audio segments or content more quickly and accurately.Te audio separation utilizes a U-Net architecture to efectively separate diferent sound sources in mixed audio, enhancing the accuracy and efciency of anime audio processing.To enhance the performance of audio separation, we introduce the Effcient Channel Attention (ECA) mechanism between the downsampling blocks of the encoder, which enhances the focus on important features while efectively handling the complex audio characteristics of anime audio.Additionally, we employ deep learning-based audio feature recognition methods that can extract and represent key features of anime audio.Trough deep learning modeling, we can capture the semantic information of anime audio more accurately, thereby improving the accuracy and efciency of audio retrieval.Te improvements made in this study efectively alleviate the problem of poor performance in audio separation for anime audio.Furthermore, the anime audio retrieval method proposed in this paper is capable of performing fast matching while occupying minimal memory.
Te contributions in this research can be summarized as follows: (1) We propose an anime audio retrieval framework based on the feature of separated anime audio and relieve the difculty of recognizing audio contents in animations.
(2) Te experimental results of the proposed audio separation achieved an average improvement of 0.23 dB in Signal-to-Distortion Ratio (SDR) and 1.08 dB in Signal-to-Interference Ratio (SIR) showing high performance in anime audio separation.(3) With the contrastive learning on the Mel spectrograms after Short-Time Fourier Transform (STFT), the proposed method can extract representative anime audio features and outperform the other existing methods in recognition.
Te rest of the paper is organized as follows.In Section 2, we discuss the related works about audio separation and feature recognition methods.In Section 3, we introduce the framework of the proposed anime audio retrieval and the details of the proposed methods.Te experimental results and analysis are given in Section 4. In Section 5, we give a brief conclusion of this work.

Audio Separation Methods.
During the research on model approaches for audio separation, Grais and Erdogan [1] proposed a single-channel speech-music separation method based on Nonnegative Matrix Factorization (NMF) and spectral masking.Tis method involves decomposing the mixed audio signal into spectrograms of speech and music.Subsequently, spectral masking techniques are applied to mask the speech and music components at each frequency, facilitating the separation of the signals.To enhance the ability of NMF to extract meaningful audio and achieve higher accuracy, Hayashi et al. [2] introduced a method based on periodic NMF.In frequency-domain methods for audio separation, neural networks are commonly used to estimate Time-Frequency Masks to enhance the precision of signal processing.Luo et al. [3] proposed a method that combines deep clustering with traditional neural networks, demonstrating its efectiveness for music separation.Chandna et al. [4] introduced a low-latency single-channel audio separation framework based on Convolutional Neural Networks (CNNs).Tis framework utilizes CNN to estimate timefrequency soft masks for source separation, signifcantly improving the processing speed.Time-domain audio separation network was the frst proposed network for audio separation in the time domain [5,6].It utilizes an encoderseparation-decoder framework to directly model the audio signal in the time domain.However, this approach may cause distortions, resulting in an imperfect reconstruction of the input signal.To address this issue, Tzinis et al. [7] proposed a two-step separation method for audio separation.Tis method divides the separation process into two stages: in the frst stage, only the encoder and decoder are trained, and in the second stage, the encoder and decoder are fxed while only the separation part is trained.Tis approach reduces distortions and improves the separation upper limit of the model.Ditter et al. [8] proposed the Multiphase Gammatone Filterbank, which utilizes a deterministic Gammatone Filterbank to enhance the performance of TasNet in handling high-frequency speech signals.Compared to randomly initialized frequency responses, it achieves a better distribution of frequency responses.Nugraha et al. [9] introduced a DNNbased multichannel music separation method, and experimental results demonstrated its performance in separating vocals and accompaniments.Furthermore, improvements were made to the training objectives and overall architecture design.Zeghidour et al. [10] proposed an end-to-end audio separation method called Wavesplit.It calculates a global speaker vector from speaker vectors within short time windows and feeds it into the source separation network to obtain the fnal source separation results.Tis approach addresses the fundamental permutation problem in source separation and provides a longer and more robust method for audio separation.Jansson et al. [11,12] applied the U-Net neural network structure widely used in medical image segmentation to the feld of audio separation.Tey transformed the audio signals into frequency and phase information using the STFT and used the frequency information of the mixed audio as input to the U-Net network.After training the network, they obtained the separated target audio.Stoller et al. [13] proposed a Wave-U-Net, which allows information exchange and fusion from multiple scales and diferent levels of abstraction.By utilizing onedimensional (1D) convolution operations, Wave-U-Net can directly map waveform-to-waveform, breaking away from the traditional Encoder-Separation-Decoder structure.Slizovskaia et al. [14] made improvements to the Wave-U-Net network to support a dynamic number of input sources.Cohen-Hadria et al. [15] compared U-Net and Wave-U-Net models and found that pitch shifting is the most efective data augmentation technique for U-Net, while techniques like channel swapping and time stretching show little diference in performance on Wave-U-Net.Meseguer-Brocal et al. [16] proposed the Conditioned-U-Net model, which incorporates a control mechanism that allows training a unique and versatile U-Net network for separating various musical instruments.

Audio Feature Recognition Methods.
Te feature recognition in this paper adopts audio fngerprinting technology, which can convert audio fles into compact and unique feature vectors.By comparing the features of audio fles using audio fngerprinting methods, the desired audio 2 International Journal of Intelligent Systems can be quickly and accurately identifed and located.Shazam and Philips proposed classic approaches in the feld of audio fngerprinting.Shazam proposed a method based on spectral analysis, extracting spectral peak points with high energy from the audio signal to generate a sparse set of points known as a "constellation map" [13].Using the information of an anchor point and its surrounding peak points in the constellation map, an audio fngerprint containing frequency, time diference, and the time position of the previous point is generated.Due to the strong robustness and linearity of spectral energy peaks, the extracted fngerprint exhibits high robustness against audio signal compression, foreground speech, and various types of noise.Philips proposed a method that transforms the audio signal into frequency-domain information, divides it into overlapping frames, maps them into 33 frequency bands, and computes the energy between adjacent audio frames to generate fngerprints for matching and retrieval purposes [17].Jia et al. [18] proposed a modifed fngerprint and matching approach to enhance robustness against noise interference.Zhang et al. [19] introduced a turning point alignment method that improves the robustness of sampling and counting methods against time scaling, enabling Philips and Philips-like fngerprints to be resistant to time scaling while improving retrieval performance under diferent noise distortions.Building upon this, Yao et al. [20] further improved the robustness of the Philips fngerprint by utilizing a band energy calculation method for peak points.Tis method not only enables the audio fngerprint to be resistant to time stretching and pitch shifting but also maintains robustness against various types of noise distortion.Chu et al. [21] proposed a robust audio fngerprint recognition method against various attacks.Tey conducted experiments in six diferent environments, including rhythm, pitch, speed changes, and noise addition, and employed a novel hashing method for audio content comparison in the similarity calculation process, leading to a signifcant improvement in accuracy.With the development of deep learning, there have also been deep learning-based audio fngerprinting methods [22,23].

The Framework of the Proposed Anime Audio Retrieval
Te overall framework of the proposed method is illustrated in Figure 1.Te anime audio retrieval method consists of two main components: the audio separation model and the audio fngerprint retrieval model.Firstly, the anime audio is processed through the audio separation model to separate diferent audio sources.Subsequently, the separated audio sources undergo audio fngerprint extraction, and the extracted audio fngerprints are used to construct an audio fngerprint database.When a specifc anime audio segment needs to be retrieved, the same process is applied to obtain its audio fngerprint, which is then matched against the index entries in the audio fngerprint database.Based on the matching results, the corresponding audio segment and its relevant information are retrieved.Table 1 shows the details of the parameters used in the proposed models.

Anime Audio Separation.
Anime audio typically consists of multiple tracks, such as vocals, music, and sound efects, which need to be mixed to create a more immersive and vibrant audio experience.Additionally, anime often portrays fctional or supernatural scenarios and scenes, meaning that the sounds in anime audio are often generated by fctional characters or situations.Tis implies that anime audio features may involve more digital processing and manipulation.Tese features often require a broader context to capture, and the ECA mechanism can assist the model in better learning long-term dependencies and global contextual information.Terefore, incorporating the ECA mechanism into the audio separation model can enhance the model's performance and enable better learning of the unique characteristics of anime audio.Te model in this paper adopts an encoder-decoder structure, where the encoder and decoder consist of fve downsampling blocks and fve upsampling blocks, with a convolutional layer in between as illustrated in Figure 2. Skip connections can be used to connect the encoder and decoder.To better capture the features of anime audio, this paper employs the ECA mechanism module in the encoder part of the audio separation model.Tis module is inserted between each downsampling block and utilizes an adaptive selection of 1D convolutional kernel sizes to determine the coverage of local interchannel information interaction.Tis improves the accuracy of extracting audio signal features.
Te ECA mechanism is a lightweight attention mechanism that is an improvement over the Squeeze-and-Excitation attention mechanism [24].It aims to strike a balance between model performance and complexity.Without reducing dimensions, the ECA mechanism calculates the interdependencies between channels by performing 1D convolution along the channel dimension.It weights each channel to achieve interchannel interaction [25].Te ECA mechanism frst applies Global Average Pooling (GAP) to the input feature map.Te GAP is more native to the convolution structure by enforcing correspondences between feature maps and categories [26].It can also avoid overftting.After the GAP, a 1D convolution with kernel size k is performed, followed by a Sigmoid activation function to obtain the weights ω for each channel.Finally, the weights are multiplied elementwise with the original input feature map to obtain the fnal output feature map.
Moreover, the ECA mechanism enhances the correlation between diferent channels, thereby improving the expressive power of the network.It enables better capturing of signal features at diferent frequencies while reducing the number of model parameters, leading to improved training and inference efciency.In this paper, for the task of audio separation in anime audio, the ECA module precisely captures the unique features of anime character voices, thereby improving the accuracy and efciency of audio separation.
International Journal of Intelligent Systems 3.2.Anime Audio Feature Recognition.In this paper, a contrastive learning-based audio fngerprint retrieval method is adopted for anime audio retrieval.After separating the anime audio through audio separation, the separated audio signal is further transformed into a log-Mel spectrogram, which is then fed into the audio fngerprint retrieval model to extract the audio fngerprint of the segmented audio.Finally, the audio fngerprint is searched in the anime audio database to quickly obtain relevant information.Figure 3 illustrates the structure of the proposed audio fngerprint retrieval model.
For each input audio segment, the model converts it into a 128-dimensional embedding vector, which is considered as the audio fngerprint of that segment.Tese embedding vectors are used for similarity measurement to determine whether they belong to the same audio.To achieve contrastive learning, the model is trained by using the embedding vectors of positive and negative samples to maximize the similarity measurement of positive samples and minimize the similarity measurement of negative samples.
From the fgure, it can be observed that the model consists of three main parts.In the preprocessing stage, the audio signal is divided into fxed-length audio segments.Each audio segment includes a portion of the previous segment to facilitate the encoder in learning the similarity between segments during contrastive learning.Te audio signal is then subjected to the STFT.
Te STFT is an analysis method that converts signals from the time domain to the frequency domain, widely used in audio signal processing and analysis [27].Te STFT is an improvement over the Fourier Transform (FT) [28], which cannot capture the temporal changes of a signal or handle nonstationary signals since it operates on the entire signal.Te STFT divides the signal into multiple short-time windows and applies FT to each window to obtain the frequency-domain information at that particular moment.Te STFT can capture the temporal changes of a signal, making it more suitable for processing nonstationary signals.Te mathematical expression of the STFT is shown as follows: (1) f(t) represents the original signal, g(t − u) represents the window function, u represents the center of the window function, ω represents the frequency, and X f (ω, u) represents the STFT result.Te window function g(t − u) can take various forms such as rectangular window, Hann window, and Hamming window.It applies weighting to the signal within the window to better refect the frequencydomain information of the signal at that particular moment.

International Journal of Intelligent Systems
Next, the feature vectors extracted from the convolutional encoder are projected into the contrastive learning space through a projection layer.Te projection layer adopts a Linear-Exponential Linear Unit (ELU)-Linear structure.Firstly, a linear transformation is applied to map the feature vectors into a fxed-dimensional space, compressing and reducing the dimensionality of the vectors.Ten, the output of the linear layer is passed through the ELU function to introduce nonlinearity, enhancing the model's expressive power and learning efciency while capturing features at diferent resolutions and abstraction levels.Te transformed features are concatenated and subsequently subjected to L2 normalization.Te model parameters are updated through backpropagation by minimizing the contrastive loss function on the training and validation sets, enabling audio retrieval in the test set through similarity calculations.
Due to various interfering factors such as noise and variable speed, anime audio fngerprints may be afected, resulting in the inability to retrieve audio fngerprints accurately or a decrease in the model's recognition ability for certain rare fngerprints due to the unique characteristics of anime audio.To address this issue, the Normalization Temperature-Scaled Cross-Entropy (NT-Xent) loss function [29] is employed to optimize the model in this study.Tis loss function enhances the model's robustness by learning the semantic relationships among data samples.
Te NT-Xent is a contrastive learning loss function based on cross-entropy, used to train neural networks to learn the similarity between samples.It maximizes the similarity score between positive samples and minimizes the similarity score between negative samples to learn the embedding vectors.In contrastive learning, the dataset is typically divided into two parts: one serving as an "anchor" and the other as a "positive," while a random "negative" sample is selected from the dataset.Te NT-Xent loss function uses the softmax function to compute similarity scores, aiming to maximize the similarity score between the "anchor" and "positive" while minimizing the similarity score between the "anchor" and "negative."Te NT-Xent loss is defned as follows: where N represents the number of samples in a batch, a i,j denotes the similarity score between sample i and its corresponding positive sample j, with higher scores desired, and a i,k represents the similarity score between sample i and other negative samples k, with lower scores desired.Te term 1(k ≠ i) outputs 1 when k � i and 0 otherwise.Te denominator represents the sum of similarity scores between sample i and all negative samples.τ is the temperature parameter used to control the smoothness of the probability distribution.Additionally, in order to maximize the results of the softmax, the negative logarithm of the loss function is taken.Finally, the total loss L, including the NT-Xent loss, can be calculated as follows:

Experiments
4.1.Dataset.Due to the scarcity of research on anime audio, there is a limited availability of datasets specifcally designed for anime audio.Most existing audio datasets are focused on natural language processing or music, and their features difer from those of anime audio, making it diffcult to directly apply them to anime audio research.Terefore, this paper has constructed an audio dataset specifcally tailored for anime.Te construction process is illustrated in Figure 4.A total of 150 anime audio clips were created for this study.Te audio fles were in WAV and MP3 formats, with a sampling rate of 44100 Hz.Among these, 100 clips were used for training and 50 clips for testing.Te anime audio data used in this study were obtained by downloading anime videos from various websites and video platforms.To ensure the performance of the model and the reliability and comprehensiveness of the results, a large and diverse anime audio dataset was necessary for training.In addition to the dataset's scale, the diversity of the dataset also afects the model's results.Terefore, this dataset consists of various anime genres such as science fction, romance, comedy, and action, as well as anime works in Chinese, English, Korean, and Japanese languages.Te FFmpeg library was used to extract all audio tracks and convert them to WAV and MP3 formats, while also performing resampling to ensure consistent sampling rates.To reduce training time, memory consumption, and the risk of overftting, long anime audio segments were cropped into multiple shorter segments of approximately 3 minutes each for easier processing.Subsequently, incomplete, noisy, or nonrepresentative parts of the audio segments were removed to ensure the quality and usability of the dataset.Finally, the processed audio segments were fed into a model trained on the MUSDB18 dataset to obtain audio tracks such as character voices and background sounds.Te MUSDB18 dataset is divided into training set (70%) and test set (30%).Te number of epochs is 200 and the batch size is 16.Table 2 shows the detail information of the experimental platform.

Evaluation Criterion.
We employ two evaluation criteria for audio separation including SDR and SIR where higher values indicate better performance.Te formulas for the SDR and SIR are shown as follows: ( 6 International Journal of Intelligent Systems S target represents the true source signal, e interf denotes the interference error, e noise represents the noise error, and e artif corresponds to the artifcially added distortion. Te audio fngerprint retrieval model utilizes the Top-1 Hit Rate (HR) as an evaluation criterion to measure its performance in both fragment-level and long audio segment-level retrieval.Te Top-1 HR represents the ratio of correctly retrieved results among the sum of correctly retrieved results and erroneously nonretrieved results from the top-ranked results.Te specifc calculation is given by Top − 1 � (n of hits @Top − 1) (n i hits@Top − 1) +(n of miss @Top − 1) . ( n of hits@Top − 1 represents the number of matches in the nearest neighbors of the retrieval vector during the Top-1 retrieval process, while n of miss @ Top − 1 represents the number of nonmatches in the nearest neighbors of the retrieval vector during the Top-1 retrieval process.

Experimental Results and Analysis for Audio Separation.
Te original anime audio segment is separated into two audio segments: anime character voices and background audio, using the audio separation model.To analyze the results of audio separation, it is necessary to compare the waveforms and spectrograms of the original mixed audio, anime character voices, and background audio.Figure 5 shows the comparison of waveforms and spectrograms for the three audio segments.
Firstly, waveform graphs can be used to observe the temporal characteristics of an audio.In the waveform graph of the original mixed audio, it can be seen that the waveform is complex due to the superposition of multiple audio signals.In the waveform graph of the background audio, the amplitude is signifcantly reduced.However, in the waveform graph of the anime character voices, the waveform appears relatively simple, indicating that the audio separation model is able to separate them from the mixed audio.Additionally, in the waveform graph of the anime character voices, periodic oscillations of the human voice can be observed, indicating that the separation method is able to preserve the original signal's characteristics reasonably well.
Secondly, spectrogram graphs can be used to observe the frequency-domain characteristics of an audio.From the spectrogram graph of the anime character voices, it can be observed that the frequency components are clear and exhibit distinct resonance peaks, indicating that the separation method is able to preserve the original signal's frequencydomain characteristics reasonably well.

Evaluation of the Audio Separation Performance under
Diferent Loss Functions and Normalizations.To investigate the practicality of diferent loss functions in audio separation and demonstrate the advantages of the Mean Absolute Error (MAE) loss function in anime audio separation, as well as validate the necessity of the Group Normalization (GN) layer for anime audio separation tasks, ablation experiments were conducted to compare the performance of audio separation with and without the GN layer under diferent loss functions.Te experimental results are shown in Table 3. MSE represents the Mean Squared Error loss function and BN denotes Batch Normalization.Te evaluation criteria are SDR and SIR.Te "✓" symbol indicates the utilization of the corresponding method.
MSE is one of the commonly used loss functions in audio separation tasks, as it efectively balances the energy differences between diferent sources, thereby improving the quality and efectiveness of audio separation.However, in the context of anime audio separation addressed in this paper, there are unique characteristics, such as the presence of many silent regions during character speech.It is crucial to accurately separate these silent regions in the given task.Since MSE loss function focuses on the sum of squared errors, it is sensitive to outliers and can adversely afect the model's performance.Based on the aforementioned results, it can be observed that the overall performance with the MSE loss function is not ideal.International Journal of Intelligent Systems Te MAE loss function, which focuses on the absolute value of errors, is well suited to address the sparsity of anime audio and its impact on the performance of audio separation models.In this paper, the MAE loss function is employed as a replacement for the MSE loss function to better emphasize details, handle the sparsity of anime audio, and improve the efectiveness of audio separation.Te results indicate that using MAE yields higher performance compared to using MSE.
Te BN is a commonly used normalization technique applied after each convolutional layer in both encoder and decoder of the model.Its main purpose is to normalize the output of convolutional layers, thereby accelerating model training and improving generalization performance.Additionally, BN can mitigate the issue of vanishing gradients, further enhancing the efectiveness of model training.However, in this paper, the anime audio dataset is independently constructed, and the sample size may be relatively small.In situations where the sample size is insufcient, BN may lead to larger batchwise sample variances, which can impact performance.
Due to the presence of multiple audio sources alternating at diferent time points in anime audio, each batch of audio samples contains diferent audio sources.GN treats each audio source as a group and performs normalization specifcally for each group.Tis enables more accurate estimation of the mean and variance for each group, resulting in better handling of the situation where multiple audio sources alternate in anime audio.Additionally, GN's computations are not dependent on the batch size, allowing it to achieve desirable performance even with small batch sizes.When combined with the MAE loss function, the combination of MAE and GN achieves a 1.48 dB improvement in the SDR evaluation and also demonstrates improvement in the SIR evaluation, compared to the combination of MAE and BN.

Te Performance of Using the ECA.
To validate the contribution of introducing the ECA mechanism in enhancing performance, a comparison was made between the performances of audio separation with and without the incorporation of ECA.Te results are presented in Table 4, where the best performance values corresponding to the same column are highlighted in bold.
Te experimental results demonstrate that the ECA enhances the performance of anime audio separation compared to before its inclusion.Anime audio typically   To evaluate the audio separation performance of the proposed method, we compare it with other methods: extended Open-Unmix (X-UMX) [30], extended Densely Connected Dilated DenseNet (X-D3Net) [30], Hybrid Spectrogram Time-domain Audio Separation Network (HS-TasNet) [31], and Hybrid Transformer Demucs (HT Demucs) [32].Te comparison results are shown in Table 5. Te average SDR of the proposed method with ECA is 6.96, higher than those of the HS-TasNet and HT Demucs.

Experimental Results and Analysis for Audio Feature
Recognition.In this paper, L2 index is selected as the indexing method.L2 index is used for efcient nearest neighbor search in a vector collection.It constructs a data structure in the vector space, partitioning the vector collection into multiple subspaces to accelerate the search process.In L2 index, the objective of the search is to fnd the nearest neighbor vector to the query vector.L2 index uses the Euclidean distance to measure the distance between vectors and assigns each vector to the corresponding subspace.In this study, the L2 distance is used to calculate the similarity between two vectors.By comparing the loss function, the L2 distance between two vectors of the same audio is minimized, while the L2 distance between diferent audio vectors is maximized.Table 6 presents the HR of audio fngerprint retrieval using the L2 index.
Audio retrieval experiments were conducted on anime audio segments with lengths of 1 s, 3 s, and 5 s.Top 1 exact indicates whether there exists an audio in the returned results that is an exact match to the query audio for each retrieval.Top 1 near indicates whether there exists an audio in the returned results that is very close to the query audio for each retrieval.Top 3 exact and Top 10 exact indicate whether there exist exact matches in the top three and top ten most similar audios, respectively, for each retrieval.
Te results of comparing diferent indexing methods are shown in Table 7, including Inverted-fle index (IVF), IVF-Product Quantization (PQ), and Hierarchical Navigable Small World graph (HNSW) index.From the experimental results, it can be observed that the performance of the model improves as the length of the query audio sequence increases.When the query length is only 1s, except for the HNSW indexing method, the top-1 HRs of the other indexing methods reach around 80%.When the query length increases to 3 s, the segment-level HR can be improved from around 80% to around 90%. Te L2 indexing method achieves the best performance, with segment-level HRs of 81.10% ⟶ 93.30% ⟶ 97.50%.Te performance diference between approximate matching results and exact matching results is 3.15% for 1s queries.Tis diference decreases as the length of the query audio increases.
To further demonstrate the superiority of the proposed method, we compare the retrieval performance of the proposed method with those of other methods: Neural Audio Fingerprint (NAF) [33], Robust and Lightweight Audio Fingerprint (RLAF) [34], Attention-based Audio Embeddings (AAE) [35], and Contrastive Learning-based Audio Fingerprinting (CLAF) [36].Table 8 shows the comparison results of Top-1 HRs with 3s audio segments using diferent methods by L2 indexing.Te Top-1 HR of the proposed method is higher than that of other methods about 2% to 8%.International Journal of Intelligent Systems However, as an anime audio contains multiple sounds of several animation characters, performance of separating each character voice needs further improvement, especially when the characters speak at the same time because the more clearly separated audio prompts the more accurate recognition.

Conclusion
To improve the efciency and accuracy of anime audio retrieval, enhance user experience, and support copyright protection and content management, this paper proposes a novel approach that combines audio separation with audio feature recognition for anime audio retrieval.Te proposed method utilizes an ECA-based audio separation technique to separate diferent audio sources within anime audio.Furthermore, an efcient indexing database is constructed to extract and match fngerprints of the separated anime audio sources.Experimental evaluations conducted on multiple anime segments demonstrate that the proposed method achieves fast and accurate anime audio retrieval, improving retrieval efciency.Additionally, the proposed framework provides efective methods for copyright protection and content management by enabling the tracking and management of audio resources within anime productions.With its wide range of potential applications, this approach holds signifcant importance in the anime industry and the feld of audio processing.In the future work, we will further improve the anime audio separating performance, especially for the anime audio that contains multiple sounds of several characters, to further increase the anime audio retrieval accuracy [37,38].

Table 1 :
Te detail information of the model.

Table 2 :
Te detail information of the experimental environment.

Table 3 :
Te results of the ablation experiments.

. 3 8.20 5.26 15.58 18.21 12.94
In the anime audio separation task of this study, the ECA mechanism proves to be efective in reducing the number of model parameters, thereby enhancing the training and inference speed of the model.With the inclusion of the ECA mechanism, both SDR and SIR criteria exhibit improvement over their respective values before incorporation.Te average SDR improves by 0.23 dB, and the average SIR improves by 1.08 dB.

Table 4 :
Comparison of audio separation performance with and without ECA.

Table 5 :
Comparison results of audio separation with diferent methods.

Table 7 :
Comparison of Top-1 HRs in diferent indexing methods.

Table 8 :
Comparison results of Top-1 HRs with 3 s audio segments using diferent methods.