Research on Audit Opinion Prediction of Listed Companies Based on Sparse Principal Component Analysis and Kernel Fuzzy Clustering Algorithm

,e prediction of audit opinions of listed companies plays a significant role in the security market risk prevention. By introducing machine learning methods, many innovations can be implemented to improve audit quality, lift audit efficiency, and cultivate the keen insight of auditors. However, in a realistic environment, category imbalance and critical feature selection exist in the prediction model of company audit opinions. ,is paper firstly combines batched sparse principal component analysis (BSPCA) with kernel fuzzy clustering algorithm (KFCM) and proposes a sparse-kernel fuzzy clustering undersampling method (S-KFCM) to deal with the imbalance of sample categories. ,is method adopts the kernel fuzzy clustering algorithm to down-sample the normal samples, and their features are extracted from abnormal sample sets based on the group sparse component method. ,e sparse normal sample set can maintain the original distribution space structure and highlight the classification boundary samples. Secondly, considering the company’s characteristic attributes and data sources, 448 original variables are grouped, and then BSPCA is used for feature screening. Finally, the support vector machine (SVM) is adopted to complete the classification prediction. According to the empirical results, the SKFCM-SVM model has the highest prediction accuracy.


Introduction
e financial reports and audit reports regularly disclosed by listed companies form a vital basis for investment decisions to various stakeholders. e audit report refers to the written document of the audit opinion issued by the certified public accountant (CPA) on the financial statements of the audited entity based on performing the audit work in accordance with the provisions of the audit standards. For the audit of financial statements, it is to express an opinion on whether the financial statements have been prepared in accordance with the applicable accounting standards and whether the financial statements are fair in all material aspects, reflecting the financial position, operating results, and cash flow of the auditee. Due to the limitation of professional knowledge, time, and other conditions, it is difficult for users of financial statements to effectively review and accurately judge the authenticity and compliance of enterprise financial statements. As a third party independent of the audited entities and stakeholders, CPA issues related audit opinions regarding the assurance documents of the company's financial situation, operational results, cash flow, and other information to enhance the credibility of financial information of listed companies. Machine learning methods can be implemented to improve audit quality, lift audit efficiency, and cultivate auditors, which has become the industry consensus. Notably, by combining supervised and unsupervised learning technology, the prediction model of audit opinions can effectively solve the potential contradiction between audit efficiency and audit risks, making the audit work of listed companies more valuable. Meanwhile, it can vastly shorten data processing time, reduce simple duplicate labor, strengthen analysis and monitoring, and allow auditors to solve problems with their occupational judgments, thereby reducing audit risks and drawing more credit conclusions to ensure the quality of audit reports. In addition, the machine learning methods are also applied to the early warning of audit risk, which can assist stakeholders to estimate the type of audit opinions issued by the registered accountants in accordance with the relevant data of listed companies, optimize the security market resource allocation, reduce capital market risk and maintain market economic order.
Nevertheless, two significant challenges hinder the practicality of artificial intelligence technology in the company's audit opinion prediction. First, most listed companies in the stock market belong to the "Standard Unqualified Opinion," Only a few companies have been issued in other audit opinions, which has a typical category-imbalanced problem [1][2][3]. Second, audit opinion prediction incurs the obsession with the "curse of data dimension". On the one hand, all data features are incorporated into prediction models, which result in excessively fitting and affecting prediction results [4,5]. On the other hand, most audit opinion prediction models only adopt financial statements and fewer nonfinancial indicators [6][7][8][9][10]. Some empirical studies have revealed a significant relationship between abnormal nonfinancial indicators and auditing opinions [11,12]. Moreover, the selection of features is based on experience or previous research, which makes the prediction model easily affected by human factors. erefore, it is the key to applying machine learning to the audit opinion prediction of listed companies that how to perform data mining and analysis of the comprehensive information of the security market to extract the valuable information to determine whether the company has audit risk by establishing a prediction model with strong anti-interference ability.
Given this, our research combines supervised learning technology with unsupervised learning technology. First, to deal with the imbalance problem of the company's sample category, we combine the sparse principal component analysis with a fuzzy kernel-clustering algorithm and propose a novel undersampling method. Second, in terms of the feature screening, to retain the internal structure of the company's characteristics, we employ the sparse principal component analysis to select the characteristic data after grouping aiming to remove redundant information in listed company data, thereby selecting the optimal auditing risk prediction characteristics. Finally, the support vector machine is adopted to classify. e main contributions of this study are as follows: (1) For the category imbalance problem in the company audit opinion prediction, we combine the sparse principal component with kernel fuzzy clustering algorithm to propose a sparse kernel fuzzy clustering undersampling method. By comparing with the mainstream methods in processing the data imbalance problem, the testing performance of our proposed approach outperforms others.
(2) In the unified kernel function mapping space, the sparse fuzzy clustering and SVM are combined to integrate the undersampling and classification prediction, which improves the classification prediction effect and enhances the model's comprehensibility.
(3) To highlight the practicability of the prediction model in this paper, we take all listed manufacturing companies in China's A-share market from 2012 to 2019 as the research object to reflect the actual structure of the sample space of companies in the security market. In terms of sample features, we use the most comprehensive feature data in China Securities Market and Accounting Research (CSMAR) database. e corresponding algorithms determine the sample matching and feature screening in the experiment to avoid the influence of human factors. e remainder of this paper is organized as follows. Section 2 reviews the related research literature. Section 3 introduces our model's specific process and core algorithms, including the BSPCA and kernel fuzzy clustering undersampling methods. Section 4 demonstrates the data, design, evaluation index, and results regarding this study. Section 5 concludes our research work and addresses future research directions.

Literature Review
Early scholars usually use statistical analysis methods (such as Logistic and Probabilistic models) to study audit risk early warning of companies. However, traditional research methods are limited by strict assumptions and have poor fault tolerance. With the wide application of artificial intelligence technology in corporate governance, an increasing number of scholars apply machine learning algorithms to predict company audit risk and financial fraud. Gaganis and Pasiouras et al. [13] applied a probabilistic neural network to predict audit opinions of listed companies. e listed companies in London Stock Exchange as the experimental objects found that their proposed model was superior to the traditional artificial neural network and logistic regression models. Perols [3] compared the application of six popular classifiers in the field of corporate financial statement fraud and found that logistic regression and SVM performed well relative to artificial neural networks, bagging, C4.5, and stacking. Fernandez-Gamez et al. [14] combined financial variables with corporate governance variables to form a feature set and used multilevel perceptron and probabilistic neural network to establish a prediction model of audit opinions. Heng-Shu [15] took the financial indicators of listed companies as variables and introduced Takagi-Sugeno fuzzy neural network to construct the prediction model of audit opinions. Salehi and Dehnavi [16] applied the grey model to predict audit reports and found that the Nash nonlinear grey Bernoulli model had the best prediction effect. Yao and pan et al. [17] adopted stepping-regression and principal component analysis (PCA) to reduce the dimension of company characteristics and used six machine learning methods to identify fraudulent activities in financial statements of Chinese listed companies. It is found that stepwise regression and SVM have the highest classification accuracy. Omidi and minetal [18] analyzed the financial statement fraud in China's stock market by combining supervised and unsupervised learning. First, the financial statement data were divided into three groups by the cluster analysis. en, multilayer feedforward neural network (MLFNN), probabilistic neural network (PNN), support vector machine (SVM), polynomial log-linear model (MLM), and discriminant analysis (DA) were used for classification and prediction, and the research found that fuzzy neural network had the best classification effect. Bao and ke et al. [19] used ensemble learning to predict accounting fraud of listed companies in the United States, and the input data were original accounting figures rather than financial ratios, which was proved to have a good prediction effect. Sánchez-Serrano and José Ramón et al. [20] taking a group of Spanish companies as research samples, compares the effects of several different neural networks in the prediction of audit opinions on the company's consolidated financial statements, and MLP obtains the best prediction effect, with an accuracy of more than 86%. Chyan-Long Jan [21] forecasts the CPA's going concern audit opinions of Listed Companies in Taiwan. e results show that the model's prediction accuracy is the highest in the case of important variables selected by CART and modeling by RNN.
Previous studies on the construction of audit risk early warning of listed companies have the following limitations: First, there is little research on selecting the overall characteristics of listed companies. Existing studies have applied factor analysis, rough domain set, principal component analysis, and other techniques to filter or reduce the dimension of company characteristics, but most of their initial characteristics are determined according to previous studies. Considering that artificial intelligence technology can handle collinearity problems well, and the audit of listed companies is for annual financial statements, characteristics in the database should be considered as comprehensively as possible. All financial indicators should be included in the research scope.
Second, it is insufficient in research on the imbalance problem of audit risk prediction of listed companies. e existing research on the audit risk prediction of listed companies often ignores the category imbalance, resulting in companies with audit risk taking only a tiny part of the stock market. Most scholars artificially select control samples according to industry and asset size to construct balanced data sets. is method ignores the original sample structure of the stock market and reduces the practicability of the prediction model. Some scholars also use SMOTE oversampling technology to deal with imbalance problems. However, the new samples synthesized by this method may not provide too much helpful information. erefore, we should compare various techniques to deal with the imbalance problem and study the best method to deal with the audit risk samples of listed companies. is study uses a larger sample size and more comprehensive sample characteristics to analyze the prediction effect of the model to retain the real market environment faced by the listed companies to enhance the practicability of the model.

Model Description
We propose a hybrid classification model that combines Sparse principal component analysis (SPCA) [22], Kernel Fuzzy C-means algorithm (KFCM), K-Nearest Neighbor algorithm (KNN), and SVM. Figure 1 reflects the process of sample matching, feature screening, and classification prediction throughout the model, and Figure 2 shows the flow of the sparse-kernel fuzzy clustering undersampling method (S-KFCM) explicitly. As shown in Figure 1, the entire model is divided into three phases. e first stage is a sample matching phase. In order to solve the problem of imbalanced problems in the prediction of listed companies' audit opinions, we apply the sparse-kernel fuzzy clustering undersampling method (S-KFCM) to choose the most similar and representative matching samples and build the balanced sample data set. As shown in Figure 2, we first divided the data set into several groups according to the year that the sample belongs to, and then conducted batched sparse principal component analysis (BSPCA) on the minority group (nonstandard audit opinion companies) in each group to obtain the feature set that best reflects the sample of the category of companies. en the feature is used to cluster most of the samples in the same year, and the kernel fuzzy clustering algorithm is used to determine the clustering center of the majority group samples. Finally, the nearest neighbor algorithm is used to find the majority group samples closest to these clustering centers as the control samples of the minority group samples. e second stage is a feature screening phase. After obtaining the categorybalanced data set, we also use BSPCA to perform feature screening, eliminate redundant information in raw data, and build the best feature dataset. e final stage is a classified prediction stage. We put the training data into SVM to train the model and then input the test sample for prediction, where SVM employs the same kernel with the nuclear fuzzy clustering algorithm in the first phase.
In the detection procedure, the key algorithms are BSPCA in stages 1 and 2 and the kernel fuzzy clustering undersampling method in Stages 1. erefore, we provide a more explicit description of our proposed algorithm in the following two subsections.

Batched Sparse Principal Component Analysis (BSPCA).
Sparse principal component analysis (SPCA) is proposed because principal component analysis can be transformed into a quadratic penalty regression problem. e objective function of SPCA is as follows: where X i is the i th row vector of X, λ > 0. When α 2 � 1, β ∝ V 1 . As such, regression knowledge is used to obtain the first principal component. By adding the LASSO penalty item to the above equation, sparse principal components can be obtained. us, the following optimization problem can be obtained: where α T α � I k .

Mathematical Problems in Engineering
As stated above, the solution of sparse principal components can be transformed into a penalty regression problem. e calculation of sparse principal components was obtained by using the least angle regression algorithm. e steps of grouping SPCA are as follows: (1) Collect and standardize the characteristic indicators of listed companies (2) Divide the characteristic indexes of listed companies into several groups (such as solvency, profitability, growth ability, etc.) according to financial statement analysis methods and data source More details about the BSPCA can be referred to [23].

Undersampling Method of Kernel Fuzzy c-Means
Clustering. e objective function of KFCM-K is, Here, ‖ · ‖ is the Euclidean distance. u ik the membership of data x k belonging to the cluster i, represented by the prototype v i , m is the fuzzification.
Given the Euclidean distance and optimizing Q concerning located in the kernel space such that ∇ v i Q � 0, we obtain [24][25][26][27]: e prototype expression for the Gaussian kernel for i � 1, 2, . . . , c is then given as, e learning algorithm of KFCM-K iteratively updates u ik as, More details about the derivation of (6) and (7) can be referred to [28].
Use the KNN algorithm to calculate the Euclidean distance between the majority group samples and each cluster center by d(x, y) � n i�1 (x i − y i ) 2 . Sort the distance values and select the K points with the smallest distance.
Our model is summarized as Algorithm 1.

Experiment Results and Discussion
In this section, we evaluate the performance of the proposed SKFCM-SVM method by using the real information data of manufacturing listed companies in China's A-share market.
MATLAB2016b and Python3.8 as tools to obtain the calculation results.  Mathematical Problems in Engineering Table 1 reflects the annual sample distribution of nonstandard audit opinion companies and standard audit opinion companies in China's A-share manufacturing industry. e imbalance ratio (IR) is the number of companies in the standard audit opinion (majority category) divided by the number of companies in the nonstandard audit opinion (minority category). As shown in Table 1, with the gradually tightening supervision from China Securities Regulatory Commission (CSRC) on the capital market, the number of listed companies with nonstandard audit opinions continues to increase year by year nonbalance ratio (IR) gradually decreases. However, compared with the companies with standard unqualified audit opinions, the data set of manufacturing listed companies still has a severe category imbalance.
Considering that the prediction of audit opinions of listed companies has vital timeliness, the training and test samples are divided according to the year of data. We set four data sets, and their specific sample distributions are shown in Table 2. Since the data in the last year of each data set also has the problem of unbalanced sample categories, we carry out kernel fuzzy clustering analysis on majority category samples in the annual sample data from 2016 to 2019. e most representative majority class samples are found and matched with the minority class samples, and the balanced data sets are constructed as the test samples.

Selection of Alternative Indicators.
At present, there is no specific economic theory to guide the selection of indicators to predict audit opinions. e fundamental reasons for various audit opinions are different, so it is difficult to fully describe a few simple ratio indicators. erefore, through China Stock Market and Accounting Research Database, we download the financial indicators, corporate governance indicators, and market transaction data of listed manufacturing companies and attempt to obtain comprehensive information reflecting all aspects of the company. In order to retain the internal structure of the company's data information and apply it to batched sparse principal component analysis (BSPCA) introduced in Section 3.1, we use the feature grouping method of the CSMAR database for reference and divide all features into 15 groups. Among them, 390 indicators reflect the financial statement information of listed companies, which are divided into ten groups. ere are 35 features of ratio structure, 27 features of solvency, 56 features of development ability, three features of risk level, eight features of dividend capacity, 67 features of operation ability, 88 features of index per share, seven features of financial disclosure index, 32 features of cash flow, and 67 features of profitability. In addition, 51 indicators reflect the governance ability of listed companies, divided into four groups, including five characteristics of ownership structure, ten characteristics of top ten shareholders, 28 indicators of relative value, and eight indicators of comprehensive governance information. ere even are seven indicators that reflect the market trading information of listed companies.

Performance Evaluation Measures.
Considering that the research in this paper belongs to the dichotomy problem when it comes to class imbalance or cost imbalance, the accuracy cannot sufficiently reflect the classification effect of the model. erefore, we use the idea of a confusion matrix to measure the model performance. We set the nonstandard audit opinion as positive and standard unreserved audit opinion as negative.
e correct prediction of the nonstandard audit opinion sample is true positive (TP), the false prediction of the nonstandard audit opinion sample is a false negative (FN), the correct prediction of the standard unqualified audit opinion sample is true negative (TN), and the false prediction of the standard unqualified audit opinion sample is false positive (FP). Specific evaluation indexes include accuracy, F1 value, G-mean, and Matthews correlation coefficient (MCC), and the specific calculation formulas are as follows: Input: Given a set of n data points X � x i n i�1 , a basis Gaussian kernel function K s, the number of clusters c, the fuzzy index m, the termination criterion ζ and T, and the initialization partition matrix U (0) � u ij c,n i,j�1 . Output: e clustering prototypes V.

Comparison of Results and Discussion of Sample
Matching Methods. In order to analyze the effect of the sparse kernel fuzzy clustering undersampling method we proposed to deal with the imbalance problem of sample categories, we will take the four data sets in Table 2 as experimental objects, introduce five popular sampling methods, and use the same classifier (SVM) for comparative study. e five sampling methods are random oversampling (RO), synthetic minority oversampling technique (SMOTE) [29,30], adaptive synthetic sampling (ADASYN) [31], random undersampling (RU), and NearMiss [32]. e first three methods belong to oversampling, while the last two methods and the method in this paper (SKFCM) belong to undersampling. Random oversampling (RO) randomly samples from the minority category samples and then adds the samples to the data set. SMOTE is to interpolate between the minority category samples to generate additional samples. ADASYN method also synthesizes new samples. e most significant feature of the ADASYN method is to adopt some mechanisms to automatically determine the number of synthesized samples to be generated for each minority sample, rather than synthesizing the same number of samples for each minority sample like SMOTE. Random undersampling (RU) is similar to random oversampling in that some samples are randomly selected from most samples. NearMiss is a prototype selection method that selects the most representative samples from the majority category samples for training, mainly aiming to alleviate information loss in random undersampling.
As shown in Table 3 and Figure 3, the S-KFCM method we proposed has achieved the best classification effect in four data sets, indicating that S-KFCM is more suitable for dealing with the matching problem of listed companies. In contrast, there is a specific defect in the sampling methods mentioned above for performing the comparative experiment. Random oversampling methods (RO), due to repeated sampling, often lead to severe extensions. Regarding the SMOTE method, if there are also a few samples around the selected minority sample, the new synthetic sample does not provide helpful information. If choosing a few samples around the majority category samples, this sample type may be noise, and the new synthetic sample will produce most of the surrounding samples to overlap, resulting in difficulties in classification. e ADASYN method and the NearMiss method are easily affected by the group point. e disadvantage of random undersampling (RU) is that the excluded samples may contain some critical information, leading to the learned model's poor effect.
Our proposed S-KFCM regards the minority category samples as the core object and uses batched sparse principal component analysis (BSPCA) to dig the vital feature combination of minority samples. Based on this feature set, the most representative samples in the majority category are found by the KFCM and KNN algorithms. Furthermore, all the underlying processes are completed according to the year of sample data, which fully considers the particular environment in the sample year to build the best sample balance dataset.

Comparison Results and Discussion of Characteristic
Degradation Algorithms. In this section, we also use the relevant experiments to test the BSPCA method adopted in this paper for feature dimension reduction. We used three  usual data dimensionality reduction methods for comparative analysis, including linear discriminant analysis, principal component analysis, and sparse principal component analysis. e nearest neighbor algorithm is the classifier in the linear discriminant analysis, and SVM is the classifier for the other dimensionality reduction methods. As shown from Table 4 and Figure 4,  sets. e cause is that the BSPCA method is different from the other three methods. BSPCA groups all features according to financial statement analysis methods and data sources and then uses sparse principal component analysis to screen each feature group. is method reduces the redundant data of each feature group, which means fewer opportunities to make decisions according to noise to reduce overfitting. It can be used to measure the information category and relative

Comparison Results and Discussion of Model Classifier
Algorithms. is section is the third part of the experiment. Under the condition that the model completes sample matching and data dimensionality reduction, we test the  influence of different classifiers on the prediction effect of audit opinions of listed companies. In this paper, multilayer feedforward neural network (MLFNN), K-Nearest Neighbor (KNN), Naive Bayesian classifier (NB), C4.5, Bagging and support vector machine (SVM) are introduced for comparative analysis. Table 5 and Figure 5 show the prediction effects of different classifiers after using sparse kernel fuzzy clustering undersampling for sample matching and batched sparse principal component analysis for feature selection of four data sets. e best index results of each data set are shown in bold. As shown from Table 5, SVM achieves the best results in Dataset I and Dataset IV and falls behind the C4.5 algorithm only in the evaluation index of the Dataset II and the F1 value Dataset III. e difference between SVM and the other five classifier algorithms is that SVM uses kernel function in the classification process. Considering that the KFCM algorithm and SVM used in our model selected the same kernel function (Gaussian nuclear function), this indicates that the sample matching and classification prediction of our model is completed under the same kernel space, so SVM is suited to be the classifier for the company's audit opinion prediction model.

Significance Test.
is section uses the Friedman test to determine significant differences between various methods in three partial experiments. We compared six sample matching algorithms, four feature drop-dimensional algorithms, and six classifiers. In Friedman's analysis, the card distribution is approximately Friedman's test statistics. We sorted the g mean in the above experiments and calculated the ranking of various methods in three partial experiments. e levels of various methods in three partial experiments are shown in Tables 6 to, 8. * χ 2 0.05 � 11.071 < 17.57. (9) * χ 2 0.05 � 7.815 < 12.
Under the null hypothesis, it would be no difference between all the methods, and therefore theoretically, R 2 j should be equal. From the data in Tables 6-8, the value of the Friedman test statistics can be calculated as follows: where n is the number of datasets and k is the number of methods, R 2 j represents the sum of the ranks for all datasets under the k th methods.
In statistical analysis, to reject the null hypothesis, the calculated value χ 2 r must be greater than or equal to the critical value of the chi-square distribution. In this set of experiments, we adopted the commonly used critical value of 0.05 freedom degrees.
According to Table 6 and Table 7, the first two parts of the experiment in this paper have passed the Friedman test, indicating significant differences between our sample matching method and feature screening method. According to Table 8, the third part of the experiment did not pass the Friedman test, indicating that our classifier methods only show specific differences.

Conclusion
Predicting audit opinions of listed companies is a research field with excellent application prospects. It can provide scientific and technical support for auditors to issue audit opinions, reduce audit risks and improve audit efficiency. However, the scarcity of non-standard audit opinion companies, the relative diversity of characteristic variables used in the existing literature, and the wide use of various classifiers pose challenges to the effective prediction of the model.   Given this, we combine batched sparse principal component analysis (BSPCA), kernel fuzzy clustering analysis and SVM, and put forward a whole set of models to deal with the prediction of audit opinions of listed companies in a real market environment. Our experiments are divided into three parts. In the first part, to show that the sparse kernel fuzzy clustering undersampling method can effectively deal with the imbalance problem of sample categories, we introduce random oversampling, SMOTE, ADASYN, random undersampling, and NearMiss for comparative experiments. It is proved that our method is most suitable for sample matching of audit opinion prediction of listed companies. In the second part, we studied the feature dimensionality reduction methods of listed companies, compared linear discriminant analysis, principal component analysis, sparse principal component analysis, batched sparse principal component analysis. Finally, we found that batched sparse principal component analysis has apparent advantages. In the third part, after determining the sample matching method and the feature dimension reduction method, we compare the influence of different classifiers on the prediction effect of audit opinions of listed companies, and the result shows that SVM has the best classification and prediction effect.
is study simulates real scenes to show how these technologies are appropriately used to audit listed companies. Significantly, this paper combines Batched Sparse principal component analysis (BSPCA) with the kernel fuzzy clustering algorithm to down-sample normal samples (listed companies with standard audit opinions) to balance training samples. In this process, BSPCA processes the abnormal sample set to obtain the abnormal samples' salient features, which are used as the features of normal samples for fuzzy kernel clustering to achieve the purpose of downsampling. It can not only maintain the sample distribution space but also accurately extract the boundary samples. However, there are still shortcomings in this study. First, this study does not cover listed companies in other industries. In future research, all listed companies in China's A-share market should be collected to further test our method. Second, further research can be performed by adding more variables, such as the text information in the company's annual report and audit report, which seems to be an essential factor. Finally, a comparative empirical study can also be carried out when the data set is obtained from other stock markets. In addition, the methods and processes used in this paper can be extended to other fields (such as company financial distress prediction and bonds predictions) more than company audit prediction.

Data Availability
e data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Disclosure
Sen Zeng and Yanru Li are co-first authors.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Sen Zeng and Yanru Li contributed equally to this work.