Bank Financial Risk Prediction Model Based on Big Data

Financial risk prediction is an important technique to systematically predict the unforeseeable risks in banking systems. The issues involving ill-timing and low accuracy in the current risks prediction methods necessitate an effective risk prediction method. Akin to the use of big data in various domains, the technology has a significant role in financial services and can be used to accurately and timely predict the possibilities of risks. In this paper, an effective hybrid method is proposed to aptly and effectively predict financial risks in the banking systems. The method utilizes the Lasso and linear regression algorithms via the big data features and framework technologies. By proper formalization of the bank financial risk problems, the risk data is obtained and processed. To filter the initial text features and preprocess the annual report text data, the information gain method is used. With the Bag-of-Words (BoW) and the word frequency reverse document frequency weighting method, the text features of financial risk prediction are extracted. The bank financial risk prediction model is constructed based on weighted fusion adaptive random subspace algorithm. The prediction results obtained are integrated so as to realize the bank financial risks in a seamless way. The experimental results show that the proposed method can effectively improve the prediction accuracy and consumes comparatively lesser time in risk prediction.


Introduction
As an important financial institution, banks have strong financial strength and diversified financial services. e safe operation of banks is of great significance to a country's economic security and healthy development [1]. On the surface, the bank is only an intermediary agency for money circulation, but in fact, the essence of the bank is to manage risks to obtain benefits. e focus of competition among peers is risk management ability, which can not only obtain high returns, but also reduce risks and be a mean to attract more customers. Financial risk prediction is the emerging research area to accurately and timely predict the risks involved in the banking. With the development of the global economy and the deepening of financial liberalization, possibility of breakout of financial crisis is higher. Moreover, finical data is becoming more vulnerable to destructiveness. Banks are high-risk industries; high-risk factors are always involved in the process of bank operation and management. e risk factors may in turn lead to financial crisis with a wide range of influence [2]. Especially under the background of increasingly complex financial ecological environment, the occurrence mechanism of financial crisis is more complex and destructive. erefore, it is of great significance to study the bank financial risk prediction and establish an effective model to accurately predict the bank financial risk levels. is will help in preventing and controlling the occurrence of the financial crisis and/or reducing the losses caused by the financial crisis [3]. At present, scholars in related fields have studied financial risk prediction and achieved some theoretical results. Pawiak et al. [4] proposed a credit score prediction method based on learners' deep genetic hierarchy network. Credit scoring is an effective and key method used by banks and other financial institutions for risk management. It provides appropriate guidance for issuing loans and reduces the risk in the financial field. Using deep genetic learner level network to improve credit score risk prediction, combined with support vector machine, probabilistic neural network, and fuzzy system, credit scoring risk prediction is realized.
is method is effective, and the prediction performance of credit score data set is the best. Niu et al. [5] proposed a resampling integrated evaluation method of P2P loan credit risk based on data distribution. e class imbalance problem is solved by using the undersampling method based on the distribution of most class data. In order to improve the classification performance of the resampling integration model based on data distribution, the basic classifier with good comprehensive performance on the verification set is used for classification prediction to realize the resampling integrated evaluation of P2P loan credit risk. is method has good prediction performance. However, the above methods still have the problems of low prediction accuracy, long time, and poor effect.
To solve the above problems, a bank financial risk prediction method based on big data is proposed. Lasso and linear regression algorithms are studied by using big data characteristics and framework related technologies. By defining the formalization of bank financial risk problems, obtain and process bank financial risk data. Using word bag model and word frequency reverse document frequency weighting method, the text features of financial risk prediction are extracted. e adaptive fusion method is then utilized to fuse the financial risk characteristics. Based on the weighted fusion adaptive stochastic subspace algorithm, the bank financial risk prediction model is constructed to realize the bank financial risk prediction. is method can effectively improve the accuracy of risk prediction within a shorter risk prediction time span. e rest of the paper is arranged into 4 sections. e technology of big data is elaborated in Section 2. Relevant theories about the bank financial risks are discussed in Section 3. e multisource heterogeneous data based financial risk prediction method is presented in Section 4. e last section, Section 5, is about conclusion and future work.

Big Data Technology
e buzz-word big data refers to the use of software utility to extract information from a large and complex data set through analyses and statistical measures. e technology of big data is to mine structured and/or unstructured data to obtain meaningful information and to generate machine learning models.

Big Data Concept.
Big data refers to a data set that cannot be captured, managed, and processed by conventional software tools within a certain time range. It is a massive, high growth rate and diversified information asset that requires a new processing mode to have stronger decision-making power, insight and discovery power, and process optimization ability [6]. e big data industry takes data as the core. By collecting, storing, processing, analyzing, and applying the generated data and displaying it to users, the data processing efficiency is high and the cycle is short. e data processing technology contained in big data makes the bank financial risk prediction more scientific.

Big Data Features.
Big data is not simply a huge amount of data but has its unique 4 V characteristics. e industry represented by IDC generally believes that big data has the characteristics of scale (Volume), diversity (Variety), high speed (Velocity), and value (Value). Big data 4 V characteristics are as Figure 1.

Large Data Scale.
e huge order of magnitude is the basic attribute of big data. With the wide use and development of Internet technology, the number of Internet users is increasing rapidly. e acquisition and sharing of data information is becoming simple. At present, through a computer or a mobile phone, people can quickly and easily obtain a large amount of information. In addition, the sharing, clicking, browsing, and trading behaviors of network users on the Internet will produce a large amount of data. e quantity level of big data has jumped from TB level to the level of PB. e bank has the attribute of natural big data. Its huge financial transaction data is a natural data pool. e bank can easily understand the revenue and expenditure, deposits, and capital operation of customers.

Categories of Big Data.
ere are various types of big data and a wide range of sources. For the banking systems, the traditional enterprise financial database can no longer meet the needs of banks. In addition to the customer service, audio, network video, and online banking transaction records are retained by the banks. e bank can also obtain more data from website log data, enterprise ERP system, GPS global positioning system, e-commerce transaction records, government management department information platform, and other channels. Data types include not only traditional relational data types, but also unprocessed, semistructured, and unstructured information.

Fast Processing Speed.
e faster frequency of data generation and update is also an important feature of big data. ere is a saying about data processing in the era of big data, which is called the one-second law. Take online financial transactions as an example. On the trading platform, a large amount of financial transaction data, logistics, and transportation data are generated with every passing second. e data is generated and transmitted continuously; therefore larger storage and faster data processing tools are required.

Low Data Value Density.
While the amount of data increases exponentially, the useful information hidden behind the data does not show the due growth proportion. Moreover, it is becoming increasingly difficult to obtain useful information. For banks, how to find useful information from a large amount of enterprise information is a problem. Because banks have strong financial strength, they can seek cooperation with professional data providers. At present, data providers represented by professional financial data service providers such as ninth power, IBM, and Intel provide banks with financial big data collection, analysis, and mining services to help banks mine data value.

Big Data Framework-Related Technologies.
e frameworks of big data refer to the systematic expression of datasets so as to overcome the possible barriers in extracting information from data. e frameworks become necessary in such situations where the datasets are enormous and clumsy that meaning and/or information cannot be easily deduced from the data. Followings are some of the big data frameworks.

e HDFS File System.
Hadoop distributed framework is the next mainstream big data processing framework, which is mainly used to process big data. e data level that Hadoop can handle is PB, which allows programs to perform distributed operations over thousands of nodes [7]. Hadoop has two core modules: (1) Hadoop Distributed File System (HDFS) and (2) the computing MapReduce framework. Among them, HDFS is a distributed file system that can be used on general hardware devices whereas MapReduce is used to realize distributed parallel computing.
e principle structure of HDFS distributed file system is as Figure 2.
HDFS is a master-slave architecture. An HDFS cluster is composed of a named node and several data nodes. Usually the architecture consists of one node and one machine (data node). e machine manages the storage of the corresponding nodes. e named node is used to manage namespaces and tuning requests. e data node is mainly used for data storage. HDFS opens file namespaces to the public and allows user data to be stored as files. e data node also executes block creation, deletion, and block copy instructions from the name node.

Spark Distributed Computing Framework.
e spark distributed computing architecture is currently the most popular big data computing framework. Compared with Hadoop's MapReduce framework, the spark is based on memory to do calculations, so the calculation performance is much better than MapReduce. e spark distributed computing framework is as Figure 3.
e main modules included in the spark framework are Spark-SQL data processing module, spark streaming data processing module, MLlib algorithm library module encapsulating mainstream machine learning algorithms, and the GraphX graph-based computing module [8]. Spark-SQL module is mainly used in data analysis, extraction, and index summary. e spark streaming is usually used for log analysis together with open source Kafka and Flume of Hadoop ecosystem. MLlib provides mainstream classification, clustering, and recommendation algorithms of machine learning, which is convenient for data science and technology to use spark for data mining.

Machine Learning and Statistics Related Algorithms.
Machine learning algorithms are the dedicated programs that automatically learn from data and improve its performance with experience. Normal algorithms need program and data to produce output whereas the machine learning algorithm generates programs by taking output and data to operate without human intervention. Followings are the machine learning algorithms used in the domain of risk prediction.

Lasso Algorithm.
In statistics and machine learning, the Lasso algorithm is a regression analysis method of simultaneous feature selection and regularization. e algorithm aims to improve the prediction accuracy and interpretability of statistical model [9]. By forcing the sum of absolute values of regression coefficients to be less than a fixed threshold, some regression coefficients are forced to become zero. e variables corresponding to these regression coefficients are effectively selected, so as to build a (1) In formula (1), t and j correspond one-to-one, which is the adjustment coefficient.
It is equivalent to Order: In formula (3), OLS is estimated by the least square method. When t < t 0 , when a part of the coefficient is compressed to a value of 0, the dimension of X is reduced to achieve the purpose of dimensionality reduction.

Linear Regression.
e basic idea of linear regression method is to characterize the input data as a linear model and estimate and solve the parameters of the model by using the least square method under the principle of minimizing the mean square error [10]. Suppose the input data set is D where D has d features and m samples, and x i is the i sample. At this time, the multiple linear regression model is described as follows: When X T X full rank matrix or positive definite matrix, the weight parameter of the feature can be obtained as When X T X is not full of rank matrix or positive definite matrix, the optimal solution obtained by parameter estimation is not unique at this time, and the variance of the model can be reduced by adding regular constraints.

Relevant Theories of Bank Financial Risk
Financial risk management is very important area in banking. Risk management in the realm of banking intends to systemically model possibilities of issues which in the long run may affect financial marketing and/or financial tweets.

Financial Risk Concept.
e general definition of financial risk is the possibility of losses to financiers in the process of financial service transactions. It may also refer to forecasting whether the actual income is lower than the expected income, or the actual cost is higher than the expected cost [11]. From the perspective of the operation of financial institutions, this paper defines financial risk as banks are likely to suffer losses under the influence of various uncertain factors in the process of financial activities such as fund-raising and utilization. is shows that the actual income is lower than the operating cost.

Financial Risk Characteristics.
e characteristics of financial risk are divided into five categories, including objectivity, uncertainty, latency, controllability, and periodicity. Details of the characteristics are given as follows.
3.2.1. Objectivity. Financial risk is accompanied by financial activities. As long as there are financial activities, there must be relevant risks. Moreover, with the continuous innovation of derivative financial instruments, it not only promotes financial development, but also brings new risks. Moreover, the occurrence of financial risks in a financial institution will inevitably affect its creditors and may further affect all aspects of economic operation.

Uncertainty. Financial institutions conduct business
or decision-making activities in an uncertain environment; that is, the operating environment of financial business activities is constantly developing and changing, while it is difficult for the actors to accurately predict the future, and financial risks may arise at any time.

Latency.
Financial risk is often manifested as the outbreak of financial crisis. In fact, financial activities may cover up some uncertain losses due to their own characteristics.

Controllability.
Although uncertain changes in the economic situation may bring risks, the risks can be effectively controlled as long as targeted measures are taken. Scientific Programming

Periodicity.
For each financial institution, it operates in the established financial ecological environment, and the financial environment is affected by the whole economic environment. erefore, when the periodic fluctuation of economy and the orderly change of monetary policy appear, it is easy to identify cyclical financial risks, which makes the monitoring of financial risks possible.

Financial Risk Classification.
According to the scope of occurrence and influence of financial risk, this paper divides the risks into systematic financial risk and nonsystematic financial risk. Details of the risks are given in the following subsections.

Systemic Financial Risk.
e systematic financial risk refers to the overall risk of the market including the impact of economic, political, social, and other environmental factors in the financial ecological environment on the whole market. Changes in external environmental factors may lead to financial crises in some banks and chain crises in the whole financial system. erefore, only through a reasonable evaluation of the macroeconomic situation in a certain period of time can we identify the systemic financial risks faced by a country or region.

Nonsystematic Financial
Risk. Nonsystematic risks refer to the possible loss caused by individual financial institutions in the financial industry. In the process of financial activities, these are the risks which are considered to be the decentralized risk. Nonsystematic financial risks can be reduced or even eliminated by improving bank management and asset allocation.

Bank Financial Risk Prediction Method
Integrating Multisource Heterogeneous Data is research work focuses on the bank financial risks intended to construct a multisource heterogeneous features set. e research proposes a banking financial risk prediction method that integrates multisource heterogeneous data.

Formal Definition of the Problem.
In order to express the proposed method clearly, a formal definition should be made before introducing the specific method. Assuming that there are n samples in a given data set D, the data set is defined as D � (x 1 , y 1 ), . . . , (x i , y i ), . . . , (x n , y n ) T , where x i ∈ R n and the category label are y i ∈ −1, 1 { }. Suppose the number of features is p; then the feature space vector is , and J represents the number of different data sources. p j is the number of features extracted from the j th data source, W � (w 1 , w 2 , . . . , w p ) T ∈ R p + is the weight vector, and | · | represents the L 1 norm. For the linear regression model, the hypothesis is p j ) ∈ R p j is the regression coefficient. Let the residual term e i be an independent and identically distributed random variable and follow a normal distribution with a mean of 0 and a variance of σ 2 . To this end, all feature vectors are normalized and centralized, that is, n i�1 x ij � 0, ‖x j ‖ 2 � 1.

Data Acquisition and
Processing. Bank financial risk prediction information can be divided into financial information and nonfinancial information. e information can generate quantitative financial characteristics and nonfinancial characteristics based on qualitative description. Among them, the financial features can be calculated and extracted by using the accounting information in the financial statements regularly issued by the bank. e nonfinancial features can be extracted by using the disclosure data in the form of financial reports, news, and other text related to the bank. Generally speaking, the information is regularly published on the network platform and is easy to obtain. e bank financial risk prediction data set collected and captured in this study will be described in detail in the next experimental design section. In addition, financial data can be transformed into structured data after simple processing, which can be directly used as the input of learning algorithm. e nonfinancial data in text form can be used for learning only after word segmentation, cleaning, filtering, and other natural language processing techniques.

Feature Extraction of Financial Risk Prediction.
Firstly, the collected annual report text data are preprocessed, and then unigrams, bigrams, and trigrams are extracted as text features by using word bag model and word frequency reverse document frequency (TF-IDF) weighting method. Because text features naturally face high-dimensional problems, high-dimensional text features may contain some redundant and irrelevant features [12]. erefore, the information gain method is further used to filter the extracted initial text features, and the important features are retained to ensure the quality of the features. e calculation process of the information gain IG(Y, F) is as follows: In formulas (6)- (8), IG(Y, F) represents that when feature F is added, the information entropy of category Y decreases, H(Y) represents the information entropy of category, p(y) represents the probability of category y, and H(Y|F) represents category under the condition of feature F. e information entropy of Y, A, represents the probability of p(y|f) certain category distribution under a single feature condition. In the process of filtering text features, all unigrams, bigrams, and trigrams with an information gain Scientific Programming 5 greater than 0.0025 are retained as important text features. In order to fully explore the role of different characteristics in bank financial risk prediction, the above characteristics are fully combined, and the combined characteristics are expressed as In formula (9), F1 represents the set of extracted financial features, F2 represents the set of emotional features, and F3 represents the set of text features.

Construction of Financial Risk Prediction Model.
Considering the demand for adaptive fusion of multisource data in bank financial risk prediction and comprehensively considering the advantages of the above random subspace method, adaptive Lasso method, and weighted fusion Lasso method for the prediction problem [13], this study proposes a financial risk prediction method based on weighted fusion adaptive random subspace. is method includes three main modules: firstly, the constructed adaptive fusion method is used to fuse the features, secondly, the base classifier is constructed, and finally, the learning results of the base classifier are integrated. e flow of financial risk prediction method based on weighted fusion adaptive random subspace is as Figure 4. e goal of the financial risk prediction method based on weighted fusion adaptive random subspace in the first stage is to perform feature adaptive fusion to obtain the sampling weight W � (w 1 , w 2 , . . . , w p ) T ∈ R p + of the feature. To this end, first consider the classic Lasso model, which has the following form: In formula (10), λ represents the regular penalty parameter. After the weighted fusion adaptive estimation is performed on the features, a weight vector corresponding to each feature composed of regression coefficients will be obtained. Features with a weight of 0 will not be adopted. On the contrary, the greater the weight, the greater the probability of the feature being selected. When fusing multisource data, it is necessary to consider the impact of the relationship between different features on the prediction results. erefore, the weighted fusion Lasso model is introduced on the basis of the Lasso model, and its form is as follows: In formula (11), λ 2 /p p i<j a ij (β i − s ij β j ) 2 is the penalty term, and a ij � ρ ij /1 − ρ ij , s ij � sgn(ρ ij ) � +1, ρ ij > 0/ −1, ρ ij > 0 and ρ ij are the correlation coefficients between any two features x i and x j . rough the weighted fusion Lasso model, related features can be screened out or retained at the same time, which effectively solves the problem of multiple collinearities between features and improves the stability of the model. In order to be able to adaptively fuse different features, this research comprehensively considers Lasso, weighted fusion Lasso model and adaptive Lasso, and other methods and proposes a new regularized sparse model weighted fusion adaptive Lasso; its form is as follows: In formula (12), | is the adaptive weight. at is, before performing weighted fusion adaptive Lasso estimation, first perform Lasso estimation to obtain a set of regression coefficient vectors, and add its inverse as the adaptive weight of the feature to the weighted fusion adaptive Lasso. In this way, different features can be penalized according to their importance, and the model becomes an unbiased estimation, and a more accurate feature subset can be obtained [14].
rough the weighted fusion adaptive Lasso estimation, the adaptive feature weights W � (w 1 , w 2 , . . . , w p ) T ∈ R p + based on weighted fusion can be obtained. After using these weights to perform probability sampling on the features, the data subset  Scientific Programming the higher the characteristic dimension of the sample subset.
In the second stage, the financial risk prediction method based on weighted fusion adaptive random subspace first determines the base classifier and then uses the data subset obtained in the first stage to train the base classifier. When the training samples are linearly separable, the representation of the hyperplane in the sample space is as follows: In formula (13), the normal vector w� [w 1 , w 2 , . . . , w d ] and the displacement b, respectively, determine the direction of the hyperplane and its distance from the origin. At this time, the distance from any sample point x i to the hyperplane is If the hyperplane (w, b) correctly classifies the sample (w i , y i ) ∈ D, there are In formula (15), the sample points that can make the equation hold are support vectors. From a geometric point of view, the support vector is the sample points on the two classification boundaries w T e classification boundary is only related to these support vectors. e sum of the distances from the support vector to the hyperplane is SVM can effectively deal with learning tasks with fewer samples, high feature dimensions, and nonlinear relationships between features [15]. erefore, in the face of highdimensional text data, this research chooses SVM as the base classifier of the financial risk prediction method based on weighted fusion adaptive random subspace. e financial risk prediction method based on weighted fusion adaptive random subspace adopts the main voting strategy to synthesize the learning results of the base classifier in the third stage. Assuming that the category distribution is c 1 , c 2 , . . . , c N and the output of the classifier h i on the sample x is h 1 , the main voting or majority voting method is expressed as follows: null, otherwise.
According to formula (17), it can be seen that when a certain category label gets more than half of the votes, the main voting method uses it as the final output label. Corresponding to the main voting method is the relative majority voting method. e calculation process is as follows: In the given formula (equation (18)), the category with the highest votes will be used as the final output category to obtain the final integrated prediction result. rough the above steps, the bank financial risk prediction is realized.

Experimental Analysis
To properly evaluate the proposed method experimentation was conducted based on real data obtained from the commercial banks. Details of the evaluation procedure along with the comparison of some state-of-the-art methods are presented in the following subsections.

Experimental Environment and Data.
In order to verify the effectiveness of the banking financial risk prediction method based on big data, the experiment used the spark cluster as the experimental environment and adopted the operation mode of spark on yarn. In this study, 26 commercial banks were selected as experimental samples, and ST markers were used as a sign that banks were in financial risk, and 871 normal samples and 129 risk samples were obtained. From a feature point of view, the experimental data set consists of 39 financial features, 12 emotional features, and qualitative text features. For the extraction of sentiment words, the CNKI sentiment dictionary and the legal-related Sogou sentiment dictionary were used. e availed vocabularies contained various possible sentiments like the positive and negative sentiment, strong and weak modal sentiment, and the uncertain sentiment.

Risk Prediction and Evaluation Indicators.
is article uses average accuracy rate, error rate, and prediction time as evaluation indicators. e average accuracy rate refers to the ratio of the number of correctly predicted samples to the total number of predicted samples. e greater the average accuracy rate, the higher the prediction accuracy. e calculation formula is In the given formula (equation (19)), TP represents a true case, TN represents a true negative case, FP represents a false positive case, and FN represents a false negative case. e error rate refers to the ratio of the number of samples with prediction errors to the total number of samples. e smaller the error rate, the better the prediction effect. e calculation formula is

Comparison of the Accuracy of Bank Financial Risk
Prediction. In order to verify the prediction accuracy of the proposed method, the methods of [4,5] are compared with Scientific Programming the proposed method, respectively. e average accuracy of different methods is obtained and depicted in Figure 5. It can be seen from Figure 5 that, under different total data samples, the average accuracy of the method in [4] is 75%, the average accuracy of the method in [5] is 73%, and the average accuracy of the proposed method is 92%. erefore, compared with the methods of Pawiak et al. [4] and Niu et al. [5], the average accuracy of the proposed method is higher, and its bank financial risk prediction accuracy is higher.

Comparison of Bank Financial Risk Prediction Results.
To further verify the prediction effect of the proposed method, the method is compared with that of the Pawiak et al. [4] and Niu et al. [5]. e comparison results about the bank financial risk prediction error rate of different methods are as Figure 6.
It is clear from Figure 6 that, under the total number of different data samples, the average error rate of bank financial risk prediction in [4] method is 4.4%. e average error rate of bank financial risk prediction in [5] method is 7.8%. e average error rate of the bank financial risk prediction by our proposed method is only 1.1%. It can be seen that, compared with the methods of Pawiak et al. [4] and Niu et al. [5], the average error rate of the bank financial risk prediction of the proposed method is smaller. Hence, the bank financial risk prediction of the proposed method is better.

Comparison of Bank Financial Risk Prediction Time.
On this basis, the prediction time of the proposed method is verified. e methods of [4,5] and the proposed method were compared in terms of risk prediction time. e comparison results of bank financial risk prediction time of different methods are shown in Table 1.
According to the data in Table 1, as the total number of data samples increases, the bank financial risk prediction time of different methods increases. When the total number of data samples is 1000, the bank financial risk prediction time of the method of [4] is 22.9 s, the bank financial risk prediction time of the method of [5] is 31.5 s, and the bank financial risk prediction time of the proposed method is only 13.3 s. It can be seen that, compared with the method of [4] and the method of [5], the bank financial risk prediction time of the proposed method is shorter.

Conclusion
e bank financial risk prediction method based on big data is proposed in this paper. e method intends to make the full use of big data technology. e bank financial risk prediction accuracy of the proposed method is high. Moreover, the method can effectively shorten the bank financial risk prediction time and has good risk prediction effect. However, in the process of bank financial risk prediction, due to the limitation of data acquisition channels, this study has not considered the prediction effect of other feasible and useful data sources. erefore, in the next research, we have planned to further expand the multisource information and collect the bank financial risk data in real time. is will help to verify the effect of the bank financial risk prediction model. Besides, the model will be augmented to make the prediction results more accurate.   Data Availability e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.