Research on Clustering Algorithm Based on Improved SOM Neural Network

Clustering algorithm is a statistical method to study sample classification. With the rapid development of science and technology, people have higher and higher requirements for data classification, so there are more and more researches on clustering in modern society. Various mathematical algorithms are introduced to further improve the accuracy of clustering. Therefore, this paper proposes an improved SOM neural network algorithm to evaluate the comprehensive quality of students. SOM neural network can automatically find the internal laws and essential attributes in the samples, self-organize and adaptively change the network parameters and structure, and realize the classification of samples. Factor analysis is introduced to reduce the dimension of input layer in SOM neural network analysis, better process high-dimensional data, and improve the speed and accuracy of the algorithm. The improved SOM neural network algorithm can be used for the cluster analysis of the comprehensive quality of college students. The algorithm simulation results show that the improved neural network algorithm can intuitively evaluate the comprehensive quality of students and reflect the overall characteristics of each type of student.


Introduction
With the advent of the era of big data, the sources of data are becoming richer and richer, and the amount of data also shows a trend of rapid growth. Research and mining of important information contained in data have become a specialty. At present, data mining technology is widely used in various fields, such as economy, finance, transportation, commerce, and education. Cluster analysis is also an important task in data mining. It can find out the laws in the data and express them in the form of visualization. At present, there are many applications of data mining in the field of education, such as students' comprehensive quality evaluation. ese assessments are also an important basis for students to strengthen learning, teachers to adjust teaching, and schools to arrange courses.
ere are many methods for the evaluation of students' comprehensive quality, such as the analytic hierarchy process adopted by Lin [1], the adaptive multiminimum support association algorithm and SOM neural network algorithm of Xie [2], and the SVM method used by Yang et al. [3]. e SOM neural network adopted in this paper is also widely used in practical life. For example, Chen [4] improved the clustering algorithm SOM-K-means to crawl and classify the network water army, which is of great significance to the governance of the network water army. Wu [5] proposed an improved clustering algorithm, SOM-K-medoids-CH, which can effectively and accurately divide a large number of bank customers, mine out their potential needs, and sell the right products to the right customers at the right time.
However, we find that the data for evaluating students are multidimensional, the subject scores are diverse, and the correlation between subjects is relatively complex [6]. Students can be divided into different categories by directly using the clustering method according to the data [7]. However, for researchers, it is difficult to directly observe the commonalities between each type of student from the classification results because of the large and complex data. Moreover, for SOM neural network algorithm, the result is also greatly affected by the input samples [8]. erefore, in view of the above problems, this paper will introduce factor analysis into the SOM algorithm model to eliminate the relevant influence, extract the important indicators in the data, and analyze and verify the classification results.

Basic eory of Factor Analysis.
Factor analysis was first proposed by British psychologist C. E. Spearman. In his research, he found that there was a certain correlation between students' grades in various subjects and then speculated whether there were some potential common factors affecting students' academic performance. Factor analysis can find out the hidden representative factors in many variables and classify the variables with the same essence into one factor, which can reduce the number of variables and test the hypothesis of the relationship between variables [9][10][11].
In factor analysis, each factor is not related to each other, and all variables can be expressed as a linear combination of common factors. ere n are samples p and indicators, X � (X 1 , X 2 , . . . , X P ) T which are random vectors. If the common factor to F � (F 1 , F 2 , . . . , F m ) T be found is, the factor model is (1) e matrix A � (a ij ) is called the factor a ij load matrix, which reflects the i importance X i of j the variable F j to the common factor ε. As a special factor, it represents the variation of variables caused by influencing factors other than common factors, which can be ignored in the practical analysis [12,13]. e model obtained by factor analysis is not affected by dimension, and its factor load is not unique. When the factor load is complex and difficult to be explained reasonably, a new factor load matrix can be obtained by factor rotation, and its analysis significance will be more obvious.

Self-Organizing
Mapping Network. Self-organizing feature mapping network was proposed by Professor T. Kohonen of Helsinki University in Finland in 1981, which is called SOM network for short. Kohonen believes that when a neural network receives external input, each region of the neural network will have different response characteristics, and this process is completed automatically.
A typical feature of a feature mapping network is that it can be divided into input layer and competition layer on a one-dimensional or two-dimensional processing unit array. After self-organizing training, neurons will be orderly arranged in the competition layer. Neurons with similar functions are very close, and neurons with different functions are far away.
SOM network adopts the Kohonen algorithm, and the influence of winning neurons on their adjacent neurons is from near to far, from excitation to inhibition. erefore, not only the winning neurons need to adjust the weight but also the surrounding neurons will adjust the corresponding weight. e learning algorithm steps are as follows: (1) Network initialization, set the initial value of the weight between the input layer and the mapping layer with a random number. (2) Normalized data and input data. Normalize the data and input the x � (x 1 , x 2 , x 3 , . . . x n ) T vector to the input layer. (3) Calculate the distance between the weight vector of the mapping layer and the input vector. e distance between j the second neuron of the mapping layer and the inputvector is where is the weight between i the neurons of the input layer j and the neurons of the mapping layer. (4) Define areas of excellence. (5) Weight learning. e weights of winning neurons and adjacent neurons are updated according to the following formula: where η is a constant of σ 2 decreases with the progress of this learning.
If the requirements are met, output the results, otherwise return to (3) to continue.

Improved SOM Learning Algorithm
In the improved SOM algorithm, a factor analysis layer is added before the input of SOM sample data. After data are input into factor analysis layer, the factor load matrix table can be obtained by dimensionality reduction of data through factor analysis. By observing the load matrix table, we can get the commonness of each factor after dimensionality reduction and then extract the representative factor and name the representative factor according to the commonness. en, the extracted data are input into the input layer of the SOM model, and the data are transmitted to the neurons of each competing layer [14,15]. e improved SOM neural network model is shown in Figure 1. e first layer is factor analysis. By n inputting samples p and indicators, X � (X 1 , X 2 , . . . , X P ) T the dimensionality of the data is reduced and standardized, the F � (F 1 , F 2 , . . . , F m ) T factors are output, and the factors are named. e second layer is the input layer, which is equivalent to a transfer station. It connects the processed data with the competitive layer and is responsible for transmission. e third layer is the competition layer. e normalized data find the winning neuron by calculating the distance between the weight vector and the input vector of the mapping layer, update the weight of the adjacent neuron, and output the result after judging that it meets the conditions.

Empirical Analysis
e data in this paper come from the academic administration system of a certain college in a certain university to obtain the four-year academic performance information tables of 130 students of a certain major in 2016.

Factor Analysis Data
Processing. First, the data of students' specific course records in the grade information table are cleaned. After data processing, the practical courses are combined into practical courses, and the common professional basic courses, professional core courses, and public compulsory courses are selected. Second, eliminate elective courses, screen and modify course name errors, remove missing exams, registration errors, and other noise data, and supplement a few missing grades with 60 points. e final data include variables such as student number, course name, and course score, and 37-course scores are obtained. According to the factor analysis theory, the experiment has 130 samples and 37 indicators, which X � (X 1 , X 2 , . . . , X 37 ) 130 are random vectors, and the common factor to be sought is F � (F 1 , F 2 , . . . , F m ) 130 .
is section adopts the factor analysis method, and the software used is SPSS statistics 26.
First, the data are imported into the software for factor analysis. After standardizing the data, the KMO value is 0.879, greater than 0.5, and the significance level is significantly less than 0.05, indicating that the variables in this study are suitable for factor analysis. e output results are shown in Table 1.
en, factor analysis was carried out on all variables to obtain the eigenvalues, variance contribution rate, and cumulative variance contribution rate of 37 variables. According to the research, the components with eigenvalues greater than 1 are selected as factors, and a total of 9 factors are extracted. As shown in Table 2, the cumulative contribution rate of the nine factors is 67.45%, more than 60%, which meets the requirements of factor analysis. e study can extract these nine factors.
e evaluation is based on the notice of the measures for the evaluation of students' comprehensive quality issued by a school, which is also the principle that this study should follow.
From the study of the component matrix of factor analysis, it is found that the common factors displayed by the component matrix are not obvious, and the interpretation of the common factors is slightly difficult. erefore, in this study, the maximum variance method is used to rotate the component matrix and sort it by size to obtain the rotated component matrix. rough the total variance interpretation after rotation, 9 factors are obtained, respectively, F 1 , F 2 , . . . , F 8 , F 9 . e factors are then named by the rotated matrix list of components. Sort the variables contained in each factor, find out the variables with larger data in the matrix table, observe the commonness between variables, and then get the name of each factor. e resulting factor naming table is shown in Table 3.

SOM Neural Network Model Analysis.
is paper uses MATLAB software to input the obtained data into the software for operation [16].
It can be seen from the input samples that the number of input neurons is 37. is study uses the hexagonal topology output. In the establishment of output layer neurons, there is no authoritative and effective theoretical method, so the trial-and-error method is used to establish the output layer neurons. rough many attempts, the number of output layer neurons is determined as 4, and the two-dimensional 2 × 2 SOM competition layer neurons are used as the capacity of clustering. e hexagonal topology is shown in Figure 2.
In the confirmation of training times, we can determine from the stability of the classification of training times. In this paper, the data are trained for 10, 25, 50, 100, 200, 500, and 1000 times, respectively, and the classification results after training are obtained. When the training times are 100 times, the classification results have been stable. erefore, the training frequency of the study is 100 times. e training classification results are shown in Figure 3.
In other initial parameters, the default value of the topology function is "hextop," and the default value of the distance function is "linkdish." After all structures and initial parameters are established, the data are substituted into  Table 4.
rough SOM neural network analysis, student groups can be divided into four categories. In order to more intuitively observe the proportion of students in each category, a pie chart of the proportion of students is drawn. At this time, we only get the number of people in each category, but the characteristics of these four categories are not known at present, so we will focus on exploring the characteristics of the four groups of people for analysis. e number and proportion of each category are shown in Figure 3.
rough the results of factor analysis in the previous article, the scores of students in each subject and the load after rotation are calculated, and the results are standardized to obtain the nine-dimensional comprehensive quality score of each student. en, according to the analysis results of the SOM neural network, the students are divided into four categories, and the average value of nine-dimensional      [17,18]. e statistical data obtained are shown in Table 5.
In order to more intuitively observe the characteristics of each type of student, the average value of the comprehensive quality of the four types of students in Table 4 is converted into a bar chart. e abscissa represents each type of comprehensive quality, the ordinate represents the score of comprehensive quality, and different colors represent each type of student group. Figure 4 shows the results. e data in the table have been standardized, and the average value of each comprehensive quality is 0. erefore, it can be seen from the above table and figure.
Compared with the top 40 students in this category, all of them have outstanding abilities.
ere are 18 students in the second category. ese students have obvious deficiencies in innovation and entrepreneurship ability, computer ability, physical quality, and language expression, but their professional core competence is relatively good.
ere are 21 students in the third category. eir physical quality and mental health are relatively weak, and their scores in other aspects are higher than those in other categories, except for mathematical logical thinking ability. It can be seen that this kind of student's professional core ability is not strong.   Computational Intelligence and Neuroscience ere are 51 students in the fourth category, which is also the largest category. In addition to physical quality and mathematical logical thinking, the rest of these students are relatively low, indicating that they have obvious deficiencies and need to start from the foundation.

Conclusion
rough empirical analysis, the algorithm first classifies the students' comprehensive quality into nine categories based on the students' course scores by factor analysis, and the individual students can be evaluated by the classified data. en, on this basis, SOM neural network clustering analysis is carried out, and students are divided into four categories. Students of different categories have corresponding characteristics, which can be evaluated for different student groups.
Aiming at the limitations of evaluating students' quality, the complexity of various data, and the evaluation based on the total score, this paper puts forward an improved SOM neural network model and adds factor analysis to the model. e model can not only extract the common factors in various disciplines, integrate various comprehensive abilities of students, but also improve the accuracy of clustering. e improved SOM model can evaluate the comprehensive quality of each type of student more intuitively and accurately and provide a strong basis for schools, teachers, and self-management, so as to promote the all-round development of students. e improved SOM neural network algorithm is of great significance to the evaluation of students' comprehensive quality.
e algorithm can reduce dimension and cluster data. However, when there are too many data dimensions, the operation difficulty of this model will increase, which also needs further improvement in the future. e algorithm can be applied in many aspects, not only to analyze students' comprehensive quality but also to evaluate and classify patients in hospitals. It is expected that the algorithm can be improved in the future, so as to make a more perfect evaluation of the comprehensive quality of students and evaluate the development of each student.

Data Availability
All data, models, and code generated or used during the study appear in the submitted article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.