Construction of Alumni Information Analysis Model Based on Big Data

In order to integrate and utilize alumni resources in a better way, big data is utilized to construct alumni information analysis model based on improved hierarchical clustering algorithm, so as to realize mining and retrieval of alumni information. First,the basic principle of hierarchical clustering algorithm is analyzed concretely. Moreover, the improvement is performed on this basis, and a method of calculating the distance between class clusters based on the ant colony optimization is proposed, which uses the shortest distance of the ant colony algorithm to optimally solve the distance between hierarchical class clusters, so as to improve the clustering accuracy. ­en, the alumni information analysis model based on improved hierarchical clustering algorithm is constructed, and the model is divided into text preprocessing, keyword extraction, text feature vector generation, name disambiguation, and alumni recognition modules. Finally, the improved hierarchical clustering algorithm and construction of model are veried by experiments. ­e results show that the accuracy of the improved agglomerative hierarchical clustering algorithm is as high as 86.4% on average and 3.8% and 4.8% more than the two traditional algorithms. ­us, the clustering eect of the algorithm is better, and the proposed alumni analysis model can eectively process text disambiguation of web pages and identication of alumni information, which has certain eectiveness.


Related Work
For school, alumni resources are the favorable support and key to school development and construction. e e ective integration and tracking of alumni information plays a crucial role for schools. For example, schools can evaluate teaching quality according to the alumni information, so as to continuously improve and perfect their teaching methods and concepts. Furthermore, the latest trends of alumni may be able to provide support and help for the development and construction of school. erefore, it is necessary for school to manage alumni resources and information e ectively. However, there are so many graduates, that it seems impractical to keep track of every alumni.
In recent years, with the development and application of big data and Internet, it has become possible to obtain alumni information and update it in real time. At the same time, the Internet information is very large, so how to accurately identify alumni-related information is the current challenge. In addition, how to quickly and accurately identify alumni information from many pieces of information and exclude the people with the same name to obtain the nal target is the current urgent problem to be solved. Barnea Avner et al. constructed an emergency intelligence analysis model based on big data and utilized big data analysis technology to analyze and predict various emergencies in advance, which has certain e ectiveness [1][2][3]. Li Hong et al. studied the information analysis model based on complex network deeply and then used Internet technology to classify and identify complex information on the network. us, the accuracy of information analysis is improved [4][5][6]. Jiho Lee et al. proposed a hierarchical clustering analysis algorithm to construct a learner emotion analysis model for learning experience text and classi ed learners through hierarchies. us, the emotion analysis of each learner is realized [7,8]. Agnivesh et al. applied clustering algorithm to data classi cation, which showed good performance and e ect [9]. Ko applied big data to the parallel improvement of clustering and concluded that clustering can be used in massive data classification [10]. On this basis, taking the alumni information in the school campus network as the basic data, and combined with the above research results, the current widely used hierarchical clustering method is improved and applied to analyze the alumni resources information; thus, the corresponding analysis model is constructed, which realizes the accurate identification and classification of alumni resources, so as to provide technical reference and research direction for the analysis of alumni information.

Hierarchical Clustering Algorithm.
Hierarchical clustering algorithm is one of the common algorithms in clustering algorithm, and its basic principle is clustering through the distance between objects [11]. is algorithm can also be called tree clustering, which is mainly divided into two forms of agglomeration and split.
Among them, agglomerative clustering means to calculate the distance between class clusters and merge its various categories. e clustering form is bottom-up. Multiple iterations and merges are carried out on a cluster. When all the data are concentrated in one cluster and the set standard is reached, the calculation can be completed [12]. e basic principle is shown in Figure 1. e clustering form of split cluster is opposite to that of agglomerative clustering, and its split form is top-down. New clusters are obtained by repeatedly decomposing a data set.
In hierarchical clustering, each cluster is classified by evaluation criteria and usually by means of distance between points and distance between class clusters [13]. e calculation formulas of the two methods are as follows.

Calculation Method of the Distance between Points.
If there are two n-dimensional vector data points A, B, A � A 1 , A 2 , . . . , A n and B � B 1 , B 2 , . . . , B n , then the distance between the two points can be calculated by cosine similarity, that is, to figure out the cosine value of the included angle [14]. e calculation formula is as follows: (1) Euclidean distance can measure the absolute distance between two vectors. e solution formula is as follows:

Measuring the Distance between Class Clusters.
At present, clustering is mainly achieved by calculating the distance between class clusters. e common method for measuring the distance is to solve the minimum, maximum, average value, and average distance of class clusters, which are shown in formulas (3)(4)(5)(6): Here, (p i ∈ c i , p j ∈ c j ).
where m i , n j and m i , n j are the centroids of c i , c j and c i , c j , respectively.
where (p i ∈ c i , p j ∈ c j ), n i , n j , and n i , n j are the sample numbers of class c i , c j and c i , c j , respectively.

Algorithm Improvement.
e traditional hierarchical clustering algorithm is simple, and the accuracy of distance calculation is not high, which is easy to be interfered by outliers in the class cluster. erefore, to better represent the distance between clusters, avoid interference, and improve the clustering effect, the ant colony optimization algorithm is added to the agglomerative hierarchical algorithm. Based on the pheromone characteristics of the ant colony optimization algorithm, the optimal path is solved, so as to improve the clustering accuracy and achieve global optimization [15].

Ant Colony Optimization Algorithm.
e basic principle of the ant colony optimization is global search, and it is an intelligent optimization algorithm. Simulating real ant behavior, the algorithm is constantly improved and optimized to identify the direction through pheromone concentration, so as to find the optimal path and achieve global optimization [16]. e ant colony optimization algorithm can also be called traveling salesman problem (TSP). If there are n cities, TSP will implement the algorithm. (1) When the ant colony reaches n cities, ants will update pheromones in the path, and the update formulas are as follows: Δτ k ij , where m represents the total number of ants and Q represents the constant. ρ < 1, L k is the distance between two points; τ ij (t) represents the pheromone on edge (i, j) at time t; Δτ k ij is the pheromone quantity generated by ant k on edge (i, j) between time t and t + n [17].
(2) e ant colony cannot return to the previous city until it reaches n cities. (3) e ant colony starts from a certain point, and the probability of selecting the next target is where T k is the city that ants can choose; n ij is the heuristic information; α is the relative importance of pheromone heuristic factor; and β is the relative importance of heuristic information.

Agglomerative Hierarchical Clustering Algorithm Based on the Ant Colony Optimization
(1) Standard Distance. Euclidean distance is used to figure out the distance between two data points, and the solution formula is as follows: where x i and x j represent two m-dimensional data points. e similarity can be measured by calculating the distance between two clusters, and the distance between clusters can be calculated by the minimum distance formula (3) in agglomerative hierarchical clustering.
(2) Objective Function. Objective function is set as the clustering error square sum (suppose there are c clustering centers after clustering is completed): where C j represents the centroid, which is calculated by a specific cluster j and m j represents the amount of data in the cluster.
(3) Agglomerative Hierarchical Clustering Based on the Ant Colony Optimization. e objective of this algorithm is to find a shortest path in all the data to improve the clustering efficiency and accuracy. Utilizing the ant optimization, ants are taken as the research object, and food is taken as the clustering center. e probability of ants searching food is put into the clustering algorithm, and data are classified by probability [18]. ere are six steps improving algorithm, and the specific steps are as follows.
It can be seen from Figure 2 that the process of improving algorithm is mainly divided into six steps, which are as follows: (1) Initialize parameters, such as the number of ants m, weight parameter α, and volatile factor ρ. (2) Set an ant as m, calculate the distance and pheromone between data points, evaluate the transition probability, and determine the merger probability between data points and alternative points. e merger probability formula is as follows: where d ij is the distance between two data points; n ij is the distance-based heuristic information; and α and β are weight parameters, which have a great influence on pheromones [19]. If p m ij ≥ p 0 , x j merges with x i ; otherwise, there is no merger.
(3) Judge whether the number of ants k reaches the total number of ants; if not, set k � k + 1, and go back to Step (2) for calculation. (4) After ants complete clustering, the clustering center and pheromone will be updated, where the expression of clustering center is where m j is the total amount of data points classified in c j .
In the process of optimization, the pheromone concentrations in the paths passed by ant are different [20]. After clustering, ants' pheromones will constantly evaporate, and the evaporation formula can be expressed as follows: Mathematical Problems in Engineering 3 where ρ represents pheromone evaporation rate. When 0 < ρ ≤ 1, it can enable the algorithm to delete long paths and avoid pheromone accumulation. After completing evaporation, all ants will leave pheromones again: where Δτ k ij represents the amount of information from data x i to class cluster c j of the k th ant.
(5) Figure out the solution of objective function. (6) Return to steps (2)-(4) until the minimum target is calculated, and then the calculation can be completed.

Design of the Alumni Recognition Model Based on Improved Algorithm.
ere are too much alumni information on the Internet. e structure of the web page is complex, and the data format is not standard, which means that it is difficult to extract features of alumni information and fail to accurately identify alumni information.
erefore, an alumni recognition model based on the improved agglomerative hierarchical clustering algorithm is proposed, which is shown in Figure 3. Firstly, alumni information is collected and preprocessed. Secondly, text features are extracted by embedding technology. Moreover, text representation model and feature vector are constructed. Finally, name disambiguation and alumni identification are carried out. us alumni information analysis is realized.
As can be seen, alumni information identification process is mainly divided into six steps, as follows: (1) Classify the text data and name entity recognition (2) Filter stop words and delete useless interjections, modal particle, personal pronouns, and so on [21] (3) Extract keywords and use TF-IDF algorithm to select key information (4) For text representation, use the word embedding tool to complete the word embedding of keywords and obtain the vectorization expression of keywords (5) For text clustering, adopt cosine similarity calculation method to cluster document vector (6) Analyze clustering results

Data Collection and Preprocessing.
To perform the data collection and preprocessing, the first step is to collect the web page documents, and the original alumni data are collected by using the Python programming of Baidu's search engine API. e second step is to specify a person's name as a search term for the web page retrieval; thus, the web document is obtained and it is stored in a local folder. en, the web page file is processed by the Python programming, and the web page information is extracted. Furthermore, the text content is extracted by regular expression. e specific work includes tag attribute extraction, tag filtering, and character filtering [22].

Text Feature Extraction.
e python package pynlpir based on mlpir natural language processing system of Chinese Academy of Sciences is utilized to annotate text sequences and complete text classification. e TF-IDF algorithm is used to extract keywords, each information is named, and its entity is taken as the final text keyword. e algorithm flow is in Figure 4. e TF-IDF algorithm is used to calculate word frequency of text and inverse document frequency; thus, TF-IDF value is obtained to measure the importance of words. e solution formula is as follows: where tf (text frequency) represents the occurrence frequency of words in the document, and the solution formula is as follows: where n ij is the number of occurrences of word i in document j; k ∩ k, j is the sum of occurrences of all words in document j; and IDF (inverse document frequency) indicates the rarity of the word in the document, namely, the importance of the word. e expression is as follows: Here, |D| is the number of all documents in the corpus and j: c i ∈ d j is the number of documents containing word c i .

Construction of Word2Vec Text Representation Model.
Before classifying text, it is necessary to select an appropriate model to represent text. e quality of text representation has a great influence on the text clustering result. However, many text models are deficient in information and incomplete, which leads to poor clustering effect in the later stage. erefore, word embedding model is used to represent text content, so as to improve the effect of text clustering. Word embedding model improves text quality by transforming text dimensions and filling in missing information. On this basis, Skip-gram, which has a good application effect at present, is used to train model and generate word vector.

Construction of Text Feature Vector.
e construction process of text feature vector is shown in Figure 5. e first step is to use trained Word2Vec model to generate word vector. e second step is to find the average value of keywords; thus, the text feature vector is obtained.
If there are n keywords in the text, the text representation model d � c 1 , c 2 , . . . , c n , and the word vector v(c i ) can be obtained by training. us, the feature vector expression of document is where v(d) is the feature vector representation of a document and v(c i ) is the word vector of the i th feature word c, namely, the average value of sum of the document vector and all the keyword vectors. e personnel text feature vector to be disambiguated and the text vector of alumni attribute in the knowledge base can be obtained by the above model.

Name Disambiguation and Alumni Identification.
Using text feature vector, text is clustered based on the improved agglomerative hierarchical clustering algorithm.
rough clustering, people with the same name is distinguished, and the texts related to the same person are divided into the same category, which achieves name disambiguation.
en, the knowledge base information is used to assist identification to find the class cluster to which a specific person belongs, so as to realize the information identification of specific person. e disambiguation process is shown in Figure 6. e improved condensed hierarchical clustering algorithm proposed in this paper is utilized to cluster text information, and the central point of each cluster is figured out. e calculation formula is shown in formula (22) [23][24][25][26][27]: where v(C i ) represents the feature vector of the center point of the C i th class cluster; m represents the number of data points in class cluster C i ; and v(d) j represents the feature vector of the j th data point.
Comparing the similarity between the feature vector of center point of the class cluster and the feature vector of target personnel information constructed based on the knowledge base and if the similarity is greater than threshold value, the target personnel class cluster can be obtained, namely, relevant web page text information of the target person.

Experimental Data.
To verify the effectiveness of the improved method, there are 8000 text corpus about sports, government, commerce, culture and scientific research obtained from the Python web crawler as experimental data, and the data set D1 and D2 of alumni with the same name are divided into 5000 training sets and 3000 testing sets.

Experimental Environment and Parameter Setting.
To obtain better experimental results, the experimental processor is the 9th generation of Intel core i5, and the operating system is windows 64-bit. In addition, cas NLPIR word segmentation system and python language are adopted in the experiment. e parameter value setting of the algorithm is shown in Table 1.

Evaluation Indicators.
To evaluate the experiment more objectively, accuracy, recall rate, and F1 value are adopted as evaluation indicators. e accuracy is the ratio of correct number to total number, and the recall rate is the ratio of correct number to real number of texts. e expression of accuracy and recall rate is as follows: recall � a a + c , (22) where precision represents accuracy; recall stands for recall rate; a is the number of text correctly identified by algorithm; b is the number of text incorrectly identified by algorithm; and c is the number of text belonging to a certain category that algorithm does not recognize. F 1 value is the comprehensive analysis standard of the clustering algorithm. e higher the value of F 1 is, the better the clustering effect of algorithm is. Formula of F1 value is as follows:

Improved Algorithm.
To test the effectiveness of the proposed improved algorithm, the experiment compares the algorithm before and after improvement with the K-means algorithm. e experimental results are shown in Table 2. It can be seen from Table 2 that the accuracy of the improved algorithm is 84.2% and 85.6% in D1 and D2 data sets, respectively, which is higher compared to the accuracy of the algorithm before improved and the traditional clustering algorithm, which indicates that the improved algorithm proposed in the paper is effective.
To test the performance of the improved algorithm, the comparison curves of accuracy, recall rate, and F 1 value of the three algorithms are as follows.
As can be seen from Figure 7, the highest accuracy of the improved algorithm is 89% and the average rate is 86.4%, which are higher compared to those of the other two algorithms by 3.8% and 4.5% respectively, which means that adding the ant colony optimization algorithm can improve the clustering accuracy.
As can be seen from Figures 8 and 9, the highest recall rate of the improved algorithm is 89% and the average rate is 86.5%, which are 3% and 3.5% higher compared to those of the other two algorithms. e F1 value of the improved algorithm is as high as 88% and as low as 86%, which is obviously higher compared to that of the other two algorithms. us, the improved algorithm has better effect on text classification and better algorithm performance.

Alumni Recognition Model Based on the Improved
Algorithm. To verify the validity of the constructed alumni information recognition and analysis model, there are 5 alumni information selected from the above data set for experimental verification. e Word2Vec tool is adopted to train word vector, and the hyperparameter settings of the model during training are in Table 3.
After name disambiguation and recognition are performed by the alumni information recognition model, the statistical results obtained are shown in Table 4.
As can be seen from Table 4, using this model to identify and classify 5 alumni, alumni with the same name is accurately distinguished, name disambiguation is realized, and the class cluster of 5 alumni is also accurately  Figure 6: Name disambiguation process.

Conclusion
e proposed ant colony optimization clustering algorithm can achieve accurate clustering of alumni information, and the clustering effect is good. Moreover, the constructed model can disambiguate alumni name information and improve the accuracy of alumni information identification. e results show that the improved agglomerative hierarchical clustering algorithm has higher recognition accuracy, recall rate, and F1 value compared to traditional agglomerative hierarchical clustering algorithm and K-means algorithm. e corresponding average rates are 86.4%, 86.5%, and 87%, which shows that adding the ant colony algorithm into traditional agglomerative hierarchical clustering algorithm can figure out the optimal solution. Applying the modified algorithm to the alumni analysis model can further improve the recognition accuracy and clustering effect, which makes the model be extended and applied in the field of alumni analysis. e contribution of this study is to improve hierarchical clustering by using the ant colony algorithm; thus, the accuracy of massive data classification is improved. is method is applied to the analysis of alumni information, which provides a reference for the extension of informatization in various fields.
However, due to the lack of experimental conditions and research experience, there are still certain limitations. e future research is to improve the execution efficiency of the improved hierarchical clustering algorithm, so as to complete text data clustering and analysis in a shorter time.

Data Availability
e experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e author declares that there are no conflicts of interest regarding this work. Adjust the size of context window 5 Worker Adjust the number of computing cores 10 Mathematical Problems in Engineering 9