Visualization Research of College Students’ Career Planning Paths Integrating Deep Learning and Big Data

As China’s education enters a high-level stage, more and more students graduate from Chinese colleges and universities. In particular, the current employment environment is ﬂexible and multilateral, and there are more and more opportunities to choose from. In view of this situation, this article aims to visualize the career planning (CP) path of college students, so as to help college students adapt to the environment of ﬂexible employment. For deep learning and big data (DLBA) technology, this article proposes the LSTM-Canopy algorithm, which is added to the traditional Canopy algorithm to enhance the self-learning clustering ability of the algorithm. Also, this study applies this algorithm to the visualization system of college students’ CP path, which can eﬀectively improve the analysis and judgment of experts on career. The experiments in this article have proved that the system can meet the normal use of 400–500 users, and the system server has successfully passed 40 load tests, and the running time is also less than 2.5s, which proves the reliability of the system.


Introduction
e current era is the era of the Internet. In recent years, computer technology and network technology have developed rapidly, and social informatization has also developed into a new stage. Some people say that the current era is the era of information, but it is more accurate to say that it is the era of data. e number of students continues to rise, and new types of information continue to emerge, and the raw student data on college campuses has grown exponentially. ese raw data hold enormous value. In the face of massive data, if we can use scientific methods to process it? "Take the essence and get rid of the dross," extracting the information we are interested in hidden in the data can make the data a real resource that can be used by us.
Of course, as more and more information accumulates without knowing how to use it. Especially, in the CP of college students, there is a lot of information, but they do not know how to use it, and they do not know their direction. erefore, it is necessary to design a visualization system of college students' CP paths for this problem.
For the analysis of college CP path, this article has the following two innovations: (1) For the problem of data classification, based on DLBA technology, this article proposes the LSTM-Canopy algorithm, which is a data clustering algorithm based on the LSTM algorithm. It adds LSTM self-learning factor to the traditional Canopy algorithm to make the clustering effect better. (2) is article designs a visualization system of college students' CP path and designs an expert system for college students' CP. Experienced teachers and entrepreneurs analyze the career reports submitted by students and present their own analysis reports to facilitate data visualization.

Related Work
College students are the main reserve force for the construction of the motherland, and there are many studies on the careers of college students. Tavabie and Simms explored common abilities and characteristics of nonclinical roles. ey organized this characteristic information into a generic job description at four key levels. ese form the basis of a career path [1]. Jackson believed that in the case of intense competition in the graduate labor market and underemployment of graduates, effective CP for college students is becoming more and more important. His research examined the impact of work-study integration on students' CP. Also, his research deepens our understanding of how work-study integration shapes college students' career goals and improves current weak levels of student engagement in CP, discussing the implications for future career counseling [2]. Nordin and Hong aimed to explore the impact of career coaching activities on children's CP. eir study was conducted on 12-year olds who were having problems with the CP process. e findings showed that four topics were successfully identified, namely, occupational understanding, sources of occupational information disclosure, occupational choice, and employment understanding through parental occupations [3]. Hung and others investigated the motivations that led Vietnamese students to choose to study in Taiwan. Quantitative research results showed that there is a significant correlation between student motivation and CP, and both directly affect decision making [4]. Ying et al. provided a comprehensive overview of research using deep learning methods to process clinical data. ey believed that despite the challenges of applying DL technology to clinical data, the prospect of DL application in clinical big data in precision medicine is still worth looking forward to [5]. Deep learning (DL) is a branch of machine learning techniques based on algorithms that learn multilevel representations. Big data analytics (BDA) is the process of examining large-scale data and various data types. Hordri et al. identified existing features of DL methods used in BDA and identified key features that affect the effectiveness of DL methods. It is experimentally demonstrated that DL for BDA is an active research area [6]. Duncan et al. aimed to define and highlight some of the "hot" new perspectives in the field of biomedical imaging and analysis. ey aimed to shed light on where the field is headed over the next few decades and highlighted areas where electrical engineers are already involved and are likely to have the greatest impact. ey have a good discussion about medical imaging being "big data." Also, they believed that the development of biomedical imaging and analysis is very good [7]. However, after relevant research, it is found that the CP for college students is reflected in the course design and teaching experience, and there is little research on the visualization of the system.

Neural Network Algorithm.
e overall structure of a long short-term memory network (LSTM) is similar to that of a recurrent neural network (RNN). e "gate" mechanism and the structural concept of cell state are introduced into the LSTM hidden layer calculation. e "gate" mechanism determines how much information is input at each time step, how much state information is saved and forgotten, and the cell state is responsible for recording the state information at the current time step. rough these two structures, the earlier information of the sequence can be preserved, and the learning of long-distance dependence of sequence features can be realized [8,9]. Also, due to the existence of the cell state, the gradient can be saved during training to solve the problem of gradient disappearance [10]. e hidden layer computing structure of LSTM is shown in Figure 1. e calculation formula of the hidden layer at the time t of LSTM is as follows, including the forgetting gate f, the input gate i, the output gate o, the cell state update c, and the calculation of the hidden layer output h. where Formula (2) is the sigmoid function, ∘ is the elementwise multiplication of vectors, and W and b are the weights and biases.
Position embedding is a matrix that represents time-step information in the same shape as the input feature. It can use trainable variables as position embedding, or it can be a custom matrix that can represent the difference between time steps. A common method is shown in Formulas (3) and (4).
PE pos,2i+1 � cos pos where PE represents the position-embedding matrix, pos represents the index of the time step dimension of the PE matrix, i represents the index of the feature dimension of the PE matrix, and n represents the total number of features. Batch normalization is an optimization algorithm that normalizes the outputs of neurons in the same position in a training batch. e algorithm makes the distribution of the output of each layer of the neural network more stable and reduces the coupling between the layers, thereby accelerating the convergence of the neural network algorithm. Assuming that the value of a neuron on a certain minibatch is {x 1 , x 2 , . . ., x n }, the batch-normalization calculation method is shown in the following formula.
where c and β are trainable variables and ε is a positive epsilon. Dropout randomly discards some neurons on the output of a layer of neurons and then scales up the training method of the signal, so that the weight does not completely depend on certain features. It achieves the purpose of reducing the degree of network overfitting and improving the generalization ability of the algorithm. e calculation method of dropout is as follows: where z (l) represents the unactivated feature output of the lth layer of the fully connected neural network and y (l) represents the activated feature output of the lth layer of the fully connected neural network. r(l) is the vector of the Bernoulli distribution with probability p for the elements of the neural network in the lth layer, and W (l) represents the weight matrix of the lth layer of the fully connected neural network. b(l) represents the bias vector of the lth layer of the fully connected neural network, σ(x) � 1/(1 + exp(−x)) is the sigmoid function, and ∘ is the elementwise multiplication of vectors [11]. e Canopy algorithm performs rough clustering without preselecting the number of clusters k, and at the same time, it greatly optimizes the method for determining the initial cluster center.

Canopy Algorithm.
Before introducing the Canopy algorithm, first define several concepts of the Canopy algorithm: then the set C k is called the non-Canopy candidate center point set. If any then d i is defined as a Canopy, T 1 is called the set radius of Canopy, and C j is the center point of Canopy. e basic idea of the Canopy algorithm can be divided into two stages: e first step is to use the roughly calculated distance as a metric. Efficiently divide the data in the data set into several subsets. Different subsets can intersect but cannot completely overlap, and each divided subset is called a Canopy. In the second step, a more rigorous clustering algorithm is selected to perform clustering operations on the data in Canopy obtained in the first step. e Canopy algorithm is a clustering strategy of thick and thin sets, and it is very suitable for preanalysis of high-dimensional data [12,13]. e Canopy clustering process is shown in Figure 2. e first step of the algorithm generates several Canopy, and each Canopy is a collection of sample data. e algorithm presets a threshold for the measure of difference [14].
Because the first step of the algorithm allows two Canopy to intersect, a data object may belong to more than one Canopy, but each data object should exist in at least one Canopy. e second step of the algorithm can be done using conventional clustering algorithms like k-means, but it is worth noting. e clustering work in the second step is performed on the data in the same Canopy, and there is no need to calculate the data in different Canopy. It is generally considered that the distance between the data between different Canopies is infinite. Consider an extreme case, if all data objects are divided into the same Canopy in the second step of the algorithm, then the work of the second step degenerates into a traditional clustering algorithm.
e Canopy algorithm does not need to preset the k value for the number of clusters but uses two similarity measures T1 and T2 to indirectly determine the number of Canopy subsets after clustering and the number of data objects in each Canopy [15,16]. T1 is called a loose threshold (loose distance), T2 is called a tight threshold (tight distance), and T1 > T2. If the distance between a data object and a Canopy center is less than T1, then the data object is added to the Canopy; if the distance is less than T2, the data object will not become a Canopy center. e relationship between the discrimination of data objects and T1 and T2 is shown in Figure 3.
e Canopy algorithm does not need to select the number of clusters, which saves the cost of the algorithm. Bringing the results of the Canopy algorithm into the next precise clustering algorithm can greatly reduce the number of iterations of the clustering algorithm, thereby improving the efficiency and accuracy of the clustering. It can also reduce the probability of local optimal solutions to a certain extent [17,18]. Canopy's algorithm mechanism also determines that it has better performance for high-dimensional data processing [10].
However, the Canopy algorithm is not perfect. It is obvious that the selection of T1 and T2 when executing the algorithm has a great impact on the performance of the algorithm. If the selection of the k value is the core of the k-means algorithm, then the determination of the T1 and T2 thresholds is the key to the Canopy algorithm. If the selected loose threshold T1 is too large, then a data object may be included in many different Canopy, and the computational cost of the algorithm in the second step will be greatly increased. If the selected tight threshold T2 is too large, the number of clusters obtained will be very small. e accuracy of the algorithm in the initial Canopy center selection will be greatly reduced, and it may even fall into the trap of local optimality, thereby reducing the accuracy of the clustering results.

Canopy Algorithm Based on Neural Network.
rough the analysis and summary of the Canopy algorithm, this section proposes an idea to improve the Canopy algorithm. e Canopy algorithm can be used to obtain relatively rough clusters. e results of the Canopy algorithm are added to the LSTM algorithm for improvement, and then, the k-means algorithm iteration is performed to complete the final clustering of the data [19,20]. From the analysis of the algorithm principle, the improved new algorithm first avoids the manual selection of the number of clusters k, and then, the preprocessing of Canopy will greatly reduce the burden of the k-means algorithm, thereby improving the accuracy and operating efficiency of the algorithm. e next section will test the performance of the improved algorithm through specific experiments.
In this article, the improved algorithm is named LSTM-Canopy algorithm, and the specific process of the algorithm is as follows:

Algorithm Performance Analysis.
In this section, the Canopy algorithm and the improved LSTM_Canopy algorithm are used to compare the two algorithms with multiple data sets, focusing on the improvement of the performance of the LSTM_Canopy algorithm to the traditional clustering algorithm.
e University of California has created a database dedicated to scientific research, called the Machine Learning Standards Database (UCIMachineLearningRepository). e data in this database is widely used in the field of artificial intelligence research. is test selected three data sets in the UCI database for testing.
e Seeds data set contains 210 pieces of data, the data dimension is 7, and the standard number of clusters is 3, which is recorded as data set DS1 in this test. e Glass data set contains 214 pieces of data, the data dimension is 10, and the standard number of clusters is 7, which is recorded as data set DS2. e CarEvolution data set has 1728 data, the data dimension is 6, and the standard number of clusters is 4, which is recorded as data set DS3. e parameters of each data set are summarized in Table 1.
Download these data sets from the UCI database website, perform data processing (for example, the Seeds data set needs to be normalized), and then import them into MATLAB software for simulation. e simulation of twodimensional data clustering is shown in Figure 4.
ere are two standards to measure the accuracy of the algorithm in this experiment: the accuracy rate P and the minimum error squared sum E min .
Assuming that the data set contains n data objects, after the clustering algorithm ends, the data objects in the data set are clustered into k classes. d i is the number of data objects that are correctly assigned to the ith cluster (compared with the standard clustering of the UCI data set to judge whether it is correct), then the accuracy is defined as Let s be the number of clusters and n k the number of data objects in the kth cluster. c k represents the cluster center of the kth cluster, d k i represents the ith data object in the kth cluster, and the error sum of squares is defined as e smallest sum of squares of errors E min is the smallest of the sums of squares of errors after the algorithm ends. ree sets of parameters with different values } were selected to perform ten clustering simulation experiments on DS1, DS2, and DS3 data sets using k-means algorithm and LSTM_Canopy algorithm. en, calculate the average of the accuracy of ten clustering results and the minimum error sum of squares, the results are shown in Table 2 and Figures 5 and 6.
It can be seen from Table 2 and Figure 5 that the LSTM_Canopy algorithm has different degrees of improvement in accuracy compared with the traditional kmeans algorithm. When the amount of data in the data set is large, the dimension is high, and there are many types; the accuracy of the k-means algorithm is greatly reduced. However, the LSTM_Canopy algorithm can improve the clustering accuracy to an acceptable level, which proves that the LSTM_Canopy algorithm can greatly improve the clustering accuracy in the clustering of high-dimensional T2 T1 Outside the T1 range, it does not belong to this Canopy cluster.
Between T1 and T2, it belongs to this Canopy cluster, and it can also be the center of another Canopy itself.

Center of current Canopy
Within the range of T2, it belongs to this Canopy and cannot be the center of Canopy.

Mathematical Problems in Engineering 5
Input: dataset D to be clustered Algorithm flow: (1) Set the thresholds T1 and T2 as the area radius (T1 > T2) (2) Calculate the distance l between other data units in the dataset D and the object d; (3) If l < T1, this data is marked with weak correlation, and this data is divided into Canopy; If l < T2, this data mark is strongly associated, and this data object is deleted from the data set D; (4) Repeat steps (2) (3) (4) until the dataset D is empty; (5) For the obtained series of Canopy centers and the Canopy centered on them, each Canopy is regarded as a cluster, and the center of Canopy is regarded as the center of the cluster, and the number of clusters is recorded as k. (6) Calculate the distances from other data objects to the k centers, and assign these objects to the clusters closest to a certain cluster center; (7) According to the data allocated to k clusters, update the center point of the cluster; (8) Repeat steps (6) and (7) until the clusters no longer change.
Output: set of k clusters.
ALGORITHM 1: LSTM-Canopy algorithm.   data and massive data. e minimum sum of squared errors of the LSTM_Canopy algorithm is also much lower than that of the k-means algorithm, and its clustering accuracy and stability are also improved, which proves that the data similarity is higher in the clusters obtained using the improved algorithm. It can be seen from Figure 6 that the accuracy of the LSTM_Canopy algorithm is always higher than that of the k-means algorithm, and its curve is significantly smoother, which proves that the improved algorithm has higher clustering accuracy and algorithm stability, which further verifies the previous results.

Visualization System Design and Realization of College Students' CP Path
e core task of the college CP expert system based on data mining is to complete the CP work for college students with the idea of an expert system. e college CP of this subject is divided into academic planning and employment guidance  Mathematical Problems in Engineering for college students, among which employment guidance is the core content of the study. In order to provide correct employment guidance for college students, it is necessary to mine the data of college students and design the reasoning mechanism of expert system based on the knowledge excavated. erefore, the system should not only have the knowledge acquisition means and reasoning mechanism of the expert system but also have the functions of data processing and management, which can complete the work of the simple management information system. e functions of the college CP expert system based on data mining mainly include two parts: the foreground function and the background function: the foreground function is the business function of the system, that is, the service function that the system can provide when the user uses the system. e background function is the management function, that is, the system administrators at all levels manage and maintain the system according to their own authority. is section will analyze the requirements of the system from the perspective of the system's foreground business functions and background management functions.

Overall System Design.
is chapter designs the system workflow with the idea of modularization according to the requirement analysis. e workflow takes into account the functional requirements and performance requirements of the system and provides the basis for the realization of the system. e overall framework of the system is shown in Figure 7.

System Function Design.
e core of the expert system is the knowledge base and the inference engine. e knowledge base stores a large amount of professional knowledge and experience in a certain field provided by domain experts. e reasoning engine uses computer thinking to simulate the thinking process of human experts based on the knowledge in the system and the demands of expert system users and answers the user's questions instead of them. e design of the knowledge base and inference engine determines the performance of the expert system. e construction of the knowledge base and the inference mechanism and the selection of the algorithm are the keys to the design of the expert system. e framework of the expert system for college CP based on data mining is shown in Figure 8. e core function of the college career planning system based on data mining is the student's academic planning and employment guidance. Click the Career Planning tab at the top of the page to switch to the career planning page. Select the academic planning function on the page to view the professional courses recommended by the expert system for students, which can be a reference for students when choosing courses. When the student user clicks to enter the academic planning page, the front end of the web page obtains the student's id and sends a post request to the server. e server first obtains the institute and major fields in the student table according to the id, so as to determine the student's college information and major information.
en, the server obtains the course selection information of the student's seniors in the same major from the system database and selects a course with a relatively high degree of selection. It renders the course data to the Career_Plan.html web page in templates and returns it, so as to realize the academic planning function for the students in school. e realization of the function of the college CP system is shown in Figure 9.

System Data Layer
Design. e data layer of the system stores all the data and information in the system, and the data layer is invisible to ordinary users of the system, only visible to those who design and maintain the system, which ensures the security of the system. e expert system of college CP is based on data mining, and the data layer includes original data, knowledge data, and system data.
Raw data are raw data without data preprocessing. System data are the data generated and used by the users of the system in the system and are stored using the MySQL relational database. Knowledge data are the key data for the operation of the expert system, including the knowledge stored in the knowledge base and the reasoning logic of the inference engine. e knowledge data are formed by the preprocessing and data mining of the original data of the students. It is huge in quantity and has high value, so it needs to be stored in the backup database in the form of text documents. Such data isolation operation not only ensures the security of the data but also facilitates the research of the data by domain experts and system knowledge engineers. Its data types are shown in Table 3.

System Test.
After completing the design and implementation of the system, this chapter tests the system. e system test includes the test of each functional module of the system and the performance test of the system to verify whether the function of the system is complete and whether the performance can meet the needs of multiple students using at the same time.

Server Load Test.
Load testing is an important component of performance testing.
e purpose is to test whether the system can operate under overload conditions that are close to the upper limit or even exceed the upper limit. Load testing can help testers examine the stability of the system, determine the volume of system users, and make targeted improvements to the system.
When load testing the server of this system, increase the number of user terminal computers connected to the server in batches. From 10 to 40, the testers use various functions of the system on the terminal computer and record the average response time of the system when the testers use it as the performance index of the server load test. In this test, the ratio of students and managers to test is 4 : 1. When testers use various front-end functions of the system, the response time of the system (accurate to 0.1 seconds) is shown in Figure 10.
As shown in Figure 10, the system showed relatively good performance in the load test. When the load reached 40 units, the response time of the system remained below 5 s.

Server Stress Test.
Stress testing is a common means of system testing, and it is a destructive performance test.
rough the stress test of the server, the stability of the system under high load for a long time can be tested, and it can also be applied to the system by applying an excessive load. It crashes the system and exposes hidden problems early and then makes targeted improvements to the system.
ApacheJMeter is a very popular testing tool. It is an open-source software with a graphical interface. It is widely used in the performance testing of simulated network systems, servers, and other systems. is test uses the latest JMeter 5.1 version. e pressure test results are shown in Table 4.
As shown in Table 4, the performance of the system varies with the number of users. Specifically, as the number of users increases, the error rate also increases.
roughput refers to the number of all requests processed by the server in one test, while throughput is the ratio of throughput to time, reflecting the ability of the server to process requests per unit time. As the number of simulated      users increases, the server throughput continues to decrease. When the number of simulated users reaches 500, the decreasing trend of throughput slows down and finally stabilizes at around 800 req/s. Combined with the previous error rate analysis, it can be concluded that the system can meet the normal use of 400-500 users when executing 1000 cycles of server request operations. As far as the network environment used in the current test is concerned, the server of this system has successfully passed the test task.

Conclusion
is article studies the related technologies of college CP expert system based on DLBA and designs and implements the system. e whole system includes the front-end business system used by students, the back-end management system used by managers at all levels, and the expert system that provides students with CP. is article uses a large number of real student data and adopts the data mining algorithm combining cluster analysis and classification analysis to complete the establishment of the knowledge module and reasoning mechanism of the expert system, thus realizing the core function of the system. Finally, this article tests the system by professional means. e test results show that the system has achieved the expected requirements in function and performance. However, due to the limitation of the author's time and personal ability, the front-end web page of the system only realizes the basic functions required by the system, and the web page design is not beautiful. It does not implement the system on the mobile terminal, and the user can only access the system through the web page, which is not convenient enough to use. erefore, follow-up research will be carried out in this way.

Data Availability
e data used to support the findings of this study are included in the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.