A College Student Behavior Analysis and Management Method Based on Machine Learning Technology

A digital campus will generate a large amount of student-related data. How to analyze and apply these data has become the key to improving the management level of students. The analysis of student behavior data can not only assist schools in early warning of dangerous events and strengthen school safety but also can use real data to describe student behavior, thereby providing quantitative data support for scholarship and grant evaluation. This paper takes a university student as the research object, collects various data in the digital campus platform, and uses an adaptive K-means algorithm in the machine learning algorithm to cluster the data. Analyze the behavior of college students from the clustering results, so as to provide a basis for the education management and learning ability improvement of college students. Specifically, the student’s study, life, and consumption data are selected as the data to describe the student’s behavior at school. This data is input into the adaptive K-means algorithm to obtain different types of student consumption habits, living habits, and learning habits. Through the analysis results, it can be found that the problem of the group of students with low financial ability, the problem of too long online time for students, and the number of books borrowed are too low. According to the characteristics of these problems, teachers and schools are provided with targeted management suggestions. The analysis of student behavior based on machine learning technology provides a reference for the formulation of students’ school management policies and provides teachers with information on students’ personality characteristics, which is conducive to improving teachers’ teaching effects. In short, the management of the results of student behavior analysis can provide a basis for the school to formulate reasonable management policies, thereby promoting precision management and scientific decision-making.


Introduction
The establishment of a digital campus has improved the efficiency of university management and has also brought great convenience to students, faculty, and staff. The digital management system can collect a large amount of data, which plays an important role in the management of the school. As for the daily management of students, if we can learn more about students, we can implement more effective programs for different students, so that we can teach students in accordance with their aptitude and improve the education level of the school. The traditional analysis and management of student behavior mostly relies on the personal experience of the manager and lacks the individualized cognition of the learner. At the same time, it cannot in-depth guide students' learning behaviors, provide personalized learning situations, and promote learning optimization. Analyzing student life and learning behavior based on intelligent technology are of great significance to the investigation of potential abnormal students and the prediction of students' future development. The key to understanding students is the data collected in the digital campus for students' study, life, and consumption. At present, many schools have established corresponding all-in-one card systems, which make students' daily campus life more convenient. Students can use the campus card to consume in canteens, supermarkets, etc., or use the card to borrow books in the library, etc. These operations will generate a large amount of student behavior data. How to use these data to discover the information contained in it is a problem that needs to be solved. Based on the machine learning technology, this paper conducts cluster analysis on campus all-in-one card data and analyzes the behavior of students.
In recent years, there has been a lot of research on student behavior analysis. Reference [1] measures student behavior based on entropy measurement. The study defined two behavioral characteristics of orderliness and diligence and analyzed the correlation between the regularity of campus life and academic performance. Reference [2] proposed a preclass student performance prediction method based on multiexample multilabel learning. The idea of this method is to use students' behavior in completed courses to predict their difficulties in learning new courses. The results of this study are convenient for teachers to track and understand the learning situation of each student. Reference [3] proposes an education measurement system to characterize educational behavior by collecting campus Wi-Fi network data. The results show that the system can obtain information about the relationship between punctuality, distraction, and academic performance. Reference [4] uses an improved recurrent neural network to simulate the student's answering process according to the student's answer records and the content of each exercise to predict the student's future performance. Based on MOOCs learner behavior data, reference [5] established a prediction model based on clustering algorithm and neural network to mine the learning rules in the learning process. The predicted results can provide personalized guidance for each learner. Reference [6] proposes a classification system to analyze the behavior of students in the teaching system and to find students with poor performance early. Reference [7] predicts their course performance based on the relevant data generated by students during the online course learning process. Reference [8] input multimodel data and semester information into the linear mixed effects model to predict the future performance of students. Reference [9] found that an improved random forest method can be used to predict the grades of freshmen and existing courses. The application of these technologies brings hope to student degree planning, lecturer intervention, and personalized advice. Annapol State University in India has developed a product [10,11], which is used to monitor student activity areas. This product analyzes students' participation in organizing activities based on the records of students swiping their ID cards. Through this software, information about students who participate in activities with low frequency can be detected.
Most of the above studies have completed the analysis of student behavior based on machine learning technologies [12][13][14]. The current popular machine learning technologies contain the tradition machine learning methods [15,16] and some advanced machine leaning technologies [17][18][19]. These advanced machine leaning technologies have been used in many practical applications, such as medicine [20], industry [21], and basic theory research [22][23][24]. In terms of performance, the accuracy of behavior analysis based on deep learning is indeed higher, but its computational complexity and hardware performance requirements are higher. The time complexity of behavior analysis based on machine learning is relatively low, and the algorithm implementation is simple. Therefore, this article chooses student behavior analysis based on machine learning. In machine learning, K -means clustering algorithm [25,26] is widely used in student behavior analysis research. However, traditional K-means is very sensitive to outliers, and a small number of outliers will have a great impact on the final clustering results. On the other hand, the K-means algorithm still has the problem that the K value cannot be adapted. In response to these problems, this paper uses an adaptive K-means algorithm to improve the efficiency and clustering accuracy of the algorithm.
The main work of this paper is as follows: (1) Collect student consumption, life, and learning data through the campus all-in-one card system and integrate data from different institutions to form a comprehensive data set for student behavior analysis (2) An adaptive K-means clustering algorithm is used for student behavior analysis. The algorithm introduces the elbow rule to optimize the data and find points far away from the cluster, thereby effectively improving the clustering performance. In addition, the algorithm also introduces the idea of self-adaptation and automatically adjusts the value of k based on the sum of squared errors. The adaptive K value is more suitable for objective reality (3) Based on the above model, the analysis results of students' consumption, life, and learning are obtained. The analysis results can be used to improve the management level of the school and truly teach students in accordance with their aptitude 2. Related Work 2.1. Student Behavior Analysis and Management. The complexity of individual college students makes it impossible for school administrators to understand students' dynamics in real time. For some students with abnormal behaviors, the students around them may be inconvenient or embarrassed to inform the management staff of the specific situation, which causes a certain lag in the work of the student management staff. In order to understand the behavior and habits of students in real time, the management of students will be transformed from passive to active. The daily behavior data of students needs to be displayed intuitively on the student behavior analysis system. On the premise of ensuring the privacy and safety of students, the data of the system mainly comes from the data collected by the school's digital systems, and the machine learning technology is used to analyze the data. The goal of student behavior analysis is shown in Figure 1.

Wireless Communications and Mobile Computing
As shown in Figure 1, first, the historical data stored in the school's digital management systems need to be integrated and stored. Second, establish a data analysis model based on the dimensions of student behavior data analysis. Third, establish a student behavior data analysis system to achieve student management goals based on data services. Finally, a real-time monitoring system for student behavior is established to monitor abnormal students in real time.

Student Behavior Analysis Process Based on Machine
Learning. The process of applying machine learning to student behavior analysis is shown in Figure 2. As shown in Figure 2, it is first necessary to collect and preprocess student behavior data. Collection is by exporting system data such as the school's teaching system and campus card. The format and structure of the exported data vary greatly. Various data needs to be integrated to obtain comprehensive data. Second, because there will be a lot of noise data in the integrated data, the data needs to be preprocessed. The preprocessed data often has problems such as high dimensionality, so it is necessary to perform feature extraction on the data. Third, conduct behavior analysis model training based on the training set. Fourth, input the test data into the trained analysis model to obtain the analysis result. Finally, perform related management and application based on the analysis results.

Student Behavior Analysis Based on
Adaptive K-Means 3.1. Behavior Analysis Framework. As shown in Figure 3, first, integrate the data collected by the campus card, educational administration system, etc. These data are mainly composed of students' life, consumption, and learning data. Second, because there is noisy data in the integrated data, it is necessary to perform preprocessing such as cleaning the data. Third, perform feature extraction on the preprocessed data and extract the main features for subsequent processing.
The feature extraction method used in this paper is principal component analysis (PCA) [27,28]. Fourth, divide the feature data set into a training set and a test set. The training set is used to train the behavior analysis model. Fifth, the test set is input to the analysis model to obtain the analysis result. Finally, the analysis results are managed and applied.

Behavior Analysis
Model. The analysis model used in Figure 3 is an adaptive K-means algorithm. The idea of the algorithm used is to optimize the sample data set X through the elbow rule to determine outliers. When the algorithm is implemented, the sample data set that eliminates outliers is used, and after the algorithm is completed, the final outlier is determined according to the similarity between the outliers and each cluster. Based on the adaptive idea, after each iteration is completed, the value of k is automatically adjusted according to the cluster evaluation index error of each cluster until the error range is met.

The Elbow Rule Detects
Outliers. The similarity determination of traditional K-means algorithm is based on Euclidean distance. Outliers will affect the estimation of k value, thereby increasing the time complexity of the algorithm. Use the elbow method to effectively detect outliers in the data set to optimize the algorithm. The specific implementation is as follows: Let the data set be X = fx i | i = 1, 2, ⋯, mg and m be the number of samples. Each sample has n features (n > 0); the sample is divided into different categories C = fc 1 , c 2 , ⋯, c k g , k is the number of clusters. Initially, all samples in X are regarded as one class, and the initial class center is where C represents the sample set contained in the j-th cluster and NðC j Þ represents the number of samples in the   Wireless Communications and Mobile Computing j-th cluster. Calculate the Euclidean distance set D from each sample in X to the cluster center μ j ; the specific formula is as follows: where According to the elbow rule, sort the data in D from small to large and get an x-d two-dimensional line graph, where d is the distance from the sample data point to the center point and x is the sample corresponding to d. As d increases, the value corresponding to the position where the distortion improvement effect increases the most is the elbow. Therefore, the elbow meets the following conditions: Divide at the elbow and temporarily define data whose distance is greater than the corresponding distance of the elbow point as an outlier. Assuming there are w samples (w ≤ m) after eliminating outliers, there are (m-w) outliers. Store the nonoutliers in the sample set X ′and renumber the samples to get X ′ = fx 1 , x 2 , ⋯, x w g. The outliers are stored in the data set Y and renumbered to obtain Y ′ = fx w+1 , x w+2 , ⋯, x m g. The use of X ′ sample set in the implementation of the algorithm can eliminate the influence of outliers to a certain extent.
After the algorithm is implemented, cluster C = fc 1 , c 2 , ⋯, c z g is obtained, z is the number of clusters, and the maximum distance from the sample in the cluster to the center of the cluster in the j-th cluster is max d j .
According to Equation (2), calculate the distance fð ðd m1 , d m2 , ⋯, d mz Þg from the sample in Y to the center of each cluster. If there is a cluster (a ∈ ð0, zÞ), make the sample x b ðb ∈ ½w + 1, mÞ in the set Y satisfy that the distance from x b to the center of cluster a is less than the maximum distance from the samples in the cluster to the center of the cluster, namely, ðd b1 < max d 1 Þkðd b2 < max d 2 Þk⋯kðd ba < max d a Þ. Then, divide x b into the nearest cluster among a clusters.
If there is no such cluster, the sample is defined as an outlier. Until all the samples in Y are traversed, the final outliers can be divided.

Selection of Adaptive k
Value. Appropriate selection of k value needs to be evaluated based on clustering evaluation index. This article uses the sum of squares of errors SSE within the cluster. The calculation formula of this indicator is as follows: where k represents the number of clusters, x represents samples, μ j represents the cluster center of the j-th cluster, and C j represents the set of samples contained in the j-th cluster. E describes the tightness of each cluster sample to a certain extent; the smaller the E, the better the clustering effect. According to the SSE, the sum of squared errors Je in each cluster is obtained. The calculation formula of Je is as follows: where x i is the sample in the j-th cluster, NðC j Þ is the number of samples in the j-th cluster, and μ j is the sample in the j-th cluster. The smaller Je j means the better the clustering effect of the j-th cluster. Initially, set Je and the threshold N of the minimum number of samples in the cluster. After each cluster is divided, the number of samples ðN 1 , N 2 , ⋯, N k Þ in each cluster can be obtained, and the ðJe 1 , Je 2 , ⋯, Je k Þ of each cluster can be calculated according to Equation (7). Then, calculate the error ΔJe j of the j-th cluster clustering evaluation index and the difference ΔN j between the number of samples in the cluster and the initial value. The specific formula is as follows: Combine Equation (8) and Equation (9)    Wireless Communications and Mobile Computing of k value in the j-th cluster where w is the number of samples in the data setX ′ , sgn ðÞ is the symbolic function, θðÞ is the unit step function, and the symbol ΠðÞ is rounded up. If N j < N, then ΔN j is negative, sgn ðΔN j Þ = -1, soðsgn ðΔN j Þ + 1Þ/2 = 0, ðsgn ðΔN j Þ-1Þ/2 = -1. Δk j = −1 represents that when the number of samples in the j-th cluster is less than the initial value, delete the cluster center of the cluster.
If N j > N, then ΔN j is a positive number, sgn ðΔN j Þ = 1, ðsgn ðΔN j Þ + 1Þ/2 = 1, ðsgn ðΔN j Þ-1Þ/2 = 0. The discussion is divided into the following two situations: (1) When Je j > b, then Je j is a positive number, θðJe j Þ = 0, and Δk j = Πðlog m ðΔJe j + 1ÞÞ, so Δk j > 0. It is necessary to add a new cluster center near the j-th cluster center to reduce the error evaluation index within the cluster. Generally, 0 < log m ðΔJe j + 1Þ < 1, thenΔk j = 1. Only when the error ΔðJe j Þ is particularly large, log m ðΔJe j + 1Þ will be greater than 1 (2) When Je j < b, then Je j is negative, θðJe j Þ = 0, and Δ k j = 1. Δk j > 0 requires a new cluster center near the j-th cluster center to reduce the error evaluation index within the cluster. Generally, 0 < log m ðΔJe j + 1Þ < 1, then Δk j = 0. When the cluster evaluation index in the j-th cluster is less than the set initial value, the cluster center of the cluster is neither deleted nor added. The sample closest to the j-th cluster center is set as the new cluster center to reduce the clustering evaluation index within the cluster. After traversing each cluster, according to Equation (10), the amount of change in k valueðΔk 1 , Δk 2 , ⋯, Δk k Þ is obtained. The updated k′ is If k ′ = k, terminate the loop; if k ′ ≠ k, continue the loop.
The flow of the algorithm is shown in Figure 4. The specific implementation steps of the algorithm are as follows: Step 1. Data optimization processing. Find outliers according to the elbow rule and store them in sample set Y. The new sample data set after optimization is X ′ = fx 1 , x 2 , ⋯, x w g.
Step 2. Set the threshold range of the evaluation index Je in a single cluster and the minimum number of samples N in the cluster.

Wireless Communications and Mobile Computing
Step 3. Randomly select k samples from w samples as the initial cluster center 1 < k ≤ w.
Step 4. Calculate the Euclidean distance of each remaining sample to the cluster center according to Equation (2) and divide each sample into the cluster closest to it.
Step 5. Recalculate the new cluster center of each cluster according to Equation (1).
Step 6. If the new cluster center is the same as the original center or less than a certain threshold, the iteration is terminated. If the new cluster center changes, continue to repeat Steps 4 and 5 until convergence.
Step 7. Calculate the Je j of each cluster and the number of samples N in each cluster. Compare Je j and N j with the initial threshold range and calculate Δk j according to Equation (10). After traversing all the clusters, calculate the new k value k ′ according to Equation (11). If any center is deleted or added, return to Step 4 until there are no new or deleted cluster centers.
Step 8. Determine the final divided outliers according to the similarity between the samples in the sample set Y and the cluster centers of each cluster.

Experiment and Analysis
4.1. Experimental Background. The data used in this article comes from data in a university's digital system database. The consumption data comes from the campus all-in-one card system, which mainly includes canteen consumption and school supermarket consumption. The life data comes from the all-in-one card system, which mainly includes exercise clock-in and time spent online. The learning data comes from the educational administration system and the book borrowing system. A total of 2017 students were selected as a sample. The extracted raw data mainly contains 89,682 pieces of consumption data, 49,860 pieces of life data, and 15,629 pieces of learning data. After the integration and preprocessing of the data, sample data of 3356 students were obtained as experimental data. Table 1 is the definition of students' consumption habits, Table 2 is the definition of students' living habits, and Table 3 is the definition of students' learning habits.
The hardware configuration information used in this experiment is as follows: CPU is Intel Core i7, graphics memory is GTX960M 4G, and memory is 16G. The operating system is Windows 10 64-bit, and the development language is MATLAB.  Table 4.

Wireless Communications and Mobile Computing
It can be seen from Table 4 that the most suitable k value obtained based on the K-means clustering algorithm used in this article is 5, which shows that 5 types of consumption habits can be obtained according to the consumption data of students. Each habit corresponds to different types of students. The characteristics of each type of student are as follows: (1) Type 1 students have lower monthly consumption levels and single-month peak consumption, but they consume more times. This type belongs to the group with lower consumption levels. Such students have poor family economic conditions and live frugal lives. It is recommended that school administrators pay attention to the living conditions of such students, and the identification and funding of poor students can consider choosing from such students (2) The average monthly consumption of type 2 students is the highest, the number of consumption is high, and the consumption peak is also high, indicating that this type of student is a high consumption group in the cafeteria (3) The monthly consumption level of type 3 students is above average, and the consumption frequency is higher each month, and the consumption amount is stable. It shows that such students often eat in the school cafeteria, and their consumption is stable, which is in line with the normal eating rules of most students in school (4) The monthly consumption level of type 4 students is in the middle, and the average monthly consumption is not high. However, the maximum consumption in a single month is relatively high, and the number of consumptions is also relatively small. Such students eat irregularly in the cafeteria and usually like to eat out of school or order takeaways (5) The average monthly consumption of type 5 students is relatively low, the number of times is relatively small, and the maximum consumption is not high. Such students do not consume frequently in the cafeteria and are more likely to eat outside of school and consume more outside of school. School administrators should pay attention to the food safety and personal safety of such students Table 5 shows the results of clustering analysis of life data based on the algorithm used in this paper.    For the cluster analysis of life data, three categories were gathered. According to the information in Table 5, the following inference can be drawn.

Life Data Analysis
(1) Type 1 students have regular schedules and meals every month. They spend a long time online and often participate in physical exercises. Such students have strong self-discipline, good physical fitness, and good living habits (2) Type 2 students overslept more frequently each month. They eat irregularly in the cafeteria, spend a long time online, and exercise less frequently. Such students have poor physical fitness. School administrators should pay attention to the learning and class conditions of such students and whether they often skip classes (3) Type 3 students often get up early every month, but they have irregular meals in the cafeteria, spend more time online, and do less physical exercise. Such students should not have a healthy habit of eating breakfast and do not like to exercise. School administrators should urge such students to change their unhealthy living habits and pay attention to their health

Learning Data Analysis
The results of cluster analysis of academic data based on the algorithm used in this paper are shown in Table 6.
Based on the K-means algorithm used in this article to analyze the learning data, students are divided into 4 categories. Based on the information shown in Table 6, the characteristics of each type of student are as follows: (1) Type 1 students' classroom attendance, the number of books borrowed in the library, and the number of library entrances are all high. Such students study hard and have good study habits (2) Type 2 students have a higher class attendance rate, but the number of books borrowed is not many, and they enter the library more often. Such students are active in class and often read and study in the study room and library. However, the small amount of books they borrow in the library indicates that such students are accustomed to reading books in the library (3) Type 3 students have a low class attendance rate, a small amount of books borrowed, and a small number of in and out of the library. It shows that such students often skip classes and do not study hard enough. They are students who do not like to learn. School administrators should focus on the learning situation of such students and promptly urge them to form good learning habits (4) Type 4 students have an average class attendance rate, fewer books borrowed, and fewer trips to the library. This kind of students just go to class often, and they do not have much time to study after class, and the degree of study hard is not high. This type of overstudy is a type that does not pay much attention to study at ordinary times and makes surprise review before the exam. It is recommended that such students develop regular study habits and arrange their study time reasonably and appropriately

Conclusion
In order to assist the school to improve the management level of students, this paper uses an adaptive K-means clustering algorithm to analyze the behavior characteristics of students. By collecting the digital system data of the campus, students' consumption, life, and learning data can be obtained. After the data is preprocessed, PCA is used for feature extraction to obtain feature data and perform model training. Input the test set into the trained model to get the clustering result. Finally, based on the analysis of the clustering results, the characteristics of all kinds of students are obtained. According to the results of consumption data analysis, five types of consumer groups were obtained. For groups with low consumption levels, more consideration can be given to poor student evaluation and work-study programs. According to the analysis results of life data, three groups are obtained. Teachers should pay special attention to groups that eat irregularly, spend a long time online, and do not exercise regularly. According to the analysis results of the learning data, 4 groups of groups are obtained. Teachers need to pay more attention to students with average attendance rate, low book reading volume, and small number of library entrances. In the management process of the school, students with different characteristics can be taught in accordance with their aptitude, thereby improving the management quality of the school. There are still some shortcomings in this article; for example, there are some limitations in the description of  In the analysis of student behavior characteristics, due to the limited data sources in the digital campus platform, it is not possible to fully reflect the behavior characteristics of students in school. Later, as the application of the school's digital campus is perfected, more campus businesses will be transferred from traditional offline to online, and more comprehensive student data can be collected.

Data Availability
The labeled dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare no conflicts of interest.