Data Mining of Students’ Physical Exercise Based on Cloud Computing

Physical exercise is very necessary for students, only good physical quality can better study. The mining of physical exercise data is the analysis and integration of physical exercise data. The purpose of this paper is to study the data mining technology of students’ physical exercise. This paper proposes a data mining algorithm for student physical exercise based on Cloud Computing (CdC). In the traditional data mining technology, data regression and clustering are added, and the data is processed through variance and covariance, which can make the data complete the processing steps such as cleaning faster and saves a lot of data processing time. The experiments in this paper show that the algorithm proposed in this paper can analyze the physical exercise data very well. For data samples with sample sizes of 1000 and 2000, the running time is about 20 seconds lower than the original algorithm, and the resource factor of the system is stable at about 82%, which proves that the algorithm has a good load balance.


Introduction
With the rapid development of information technology and storage technology, more and more computers are used in various fields, and a large amount of historical data has been accumulated in various fields over time. And over time, these skyrocketing volumes of data have reached staggering numbers. Usually, due to the large amount of data, traditional statistical methods or simple data analysis methods cannot obtain useful information from these massive data. Sometimes, even if the data sets are relatively small, traditional processing methods are powerless for these data due to some nontraditional characteristics of the data itself. In addition, there are some problems that cannot be solved using existing data analysis techniques. But there is a lot of information hidden in these data that is difficult for people to know, but it has very important information for some applications in real life. This hidden information can often be used to guide decision-making and business conduct in many areas. If these data are not used simply because the means of analyzing data are backward, then the data will lose the opportunity to be revisited and utilized, which will be a great waste of data resources. In this case, it is necessary to develop methods that can process massive data sets and technologies to mine and analyze information. Data mining is produced in this situation. Data mining methods can discover information and laws that cannot be found by traditional data processing methods. Because of the effectiveness and efficiency of data mining, data mining technology has been widely used. Physical exercise, as the most important form of mass physical activity, can enhance body health and fitness level, improve physical appearance, regulate mental and emotional health, improve social relations, and even improve the overall national physique. Physical exercise covers a wide range of people, and the forms, contents, and methods are more flexible and diverse. Formally, an individual can choose to exercise alone or with friends. The content includes entertainment, combat, health and corrective sports and fitness, and bodybuilding. Exercise methods are divided into transformation method, repetition method, competition method, etc., which are commonly used in daily teaching and training, as well as some exercise methods created in practice, such as morning exercise and interclass exercise. According to different age, gender, occupation, and health status, exercisers should carry out appropriate exercise activities according to their own physical conditions and participate in physical exercise in a targeted, step-by-step, and persistent manner. In the process of physical exercise, it is necessary to establish correct exercise values, and avoid problems that cause negative emotions to be vented through exercise, and that the load exceeds the range of physical functions, which is detrimental to physical and mental health. In addition, if it can combine movement and stillness, form and spirit, and internal and external during physical exercise and apply traditional Chinese body training methods to it, the benefits for exercisers will be even greater. At the same time as the physical fitness is improved, the mind is also calmed, which is helpful for meditation.
College students are the pillars of national construction and the driving force for the sustainable development of schools. College students must develop in an all-round way. In the comprehensive quality of all students' development, sports development is also very important. The level of physical quality reflects an important standard for how much a college student can participate in future social activities. The educational philosophy of the school is very important to the development of college students. In university courses, the form, content, and innovative teaching methods of physical education are the keys to the development of physical education in colleges and universities. The current state of college students' physical fitness is generally not ideal, especially in some colleges and universities, where students only perform limited physical exercise for physical fitness tests. This thinking mode is similar to the traditional Chinese test-taking mode. That is, to exercise physical fitness to pass the exam and not make sports a part of life. Therefore, it is necessary to use the data mining research of student physical exercise based on CdC to illustrate whether the exercise level of the student group has reached the standard. This article has the following two main innovations in the writing of the article: (1) CdC and data mining technology are applied to the research on physical exercise. In today's digital world, everything can be quantified. Through the powerful data analysis ability of CdC and the data analysis ability of data mining technology, it can effectively analyze the physical exercise of students. (2) It proposes a data mining algorithm for physical exercise based on CdC. This kind of algorithm is targeted, and it can mine and analyze the physical exercise time, time period, and the basic data of students, and the results formed by such data mining technology are more convincing.

Related Work
We all know that the data contains a wealth of useful information. Because the size of data has increased by an order of magnitude since the invention of the computer, data mining research has become very popular. Helma  literature analysis to conduct an analysis of data mining techniques to support network analysis machine learning for intrusion detection, and their findings provided some recommendations on when to use each method [2]. Xu et al. believes that data mining technology's growing popularity and development has posed a serious threat to the security of personal sensitive data. To that end, they discuss the privacy issues that each type of user faces as well as the methods that can be used to protect sensitive data [3]. In a systematic review of the application of machine learning, data mining techniques, and tools in diabetes research, Kavakiotis et al. linked biotechnology and data mining techniques [4]. Chaurasia and Pal also conduct research on female breast cancer using biotechnology and data mining as a foundation. They used Weka software to compare three classification techniques, and the results show that Sequential Minimal Optimization (SMO) has a higher prediction accuracy than IBK and BF Tree methods, with Chaurasia V96.2 percent [5]. Yan and Zheng investigated the use of data mining technology in the stock market. They constructed a "universe" of more than 18,000 fundamental signals from financial statements and used a guided approach to assess the impact of data mining on fundamental-based anomalies. They found that many fundamental signals are significant predictors of crosssectional stock returns even after accounting for data mining [6]. In terms of physical exercise research, more research is on the human body-related functions of exercise. Grassmann et al. believe that even a single physical exercise can enhance executive function. Therefore, they conducted a systematic review of the literature to analyze articles evaluating executive function in children with ADHD following acute exercise. They concluded that 30 minutes of physical activity could improve executive function in children with ADHD [7]. Håkansson and others concluded that the increase in BDNF they found after physical exercise is more likely to be of peripheral rather than of central origin. However, the association between BDNF levels and cognitive function after intervention may have implications for BDNF reactivity in serum as a potential marker of cognitive health [8].
It can be found that in the current research, there is almost no intersection between data mining and physical exercise. Although data mining technology has been used in many aspects, there is not much research on the analysis of physical exercise data.

CdC and Data Mining Technology
3.1. CdC. Cloud computing is the product of the integration of traditional computer technologies and network technologies such as grid computing, distributed computing, parallel computing, utility computing, network storage, virtualization, and load balancing. It aims to integrate multiple relatively low-cost computing resources into a resource pool with powerful computing and storage capabilities through the network and push this powerful computing capability to end-users with the help of advanced business models such as SaaS, PaaS, and IaaS.

Wireless Communications and Mobile Computing
As shown in Figure 1, these three types all have the following five characteristics, and these five characteristics can judge whether a new thing is an extremely important sign of CdC [9]. They are (1) distributed, (2) virtualization, (3) dynamism and scalability, (4) clusters composed of cheap servers, and (5) massive data storage and processing.

Data
Mining. The emergence of data mining (DM) technology [10] enables people to obtain meaningful information from massive data, and people can make decisions and make plans based on the obtained information. Data mining can extract information that is not known in advance but has potential value from a large amount of noisy data. Data mining is the core of database knowledge discovery (KDD) technology. Data mining technology originates from knowledge discovery and is an important step in knowledge discovery. Therefore, data mining is not only the mining of data but also the discovery and refinement of information, laws, knowledge, and other things that people are interested in and are useful for social activities. Figure 2 shows the following:

Data Mining General Process.
(1) Determining the business objects: data mining can be carried out with goals, and the calculation ideas will be clearer (2) Inputting data: input the raw data to be processed into the computer system. These raw data may be WEB data, sensor data, XML documents, or database data (3) Data preprocessing: (1) data cleaning-remove the noise data mixed in the original data, accurately determine the abnormal points, and delete the vacant data. (2) Data integration, which integrates databases from multiple data sources. (3) Data selec-tion-select data according to the following data mining requirements and ensure that the selected data can make the subsequent analysis work smoothly. (4) Data transformation, in order to make the mining results more in line with our requirements, the data form can be properly transformed (4) Data mining: to obtain useful associations or patterns from the results of mining (5) Postprocessing, including pattern assessment and knowledge targeting: (1) pattern evaluation-put the discovered patterns into practical applications to test the performance of the patterns and use pretrusted knowledge to check the contradictions that may exist in the patterns. (2) Knowledge identification-the results of data mining are represented by visual means and the results are displayed to customers (6) Useful information: the final result of mining, which is generally useful information that people are interested in

Data Mining and Classification in Cloud Environment
(1) Principle of Web Crawler. The purpose of web crawler is to automatically extract the content of web pages. The workflow of a web crawler is to start with one or several URLs, obtain the URL of this page, and then recursively read the content of the web page, so as to collect information related to the entire site, until it stops when certain conditions are met. The web crawler is also an important part of the search engine, and the overall structure is shown in Figure 3.
As shown in Figure 3, the web crawler crawls the information on the web in sections. The role of the repository is to store the web pages and URL links crawled by the crawler.

Wireless Communications and Mobile Computing
There are various methods of web crawler, and its principle is more complicated [11]. Most of the web crawler principles are based on the following steps: (1) obtaining the URL of the webpage content through the webpage analysis algorithm and putting it into the queue; (2) using the search method to pick the URL from the queue as the next object; and (3) repeating the second step until the URL is stopped when conditions such as depth restriction or area restriction occurs [12,13].
(2) Python-Based Web Crawler. Commonly used languages such as Python, Java, and PHP can collect network data. Using crawler technology in daily life, Python has always been the most widely used language because of its concise language and easy-to-master, powerful HTTP library and HTML interpreter, good crawler framework Scrapy, and advantages in text character processing. More and more people start to use Python web crawler for web data mining [14].    [15]. The Hadoop framework structure diagram is shown in Figure 4.
At the bottom of the distributed framework is Hadoop Distributed File System (HDFS), which corresponds to Google File System [16]. HDFS is a distributed file storage system. It provides features such as high reliability, load balancing, and redundant replication, ensuring the safe and reliable storage of massive data on storage media. At the same time, the throughput of I/O in the distributed file system is improved by means of redundancy, which provides the possibility for the realization of parallel computing of massive data. In massive data computing, the data transmission time is often longer than the actual computing time, so HDFS is the most basic factor to ensure the operating efficiency of the Hadoop framework. On top of HDFS is the MapReduce framework, and massive data parallel computing is implemented on the MapReduce programming framework. The data read and written during the running of the program are placed in HDFS, so that the application can effectively utilize the high I/O throughput of HDFS, thereby accelerating the overall running speed of the application [17].
MapReduce is a new distributed model abstracted from the features of functional programming. In terms of implementation, details such as parallelization, fault tolerance, data distribution, and balancing are hidden, and the entire distributed process is regarded as a function-like process expressed by map/reduce. It logically maps the input data into multiple key/value pairs and converts the key/value into key1/value1 through the corresponding function operation in the map. Then, in the reduce phase, the same key will be aggregated to achieve the final purpose of data manipulation. Therefore, in actual programming, developers only need to clarify the meaning of the key and value in each stage of data map/reduce, and then they can convert the existing business to the distributed parallel computing platform [18,19]. Figure 5 depicts the MapReduce workflow. The Hadoop framework implements data splitting and merging using two classes of Mapper and Reducer. Mapper provides setup and cleanup to manage resources in the life cycle of Mapper, in addition to map functions for data segmentation. Setup is called after the Mapper is finished and before the map action is executed, so all of the data shared by each mapper that needs to be preread should be preread in the Setup. After all map actions have been completed, cleanup is called, and it is primarily used to clear temporary data. In practice, the map/reduce job divides the input data set into several independent data blocks, which are then processed in parallel by the map task. After the map's processing is finished, the framework sorts the output before feeding it to the reduce task. Because of the presorting feature in the Reduce process, Hadoop results are always orderly within a certain range, giving Hadoop's MapReduce framework an inherent sorting advantage. The efficiency of MapReduce under the Hadoop framework can be greatly improved by making reasonable use of this property [20]. The commonly used functions and functions of the Map function are shown in Table 1.

Physical Exercise Data Mining Algorithms.
Calculating the distance from the test point to other points from the obtained k data points, assuming that any instance x is described by x = fa 1 ðxÞ, a 2 ðxÞ, ⋯a n ðxÞg, the distance between the two instances can be expressed as follows: D represents the distance between two points, and a r ðx i Þ represents the corresponding value on dimension i of instance X, which is used to calculate the Euler distance between two instances. The weighted calculation is done by distance, and the calculation formula of the weighted function of distance is as follows: Among them, ω i represents the size of the weight calculated according to the distance, and its calculation formula is as follows: X q is the prediction point, X i is the adjacent point of X q , and the reciprocal of the distance between the two is calculated as the size of the weight. Of course, other calculation methods can also be used. The weights are weighted according to the mixed Gaussian model, and dðx q , x i Þ is the Euler distance local model. If x q = x i , thenf ðx q Þ = f ðx i Þ, as shown in Figure 6.
In local weighted regression, the weighting of points is calculated according to the distance, and the regression uses the weighted points to calculate the cost. There are many weighting methods for points; the most commonly used is the Gaussian model, such as the following: In formula (4), k is a smoothing parameter, determined by the following steps.
Formula (5) is used to calculate the average mathematical expectation of surrounding neighbors.

Class name Function
TextInputFormat Read a plain text file, the key is the offset of each line, and the value is the content of the line KeyValueTextInputFormat Read a plain text file, the default first column is key, and the rest is value SequenceFileInputFormat Read Hadoop custom datastore binaries NLineInputFormat Similar to TextInputFormat, the difference is that the original data can be divided into lines DBInputFormat Read data from the database X Figure 6: Euler distance local model diagram.

Wireless Communications and Mobile Computing
Formula (6) is used to calculate the mathematical variance of the neighboring points.
Formula (7) is used to calculate the covariance between two points in the surrounding neighborhood.
The expansion expressions of the formula are as follows: Using the covariance of the data to calculate and describe the expectation and variance of the data, the expectation is calculated asŷ The variance calculation formula is as From the calculated results, the variance-based method can achieve the best results, that is, the value of the minimum extremely smooth coefficient k around the reference point. Then, the regression function is as follows: In formula (13), β 0 is the constant term of the regression, β n is the regression coefficient, andf ðxÞ is the predicted value of the regression. The expectations of the first two are shown in the following: The expectation can be obtained as Performing weighting processing on each point, as shown in The weight of y can be obtained, as in Using the weighted points to calculate the regression coefficient β, as in

Physical Exercise Data Mining
The physical fitness test should have become a set of mechanisms to urge college students to participate in physical exercise and a supervising force to strengthen the physical fitness training and training of college students. However, due to various factors, physical education has become an obstacle in the minds of students, and it has developed into a physical exercise that students cannot do in order to pass the exam. In some provinces and cities, "the failure rate of college students' physical health reaches 10.25%, and some of the students' functions and quality indicators have declined." Students do not regard physical activity as a course that they must experience in their life and do not realize the important role of strengthening physical exercise in improving their quality of life. The school also does not realize that the physical fitness test and the development of physical education courses are an important part of the development of sports in colleges and universities nor do they realize that guiding students to the correct physical fitness exercise is a performance of responsibility to the students. Under the current circumstances, the school's physical education system also needs to be improved in many aspects. Through the continuous reform of physical education and teaching, the development of college sports will be promoted.

Data Sources.
The data content of this experiment includes weight, height, exercise mode, exercise time period, and exercise time length. The data structure of the file is shown in Table 2.

Data
Cleaning. Data cleaning refers to solving the problem of dirty data, eliminating inconsistencies in the data, and at the same time preliminarily formatting the data according to the requirements of data mining, so as to provide a good data source for the final data mining. Therefore, two aspects should be done in data cleaning: (a) removing invalid data and (2) data formatting. Usually, data processing programs are executed based on the database. Database providers such as Oracle and MSSQL provide their own data cleaning tools, such as SISS (SQL Server 2005 Integration Service), which are commonly used data cleaning tools. Most data cleaning tools of this type provide thread-based parallelization, and the efficiency of the overall data cleaning process is directly determined by the hardware system. Since most of the parallelization implementations of the system are based on the parallelization of shared memory, the system scalability is poor. Therefore, in the case of running such data cleaning tools, if want to obtain better performance, it must have a powerful hardware configuration, such as an expensive minicomputer. In addition, since the design of this type of data cleaning tool is based on the realization of relational database at the beginning, it essentially denies the possibility of this type of data cleaning tool to clean unstructured data. Today's data structure has developed from structured data to multidimensional unstructured data, such as web data. Therefore, such data cleaning tools cannot achieve satisfactory results in the face of massive unstructured data. In order to set up a data cleaning framework in a Hadoop cluster, the original cleaning process must first be converted into a MapReduce model, and then the Hadoop framework must be used for coding. We decided to use the Hive tool to complete the data cleaning in order to reduce code writing time and simplify the MapReduce code writing process. Hive is a MapReduce query implementation that works like SQL and can use any file in HDFS as a source file for fast MapReduce operations. Because Hive's main function is to convert SQL-like statements into a series of MapReduce programs, using Hive for data cleaning is a parallelization process based on the Hadoop framework, and the Hadoop framework determines its overall scalability. As a result, using Hive as a data cleaning tool can fully meet the paper's scalability requirements for data mining.
The purpose of this test is to examine the linear scalability of data cleaning tools based on Hadoop+Hive mode, so 3, 6, and 12 nodes are used for testing.
It can be seen from Table 3 that with the increase of the number of nodes, the test time will gradually decrease, which shows that in the process of data cleaning, nodes can be appropriately added for big data to improve cleaning efficiency.
It can be seen from Figure 7 that when the computing nodes increase to 2 times, 3 times, and 4 times the original number of nodes, respectively, the performance gains obtained are 1.91, 2.80, and 3.60, respectively, and the linear expansion indicators reach 0.955, 0.933, and 0.9. This shows that the data cleaning tool based on Hadoop+Hive has excellent linear scalability. Therefore, in practical applications, the overall mining speed can be improved by adding more computing resources. However, it should be noted that when the computing tasks on each node are already low, the performance increase obtained by adding more computing nodes is often small. Therefore, the expansion capability at this time will no longer be linear, and this state is called the inflection point in this paper. In practical applications, computing resources should be adjusted reasonably according to the operating status of nodes to prevent inflection points due to excess resource increase, resulting in unnecessary cost.

Data Mining Algorithm Performance Analysis
In order to test the performance of the physical exercise data mining (PEDM) algorithm, this paper uses the CloudSim simulation platform to simulate the construction of a data center and compares the algorithm with the traditional data mining algorithm. The experimental test includes two aspects: the first type of experiments is mainly compared from the perspective of task completion time, and the second type of experiments is mainly compared from the perspective of equilibrium factors. It can be seen from the experimental results in Figure 8 that in the early stage of the experiment, the PEDM algorithm even takes more time than the traditional algorithm, but in the later stage of the experiment, the total execution time of the PEDM algorithm is significantly less than that of the traditional algorithm. The main reason is that the  It is not difficult to see from the experimental results in Figure 9 that the curve fluctuation range of the PEDM    Wireless Communications and Mobile Computing algorithm is the smallest. It shows that the algorithm makes the average load balance of the system better, while the traditional algorithm fluctuates greatly. The maximum value of the traditional algorithm is close to 84%, and the minimum value is about 38%.
As can be seen from Figure 10, when n = 1000 and 2000, through the PEDM algorithm, the different computing resources are all above 80%, and they are relatively stable without major fluctuations, and the system load balance can be basically achieved.

Conclusions
The purpose of this paper is to look into the data mining algorithm used to analyze students' physical activity. By investigating the physical fitness and status of current college students, conducting data mining analysis, identifying the factors that influence college students' physical fitness, and conducting a more detailed analysis of the influencing factors, the algorithm can grasp the current level of physical fitness of college students. The background, significance, and related work of the article are first presented in this article, and it is discovered that current research on students' physical activity is more concerned with a questionnaire survey of the students themselves, as well as the impact of sports on students. Despite the fact that follow-up experimental research has shown that the algorithm has good performance, it has not been optimized for other parts due to space constraints. In the follow-up research, it is hoped that more research can be done.

Data Availability
The data used to support the findings of this study are included within the article.

10
Wireless Communications and Mobile Computing