Educational Information Refinement with Application Using Massive-Scale Data Mining

In the manuscript, we propose a novel online learning mechanism based on educational data mining. By leveraging the computer-assisted information-based learning guidance platform, we collect the relevant information of students’ login platform and resource browsing. Subsequently, we preprocess these students’ login data based on which the statistical analysis of students’ login and resource browsing habits are learned through a decision-making mechanism. The decision tree algorithm discovers the underlying factors that inﬂuence it from million-scale real-world instructors/students. In this way, instructors and teaching staﬀ can eﬀectively grasp the learning process of students according to the analyzing results. Based on this, the target teaching content integration and teaching model construction can be realized accordingly. This can substantially improve the eﬀectiveness and quality of online learning. In the evaluation stage, we observe that the strategy to deploy a virtual lab environment vigorously brings greater ﬂexibility in the allocation of computing resources to educational institutions. In an ideal sandboxed laboratory context, students can obtain and create an internal network and then have accession to all the computers conveniently. By doing so, gathering savvy skills gives them the workability to build architectures based on mined data. In this work, we adopt the so-called EDUCloud to provide a personal cloud network connection that can ﬂexibly deploy the mined laboratory-related data to facilitate online learning between instructors and students.


Introduction
Since Internet technology that grows rapidly has a wide range of implementations that are driven by the big data era, data-driven online learning and big data-guided education have gradually been recognized. In the past decade, the online learning strategy has also undergone significant changes. Data mining techniques and other machine learning technologies can provide strong support for model construction in the education field. It has become the development trend of university education to investigate the related learning behaviors through in-depth analysis of the underlying correlation between educational variables. is can assist education-related decision-making. e effective evaluation of online learning status is one of the key techniques of current research. During the learning process, learners will characterize massive-scale learning behavior data in the online learning platform. According to these rules, we can provide a personalized environment and learning guidance to different learners. In our work, we call the operation of converting education-related raw data (from multiple educational systems) into the discovered information as educational data mining. In practice, instructors and students can acquire education-related information to further improve the instructing quality. is can also guide the research of educational researchers and the updating of educational software systems. e online learning algorithm is gradually upgraded to meet the standard of real-world applications. In the traditional faceto-face instructing mode inside the classroom, instructors can utilize on-site questions and classroom evaluations. ey can evaluate the learning results of the related courses. e traditional course evaluation techniques are no longer applicable to the online learning context that spans both spatially and temporally. As a new component of the education system, educational data mining potentially requires a positive interaction with multiple educational elements to ensure the improvement of instructing. e objective of educational data mining can provide instructors with more accurate and objective feedback. is can realize the organization and innovation of instructing content. Meanwhile, this can assist the adjustment and optimization of instructing strategies. Finally, this can also upgrade the instructing process and curriculum according to the situation of learners. Such updating process can be continuously optimized and even parallelized. Accordingly, an online instructing mode that can meet the actual instructing requirements is constructed. Based on the application of data mining, educational data mining mainly includes discovering informative data in instructing, management, and scientific research. In this work, we take the data mining application of network instructing (ELearning) as the main research object. We further complete an insightful analysis of learning behavior according to the login data of students collected using an information-guided machine learning platform. e analysis of online learning behavior is primarily concerned with the recorded data of the online teaching platform. Practically, it is necessary to combine various mining and visualization techniques to discover the behavior patterns of instructors and students. We also learn the time when the behavior occurs, the utilization of behavior objects, as well as the influencing factors of multiple online learning behavior. In this work, we conduct in-depth data mining (combined with the characteristic data of teachers and students), through statistics and mining results analysis. Based on this, we calculate the basic characteristics and influencing factors of students/instructors. e educational data mining mode integrates and applies a variety of data mining algorithms. We focus on multiple specific mining tasks. Our proposed framework mainly includes three parts: data, tools, and algorithms, and the mining data, as elaborated in Figure 1. Our framework is becoming an online self-learning guidance platform for computer-aided courses. It involves rich and diverse instructing resources, strong interactivity, and openness. Meanwhile, it can track and feedback learners' learning conditions. Also, a large set of user login data will be preserved by students. It is generated when our constructed platform conducts self-learning (including data information on login behavior, resource browsing, etc.). In this work, we utilize data mining techniques (including cluster analysis and correlation analysis) and network log analysis tools to conduct a detailed and indepth evaluation to investigate the internal factors influencing online learning behaviors.
In this work, we focus on computer sciences students, combined with the basic login data of these students. Meanwhile, we collect the corresponding log data (including login, resource browsing, and learning experience) generated by leveraging the information-guided learning platform and the evaluation data in the test platform. More specifically, we select four pieces of data and import them into the database to construct the corresponding data table. Moreover, we use the "student number" as the key feature to establish the relationship between the four data tables. On this basis, the aforementioned four pieces of data are analyzed. e table utilizes the joint query technique to conduct the intersection operation process. And it acquires a total of 1265 students in the four tables. Afterward, the log data corresponding to these students are leveraged as the analysis object, that is, the basic information, formative test data, login, and resource browsing. In detail, our proposed framework is built upon the EDUCloud architecture, a lecture, and a lab for data analysis, as well as a distributed system (including distributed information systems). It is the so-called service-oriented instructing, especially the microservice architecture, and the novelties are the dataoriented technique of the instructing EDUCloud. In the experiment, preliminary data-related training was conducted in the lab, focusing on data analysis and mining concerning well-known software such as Apache Hadoop and Apache Spark. e techniques of data mining are focused on initial experience which is another contribution of this research. e subsequent sections of the manuscript are structured as follows. Section 2 concisely introduces the previous work closely related to ours. Next, we detail the three key components in our framework. en, we experimentally evaluate the effectiveness of our method. e last section concludes and presents the future work.

Related Work
In the past decade [1][2][3][4][5], it can be observed that the scenarios of cloud computing used by multiple top universities in their educational programs are becoming more and more diverse. Well-known educational programs include cooperative Software-as-a-Service (SaaS) platforms, Google's G Education Suite [2], those used to teach high-performance computation [3] or R platform, and Scilab platform [5]. e CloudIA platform as we aforementioned [1] can provide a hybrid platform of SaaS and PaaS (Platform-as-a-Service). Meanwhile, IaaS (Infrastructure-as-a-Service) services on a single system within a hybrid cloud environment has the ability to scale to Amazon Web Services as needed. In PaaS, the key technique is the on-demand deployment of individual virtual machines by leveraging a self-service portal accessed by both learners and instructors. However, it is impractical to utilize them to construct a fully functional microservices lab without calculating them with a customized internal network. Literature [4] applied the association rule algorithm to conduct an association analysis on the knowledge points of the remote examination questions.
is played an indispensable role in the development of examinations and instruction. Literature [5] first converted the students' scores into Boolean expressions and subsequently leveraged the association rules. e authors calculated the relationship between the courses [1] to convert the transaction item data into the item transaction database. ey further eliminated the items in the sets not satisfying the minimal support degree. is facilitates a generation of a new candidate set through the connection, which only calculates the support degree by accurately characterizing the intersection. Herein, the process of repeated comparison in the pruning process is abandoned. Literature [6] proposed to update the new transaction database continuously. Every time the support degree is calculated, the newly generated transaction database is searched subsequently, and finally the new frequent itemsets are generated for continuous prune. During this, the new transaction database is becoming smaller and smaller, to achieve the purpose of improving efficiency. Literature [7] proposed to improve efficiency by reducing useless data in transaction databases and reducing the comparisons according to the orderly arrangement. Moreover, they proposed an alternative discretization method. e method can model nonfixed data width, based on which it can be adjusted according to realworld situations to improve the result accuracy.
At present, researchers mainly utilized Citespace, a citation visualization analysis software developed by Zhang et al. [8] for bibliometric research. e software can intelligently analyze high-frequency keywords, keyword co-occurrence maps, authors, research institutions, and other collected subject literature. However, the bibliographic databases supported by Citespace have certain limitations. Although it covers large literature databases such as WoS (Web of Science) and CNKI (China Knowledge Network), it does not support the Japanese literature database (CiNii). CiNii is a bibliographic database maintained by the National Institute of Information from Japan. is includes all Japanese papers published in Japanese academic journals. e STEM education papers published by Japanese scholars are included in this database. In this way, the database provides an important literature source for the study of Japanese localized STEM research. It can also assist to fill the gap of domestic scholars' research on Japanese STEM education. Meanwhile, since Citespace cannot analyze Japanese documents, the author uses a text data mining (Text Mining) software called KH Coder developed by Japanese scholar Koichi Higuchi. An example of these components is shown in Figure 2. e software exhibits powerful text data mining functions and has been pervasively used in many research domains such as sociology, economics, linguistics, and education in Japan. KH Coder software can conveniently calculate word frequency statistics, part-of-speech analysis, contextual keywords, keyword retrieval, similarity calculation, automatic classification, automatic clustering, abstract production, and visualization (such as histograms, line graphs, network graphs, scatter graphs, bubbles graph, and cluster analysis dendrogram) and other functions proposed by Google. After multiple version updates, the software currently supports text data mining in foreign languages such as Japanese, Chinese, English, French, German, Spanish, Portuguese, Italian, and Russian. Unlike Citespace which can only analyze the subject text of papers, KH Coder can exploit all text information in the language supported by its system. In particular, as long as its electronic text information is obtained in the textual format, the KH coder can conduct text data mining. Second, KH Coder supports more languages than Citespace [9]. e general operation of textual data mining based on KH Coder includes the following steps: (1) collecting and organizing textual information; (2) loading the text into the software and performing word selection and preprocessing to ensure the accuracy of the extracted data; (3) based on the keyword search function, we observe the context information of the high-frequency keywords; (4) using the software to generate a keyword co-occurring relationship diagram, and (5) analyzing the distribution and connection of the above key elements.  Mathematical Problems in Engineering 3

Our Proposed Method
As can be seen from Figure 3, the number of participants in the learning process increases sharply over the past few weeks. Meanwhile, the number of participants in the online learning was 100% in the last week. At the beginning of the online learning, the overall development speed of the students was slow, and the instructors are required to intervene based on the real circumstance. Besides, the instructors urged students to log in to the platform to participate in online learning. It is noticeable that the time of teaching resources can be allocated reasonably accordingly. e analysis of the influencing factors affecting students' login behavior is an effective classification and regression algorithm. In practice, the Microsoft decision tree algorithm is suitable for predictive modeling, mainly for discrete and continuous attributes. Meanwhile, this decision tree algorithm is based on the trend towards specific results, as well as the relationship between the input columns in the data set. In this way, the discrete attributes are predicted, based on which the continuous attributes are calculated by leveraging the linear regression algorithm to conduct the determination of the split position of the decision tree. Meanwhile, the mechanism of the Microsoft decision tree algorithm is presented as follows. Firstly, in the tree, create a series of splits and represent these splits as "nodes" to generate a data mining model. Once the algorithm observed that the predictable column is closely related to the input column, a node is incorporated into the model. e algorithm determines the splitting process according to the type of prediction object (continuous column or discrete column). e basic idea of a decision tree is the information gained before and after the split of each node: where H(t) and H(s, t) mean the entropy before and after the node split, respectively. Herein, H(s, t) is the phi function that is defined as follows: where P L and P R denote the nodes on the left and right after node split. e decision tree adopts a two-layer structure. rough the construction of data mining middleware between the tree-building algorithm and the database, the analysis efficiency is improved substantially. By leveraging the decision tree algorithm, we conduct a comprehensive analysis of the three key factors (i.e., level, major, and gender) of the students. Based on this, we construct the mining structure and mining model. Afterward, the factors affecting the students' login behavior (in days) are analyzed. We take the number of login days as the predicted value. e input value, that is, the level, major, gender, and the ratio indicates the proportion of students with login days greater than five days and less than three days. Such constructed decision tree page is shown in Figure 4. It shows the prediction of the number of login days based on the decision tree analysis results. Besides, it can be seen that the influencing level on the number of login days and the factors from strong to weak are level, specialty, and gender are calculated.
Compared with the students at the undergraduate level, the number of login days for the students at the pilot and noncommissioned levels is significantly lower. Meanwhile, the number of login days for the students at the noncommissioned level is lower than that of the pilots at the same level. e number of login days for the students is heavily affected by different majors. By summarizing the aforementioned analysis results, instructors can efficiently and effectively guide and adjust online learning behaviors for students at different levels and majors. Within the same level (such as undergraduates), the reason for the large differences in login behavior between different majors is that different  majors have different orientations and manage different teams of students in different majors. is will result in great differences in students' learning behaviors. Student managers can maintain different teams of students according to the actual situation. In this way, the learning effect of students can be significantly improved. is can also improve the resource browsing behavior, and we further analyze the influencing factors. e information-based learning guidance platform covers general courses of computer basics, including university computer basics, programming, hardware basics, and the arrangement of course teaching resources. is strategy is fulfilled according to cases, chapters, and knowledge points, including an operation video demonstration, animation interaction, test question bank, and other resource types and completes statistical analysis based on students' browsing logs. e logs refer to courses and course resource modules, based on which the specific results are shown in the table. e above educational data mining framework is implemented based on the EDUCloud platform. More specifically, the current implementation of EDUCloud is constructed upon VMware vCloud Director (vCD) [10]. It also involves components such as ESXi (Hypervisor) and vSphere since its prior configuration and background comply with this technology stack. Even though certain limitations and requirements pertinent to this platform appear, the structural concept of the EDU Cloud system is designed to prevent vendor lock-in, which can provide a single virtual lab (blue) and public services (green) connection. Besides, pfSense nodes (orange) can function as gateways to the Internet or other internal networks, which is succeeded by mitigating the reliance on specific attributes of a platform to the maximal extent; the lab set up regarding the concept and substantial sections is kept portable (the virtual machines) across various suppliers. Labs in vCD are captured in the configuration of vApps that consist of one VM or more than one RQ1 and its networks, containing internal networks and connections to external networks. In this way, each lab with an internal network connects the virtual machines (RQ3) without any direct access to any virtual machines remotely. All EDUCloud labs exceptionally have a pfSense-based [11] gateway virtual machine that acts as a gateway and OpenVPN server. is attribute allows learners to participate in the internal network and provide Internet accession (RQ4). At the same time, the connection between separate vApps could be supplied by incorporating a common services vApp to the lesson, creating a point-topoint VPN connection to all other locations in the lab used for the lecture. While CommonServices supply multiple extra services, its pfSense application also provides the joined labs to provide services utilizing the translation of the network address (RQ5).
e key steps of the proposed method are presented in Algorithm 1 by summarizing the discussions.

Experimental Results and Analysis
In the experiment, the influencing factors of students' browsing behavior are obtained. According to the data of browsing frequency of students and instructors, the resource modules with the highest degree of learning and usage are the digital libraries of each course chapter. Herein, the lowest frequency is the commonly used software tool libraries. ese resource modules are listed on the homepage of the course in this order. In this way, the layout of each course module on the homepage can better satisfy the learning habits of the students. By displaying the corresponding course module in a prominent position on the homepage, the student's attention to a certain resource is increased. At the same time, the best performance is the Flash animation interactive area, and the last is the auxiliary database. e reason for the highest learning level is mainly because the Flash animation interactive area is conducive to stimulating students' interest in learning and attracting students to learn corresponding operations through repeated learning. In practice, it is necessary for platform administrators and instructors to continuously enrich the question bank and further place the search area of the question bank in a prominent position on the homepage to improve the efficiency of answering students' questions. In turn, this can help students to better fulfill the course learning. Figure 5 depicts an elaboration of the stack architecture.
As we observed, both vCD and pfSense require an authentication provider's (RQ8) LDAP infrastructure; EDUCloud leveraged the permissions system in vCD to handle authorization as the directory that was accessed as a read-only source. e allowances for the personal vApps were subsequently synchronized to the corresponding pfSense instance by leveraging a Java user that has accession to the vCD API. Authorization can be managed entirely through LDAP as an alternative implementation. Whether the directory utilized to manage authorization is read-only, Keycloop is utilized to set up an identity and find a solution of access management [8]. e EDUCloud administers the process. e templates preserved in the catalog instantiated all the vApps and could be distributed across lessons if demanded. Each template is represented by a complete snapshot of the whole VMs in the laboratory. By doing so, deploying additional examples could be easier as well as turning the available examples to their original stage in case a misconfiguration occurs destructively (RQ6). If not protected in advance, a destructive process could potentially lead to losing all work, so creating a snapshot of the deployed  Mathematical Problems in Engineering lab instance as an alternative way could be a feasible way before running risky processes. e EDUCloud including the data lab enables participants to become savvy by running hands-on implementation when the processing of big data is employed in a distributed system, which contains aspects of mining, processing, and storage. Constructing a distributed environment is significant since the issues related to a distributed system could be recognized such as failures of nodes. e Apache Hadoop inspired by GFS and the included HDFS file system is used as a data storage basis. A file system called HDFS is replicated based on the orientation of blocks. While one instance is employed in each group vApp, one instance is utilized in the common service vApp. e comprehensive operation supplies its own distributed file system to each group to test it. Herein, the complete permissions are provided to each group to store, change, and delete data without affecting any other groups. For instance, storing files with a replication factor of three, exploring the locations of the blocks of the file replicas, and shutting down certain nodes are even allowed. e generic data processing structure comes as the next layer. Apache Hadoop accompanies Map/Reduce [9] as the foremost known structure in the field though it has some issues related to limitations and performance when disk I/O between processing steps occurs heavily. When Apache Spark [10] is preferred over Hadoop as an alternative, the data mining system with educational purposes supports in-memory crunching of the stored data of all cluster nodes on the RAM. Installing two frames simultaneously can instruct students on these questions. YARN [11] is utilized to manage the resource concerning both Hadoop and Spark. Table 1 presents the recorded data as an illustrative case.
Preinstallation of both systems on each group of vApp is conducted, allowing participants to start straightway with exercises pertinent to data mining. e clusters can be rootaccessed fully. In this way, both Hadoop's and Spark's configurations can be used to conduct experiments by them. In case of the installation being crushed by them, installation of a new version of the vApp template can be possible.
Herein, some special exercises can be listed: Lab 1 contains warm-up exercises whose objective is to find out how to utilize the environment, set up connections related to requirements of VPN, and familiarize with HDFS utilizing some functions such as storing and retrieving files. Besides, the web UI of the HDFS is employed to search for the physical location of the file blocks and their copies. Lab 2 deals with Map/Reduce. Learners are expected to conduct two fundamental tasks related to Map/Reduce to find out the mindset of the Map/Reduce. e conventional word counting exercise with its simple form contains preprocessing of data mining stages to eliminate replicated solutions. Besides, the second round of the exercise continues with a more difficult task. Evaluation of a data set with several relationships must be run to determine all types of correspondence between pairs, which are pertinent to two Map/Reduce tasks to run in order. Lab 3 deals fully with Spark. e core task is to attain a hands-on implementation with it and to conduct a comparison study between Spark and Map/Reduce, which is the second round of the reimplementation of Lab 2 by employing Spark. e Spark enables chaining of the calculations to be followed naturally and contains a wider set of fundamental operations with the data. Since the other objections were to determine the differences of the runtimes and also to comprehend the structural differences between Spark and Map/Reduce, the solution can be easily understood. Lab 4 deals with iterative jobs, which employ iterative tasks when data mining is conducted though it is an issue in Map/Reduce. K-means clustering is considered a representative instance when an iterative task is under consideration. Some metadata during the five laboratory experiments is shown in Table 2.
e algorithm calculates the centroids of the newly generated clusters, and all points are assigned to these centroids in each step. e objective is to conduct the algorithm two times utilizing 131 Map/Reduce and Spark. e version of the Map/Reduce leads to a lot of I/O by saving several intermediate outcomes to disk in each step. Input: Massive-scale student/instructor data collected from the Internet; decision tree with initialized values; defined mining patterns Output: Mined educational patterns (1) Extract students and instructors' learning data during the instructing process and model their relationships; (2) Use a decision tree to discover the complicated relationships (educational patterns) among students and instructors, and learn the inherent parameters of the decision tree; (3) Use EDUCloud to create their internal network, and further stimulate the entire instructing process. e Spark version can save all information in the memory and distribute it across the clusters. Lab 5 contains the concept called failure tolerance. e failure related to node or link is a routine issue of almost any process in a distributed convention. Besides, such issues must be addressed by each distributed task without resuming the task again. In this way, fault management (containing flexibility to resume parts of the task utilizing the other nodes and employing data checkpoint) is a significant section of the framework.
is Lab enables learners to explore these competencies and compare Hadoop with Spark concerning fault tolerance. Storing data in HDFS, eliminating data nodes by terminating VMs, and exploring what occurs when outages appear after resuming the failed nodes when conducting Map/Reduce and Spark tasks could result in failures of nodes. Lab 6 deals with data mining. Typically, we in this experiment do not implement basic data mining algorithms. e standard approach is to leverage an existing implementation provided by a package or library. Learners research Spark-based MLLib [12] to conduct classification exercises employing statistical modeling of decision trees and logistic regression. At the peak of the project, with approximately 15 students working an average of 1.5 days per week in three groups, the following problems arose during the experiment. e problems include freezing of the cloud because of bulky database logs and incorrect starting settings for learners. e "convenient way" was taken by them because they did not look closely sufficiently at the setup recommendations [13]. is will substantially affect them in the midproject during data mining. It will also cause incompatible/partially complete (sub) product versions.

Conclusion
is paper mainly studies the online learning mechanism based on educational data mining. By leveraging the computerassisted information-based learning guidance platform, it collects the relevant information data of the students logging on the platform and resource browsing and subsequently preprocesses the log data. Based on this, the students' login and resource browsing situations are statistically analyzed, and the process of mining the influencing factors is completed through the use of the decision tree algorithm, with different levels, majors, and genders. e above-proposed mining technique is realized on the EDUCloud platform. It presents a reliable, scalable personal cloud resolution for hosting virtual labs for instructing microservices and other subjects in educational data mining. It provides students with hands-on implementation by supplying root accesses to virtual machines and networks in a sandbox environment. It enables instructors to design highly complex motion tasks with minimal administrative effort concurrently since premade templates can easily instantiate new laboratories. A dedicated lab construction has been supplied as an instance to conduct a virtual lab running on EDUCloud to teach the concepts of EDUCloud with the main objective of conducting data mining and analysis, which is potentially based on large data sets. e laboratory enables learners to preconfigure stacks for the development of the data mining laboratory. It further leverages techniques such as Apache Hadoop and Apache Spark to demonstrate data mining, analysis-specific techniques, and patterns. e application of two stacks within a lab and the interconnectivity of the deployed lab examples can provide occasions for comparative illustrations between student and instructor groups.  Probability  XXXX01  76  94  73  75  76  92  XXXX02  78  43  64  83  88  88  XXXX03  87  56  83  75  65  84  XXXX04  93  76  75  84  74  76  XXXX05  77  85  68  83  84  84  XXXX06  65  75  74  75  82  93  XXXX07  75  69  77  85  78  83  XXXX08  64  88  83  92  79  75  XXXX09  59  74  93  78 83 85

Data Availability
Data will be provided upon request to author.

Conflicts of Interest
e author declares that there are no conflicts of interest.