Data Mining Techniques to Analyze the Impact of Social Media on Academic Performance of High School Students

Department of Computer Science, Government College University Faisalabad, Pakistan Department of Information Science, Division of Science and Technology, University of Education, Lahore, Pakistan Department of Computer Science, Islamia University Bahawalpur, Rahim Yar Khan Sub Campus, Pakistan Department of Computer Science, Federal Urdu University of Arts, Science & Technology, Islamabad, Pakistan Faculty of Computing and Informatics, University Malaysia Sabah, Jalan UMS, 88400 Kota Kinabalu, Sabah, Malaysia


Introduction
Student's performance modeling is one of the challenging and popular research topics in educational data mining (EDM) [1]. Multiple factors influence the academic performance in nonlinear ways. The widespread availability of educational datasets further made educational data mining more attractive to the researchers. EDM is a field in which data mining algorithms are applied on educational data to improve and predict the performance of education in institute students [2].
Information Technology (IT) is an important part of learning process [3]. It greatly influences the online student performance and GPA. The study [4] believes that the use of technology such as internet is one of the most important factors that can influence the educational performance of students positively or negatively. Students are spending too much time on social media sites like Facebook and do have not enough time to study. This behavior leads students toward poor performance during high school studies, and they consider themselves difficult to survive in higher studies. EDM can detect this poor behavior pattern at right time to maximize the student grades and minimize the failure rate of weak students.
Social media greatly influence the school-age students. Social media influences students' academic and personal lives. Students use social media for academic purposes to improve their performance. Teachers and students both can use social media as a teaching and learning tool for ease and improve learning and teaching process [5].
The objective of the study is to predict student performance with the use of technology, weekday-social-mediause, and weekend-social-media-use and, furthermore, to find out the student who desires to get higher education in advance and to find how parent's education influences the student performance.
In this paper, data is collected from Callboard 360 LMS (learning management system). We used six machine learning techniques (DT, SVM classifier, SGD classifier, RF classifier, AdaBoost, and LR classifier) to determine the patterns inside the student performance data.
The results show that technology greatly influences the student performance. Six classifiers are used to identify the performance on the basis of different features like romantic status, use of technology, weekday-social-media-use, weekend-social-media-use, parent education, and living area. It helps the teachers to identify the fair, good, and poor students.
The proposed work analyzes performance and finds the desirable and undesirable student behaviors of students, which will help to combine students into classes based on different performance capabilities, furthermore predicting the student's social activities.

Relate Work
Educational institutes face different problems to identify reasons of drop-out, graduate not on time, pass to fail ratio, effect of parent's involvement on student performance, effect of attendance on student performance, predicting student's performance on the basis of previous marks, and many more. For solving such problems, several studies present machine learning and statistical solutions.
The study [4] believes that the use of technology such as internet is one of the factors that can influence educational performance of students positively or negatively. Students are spending too much time on social media websites like Facebook and do have not enough time to study which leads students towards poor performance during high school studies and ultimately consider themselves difficult to survive in higher studies. EDM can detect this poor behavior pattern at right time to maximize the student grades and minimize the failure rate of students. Another study [5] shows that social media influences the school-age students positively. Students use social media for academic purposes to improve their performance. Teachers and students both can use social media as a teaching and learning tool for ease and improve learning and teaching process.
Study [6] used decision tree, a nonlinear classifier to generate tree and rules. For analyzing results, the J48 algorithm is used as an analyzing tool. Dataset is collected through the surveys of students of master and Ph.D. studies. Study [7] finds the student's academic results based on cluster groups and uses standard (statistical) algorithms to collect and manage their score data corresponding to the level of their performance. The K-mean clustering algorithm is used to analyze academic performance. Another study [8] used knowledge discovery and data mining tools for extracting useful information from data repository, which is used to enhance the quality of education. DM methods are used for decision making in educational systems. A decision tree (DT) algorithm is used for data searching based on divide and conquer rule. Student's academic performance is measured by applying decision tree algorithms on students' past academic data to predict and analyze the data. These performance measures help to find the dropout students and to identify who need special coaching and allocation of instructor for suitable advice and counseling.
Study [9] declared that decision tree is the most broadly used supervised classification algorithm. Its step creation is fast and easy. The DT classifier is applied on any field. Student qualitative dataset is collected from educational data mining. Different decision tree algorithms are applied on dataset and compared the performance. These algorithms are CART, C4.5, and ID3. The contrast outcome shows that the Gini index of CART influences the information of gain ratio of ID3 and C4.5. The CART algorithm performance and correctness are greater than ID3 and C4.5, because the DT algorithm results prove that student's performance is influenced by qualitative features.
Reference [10] declared that data mining techniques increasingly merged day by day with the educational field. Data mining and education field are combined called educational data mining that help to identify the features and information of students. This study uses to predict and analyze the performance of bachelor and master students at university level students. The performance is analyzed with two algorithms: decision tree and fuzzy genetic algorithms. The dataset contains features internal-marks, sessional-marks, and admission-marks which are used to identify the results. Internal-marks contain attendance-marks, AVG-marks, sessional-marks, and assignment-marks. Weighted marks obtain from matric and interclass. In master degree, examination marks are also included. A systematic model is used to enhance the performance of students in the early stage and in time. To find the result and solution in early stage, conduct good result in final examination. Students also view their result and new updates. Many companies connect to the educational organizations to find out students according to their needs.
Reference [11] declared that large amount of data is stored in different technological spaces and makes new data quickly and easily. Data mining is also combined with these technologies. With the help of data mining techniques, important information from ordinary data can be taken out. Because of these techniques, data can be produced quickly and easily daily or each second. Using data mining methods provide meaningful knowledge. An educational database contains huge amount of data related to student data mining methods applying on this data. This study defines how to use DM algorithms such as KNN, naïve Bayes, and DT algorithm. Apply these algorithms on student raw data and find the best result.
Reference [12] narrated that college students have great facility of internet. Internet educates the students in living and learning process. This study discloses the connection between internet use behavior and educational performance of students. It also analyzes students that are undergraduates 2 Wireless Communications and Mobile Computing by using machine learning algorithms. The dataset of 4000 students has attributes of online-duration, internet-trafficvolume, and connection-frequency, which were extracted, calculated, and normalized from the real internet usage. DT, NN, and SVM were used to find student educational performance by using these attributes. Internet-con and frequency attribute are positively linked, and internet-trafficvolume attribute is negatively linked with academic performance of students. The online-time and internet-time suffering results in surprising performance among different datasets. The number of features increases and improves accuracy. The results define that internet usage is able to distinguish and analyze student's academic performance.
Reference [13] narrated that in higher education, data mining approaches are used and create an attractive part in educational research. These approaches are used for identifying and finding meaningful data from large meaningless data. By using a supervised data mining method, find the results of student progress. To find the student progress is helpful for current educational organizations. The basic purpose of the study is to make a model with the help of classification methods. This model analyzes the student performance in Malaysia. This model is used to find the most important features from the large dataset. Many approaches which are KNN, naïve Bayes, DT, and logistic regression approaches are used to analyze the student academic result performance. These approaches are based on accuracy measure, precision, recall, and ROC curve. The output showing the naïve Bayes algorithm is better. NB is disclosing important attributes that are used to find excellent students whose grades are A+ and A.
Reference [14] reported that large number of students dropped out a major worry of higher education organizations. It greatly influences the fee of students and discarded public resources. It is necessary to find those students who are in danger of dropping out and find those attributes that are the cause of higher dropout rate. Educational data mining methods are used to recover this problem. In this study, the University Teknologi MARA students of computer science undergraduate after three years. DT, logistic regression, random forest, KNN, and NN algorithms are matched to analyze student performance. Several machine learning methods are combined and make an efficient model. The logistic regression method is the best algorithm to analyzing and predicting the dropout students.

Educational Data Mining Model
This study evaluates the impact of technology on student's educational performance. This study proposed the educational data mining (EDM) model that is divided into five major sections such as collection of dataset, preprocessing of dataset, feature extraction, selection of classifier, and model evaluation, see Figure 1. Each section may contain more than one subsection. In step one, dataset was cleaned and checked if there is no missing value. After cleaning the dataset, required features were extracted from the dataset like "use of technology," "weekly-social-media-use," or "weekday-social-media-use." In the next step, different learning models were used to predict student's final grade performance. After that, model's performance was compared based on accuracy score to select the best learner for the problem. The algorithms used in the study are DT, SVM classifier, SGD classifier, random forest classifier, Ada-Boost, and logistic regression classifier.
3.1. Dataset Collection. The data used for the analysis is collected from an electronic-learning system called Kalboard 360 that is publicly available on https://www.kaggle.com/ d50stuck/kalboard-360-use-case. The features of student dataset and their categories are listed in Table 1.

Preprocessing.
In the preprocessing phase, first, we make sure that there is no irrelevant and unacceptable value existed inside the dataset. This process is called cleaning. After cleaning process, we analyzed the data and removed unnecessary fields that are not relevant to our research objective. This process makes data more refined and relevant to research objective. In the preprocessing, we also handled the null values in the dataset.

Selection of Classifier.
After obtaining the required features, different classifiers were trained on the dataset. The algorithms used in the study are DT, SVM classifier, SGD classifier, random forest classifier, AdaBoost, and LR classifier.
3.4. Decision Tree. The DT classifier is simple and understandable by analysts and end users. It is a tree shape model built based on the features, see Figure 2. These are WSM, DSM, living area, romantic status, parent education, technology, and desire-higher-education; these all features are called nodes and influence the student final scores. Every 3.5. SVM Classifier. This classifier is a linear algorithm that is suitable for small datasets. Support vector machines are not suitable for large datasets because it takes small memory and needs more training time. I used this classifier because my dataset is small; it correctly classifies features that are influencing the student performance. It divides features into two classes, for example, living area divides into a rural and urban area and the use of technology divides the yes or no class. The study divides the use of internet into two classes such as "low use" and "high use". If a person uses 1-2 days, then it is considered in "low use" and -1 weight is given to the user. If a person uses 4-5 days, then it is considered in "high use" and 1 weight is given to the user. In case of 3 days, 0 weight is given to it. All categories are shown in Figure 3. X 1 , X 2 , and X n are vectors/features; y is a class to which features belong ðx 1 , y 1 Þ, ðx 2 , y 2 Þ, ⋯, ðx n , y n Þ.
3.6. Random Forest. The whole tree is divided into small parts/samples. The random forest classifier builds small trees for every feature of the features like weekly-social-media-use, weekday-social-media-use, living area, romantic status, technology, and parent education. The random forest classifier analyzing the student performance by splitting the nodes of decision tree random forest builds multiple decision trees. At the end, voting process is performed for every sample and  finds the performance of the students. The random forest algorithm provides the best result then the decision tree algorithm and other algorithms that are used here. The working of the random forest classifier is as shown in Figure 4.

Logistic Regression Classifier.
The logistic regression classifier is another type of supervised learning algorithm. It is also called the logit model. It is a statistical model used to find the probability of a class pass or fail. It uses a basic logistic function to construct a binary dependent variable. It is easy to implement. It provides a baseline to a binary classification. It also defines the link between dependent variable and independent variables. LR outcome is constant. The equation of logistic regression is represented in where y is the predictive output, b0 is the intercept term, and b1 is the single value coefficient of input (x).

AdaBoost.
The AdaBoost classifier is a meta-algorithm of machine learning. Meta-algorithms mean different low accuracy classifiers merged into a single highly predictive model to increase performance. This classifier is sensitive to error porn data and outliers. This algorithm is less risky in overfitting problems as compared to other algorithms. The AdaBoost classifier is used to build a high-performance classifier whose accuracy is high. This classifier combines weak and poor classifiers and makes a strong and highly performing classifier. As shown in Figure 5, the AdaBoost classifier works in the following steps: (1) Firstly, AdaBoost selects training samples randomly Negative object y = -1 Negative object y = 1   The batch is created from the complete dataset. The complexity is high when the dataset is big. It uses a single sample of data. The next time the sample is exchanged with the next sample randomly and then performs repetition. It increases the efficiency of the classifier and is easy to implement.

Results
The results of the educational data mining model are presented as follows.

Model Evaluation.
The evaluation phase of our model analyzes the outcomes of every classifier on the basis of the following factors. Confusion or error matrix is used to evaluate the performance of a classifier, see Figure 6. Accuracy is the basic evaluation metric to analyze the rate of correctness of the prediction. The accuracy is measured with a formula, see the following:

Correlation Heat Map.
A heat map is a simple and useful tool to find out useful attributes in a dataset. Diagram represents correlation between different features. Value 1 shows two feathers are positively correlated, and when the correlation is closer to or similar to -1 increase or decrease, one variable value will decrease or increase the other variable. The main advantage to use a heat map is how a feature is useful according to my problem and cleans my dataset before its use and execution. The correlation heat map is shown in Figure 7.  The author classifies students into three categories, "good," "fair," and "poor" according to their final exam performance, and then analyzed a few features that have a significant influence on students' final performance, including romantic status, parent education level, frequency of going out, desire of higher education, and living area. The three categories of students are shown in Figure 8.

Final
Grade by Frequency of Technology Usage. Figure 9 shows 5 levels of technology understanding and use. The use of technology depends on the understanding of devices that are used by students. Good students understand the technology and use it for their studies. The performance is high because they understand the technology and cannot waste their time. The poor students are not capable of using technology. When they use devices, they cannot understand what they are working, so they waste their time and energy. So, that type of student has low performance. The fair students are that some students understand, and some cannot understand the technology; some students use technology but cannot improve their performance because they have no proper guidelines and training to use technology.   Figure 10 shows good or intelligent, fair or normal, and poor or weak students use weekly social media. The use of social media divides into five levels (1, 2, 3, 4, and 5). Low social media usage on the weekend is represented by 1 and 2. The highest use of social media is represented by 4 and 5 levels. The medium use of social media on the weekend is represented by 3. Poor students that are weak in their studies use social media which results in their performance becoming slower. Fair and good students use social media highly; then, their performance is also decreasing. So, the high use of social media also influences the student performance.

Final Grade by Parents' Education
Level. Parents' education level influences student performance. 4.6.1. Father Education Level. A parent's education level has a positive correlation with a student's final score. Father education level influences the student grade. Educated fathers also affect children's education. Father education affects the student's performance, but mother education greatly affected the student performance. Figure 11 shows that father education influences the student grade. 4.6.2. Mother Education Level. Comparatively, the mother's education level has a bigger influence than the father's  Wireless Communications and Mobile Computing education level. Because the mother guides their children and supports them in their studies more than fathers so mother education highly influences student final score. Most mothers are uneducated; the student performance in early stage is based on mother education. Some mothers are nonserious about the studies of their child because they are not educated, so their child cannot perform well in education at school level. Figure 12 shows how much mother education influences the student performance.

4.7.
Feature Effect. Figure 13 shows that the number of features increases the prediction accuracy of a classifier. but the multiple features also increase the complexity of a classifier. Sometimes, using multiple features cannot increase the prediction accuracy because the features are irrelevant to the problem, and sometimes, a small number of features greatly influence the prediction accuracy.

Final
Results of Classifiers. Supervised learning algorithms used to predict the student academic performance with the use of technology. These algorithms are DT, random forest, SVM, L-regression, AdaBoost, and SGD. The score of decision tree is 0.90%, random forest score is 0.98%, support vector classifier score is 0.86%, logistic regression is 0.88%, AdaBoost is 0.89%, and SGD classifier score is 0.84%. The scores prove that the random forest classifier has the best results as compared to other classifiers. The DT classifier is the second one. The decision tree classifier score is lower than random forest because decision tree has problem of overfitting. The Stochastic Gradient Descent classifier gains the lowest scores. The comparison of classifiers is shown in Table 2.

Discussion
Last few years' educational data mining received great attention. Many data mining approaches extract knowledge from educational databases. The extracted information from educational data helps the educational institutes to improve teaching and learning process. This enhancement of data improves the students and educational institute output performance. Student behavioral attributes also influence the student performance. Using behavioral feature accuracy of classifiers is greater than without using behavioral features. The decision tree classifier shows higher accuracy of 75% without behavioral feature of 55% percent accuracy. The newly emerging field of research is EDM. Educational DM is the combination of data mining and educational data [15][16][17][18]. It helps the student to improve their performance and their learning activities. The educational data is used from any education repository, for example, learning management system, web-based education, and online data. Assembling methods are also used for getting higher performance of a classifier. These techniques divide the data into equal sizes, and the voting process is used. Highly voted data is extracted and concluded the results. The bagging algorithms are bagging, boosting, random forest, etc. In traditional models, single model is used for training data, however, in ensemble models, more than one model is used for training of data. Multiple models train with attribute with voting process. The advantage of the assembling method is that the accuracy is higher than the single model [19][20][21][22].
Determining assessment and activity data can affect students' educational performance. The four selection algorithms are decision tree, random forest, multilayer perceptron, and logistic regression which were used to identify the important features that affect students' academic performance. Results show that the most important feature that can affect student educational data is assessment data like final exam and assignment marks are most important. Decision tree performs useful as using random forest achieving highest accuracy [23][24][25][26].
Technology like social media usage (weekday and weekly), living area, parent education, desire to receive higher education, and romantic status are features that greatly influence the student performance. Social media use natively and consuming more time on social media can decrease the student performance. By using the random forest assembling method, the accuracy of a classifier is higher 98% than other classifiers like decision tree, AdaBoost, logistic regression, and Stochastic Gradient Descent. The tech-nology feature greatly influences the student performance [27]. Behavioral, assessment marks, parent involvement, living area, and many more features can influence the student education, but in this modern world, technology can play a vital role in every student's life. This paper focuses on technology like social media uses. The main benefit of assembling an algorithm is gaining higher accuracy than the single model classifier like the SVM classifier.

Conclusion
Academic achievement is the biggest concern of every educational institute. This paper describes the importance and impact of technology on student education. This study used machine learning techniques such as DT, SVM classifier, SGD classifier, random forest classifier, AdaBoost, and logistic regression classifier to determine the patterns inside the student performance data. We have used six different classifiers to analyze the student performance records. Our objective was to evaluate and analyze the impact of technology on student education, so we use our attributes including technology features, weekday-social-media-use, and weekendsocial-media-use to analyze student data. All six of our classifiers achieved performance by adding technology features along with other features. Notably, random forest achieved higher accuracy of a classifier which is 98% as compared to the other classifiers. The score of decision tree, AdaBoost, logistic regression, SVM, and SGD is 90%, 89%, 88%, 86%, and 84%, respectively. This research shows that nowadays technology is a very important factor to achieving better performance in educational institutes. Social media greatly influence the student education. This feature increases the accuracy of classifiers. Currently, in the changing world, online and home-based educations become very important. Furthermore, the analysis of the different factors of the technology, the negative impact of technology, and the impact of home-based learning could be the key direction of future research.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest to report regarding the present study.