A Neuro-Fuzzy Approach in the Classification of Students' Academic Performance

Classifying the student academic performance with high accuracy facilitates admission decisions and enhances educational services at educational institutions. The purpose of this paper is to present a neuro-fuzzy approach for classifying students into different groups. The neuro-fuzzy classifier used previous exam results and other related factors as input variables and labeled students based on their expected academic performance. The results showed that the proposed approach achieved a high accuracy. The results were also compared with those obtained from other well-known classification approaches, including support vector machine, Naive Bayes, neural network, and decision tree approaches. The comparative analysis indicated that the neuro-fuzzy approach performed better than the others. It is expected that this work may be used to support student admission procedures and to strengthen the services of educational institutions.


Introduction
Accurately predicting student performance is useful in many different contexts in educational environments. When admission officers review applications, accurate predictions help them to distinguish between suitable and unsuitable candidates for an academic program. The failure to perform an accurate admission decision may result in an unsuitable candidate being admitted to the university. Since the quality of an educational institution is mainly reflected in its research and training, the quality of admitted candidates affects the quality level of an institution. Accurate prediction enables educational managers to improve student academic performance by offering students additional support such as customized assistance and tutoring resources. The results of prediction can also be used by lecturers to specify the most suitable teaching actions for each group of students and provide them with further assistance tailored to their needs. Thus, accurate prediction of student achievement is one way to enhance quality and provide better educational services. As a result, the ability to predict students' academic performance is important for educational institutions. A very promising tool to achieve this objective is the use of data mining. Data mining processes large amounts of data to discover hidden patterns and relationships that support decision-making.
Data mining in higher education is forming a new research field called educational data mining [1,2]. The application of data mining to education allows educators to discover new and useful knowledge about students [3]. Educational data mining develops techniques for exploring the types of data that come from educational institutions. There are several data mining techniques, such as statistics and visualization, clustering, classification, and outlier detection. Among these, classification is one of the most frequently studied techniques. Classification is a process of supervised learning where data is separated into different classes. Classification maps data into predefined groups of classes. The goal of a classification model is to predict the target class for each sample in the dataset. There are various approaches for classification of data, including support vector machine (SVM), artificial neural network (ANN), and Bayesian classifier approaches [4]. Based on these approaches, a classification model that describes and distinguishes data classes is constructed. Then, the developed model is used to predict the class label of new data that does not belong to the training dataset. These approaches have been widely applied in educational environments [5][6][7]. In this study, we present a classification model based on a neurofuzzy approach to predict students' academic performance level.
Neural networks and fuzzy set theory, which are termed soft computing techniques, are tools of establishing intelligent systems. A fuzzy inference system (FIS) employing fuzzy ifthen rules in acquiring knowledge from human experts can deal with imprecise and vague problems [8]. FISs have been widely used in many applications including optimization, control, and system identification. Fuzzy systems do not usually learn and adjust themselves [9], whereas a neural network (NN) has the capacity to learn from its environment, selforganize, and adapt in an interactive way. For these reasons, a neuro-fuzzy system, which is the combination of fuzzy inference system and neural network, has been introduced to produce a complete fuzzy-rule-based system [10,11]. The merits of neural networks and fuzzy systems can be integrated in a neuro-fuzzy approach. Fundamentally, a neuro-fuzzy system is a fuzzy network that not only includes a fuzzy inference system but can also overcome some limitations of neural networks, as well as the limits of fuzzy systems [12,13] because it can learn and represent knowledge in an interpretable manner and learning ability. One of the neurofuzzy systems, a neuro-fuzzy classifier (NFC), combines the powerful description of FIS with the learning capabilities of NNs to partition a feature space into classes. NFCs have been commonly used for different problems [14,15]. In this paper, we use an NFC with a scaled conjugate gradient (SCG) algorithm improved by Cetişli and Barkana [16] to classify students based on their expected academic performance levels.
The paper is organized into seven sections. After the introduction in Section 1, some of the most commonly classification approaches are presented in Section 2. Section 3 describes the neuro-fuzzy classifier. Section 4 is dedicated to describing the process of NFC training. The data preparation is in Section 5 and the results are in Section 6. Finally, Section 7 presents the conclusion.

Classification Approaches
Various approaches are used for discovering knowledge from databases. In this section, the most commonly used approaches are briefly discussed.

Support Vector Machine (SVM)
. SVM is a supervised learning method influenced by advances in statistical learning theory [17]. SVM has been successfully applied to number of applications in classification and recognition problems. Using training data, SVM maps the input space into a highdimensional feature space. In the feature space, the optimal hyper-plane is identified by maximizing the margins or distances of class boundaries. The training points that are closest to the optimal hyper-plane are called support vectors.
When the decision surface is achieved, it can be used for classifying new data.
Consider a training dataset of feature-label pairs ( , ) with = 1, . . . , . The optimum separating hyper-plane is represented as where ( , ) is the kernel function; is a Lagrange multiplier; and is the offset of the hyper-plane from the origin. This is subject to constraints 0 ≤ ≤ and ∑ = 0, where is a Lagrange multiplier for each training point and is the penalty. Only those training points lying close to the support vectors have nonzero . However, in real-world problems, data are noisy and there will be no linear separation in the feature space. Hence, the optimum hyper-plane can be identified as where is the weight vector that determines the orientation of the hyper-plane in the feature space and is the th positive slack variable that measures the amount of violation from the constraints.

Naive Bayes Classifier.
A Naive Bayes classifier is based on Bayes' theorem and the probability that a given data point belongs to a particular class [18]. Assume that we have training samples ( , ), where = ( 1 , 2 , . . . , ) is an -dimensional vector and is the corresponding class. For a new sample tst , we wish to predict its class tst using Bayes' theorem: However, the above equation requires estimation of distribution ( | ), which is impossible in some cases. A Naive Bayes classifier makes a strong independence assumption on this probability distribution by the following equation: This means that individual components of are conditionally independent given its label . The task of classification now proceeds by estimating one-dimensional distributions ( | ).

Neural Network (NN).
Neural networks can represent complex relationships between inputs and outputs [19]. The classification procedure based on NNs consists of three steps, namely, data pre-processing, training, and testing. The data pre-processing refers to the feature selection. For the data training, the features from the data preprocessing step are fed to the NN, and a classifier is generated through the NN. Finally, the testing data is used to verify the efficiency of the classifier.

Decision Tree (DT).
A decision tree is a hierarchical model composed of decision rules that recursively split independent inputs into homogenous sections [20]. The aim of constructing a DT is to find the set of decision rules that can be utilized to predict outcomes from a set of input variables. A DT is called a regression or classification tree if the target variables are continuous or discrete, respectively [21]. The computational complexity of a decision tree may be high, but it can help to identify the most important input variables in a dataset by placing them at the top of the tree.

Neuro-fuzzy Classifier (NFC) Architecture
A typical fuzzy classification rule , which demonstrates the relation between the input feature space and classes, is as follows: : where represents the th feature or input variable of the th sample; denotes the fuzzy set of the th feature in the th rule; and represents the th label of class. is identified by the appropriate membership function [22].
In the NFC, the feature space is partitioned into multiple fuzzy subspaces by fuzzy if-then rules. These fuzzy rules can be represented by a network structure. An NFC is a multilayer feed-forward network consisting of the following layers: input, fuzzy membership, fuzzification, defuzzification, normalization, and output. The classifier has multiple inputs and multiple outputs. Figure 1 depicts an NFC with two features { 1 , 2 } and three classes { 1 , 2 , 3 }. Every input is defined with three linguistic variables; thus, there are nine fuzzy rules.
Membership layer: the membership function of each input is identified in this layer. Several types of membership functions can be used. In this study, a Gaussian function is utilized, since this function has fewer parameters and smoother partial derivatives for parameters. The Gaussian membership function is defined as where ( ) is the membership grade of th rule and th feature; represents the th sample and th feature; and are the center and the width of Gaussian function, respectively.
Fuzzification layer: each node in this layer generates a signal corresponding to the degree of fulfillment of the fuzzy rule for the sample. It is called the firing strength of a fuzzy rule with respect to an object to be classified. The firing strength of the th rule is as follows: where is the number of features. Defuzzification layer: in this layer, weighted outputs are calculated; each rule affects each class according to their weights. If a rule controls a specific class region, the weight between that rule output and the specific class will be larger than the other weights. Otherwise, the class weights are small. The weighted output for the th sample that belongs to the th class is calculated as follows: where denotes the degree of belonging to the th class that is controlled by the th rule and represents the number of rules.

Computational Intelligence and Neuroscience
Normalization layer: the outputs of the network should be normalized, since the summation of weights may be larger than 1 in some cases where denotes the normalized value of the th sample that belongs to the th class and is the number of classes.
Then, the class label for the th sample is obtained by the maximum value as follows: where denotes the class label of the th sample.

Training NFC
In order to determine an optimum fuzzy region, the parameters, = { × , × , × }, of the fuzzy if-then rules must be optimized [23], where and are the matrices containing the sigma and centre values, respectively; presents the weight matrix of connections from fuzzification layer to defuzzification layer; , , and are the number of rules, features, and classes, respectively. The -means clustering method is utilized to obtain the initial parameters and to form the fuzzy if-then rules [24]. The -means clustering method aims to partition the input feature space into a number of clusters in which each data point belongs to the cluster with the nearest mean. This results in a partitioning of the data space. For a given dataset, this method can estimate the number of clusters and the cluster centers. In Figure 2, a feature space with two inputs { 1 , 2 } is shown. Suppose that every input is divided into three fuzzy sets by employing the -means method. Each fuzzy set is characterized by the appropriate membership function; as a result, each input has three membership functions. A fuzzy classification rule describes the relationship between the input feature space and the classes. The formation of the fuzzy if-then rules is illustrated in Figure 2. Each input is represented as three membership functions; thus, we have nine fuzzy rules.
Several training algorithms, including the Kalman filter [25], the Levenberg-Marquardt method [26], have been used to optimize the parameters of NFC. Application of the SCG algorithm showed that the SCG algorithm produced the least error and the highest efficiency [27]. Moreover, the SCG improved by Cetişli and Barkana [16] has the ability to decrease the training time per iteration and to not affect the convergence rate. Hence, the improved SCG method is utilized for optimization in this study.
The cost function is determined from the least mean squares of the difference between target value and calculated class value. The cost function is as follows: where is the number of samples and and represent the target and calculated values of the th sample belonging to the th class, respectively. If the th sample belongs to the th class, the target value is 1; otherwise, it is 0. The aim of SCG algorithm is to find the optimal or nearoptimal parameter * from the cost function ( ). In the SCG algorithm, the next closest update vector, +1 , to the current vector is identified as where = ( ) and = ( ) are the gradient vector and the Hessian matrix of ( ), respectively. The product, − −1 , is called the Newton step; its Newton direction is indicated by the minus sign. If the Hessian matrix is positive definite and ( +1 ) is quadratic, Newton's method directly reaches a local minimum in a single step [23]; however, reaching a local minimum commonly requires more iterations. Møller [28] introduced a temporal parameter vector , which is between +1 and and is defined as where is the short step size and = − is the conjugate direction vector of the temporal parameter vector at the th iteration. The actual parameter update is calculated as where +1 is next parameter update vector; is current parameter vector; and is actual parameter updating step size and is calculated as follows: where is the second-order information and denotes the basic long step size. To calculate , the second-order information should be obtained from the first-order gradients.
In the SCG algorithm, two different gradients of the parameter vector are calculated in any iteration. The gradient of the temporal parameter vector , is calculated using the short step size , and the gradient of the actual parameter update +1 is calculated using the long-step size , which is Computational Intelligence and Neuroscience 5  [16] stated that the gradient of , in the th iteration is more suitable than the gradient of +1 . In the SSCG (speeding up SCG), the second gradient is used to estimate the first gradient for the next iteration. The estimation of only one gradient has the benefit of shortening the iteration time.

Data Preparation
This section represents application of the proposed model in the prediction of students' academic performance level. In this paper, an application related to the context of Vietnam was used as an illustration.

Identifying Input and Output
Variables. Through a literature review and discussion with admission officers and experts, a number of academic, social-economic, and other related factors that are considered to have influence on the students' academic performance were determined and chosen as input variables. The input variables were obtained from the admission registration profile and are as follows: the university entrance exam results (normally, in Vietnam, candidates take three exams for the fixed group of subjects they choose), the overall average score from a high school graduation examination, the elapsed time between graduating from high school and obtaining university admission, the location of high school (there are four regions, as defined by the government of Vietnam: Region 1, Region 2, Region 3, and Region 4. Region 1 includes localities with difficult economic and social conditions; Region 2 includes rural areas; Region 3 includes provincial cities; and Region 4 includes central cities), type of high school attended (private or public), and gender (male or female). Nonnumerical factors must be converted into a format suitable for neural networks. The input variables and ranges are presented in Table 1.
The preliminary step of all classification approaches is to identify the number of classes in which dataset is to be classified and to assign class labels. Based on the current grading system used by the university and the scope of this project, three classes were identified as "good, " "average, " and  "poor. " As shown in Table 2, class labels were defined as 1, 2, and 3 for "good, " "average, " and "poor, " respectively.

Dataset.
We obtained our data from the University of Transport Technology, which is a public university in Vietnam. For input variables, we used a real dataset from students in the Department of Bridge Construction, and for output variables, we used their achievements for the 2011-2012 academic year. The dataset belongs to the University of Transport Technology and can be requested by contacting the corresponding author by email. The dataset consisted of 653 cases and was divided into two groups. The first group (about 60%) was used for training the model. The second group (about 40%) was employed for testing the model. The training dataset served in model building while the other group was used for the validation of the developed model.

Results
The model was coded and implemented in the MATLAB environment (Matlab R2011b) and simulation results were then obtained. The NFC was trained with 100 iterations.
In the study, a 10-fold cross-validation method was utilized to avoid overfitting. The training dataset was divided into 10 subsets. Each classifying structure was trained 10 times. Each time, one of the 10 subsets served as the validation set and the remaining subsets were used as the training sets. The classifying structure that was selected has the highest accuracy on the validation set (averaging over 10 runs).
After training and validating, the NFC was tested using the testing dataset. Efficiency of the classifier was determined by comparing the predicted and actual class labels for the testing dataset. The comparison is given in Figure 3, in which the confusion matrix is represented. The NFC was able to accurately predict 60 out of 71 for the "good", 139 out of 148 for "average, " and 36 out of 42 for "poor. " This gives an accuracy of 84.51%, 93.2%, and 85.17% for "good, " "average, " and "poor" classifications, respectively. This provides 90.03% accuracy for the NFC, which is satisfactory when compared with results from studies on prediction.
To assess the performance of the NFC, we compared the results obtained by the NFC with those obtained by other classification approaches. The 10-fold cross-validation method was also used to identify the classifier structures. For the SVM, RBF kernel (often called Gaussian kernel) was used. The prediction accuracy of the SVM classifier for the testing dataset came out to be 82.76%. For the Naive Bayes classifier, the prediction accuracy of the classifier was found to be 72.8% for the testing dataset. In order to perform the classification based on a neural network, we investigated different neural network architectures with different numbers of hidden layers and neurons. Performance was measured using mean squared error function; the Levenberg-Marquardt algorithm was utilized to train neural networks. The network architecture with the highest efficiency in comparison with other architectures was selected. The architecture that was selected consisted of a single hidden layer with 10 neurons. Overall, accuracy of the neural network was 86.2%. Finally, a classifier based on a decision tree was applied to the problem. The classification and regression tree (CART) algorithm was used for constructing the decision tree model. The obtained accuracy for the decision tree was 82.76%. The results of these approaches to the classification of students' academic performance levels are summarized in Figure 4, together with confusion matrices. When these results were compared with those obtained by the NFC model, it was found that the NFC outperformed the SVM, Naive Bayes classifier, neural network, and decision tree in classifying students' academic performance levels.
From the results, it can be concluded that the NFC model can be used to classify students into different groups based on their expected academic performance levels. The model achieved an accuracy of over 90%, which shows that it may be acceptable and good enough to serve as a classifier of students' academic performance levels.

Conclusions
By classifying students into different groups, educational institutions are able to strengthen their admission systems as well as provide better educational services. Thus, a model which could classify students based on their expected academic performance levels is necessary for institutions. There have been various approaches to classifiers. However, for a specific problem, increasing the classification model accuracy is still a subject with great importance. In this study, we presented an NFC model to a group of students. We also evaluated the classification accuracy of the model by comparing it with other well-known classifiers, including SVM, Naive Bayes, neural network, and decision tree classifiers. The obtained results demonstrated that the NFC model outperformed the others. The results of the present study also reinforce the fact that a comparative analysis of different approaches is always supportive in choosing a classification model with high accuracy. It is expected that this study may be used as a reference for decision-making in the admission process and to provide better educational services by offering customized assistance according to students' predicted academic performance levels.