Current machine learning (ML) based automated essay scoring (AES) systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.
Generally, essay scoring is performed manually by skilled assessment experts. However, when essays are scored manually, there are a couple of limitations. First, it is difficult to acquire consistent results from the scoring, because of human errors and biased preconceptions. Second, it requires a considerable amount of time and effort in scoring. Third, it is impractical for humans to provide a detailed analysis or individual feedback. Consequently, there is growing interest in a computerized system that can automatically assess essays, since the system could potentially assist or even replace human assessors.
So far, most AES approaches have tried to find an appropriate ML algorithm and have focused on finding useful features for training the ML algorithm for AES [ The limited number of training data does not properly represent the expanded high-dimensional feature space; thus, optimized training cannot be performed. The vast number of features must be considered at the same time; thus, an increase in training time is required.
The phenomenon (1) is referred to as the “curse of dimensionality.” Most features used for AES have values of integers or real numbers, and there are hundreds of features. Thus, a system using these features cannot help but fall into the “curse of dimensionality.” ML based systems must train systems with high-dimensional data by spending a tremendous of time for training. Therefore, it is essential that the feature space be reduced, so that optimal performance can be obtained, and training can be made more efficient.
In this paper, we experiment with and analyze the effects of three different techniques to reduce the feature space: normalization, discretization, and feature selection. The normalization techniques can transform the different ranges of each feature value into a fixed range, thereby reducing the whole range of feature values. The discretization techniques combine feature values represented by numbers into corresponding groups and convert feature values belonging to a specific group into one corresponding integer value, thus reducing the number of feature values. The feature selection techniques reduce feature space by selecting and using features that are relevant to answers and features that are easily distinguishable from different samples.
The normalization, discretization, or feature selection techniques unfortunately do not always have positive effects on the performance of the application. An appropriate combination of feature optimization techniques must be selected for corresponding domains with ML algorithms. Our research shows that using the appropriate feature optimization techniques can reduce the dimensions of features and thus result in the efficient training and performance improvement of AES.
The remainder of this paper is organized as follows. In Section Section In Section In Section Finally, in Section
Various studies on AES have taken place. Often, research that is based on the ML approach focuses on exploring novel features and learning methods to improve the performance for essay scoring. Project Essay Grade [
The previous AES studies avoided using too many features, because various kinds of useful features are not widely known, and the increase in the number of features would increase the training time for ML methods. In order to utilize various and vast amounts of useful features, to efficiently perform AES, the appropriate feature optimization techniques, such as the normalization technique, discretization technique, and feature selection technique, are required.
Several different approaches have been developed to reduce feature space, by using normalization and discretization techniques, thereby improving the performance [
Chmielewski and Grzymala-Busse’s study [
There have also been various approaches to reducing the dimensions of high-dimensional data, by selecting appropriate features from the vast number of possible features in many domains, by using good feature selection techniques [
So far, we have introduced studies on various domains, which have shown performance improvement by reducing dimensionality. In this paper, we introduce a new domain that can reduce the dimension of features by applying feature optimization techniques.
According to previous studies, there are many features that have been found to be useful for AES. In this section, we provide diverse features for AES and a brief description of useful ML algorithms for AES.
In this study, we include most features that have been proven to be useful for AES; our newly proposed features, including advanced NLP techniques, are also used in conjunction.
The frequency, average, and ratio of characters, words, and sentences are used for the basic features. Under the assumption that the distribution of the part-of-speeches (POSs) is different, according to the essay grades, we also used the features relating to POS. The level of vocabulary usage is evaluated by using external resources, including elementary and middle dictionaries and the dictionary for the graduate record examination (GRE), and is also used for features. The various features relating to
The number of features used in our study exceeds 300 and is represented in groups, in Table
It represents the list of features used for learning AES.
Category | Types of features |
---|---|
Basic | (i) Number of characters, words, vocabularies, and sentences |
|
|
Dictionary | (i) Ratio of words and vocabularies in each dictionary |
|
|
|
(i) Number of |
|
|
Advanced NLP | (i) Average number, maximum frequency, ratio of compound nouns, noun phrases, and named entities per sentence |
So far, there have been many attempts to utilize ML algorithms for AES. The ML algorithms for AES can be classified into two categories: regression and classification. Regression is used for predicting or estimating the corresponding target value given the feature values of the specific instance by analyzing the relationships between the feature values and the target value. Classification is used to determine the corresponding category of the specific instance.
The big difference between the two approaches is in the characteristics of the target values. The target values for regression are ordered and continuous, while the target values for classification are unordered and discretized. For AES, the regression approaches try to predict the continuous target score based on feature values that represent the characteristics of essays; the classification approaches try to identify the categorized score under the assumption that each essay belongs to a specific category.
In this study, we have experimented and compared two different regression based ML algorithms and two different classification based ML algorithms for AES and have tried to identify the appropriate ML algorithms for AES by performing different experimentations.
In this section, we provide a brief description of the four ML algorithms, which are applied in our experimentation for AES: MR model [ MR is the most widely used algorithm in AES research. ME achieves good results in document classification, using many features. SVM is the best algorithm for solving various classification problems. SVR applies regression to SVM, and it is expected that SVR may have the benefits for both regression and classification.
The MR model [
Each instance of the training data represents an essay, and an essay is represented by
The ME model [
The fundamental formula for the ME model is represented in expression (
The final formula of the ME model is represented by expression (
The SVM [
In Figure
Concept of the support vector machine (
In the training process for SVM, it tries to find a
The task for AES is to classify an essay into one of six grades, by utilizing more than 300 different kinds of features. In order to find many hyperplanes to separate the six different grades into more than 300 high-dimensional spaces, much time is needed for optimization. Therefore, the feature space must be reduced through feature optimization techniques.
SVR [
In the training process for SVR, it tries to find
Figure
Automated essay scoring system architecture based on machine learning.
The feature extractor uses all natural language processed information to extract features that accurately represent essays characteristics. More than 300 diverse features can be employed for training the AES model. The features can be classified into six categories, as shown in Table
The feature optimizer selectively performs normalization, discretization, and feature selection for optimal performance. The labeled essays are input for training into the learner module, and the unlabeled essays are input for prediction into the predictor module. The learner module is used to create the training model, and the training model is used in the predictor module to calculate the final grade (score) of an essay.
In this paper, we focus on ML dependent feature optimization techniques for reducing dimensions of features, because the performance of AES is dependent on ML algorithms and feature optimization.
In research studies based on ML, features that have a variety of types and ranges are used. The number of words and the ratio of words in an elementary dictionary that are used by the feature values in AES also have different ranges. When we use these features directly in the AES system, it is possible to perform nonoptimized learning. Therefore, we must convert all feature values into a fixed range.
Although there are a number of normalization methods, we selected the two most commonly used methods: min-max and
In Formula (
The discretizing feature values simplify data representation by converting continuous feature values that belong to a specific range into a certain feature value. This process makes the feature values suitable for ML. The type of discretization method, the range of the discretized section, and the number of discretized sections all affect the performance of the ML system.
In our study, we used two simple discretization methods: discretization by instance number (DIN) and discretization by feature value (DFV). DIN assigns the same number of instances to each section, after sorting them by the feature value. DFV converts all feature values that belong to the specific range into one feature value, after setting the range of feature values for each section. DIN is advantageous when the distribution of feature values is uniform. DFV is beneficial when the distribution of feature values is normal.
Feature selection filters out noisy features and discovers the optimal feature set in ML. Even though we determine a feature set with appropriate intuition and assumptions, some features in the set may produce a negative effect. Further, too many features can hinder learning or delay it. In this work, we compare three feature selection methods: correlation (COR) [
Formula (
To calculate the information gain, we calculate the entropy first. Entropy of random variable
The information gain formula between random variables
Used as another feature selection method, mRMR considers the dependency between the features as well as the relevance between the feature value and the golden score. The mRMR is calculated as formula (
For our experiment, we used the essay practice data that covered 13 topics. The correct answer was constructed based on scores provided by many human experts, who were hired. More than two human experts assign scores ranging from 0 to 6 to each essay. For each topic, Table
Descriptions of essay practice data.
Topic | Number of essays | Average length | Human cor. |
---|---|---|---|
Small town or big city | 266 | 362.8 | 0.5202 |
Parents are the best teachers | 774 | 345.7 | 0.5308 |
Qualities of a good neighbor | 242 | 355.8 | 0.5664 |
Positive influence of TV or movies | 227 | 354.1 | 0.5127 |
Reasons for attending college | 241 | 310.5 | 0.5527 |
Dining at a restaurant versus home | 385 | 360.1 | 0.4567 |
Why do some people go to museums | 221 | 347.8 | 0.6096 |
Best ways to reduce stress | 156 | 356.4 | 0.4420 |
Qualities of good parents | 211 | 357.8 | 0.6260 |
Achieving success by working hard | 370 | 378.5 | 0.4408 |
Rejection of the invite | 528 | 72.8 | 0.7287 |
Report on FORBES MEDIA | 528 | 148.8 | 0.6438 |
Hiring a family member or friend | 528 | 207.0 | 0.7319 |
|
|||
Total | 4677 | 281.9 | 0.6088 |
We have conducted a preliminary experiment considering all three feature optimization methods, in order to roughly determine all base parameter values prior to conducting the main experiments for investigating the effect of each feature optimization method. If we apply each individual optimization method separately without combining other optimization methods, the performance of AES deteriorates, and the effects of the feature optimization methods disappear. For example, we do not obtain the effect of the normalization method when we do not apply the discretization method and the feature selection method altogether. For this reason, we have obtained base robust parameter values, which indicated satisfactory performance for most cases, by conducting preliminary experiments (i.e., for normalization: min-max; for discretization: DFV with 10 sections; for feature selection: 80 feature selections with correlation). In the following experiments, the intended feature optimization method is tested and modified with these base parameter values.
We used the following three ML packages for experiment: MR: GNU Scientific Library ( ME: Maximum Entropy Modeling Toolkit for Python and C++ ( SVM and SVR: LIBSVM (
For each of the four ML algorithms (MR, ME, SVM, and SVR), we compared the following three normalization methods: none, min-max, and
Although the difference between the normalization methods is insignificant, the difference in performance between nonnormalization and normalization was noteworthy (Table
Comparison of normalization methods (correlation).
Normalization | MR | ME | SVM | SVR |
---|---|---|---|---|
None | 0.1702 | 0.3166 | 0.2839 | 0.2980 |
min-max | 0.7383 | 0.7160 | 0.7675 | 0.7756 |
|
0.7315 | 0.7289 | 0.7701 | 0.7747 |
We compared the following three discretization methods: none, DIN, and DFV. For each discretization method, we performed experiments on different numbers of sections (2–16, in increments of 2) to determine whether the number of discretized sections affected the performance. We applied other optimization techniques and parameters equally. Min-max normalization was performed, and 80 features were selected, using correlation.
We have performed discretization experiments using four different ML algorithms. As shown in Figure
Comparison of discretization methods.
In cases of SVM and SVR, which show effects of the discretization methods ideally, there are large performance differences for DIN and DFV discretization methods, between a number of sections. We have found that the performance of DFV was better than the performance of DIN. This is because the original distribution of feature values is maintained for the DFV method, while, for the DIN method, the same number of feature values per section was assigned compulsively.
We have also found that the performance decreases as the number of sections increases. This is because too many sections cause the decreased number of feature values per section, the sparseness problem in some cases, and the diminished effects of discretization.
The experimental results show that MR and ME did not yield a better performance by performing a discretization method, but SVM can get dramatic improvements in performance. SVR also improved in performance, although the improvement is not as good as SVM cases.
We compared the following four feature selection methods: none, COR, IG, and mRMR. For each feature selection method, we performed experiments on different numbers of features (20–160, in increments of 20). We applied other optimization techniques and parameters equally (min-max normalization, DFV with 10 sections).
Figure
Comparison of feature selection methods.
Using features that consider only the relevance of the golden score is more advantageous than using selected features that consider both the relevance of the golden scores and dependencies between features. We assume that the dependencies between features would yield bad side effects, because of the different characteristics of features used in AES.
The AES system proposed in this paper automatically constructs the optimized set of features by selecting features from the training data. It is difficult for us to say that a specific feature is always effective, because the set of selected features is different according to the experimental settings, subjects, or folds for cross-validation. In this section, we try to identify the effective features for AES by examining the generally selected features in most cases. In order to do this, we have experimented with 130 different training procedures and tests with base parameters. The list of features shown in Table
Effective feature list.
Number of times selected | Feature name | Meaning of feature |
---|---|---|
130 | posNumINVoca | Number of vocabularies with IN POS tag |
130 | posNumIN | Number of words with IN POS tag |
130 | lmPosTrigramVoca | The number of different POS trigrams |
130 | lmPosTrigramOccMore3 | The ratio of POS trigrams occurred more than 3 |
130 | lmPosTrigramOccMore2Less5 | The ratio of POS trigrams occurred more than 2 but fewer than 5 |
130 | lmPosTrigramOccMore2Less10 | The ratio of POS trigrams occurred more than 2 but fewer than 10 |
130 | lmPosTrigramOccMore2 | The ratio of POS trigrams occurred more than 2 |
130 | lmNumVoca4Root | Biquadrate of the number of vocabularies |
130 | lmNumVoca | The number of vocabularies |
130 | lmLexWordOccMore5 | The number of different words occurred more than 5 |
130 | lmLexWordOccMore4 | The number of different words occurred more than 4 |
130 | lmLexWordOccMore3 | The number of different words occurred more than 3 |
130 | lmLexWordOccMore2Less5 | The number of different words occurred more than 2 but fewer than 5 |
130 | lmLexWordOccMore2Less10 | The number of different words occurred more than 2 but fewer than 10 |
130 | lmLexWordOccMore2 | The number of different words occurred more than 2 |
130 | lmLexWordOccMore1Less5 | The number of different words occurred more than 5 |
130 | lmLexWordOccMore1Less10 | The number of different words occurred more than 10 |
130 | lmLexWordOccMore1 | The number of different words occurred more than 1 |
130 | lmLexBigramVoca | The number of different lexical bigrams |
130 | lmLexBigramOccMore2Less5 | The ratio of lexical bigrams occurred more than 2 but fewer than 5 |
130 | lmAvgLexWordDistance | The average distance of same words |
130 | lmAvgLemmaWordDistance | The average distance of same lemmas |
130 | cNumWordLen8 | The number of words whose length is more than 8 characters |
130 | cNumWordLen7 | The number of words whose length is more than 7 characters |
130 | cNumWordLen6 | The number of words whose length is more than 6 characters |
130 | cNumWordLen5 | The number of words whose length is more than 5 characters |
130 | cNumWord | The number of all words |
130 | cNumNotStopWord | The number of all words except stop words |
130 | cNumNotStopVoca | The number of all vocabularies except stop words |
130 | cNumMidd | The number of words in the intermediate dictionary |
130 | cNumElem | The number of words in the elementary dictionary |
130 | cNumChar | The number of all characters |
130 | cCharNotStopWord | The number of all characters except stop words |
129 | posNumNN | The number of words with NN POS tag |
129 | lmLexBigramOccMore2Less10 | The ratio of lexical bigrams occurred more than 2 but fewer than 10 |
129 | lmLexBigramOccMore2 | The ratio of lexical bigrams occurred more than 2 |
128 | posNumJJVoca | The number of vocabularies with NN POS tag |
128 | posNumJJ | The number of words with NN POS tag |
126 | cNumWordLen10 | The number of words whose length is more than 10 characters |
125 | posNumNNSVoca | The number of vocabularies with NNS POS tag |
In this experiment, we used a server with two AMD Opteron 4180 (6 core) processors; thus, 12 cpu cores can be employed. A 32-GigaByte memory and 64-bit Debian operating system was also employed for our experiment. Because the AES system is a complicated system including various NLP processing modules implemented by many programming languages, we tried to use serviceable resources, such as process cores and memories, as much as possible to maintain system efficiency. For extracting various features, we utilized the maximum threads by using openMP; for training and testing, we utilized all 12 cores by employing the multiprocessing module from the python standard library. In order to compare the efficiency of the AES system with different numbers of features, we used the same feature optimization techniques (the min-max normalization method, DFV with 10 sections, and the feature selection method based on correlation) and measured the time required for training and testing the AES system, with a different number of features.
We performed all four different ML algorithms introduced in Section
As shown in Figure
Efficiency comparing.
In addition, we will discuss the tradeoff between efficiency and effectiveness of the AES system using feature optimization methods. According to previous experimental results, we have shown that the feature optimization methods improve the effectiveness of AES system. However, the feature optimization methods would decrease the efficiency of AES system, because of the increased processing time for feature optimization, even though the training time and testing time were reduced due to the reduced number of features. Although most of the feature optimization methods do not require an excessive amount of time, some specific feature optimization methods, such as mRMR, require processing times, to some degree. Accordingly, we have considered the tradeoff between the reduced time of training and testing and the increased processing time of feature optimization; thus, we have to select a proper method of feature optimization for practical purposes.
This paper presented feature optimization techniques consisting of appropriate normalization, discretization, and feature selection methods that can be applied to the ML based AES system. We have shown that both the effectiveness and efficiency of the system can be improved. These feature optimization techniques reduce the high-dimensional feature space. As a result, the performance is improved and the training time also decreased. By experimenting and analyzing the relationship between the ML algorithms and the feature optimization techniques in the domain of the AES with a large number of English essay data, we have obtained many useful findings.
We can summarize the results of the experiments with the following four main discoveries. The different combinations of feature optimization techniques give rise to large variation in performance. A normalization technique is essential for every ML method. There is a 2.3-fold performance difference in the minimum and 4.3-fold performance difference in the maximum between a ML method employing a normalizing technique and a ML method that is not employing a normalizing technique. The discretization technique is not useful for the MR or ME model. On the contrary, a discretization technique is mandatory for SVM, and the performance of SVM is improved when a discretization technique and an appropriate number of sections are used. Because the reduced number of features is useful for the MR model, an appropriate feature selection technique is required for MR. On the contrary, ME model can utilize the large number of features; thus, the feature selection technique is relatively less important for ME. For the SVM and SVR model, the number of features causes large variations in performance; therefore, the proper number of features must be determined when a feature selection technique is used.
Experimental results of all combinations of parameter values showed that the best performance was acquired when we used the SVR ML algorithm,
For future works, we plan to apply these feature optimization techniques to another domain and intend to show that the feature optimization techniques are also useful for improving both the effectiveness and efficiency of the system.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3C4A7033344).