Improved Extreme Learning Machine and Its Application in Image Quality Assessment

Extreme learning machine (ELM) is a new class of single-hidden layer feedforward neural network (SLFN), which is simple in theory and fast in implementation. Zong et al. propose a weighted extreme learning machine for learning data with imbalanced class distribution, whichmaintains the advantages fromoriginal ELM.However, the current reported ELMand its improved version are only based on the empirical risk minimization principle, which may suffer from overfitting. To solve the overfitting troubles, in this paper, we incorporate the structural risk minimization principle into the (weighted) ELM, and propose a modified (weighted) extreme learning machine (M-ELM and M-WELM). Experimental results show that our proposed M-WELM outperforms the current reported extreme learning machine algorithm in image quality assessment.


Introduction
Extreme learning machine (ELM) was proposed as a new class of single-hidden layer feedforward neural network by Huang et al. [1].Its basic idea is to set a suitable number of nodes in the hidden layer before training and randomly assign the values for the input weights and offsets of the hidden layer in the implementation procedure.The algorithm completes the whole process at once and generates a unique optimal solution without the necessity of iterations.So it has the advantages of easy parameter selection and fast learning speed.Liang et al. [2] also proposed an online sequential extreme learning machine algorithm (OS-ELM) that can learn data one-by-one or chunk-by-chunk.Although OS-ELM provides better generalization performance, it excessively depends on experimental data.Lan et al. [3] presented an ensemble of online sequential extreme learning machine (EOS-ELM), which is a more stable integrated network structure consisting of multiple OS-ELM networks.Rong et al. [4] developed an OS-Fuzzy-ELM algorithm by combining TSK fuzzy inference system and ELM algorithm, which reduces the training time significantly.Feng et al. [5] presented an improved ELM algorithm based on error minimization.In [6], Zong et al. proposed a weighted ELM for dealing with data with imbalanced class distribution, which is able to be generalized to balanced data and maintains the advantages from original ELM.However all these algorithms only consider the empirical risk minimization principle, which can easily lead to overfitting [7].
Support vector machine (SVM), proposed by Cortes and Vapnik [8], is actually also a single-hidden layer feedforward network.In [9][10][11], Suykens et al. proposed the least-squares support vector machine (LS-SVM), which transforms the linear inequality constraints into linear equality constraints in the support vector machine and, thus, converts solving the QP problem into solving linear equations.It reduces the difficulty of support vector machine learning a great deal of samples and also improves the efficiency.Both the SVM and LS-SVM are general algorithms based on guaranteed risk bounds of statistical learning theory, that is, the socalled structural risk minimization (SRM) principle, which improves their generalization ability.
In this paper, to lower the overfitting phenomena of the extreme learning machine algorithms, we refer to the LS-SVM algorithm, draw the structural risk minimization principle into the ELM and WELM algorithms, and propose a modified ELM and WELM algorithm and call them as M-ELM and M-WELM.Our experimental results suggest the validity of our proposed M-ELM and M-WELM algorithm.
The structure of this paper is organized as follows.The brief introduction to ELM and weighted ELM is done in Section 2. In Section 3, the principles of our proposed M-ELM and M-WELM are described.Experimental results and performance assessment are presented in Section 4. In Section 5, the conclusion is presented.

Brief Introduction to
Extreme Learning Machine

Extreme Learning Machine (ELM).
Extreme learning machine (ELM) proposed is a single-hidden layer feedforward networks (SLFNs) which randomly selected the input weights and analytically determines the output weights of SLFNs [1,2,12].One key principle of the ELM is that one may randomly choose and fix the hidden node parameters.
After the hidden nodes parameters are chosen randomly, SLFN becomes a linear system where the output weights of the network can be analytically determined using simple generalized inverse operation of the hidden layer output matrices [13].
For an observation data set with  nodes in the hidden layer and the excitation function , the extreme learning machine model can be expressed as where   is the output weight of the th hidden layer node and the output neuron,   is the input weight of the input neuron and the th hidden layer node, and   is the offset of the th hidden layer node.Consider h() = [( 1 , The least-squares solution to the equations is where  + is called the Moore-Penrose generalized inverse of the hidden layer output matrix .

LS-SVM Regression.
Assume that an input and output sample data set for regression analysis is  = {( LS-SVM regression method is used to solve the weight vector  and deviation  * .Based on the structural risk minimization, the optimization model of the optimal regression function [9][10][11] can be established as where  is the penalty constant, which is a compromise between complexity and fitting accuracy of regression model.Higher value  means higher fitting degree.  is the slack variable.LS-SVM transforms the inequality constraints into equality constraints by defining loss functions different from those in the standard SVM.It constructs the following Lagrange function: where   is the Lagrange multiplier.According to KKT optimal conditions, the linear equations can be obtained as follows: where and Ω , = (  ,   ) is the th row and the th column data element, where Ω , = (  ,   ) is the kernel function that satisfies the Mercer condition.Solve the linear equations and get the nonlinear mapping equation as follows: where T = [ 1 , . . .,   ].More precisely, Minimize: where h(  ) is the feature mapping vector in the hidden layer with respect to   ,  represents the output weight vector connecting the hidden layer and output layer, and  is the regularization parameter to represent the trade-off between the minimization of training errors and the maximization of the marginal distance.  , the training error of sample   , is caused by the difference of the desired output   and the actual output h(  ).

Weighting Schemes.
The key issue of the WELM is to define an appropriate weight matrix W = diag{  },  = 1, . . ., , which determines what degree of rebalance users are seeking for and how much further the boundary is pushed towards the majority class [6].In [6], two weighting schemes are proposed.The simple one is the weight value that can be automatically generated from the class information, which is in fact a special case of the cost sensitive learning: Weighting Scheme W1: where #(  ) is the number of samples belonging to class   ,  = 1, . . ., .
Another weighting scheme is the authors of [6] adopts the value of golden standard that represents the perfection in nature and minishes the balancing step into the ratio of 0.618 : 1 between minority classes and the majority classes, as shown in Compared to weighting scheme W1, the boundary using weighting scheme W2 is pushed slightly backwards the minority class so that the misclassification cases in compromise on the majority side are sought of being alleviated, so we adopt weighting scheme W2 in the next experiments.

Modified Extreme Learning Machine Algorithm
The traditional extreme learning machines are based on the empirical risk minimization principle and the training error minimization principle, whose drawback is that it is likely to suffer from overfitting, which reduces the generalization capability consequently.
According to the statistical theory, the actual risks include the empirical and structural risks, and a model with good generalization performance should be able to balance empirical and structural risks to obtain the best compromise.So we lead the structural risk minimization principle into the ELM algorithm and propose a modified weighted ELM and WELM model based on ELM and WELM, which we call it as M-ELM and M-WELM.
Assume that an input and output sample data set for regression analysis is  = {( 1 ,  1 ), . . ., ( 1 ,  1 )}, where   ∈   and  ∈ ,  = 1, . . ., .We draw into the condition of the structural risk and adjust the proportion of the empirical and structural risks by  instead of the  in formula (10), and the optimization model of the optimal regression function can be established as follows: where  2  , the sum of the square errors, represents the empirical risk and ‖‖ 2 represents the structural risk, according to the maximal margin principle in statistical theory [2].According to formula (6), the formula above is the conditional extreme problem and can be transformed into the Lagrange equation as follows: where the Lagrange multiplier   is the constant factor of sample   in the linear combination to form the final decision function.Further, by making the partial derivatives with respect to variables (,   ,   ) all equal to zero, the KKT optimality conditions are obtained: The solution of  can be derived from ( 17) regarding left pseudoinverse.Usually, left pseudoinverse is more suitable Mathematical Problems in Engineering since it is much easier to compute matrix inversion of size  × , when  is much smaller than : The same as formula (7) in Section 2.2, we can obtain the following linear equations: where  = [ where  is the excitation function.
The sigmoid function is used in this paper as follows: Solve the linear equations and then get the following nonlinear mapping equation below that is derived from (8): The whole steps of the M-ELM or M-WELM algorithm can be summarized as follows.
Given a training set , activation function , and hidden node number , consider the following.
Step 3. Substitute   into formula (15) and calculate the output weight .
As it can be seen, the M-WELM is able to be generalized to cost sensitive learning and can also deal with data with imbalanced class distribution as the WELM.On the other hand, its overfitting risk can be reduced by considering both the empirical and structural risks simultaneously.

Test on the Benchmark Boston Housing Data Set. Boston
Housing data, obtained from the UCI database, is a data set commonly used for measuring the performance of regression algorithm.It contains the information of 506 sets of commodity houses in Boston Housing, including 12 continuous characteristics, one discrete characteristic, and house prices [16].The purpose of regression estimation is to predict the average house price by training part of the samples.
In the experiments, the samples are randomly divided into two sample groups: random 70% of them for training and the remaining 30% for test.We repeat the random train-test procedure 100 times and calculate the mean square training and prediction error of every algorithm, and the experimental results of several algorithms are shown in Table 1.We have adjusted the parameters for every algorithm so that every algorithm can get a pretty good result.The number of hidden neurons of M-WELM, M-ELM, W-ELM, OS-ELM, and EOS-ELM is set to 180, and the number of hidden neurons of ELM, B-ELM, and C-ELM is set to 65.
It can be seen from Table 1, for the Boston Housing data set from a real-world multi-input single-output system, that our proposed M-WELM algorithm shows the best prediction performance than other types of ELMs, and M-ELM ranks number 2, with both of which indicating the robustness of our proposed idea for modifying ELM algorithm.

Test on the LIVE IQA Database.
Algorithms that automatically assess perceptual image quality are critical for numerous image processing applications.Recently, machine learning based blind image quality assessment has great progress, such as the BRISQUE [17], the LBIQ [18], the DIIVINE [19], and the BLIINDS [20] using SVR and the GRNN-based method [21].Of these indices, the BRISQUE shows the best performance in overall, so here we use the same image features adopted by the BRISQUE index [17] to test our proposed MELM and M-WELM algorithms for image quality assessment (IQA).
In reference [17], the authors used 36 natural scene statistical features in the spatial domain to predict image quality as shown in Table 2, that is, the shape and variance from a GGD fit of the MSCN coefficients, the shape, mean, left variance, and right variance from a GGD fit of the H pairwise products, V pairwise products, D1 pairwise products, and D2 pairwise products, which are extracted at two scales, the original image scale, and at a reduced resolution (low pass filtered and downsampled by a factor of 2).Here we also adopt these 36 image statistical features to predict image quality by using different ELM algorithms and then compare our proposed modified ELM algorithms with the reported ELM algorithms.Firstly, we test out proposed algorithm on the LIVE IQA database [22], which consists of 29 reference images with 779 distorted images spanning five different distortion categories" JPEG2000 (JP2K) and JPEG compression, additive white Gaussian noise (WN), Gaussian blur (Blur), and a Rayleigh fast-fading channel simulation (FF).Each of the distorted images has an associated difference mean opinion score (DMOS) which represents the subjective quality of the image.
Three performance metrics are used to evaluate the algorithms.The first is the Spearman rank ordered correlation coefficient (SROCC), which measures the prediction monotonicity of the quality index.The second is the root mean square error (RMSE).The third is the running time.
Because learning based method requires a training stage in order to construct the relationship between the extracted statistical features and DMOS, we split the LIVE dataset into two nonoverlapping sets-a training set and a testing set.The training set consists of 80%, 50%, or 30% of the 29 reference images and their associated distorted versions, respectively, while the testing set consists of the remaining 20%, 50%, or 70% of the 29 reference images and their associated distorted versions.The regression models are trained on the training set and the results are then tested on the testing set.In order to ensure that the proposed method is robust across content and is not governed by the specific train-test split utilized, we repeat this random 80% train-20% test,  As it can be seen from Tables 3-5, compared with ELM (or WELM), our proposed M-ELM (or M-WELM) shows better subjective judgment no matter how much the percentage of samples is used, which suggests the validity of introducing the structural risk minimization principle.Our M-WELM algorithm shows the best performance against the other reported ELM algorithm and SVR algorithm, especially when using less training samples, which further demonstrates the effectiveness of integrating the structural risk minimization principle and the weight method into the ELM model.In addition, we can find that our proposed M-WELM is far faster than the SVR, which provides an effective real-time solution to IQA.

Test on the TID 2008 Database.
To prove the promotion of the proposed M-WELM, we further test on the same (available) distortions in an alternate database-the TID2008 [24].It consists of 25 reference images and 17 distortion types with 1700 distorted images.Of these 25 reference images only 24 are natural images, so we test our algorithm only on these 24 images.Here we use all 779 distorted images in the LIVE IQA database as the training set and the images in the TID 2008 as the testing set.We still repeat the random train-test procedure 1000 times and report the median SROCC, RMS, and running time as shown in Table 6.The values of parameters of every algorithm are the same as the used in Section 4.2.
From Table 6, we can find our proposed M-WELM showing the highest consistency with the subjective scores amongst all types of ELM algorithms, and it is also competitive with the SVR in the performance, but it is far faster than the SVR, which provides a real-time solution to IQA.

Conclusion
Current reported ELM and weighted ELM algorithms are based on empirical risk minimization principle, which may easily lead to the overfitting risk during learning process.By introducing the structural risk minimization principle to the ELM and weighted ELM algorithms, we propose an improved (weighted) extreme learning machine algorithm (M-WELM and M-ELM) to solve the overfitting problem, which takes into account both the empirical risk and the structural risk simultaneously and adjusts the proportion of the two risks properly.Our experimental results show that the M-WELM outperforms the current reported ELM algorithms in IQA and also has competitive performance with the SVR, but it is far faster than the SVR, which provides an effective real-time solution to IQA.
1 ,  1 ), . . ., (  ,   )}, where   ∈   and  ∈ ,  = 1, . . ., .LS-SVM regression algorithm maps the data  into a high-dimensional feature space  through a nonlinear mapping  and does linear regression in the space .The regression estimation for the observation data set given above can be formulated as below, where  and  * are the regression factors, * .

Table 1 :
The experimental results of the several algorithms on the Boston Housing data set.

Table 2 :
[14]ary of natural scene statistical features extracted in the spatial domain[14].

Table 3 :
Median Spearman's rank ordered correlation coefficient (SROCC), RMS, and running time across 1000 train-test combinations on the LIVE IQA database (80% samples are used for training).

Table 4 :
Median Spearman's rank ordered correlation coefficient (SROCC), RMS, and running time across 1000 train-test combinations on the LIVE IQA database (50% samples are used for training).

Table 5 :
Median Spearman's rank ordered correlation coefficient (SROCC), RMS, and running time across 1000 train-test combinations on the LIVE IQA database (30% samples are used for training).
whose parameters are estimated using cross validation on the training set.Other ELM algorithms are implemented by us.The used number of hidden neurons of M-WELM, M-ELM, W-ELM, OS-ELM, and EOS-ELM is set to 120, and the number of hidden neurons of ELM, B-ELM, and C-ELM is set to 75.

Table 6 :
Median Spearman's rank ordered correlation coefficient (SROCC), RMS, and running time across 1000 train-test combinations on the LIVE IQA and TID 2008 database.