Understanding and predicting dynamic change of algae population in freshwater reservoirs is particularly important, as algaereleasing cyanotoxins are carcinogens that would affect the health of public. However, the high complex nonlinearity of water variables and their interactions makes it difficult to model the growth of algae species. Recently, support vector machine (SVM) was reported to have advantages of only requiring a small amount of samples, high degree of prediction accuracy, and long prediction period to solve the nonlinear problems. In this study, the SVMbased prediction and forecast models for phytoplankton abundance in Macau Storage Reservoir (MSR) are proposed, in which the water parameters of pH, SiO_{2}, alkalinity, bicarbonate
Freshwater algal bloom is one of water pollution problems that occurs in eutrophic lakes or reservoirs due to the presence of excessive nutrients. It has been found that most species of algae (also called phytoplankton) can produce various cyanotoxins including
Computational artificial intelligence techniques have been developed as the efficient tools in recent years for predicting (without considering time series effect) or forecasting (considering time series effect) algal bloom. Previous studies [
Considering the drawbacks of both the methods, recently support vector machine (SVM) started to be used for predicting the chlorophyll concentration. It is a new machinelearning technology based on statistical theory and derived from instruction risk minimization, which can enhance the generalization ability and minimize the upper limit of generalization error. Compared to ANN, SVM has advantages of only requiring a small amount of samples, high degree of prediction accuracy, and long prediction period by using kernel function to solve the nonlinear problems. It is believed that SVM will provide a new approach for predicting the phytoplankton abundance in the reservoirs [
In this study, we attempted to develop an SVMbased predictive model to simulate the dynamic change of phytoplankton abundance in Macau Reservoir given a variety of water variables. The measured data from 2001 to 2011 were used to train and test the model. The present study will lead to a better understanding of the algal problems in Macau, which will help to develop later guidelines for forecasting the onset of algae blooms in raw water resources.
Macau is situated 60 km southwest of Hong Kong and experiences a subtropical seasonal climate that is greatly influenced by the monsoons. The difference of temperature and rainfall between summer and winter is significant though not great. Macau Main Storage Reservoir (MSR) (Figure
Location of the MSR.
Macau Water Supply Co. Ltd. is responsible for waterquality monitoring and management. Location in the inlet of the reservoir was selected for sampling. Samples were collected in duplicate monthly from May 2001 to February 2011 at 0.5 m from the water surface. A total of 23 water quality parameters, including hydrological, physical, chemical, and biological parameters, were monitored monthly. Precipitation was obtained from Macau Meteorological Center (
In this work, correlation analysis was conducted to identify the water parameters which were significantly correlated with phytoplankton abundance (Table
Correlation analysis of prediction and forecast model.
Parameters  Prediction model  Forecast model  

Time lagged (month)  


 
Turbidity 

0.00  −0.01  −0.06 
Temperature 

0.21  0.19  0.14 
pH 




Conductivity 

0.01  0.14  0.21 
Cl^{−} 

0.10  0.22  0.28 


0.03  0.14  0.22 
SiO_{2} 


0.16  0.04 
Alkalinity 


−0.21  −0.12 




−0.24 
DO 






−0.22  −0.22  −0.15 


−0.08  −0.02  0.03 


0.10  0.08  0.25 
TN 




UV_{254} 




Fe 

−0.06  −0.04  −0.08 


0.06  0.06  0.03 
TP 

0.05  0.02  0.00 
Suspended solid 



0.23 
TOC 


0.29 

HRT 

−0.11  −0.13  −0.16 
Water level 

0.05  0.01  −0.02 
Precipitation 

0.05  0.11  0.06 
Phytoplankton abundance  — 



As a prediction algorithm, SVM was firstly proposed by Vapnik [
SVM is selected in this work because of its advantages over other “black box” modeling approaches such as ANN as listed as follows [
The architecture of the estimated function does not have to be determined before training. Input data of any arbitrary dimensionality can be treated with only linear costs in the number of input dimensions.
SVM treats the regression as a quadratic programming problem of minimizing the datafitting error plus regularization, which produces a global (or even unique) solution.
SVM combines the advantages of multivariate nonlinear regression in that only a small amount of data is required to produce a good generalization. In addition, the weakness of the transformational models in multivariate nonlinear regression can be overcome by mapping the data points to a sufficiently highdimensional feature space.
Results obtained from SVM are easy to interpret.
In SVM, the whole process consists of several layers. The input vectors are put in the first layer. Suppose that the training datasets are
A nonlinear mapping
Then, in this higherdimension feature space, optimal decisions function is
In this way, nonlinear prediction function is transformed to linear prediction function in higherdimension feature space [
As introduced previously, SVM can provide the global optimum solution because the problem in SVM is transformed to finding the solution to the quadratic programming. So, the minimization problem shown in (
According to Mercer’s condition, in SVM the inner product
linear:
polynomial:
radial basis function:
sigmoid:
For these four kernel functions, in general, the RBF kernel function is a reasonable first choice [
As shown in the kernel function mentioned previously, there are three parameters which need to be specified in the application of SVM: (1) capacity parameter
With the above introduction of SVM, it is necessary to present performance indicators. The performance of models was evaluated using the following indicators: square of correlation coefficient (
The correlation of log_{10} phytoplankton and water parameters for forecast model and prediction model was shown in Table
Performance indexes of the prediction and forecast models.
Performance index  Prediction model  Forecast model  

Accuracy  Generalization  Accuracy  Generalization  
performance  performance  performance  performance  
(training set)  (testing set)  (training set)  (testing set)  
ANN  SVM  ANN  SVM  ANN  SVM  ANN  SVM  

0.752  0.760  0.749  0.758  0.758  0.863  0.760  0.863 
RMSE  0.307  0.307  0.316  0.351  0.299  0.229  0.306  0.264 
MAE  0.238  0.243  0.243  0.274  0.229  0.127  0.247  0.226 
After the correlation analysis, it comes to the testing of the models invoked two parts, the accuracy performance and the generalization performance. Accuracy performance is to test the capability of the model to predict the output for the given input set that is originally used to train the model, while generalization performance is to test the capability of the model to predict the output for the given input sets that were not in the training set. In order to prevent the model that is memorizing the inputs instead of generalized learning, both performance checks need to be considered. In the present research, the performance indexes for SVMbased models were averaged with 50 runs.
In the application of SVM in this work, for the predication model, after the correlation analysis, 9 parameters such as pH, SiO_{2} are selected as the independent variables, and phytoplankton abundance is selected as the induced variable (target value). Then, the data from May 2005 to December 2008 are used to train the model, and data from January 2009 to February 2011 are used to test the model. In the training process, the crossvalidation approach as mentioned previously is adopted to obtain the optimal combination of parameters for the testing. Specifically, the training data are divided into 10 about the same size groups that are 9 groups for training, and the rest 1 group is used to test the model trained by the previous 9 groups’ data. Then, this (9 groups training and 1 group testing) is repeated for 9 times (10 times in total). And then, parameters of the one process which has the best testing performance in these 10 repeats will be used as the optimal parameters combination in the “real” testing process which has the data from January 2009 to February 2011. The forecast model basically follows the same steps of the prediction model, while the only difference between these two models is that effect of time series which is included in the forecast model. So, in the forecast model, only the previous three months’ data are included in the training process.
The performance of prediction and forecast models was shown in Table
Observed and predicted phytoplankton level for the training and validation dataset of the prediction models.
Observed and predicted phytoplankton level for the testing dataset of the prediction models.
SVM result for the training and validation (a) and testing (b) data set.
Observed and predicted phytoplankton level for the training and validation dataset of the forecast models.
Observed and predicted phytoplankton level for the testing dataset of the forecast models.
SVM result for the training and validation (a) and testing (b) data set.
These results confirmed that SVM can handle well the nonlinear relationship between water parameters and phytoplankton abundance.
The SVMbased prediction and forecast models for phytoplankton abundance in MSR are proposed in this study. 15 water parameters with the correlation coefficients against phytoplankton abundance greater than 0.3 were selected, with 8year (2001–2008) data for training and crossvalidation and the most recent 3 years (2009–2011) for testing. The results showed that the forecast model has better performance with the
The authors thank Macao Water Supply Co. Ltd. for providing historical data of water quality parameters and phytoplankton abundances. The financial support from the Fundo para o Desenvolvimento das Ciências e da Tecnologia (FDCT) (Grant no. FDCT/016/2011/A) and Research Committee at University of Macau is gratefully acknowledged.