Sludge Bulking Prediction Using Principle Component Regression and Artificial Neural Network

Sludge bulking is the most common solids settling problem in wastewater treatment plants, which is caused by the excessive growth of filamentous bacteria extending outside the flocs, resulting in decreasing the wastewater treatment efficiency and deteriorating the water quality in the effluent. Previous studies using molecular techniques have been widely used from the microbiological aspects, while the mechanisms have not yet been completely understood to form the deterministic cause-effect relationship. In this study, system identification techniques based on the analysis of the inputs and outputs of the activated sludge system are applied to the data-driven modeling. Principle component regression PCR and artificial neural network ANN were identified using the data from Chongqing wastewater treatment plant CQWWTP , including temperature, pH, biochemical oxygen demand BOD , chemical oxygen demand COD , suspended solids SSs , ammonia NH4 , total nitrogen TN , total phosphorus TP , and mixed liquor suspended solids MLSSs . The models were subsequently used to predict the sludge volume index SVI , the indicator of the bulking occurrence. Comparison of the results obtained by both models is also presented. The results showed that ANN has better prediction power R2 0.9 than PCR R2 0.7 and thus provides a useful guide for practical sludge bulking control.


Introduction
Sludge bulking is the most common solid separation problem in activated sludge problem, which is caused by the excessive growth of filamentous bacteria extending outside the flocs, thus interfering with the settling of activated sludge.It has been reported that over 50% the downhill simplex method to minimize the sum of the square errors between observation and prediction; then ANN models were used to predict the remaining errors of the optimized mechanistic model.This study used over 10 days' 6-9 h measurements from the activated sludge treatment plant located at Norwich, England.Though the study is based on the real data, the models were not used for predicting bulking phenomena.
C ôté et al. 20 developed and applied ANN models for the prediction of future bulking episodes.The simple prediction models based on the online data flow rates , analytical data COD, BOD , and qualitative data presence of foam, filamentous bacteria, microfauna, and appearance were developed for the effluent TSS that is an indicator of plant performance.The data came from the WWTP from Catalonia, Spain, in 609 consecutive days.Through the combined use of the rough set theory and ANN, the reasonable prediction models are found to show the different importance of variables and provide insight into the processes' dynamics.However, compared to SVI, TSS was not a good indicator for bulking, though during bulking episodes, the effluent TSS undoubtedly increases.Besides, the parameters sets used for selecting the significant variables are incomplete.For example, the important variables including temperature, pH, TN, and TP are not included in the selection, resulting in losing the key information for explaining the bulking phenomena.
A recent study 21 utilized a self-organizing radial basis function SORBF neural network method to predict the evolution of SVI.The hidden nodes in the SORBF neural network can be grown or pruned based on the node activity and mutuality to achieve the appropriate network complexity and overall computational efficiency.The performance of this method was verified in a real WWTP.This method enhanced the capacity of the RBF model to adapt to nonlinear dynamic system and thus yielded more accurate predictions than the other method.However, in this study only limited input parameters, influent flow rate, DO, pH, BOD, COD, and TN were included, which are not enough to explain sludge bulking mechanisms.
Considering the drawbacks of previous studies using ANN in wastewater treatment system, the purpose of the present study is to analyze bulking problems of CQWWTP that used the A/A/O treatment process that has not been discussed before, based on more complete daily variables including temperature, pH, BOD, COD, SS, NH 4 , TN, TP, and MLSS for the whole year.These variables provide more complete data input to explain the bulking mechanisms, in spite of applying only the data-driven models developed in the study.The models can be used to evaluate the relative influence of the operational conditions, influents characteristics, and activated sludge concentrations on the SVI and to predict the SVI values using PCR and ANN.The comparisons of both models in this study and the prediction model developed by Han and Qiao 21 were made to select the best prediction method for wastewater treatment management.The key contributions of this paper not only focus on the mathematical modeling itself, but also take the complete main factors that affect the bulking into consideration, by integrating all of those potential mechanistic bulking causative variables into both models, though only the data-driven models were applied.
The rest of this paper was organized as follows.The study area and data source of CQWWTP were first introduced concisely, followed by modeling approaches PCR and ANN formulation and the performance indicators used for evaluation in Section 2. Section 3 presented and discussed the modeling results performed by PCR and ANN, respectively, and made comparisons between both methods.The conclusion was drawn finally in Section 4.

Study Area
Chongqing is the biggest city in Western China.Like Beijing, Tianjin, and Shanghai, it is directly under the central government of China.The city has grown very quickly during the last 10 years, with the population of 31 millions and the area of 82,400 km 2 .There is now a big effort to collect and treat the wastewater, due to the recent achievement of the three-gorge dam in the downstream.CQWWTP 29.601615 in latitude and 106.634133E in longitude, Figure 1 , one of biggest WWTP in Chongqing, is designed to have a capacity of an average flow rate of 300,000 m 3 /d and about 750,000 person equivalents in carbon, nitrogen, and phosphorus .CQWWTP uses conventional A/A/O anaerobic/anoxic/aerobic treatment processes Figure 2 that are susceptible to sludge bulking.It was reported that 36% of sludge experience bulking in the year of 2010, and the situation appeared to be worsening in the recent years, particularly in the springs.

Data Source
Sludge samples were collected daily in the reaction tank over the year of 2010.The monitored parameters included operational conditions temperature and pH , influent characteristics BOD, COD, SS, NH 4 , TN, and TP , and activated sludge concentrations MLSS .Water samples were preserved, delivered, and analyzed using the standard methods of the American Public Health Association 22 .
Figure 3 showed the changes of water parameters over the time, with the simple statistical analysis shown in Table 1.It was showed that the pH is maintained stable over the range of 7.6-8.2.The water temperatures matched the atmospheric temperatures that are low in the winter and high in the summer.The BOD and COD concentrations in      the influents fluctuated from time to time, with high standard deviations of 74.8 mg/L for BOD and 113 mg/L for COD.However the BOD/COD ratios were within 0.53-0.8for 85% of data, which were within the normal range of municipal wastewater, indicating that it is readily biodegradable wastewater.Similarly, the nitrogen and phosphorus concentrations in the influents fluctuated with 2-3 times higher or lower than the average values, which were believed to be the highly possible reasons that affected the growth of the bulkingcausing filamentous bacteria in the reaction tank afterward.Due to the instability of the wastewater characteristics and the occurrence of bulking, the MLSS in the aeration tank cannot keep stable, ranging from 2000 mg/L to more than 6500 mg/L.It was also noted that the closely zero concentrations at the end of July and August were due to the measurement errors.Figure 4 showed the change of SVIs over time, which clearly indicates that the bulking mostly happened in the springs from Jan. to April, with the SVIs greater than 150 mL/g.On the other hand, bulking levels were low in the summers from July to September, with the SVI around 50 mL/g.When Figure 4 was compared with Figure 3, it was found that there is a correlation between temperature and SVI, showing that bulking in CQWWTP mostly happened in the spring and nonbulking occurred in the summer.This relationship would be further investigated in the following statistical studies.

Modeling Approaches
Two different modeling techniques, PCR and ANN, were analyzed and applied to model the SVI data from CQWWTP.The measured SVI values in the reaction tank reflected the bulking levels, which in turn depended on the various variables including physical, chemical, and biological water parameters and the interaction among them.They all affect the growth of the filamentous bacteria in the biological WWTP.From the literature review 23, 24 , those important parameters include temperature, pH, BOD, COD, SS, NH 4 , TN, TP, and MLSS.Temperature and pH are the growth environment for microorganisms.The temperature increases the growth of floc-forming bacteria and filamentous bacteria and strengthen, their interaction and competition.The optimum pH in the reaction tank is 7-7.5, and pH below 6.0 would favor the growth of fungi that induces filamentous bulking.SS and MLSS is the indicator of the amount of activated sludge.The wastewater compositions, BOD/COD, NH 4 /TN, and TP are the carbon source, nitrogen source and phosphorus source for microorganisms, respectively.High carbohydrate components and low substrate concentrations with low F/M food/microorganism ratios appear to be conducive to sludge bulking 1 .Besides, the deficiency of nitrogen and phosphorus results in the production of nutrient-deficient floc particles and loss of settleability in reaction tanks.Thus all these parameters were taken as the input of the models.

PCR
PCR is divided into two parts, principle component analysis PCA and multiple linear regressions MLRs .PCA is a multivariate statistical method which uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principle components PCs , thus reducing the complexity of multidimensional system by maximization of component loadings variance and elimination of invalid components.MLR attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.The eigenvalues of the standardized matrix are calculated from following equation: where C is the correlation matrix of the standardized data, λ is the eigenvalues, and I is the identity matrix.Then the weights of the variables in the PC are calculated by where W is the matrix of the weights.Varimax rotation was used to obtain values of rotated factor loadings for evaluating the influence of each variable in the PC.These loadings represent the contribution of each variable in a specific principle component.
In this study, the PCA was performed on the water parameters to rank their relative significance and to describe their interrelation patterns as well as on the phytoplankton population levels.The stepwise option was used to choose the principle components, and the principle component scores of the selected parameters were used as independent variables in the MLR to check if the occurrences of phytoplankton could be explained by environmental  variables as well as to predict the phytoplankton abundance.Since phytoplankton abundance did not show normal distribution, logarithmic transformation was applied to phytoplankton data to be used in PCA.Kaiser-Meyer-Olkin KMO measure of sample adequacy and Bartlett's test of Sphericity were used to verify the applicability of PCA 25 .PCA and MLR were carried out using PASW 19 software package SPSS Inc. .The detail procedure of PCR has been described in our previous study 26 .

ANN
ANN computing is a new approach to system modeling and identification, with the attractive self-learning system.Different from conventional computational methods to process information, ANN is a system based on the operation of biological neural networks.It has the advantage of being able to assign significance to the input parameters and map the inputs to outputs when the relationships between parameters are unknown.
The ANN model was built with a three-layered feedforward network Figure 5 : an input layer, one or more hidden layers, and an output layer.The nodes in each layer were connected by weights, which will be adjusted through the training process to obtain the optimum model.Tan-sigmoid transfer function was used in the hidden layer to give the nonlinear modeling capability.The neural network architecture consists of two or more layers of neurons connected by weights denoted as w ji .Each neuron is used to calculate its output based on the amount of stimulation it receives from the individual input vector x i where x i is the input of neuron i .Then the net input of a neuron is calculated as the weighted sum of its inputs, and the output of the neuron are used to estimate the magnitude of this net input via the transfer.The net output u j from a neuron can be expressed as Sigmoid function was selected as the transfer function in this study, which is represented in the following equation: where y j is the output of the jth neuron in the layers.
The ANN is first to establish a relationship between a set of input variables and a set of output variables from the historical data sets.This is achieved by repeatedly presenting examples of the desired relationship to the network and adjusting the connection weights i.e., the model coefficients to reduce the mean-root-square error RMSE between the simulated outputs and the observed outputs.The weights of the network continually change until the total error of all the training set is below the acceptable error or other stop mechanism.
Backpropagation is most widely used due to its broad applicability to solve complex nonlinear problems in many domains, such as classification, prediction, and modeling.It works to determine the optimal weights and improve function approximation potential for complex nonlinear data by increasing the number of the hidden layers or the neuron in the hidden layers.Thus the new weights can be calculated by adding a modification to the old weights.The collected data is divided into two sets, one for training and the other for testing.
Determining the size of the hidden layer is a significant task in ANN.Some general rules for selecting the number of hidden nodes N H in the ANN model suggest that it should be within N I and 2N I 1 27 , where N I is the number of input nodes.Moreover, in order to prevent overfitting of the training data, Rogers and Dowla 28 also suggest that the condition N H N TR / N I 1 needs to be satisfied, where N TR is the number of training samples.In this study, a trial-and-error approach was carried out to find the optimum number of hidden nodes in the models.In general, a network structure with less hidden nodes is more preferable; this usually gives better generalization capabilities and fewer overfitting problems.To avoid the overfitting problem, which commonly occurs with the application of ANN, cross-validation tests were used.The selection of the network was performed by considering a minimum value of MSE for the cross-validation data set 29 .
In this study, ANN development and simulation were conducted using ANN toolbox of Matlab 2011a Matwork, NA .Batch gradient decent backpropagation training algorithm was adopted; the training stops when it hits one of the several stopping criteria, including

Performance Indicators
The performance of models was evaluated using the following indicators: coefficient of determination R 2 that provides the variability measure for the data reproduced in the model.Prediction R 2 is a good measure for both comparison and seeing the model's prediction capability.The calculation method is also known as the cross-validation, in which we exclude the first observation, and build the model with the remaining ones, use this model to predict the excluded observation, and repeat for all observations.It is a good measure for out-of-sample accuracy.As this test cannot give the accuracy of the model, other statistical parameters should be reported.Mean absolute error MAE and root-mean-square error RMSE measure residual errors, providing a global idea of the difference between the observation and modeling.The indicators were defined as follow: where n is the number of data; Y i and Y i are the observation data and the mean of observation data; respectively, and Y i is the modeling results.

Results and Discussion
Correlation between SVIs and water parameters were analyzed to evaluate the influence of each parameter on the bulking level, which provides a measure of linear relationship between SVI and each parameter.The results Table 2 showed that all the coefficients were greater than 0.15, indicating that all these parameters had high correlation with SVIs and thus included in the models as input variables.It is noted that high correlation coefficient 0.82 was found between SVI and temperature, which was consistent with the observation that bulking of CQWWTP mostly occurs in the springs.

PCA
The values of KMO for both prediction and forecast models were above the criteria value of 0.6, indicating that the PCA was applicable 13 .PCA demonstrates the relative importance of each standardized variable in the PC calculations.
The PCA for the prediction model was performed using the 9 selected parameters from the result of correlation analysis.Table 3 showed that the first 3 principle components can explain 74.1% variation of the data variation.The scree test suggested only 3 components with the eigenvalues greater than 1 to be retained, in which all the 9 parameters were included.The composition of the 3 principle components are shown in Table 4, in which PC1 represented the component of water characteristics in the influent expressed as a function of COD, BOD, SS, TP, TN, and NH 3 -N, PC2 represented the component of activated sludge mixed liquor concentration expressed as a function of MLSS, and the PC3 represented the component of environmental condition expressed as a function of temperature and pH.

MLR
The MLR results for the prediction model were shown in Table 5. Stepwise approach was adopted.A t-test significance level of 0.05 was applied to calculate the statistically valid parameters.MLR result showed that all PCs were significant.Therefore, the prediction model for phytoplankton abundance can be written as SVI 468.935 0.025 PC1 − 0.007 PC2 − 15.898 PC3 .

ANN
To apply the ANN model, several network structures were tested to find the most appropriate topology.Using the 9 water parameters as inputs, the best architecture consisted of a threelayer network.Sigmoid and linear functions were used as activation function in the neurons of the hidden layer and output neuron, respectively.70% of original data were used for training, among which 10% were randomly selected for cross-validation, and the remaining 30% of data were used for testing.The training was performed for a maximum of 30000 iterations.The detailed results were presented in Figures 6-9, and they are discussed in more detail in the next section.

Modeling Results Comparison
Testing of the models invoked two parts, the accuracy performance and the generalization performance.Accuracy performance is to test the capability of the model to predict the output for the given input set that originally used to train the model, while generalization performance is to test the capability of the model to predict the output for the given input sets that were not in the training set.In order to prevent the overfitting issue of the model, both performance checks need to be considered.In the present research, the performance indexes for ANN's models were averaged with 50 runs.The performance of prediction models were shown in Table 6.Using the PCR model, the performance indexes for the testing step were generally better than those for the training step, with the R 2 of 0.689 for training and 0.772 for testing.Compared to PCR model, the ANN model has the best performance, with R 2 0.901, 0.907 , RMSE 21.141,20.258 ,and MAE 16.375,15.899for accuracy and generalization performance, indicating that instead of PCR, ANN can handle well the nonlinear relationship between SVIs and water parameters.
It was noted that the ANN model did not need to perform PCA to obtain the good results.The PCA-ANN results obtained with the R 2 of 0.9 not shown here cannot improve the prediction powers for testing and training data sets, confirming that ANN is a powerful tool for dealing with collinearity of data.In the prediction models, no delay was observed for the PCR model in the training set data Figure 6 , but the magnitude is more fluctuate than the ANN models.The prediction of the testing set in Figure 7 for both models exhibit over-estimates in the low SVI level region.In general, ANN was successful to predict the SVIs with a reasonable degree of accuracy for the forecast and the prediction model.
The modeling SVIs versus observed SVIs for PCR and ANN were showed in Figures 8 and 9, respectively.For both training and testing data, both models fitted the measured data well, with the slopes equal to 1 for both fitting curves, that is, the modeling results are equal to the measured data.However, compared to the PCR in which the measured data were distributed more scatter along the fitting curve, ANN models provided better simulation for the measurements, confirming that ANN fits better than PCR when used in predicting the SVIs of CQWWTP.From the modeling point of view, a disadvantage of ANN is that the mechanisms of the inner signal processing are unknown.However, it has provided enough information for CQWWTP to prevent the sludge bulking problems; for controlling the sludge bulking problem occurrence, the engineers can only control the predicted SVI < 150 mL/g by adjusting the operational variables, such as MLSS, without understanding the complete mechanisms and the relationships among the variables.ANN was demonstrated to effectively solve the problems where response flexibility and constant tuning of the models are required.
When compared with the SORBF model recently developed by Han and Qiao 21 , our ANN models showed similar values of RMSE and R 2 and simpler ANN algorithm, demonstrating that our ANN model is suitable and has more advantages for the SVI prediction.This is highly probably due to the more complete bulking causative variables involved in the ANN model thus providing more information in explaining the sludge bulking phenomena, despite that the complete mechanisms causing bulking and the relationships among variables are still unclear.
In summary, predicting sludge bulking using our ANN model can provide accurate prediction results.The fitting accuracy was found to improve with the increasing number of bulking causative variables.The model has been tested in the CQWWTP using A/A/O processes, which are different from the traditional aerobic process.Thus, empirical studies will also be conducted in the future for additional data sets to demonstrate that the ANN model is generalizable to extensive data sets under different circumstances.

Conclusions
The econometric technique PCR and the artificial intelligence technique ANN applied in the study are powerful analysis tools that can be used to solve a problem that is poorly understood or difficult to solve with the traditional deterministic relationship.The updated knowledge on sludge bulking is still unclear, and thus the unconventional systematic data-driven modeling approaches could be used to improve the prediction.Prediction models with PCR and ANN were compared for simulating the SVIs in CQWWTP, using nine water parameters including environmental conditions of temperature and pH, wastewater characteristics of BOD, COD, SS, NH 4 , TN, and TP, and activated sludge concentration of MLSS.PCA result indicated that only 3 PCs with eigenvalues greater than 1 were obtained, which can explain 74.1% variance of data.The application of PCA in the PCR model was considered better than using the original data, as it would eliminate the collinearity problem and reduce the number of inputs, thus decreasing the model complexity.
PCR showed worse prediction performance than ANN, indicating that the complex nonlinear relationship among the variables in the treatment systems cannot not be simulated using linear model alone.Besides, by using PCR, the highest SVI values were underestimated during the training step.On the other hand, ANN had better prediction power with the R 2 of 0.9 for both accuracy performance and generalization performance, implying that ANN is good to deal with the collinearity problem in the data without performing data pretreatment using PCA.Compared with the recently developed SORBF model, ANN model is suitable and has more advantages for the SVI prediction by using simpler ANN algorithm and including more bulking causative variables in the model.The ANN models established by this research project performed well to address the wastewater quality and sludge bulking problem of CQWWTP.The modeling approach described here for analyzing the bulking problem has yielded useful information for effective wastewater treatment management.
Though the ANN presented here is obtained from the CQWWTP, the technique can also be applied for the other WWTPs, as the input parameters and operational conditions are similar.The method can be used for control of wastewater treatment operation in order to improve the treatment performance.

Figure 3 :
Figure 3: Change of water parameters over time in 2010.a Temperature and pH; b MLSS and SS; c BOD and COD; d TN/NH 3 -N lines and TP bar chart .

Figure 4 :
Figure 4: Change of SVIs over time in 2010.

Figure 6 :
Figure 6: Observed and predicted SVIs for the training data set of the prediction model.

Figure 7 :
Figure 7: Observed and predicted SVIs for the testing data set of the prediction model.

Table 2 :
Correlation coefficients between SVIs and water parameters in MSR.

Table 3 :
Eigenvalue and percentage variance of the 9 principle components for the prediction model.

Table 4 :
Composition of the principle components for the prediction model.

Table 5 :
MLR result for prediction model.

Table 6 :
Performance indexes of the PCR and ANN prediction models.