Machine Learning Framework for Intelligent Detection of Wastewater Pollution by IoT-Based Spectral Technology

Industrial wastewater contains excessive micro insoluble solids (MIS) that probably cause environmental pollutions. Near-infrared (NIR) spectroscopy is an advanced technology for rapid detection of the complex targets in wastewater. An Internet of Things (IoT) platform would support intelligent application of the NIR technologies. The studies of intelligent chemometric methods mainly contribute to improve the NIR calibration model based on the IoT platform. With the development of arti ﬁ cial intelligence, the backward interval and synergy interval techniques were proposed in combination use with the least square support vector machine (LSSVM) method, for adaptive selection of the informative spectral wavelength variables. The radial basis function (RBF) kernel is applied for nonlinear mapping. The regulation parameter and the kernel width are fused together for smart optimization. In the design for waveband auto ﬁ ttings, the total of digital wavelengths in the full scanning range was split into 43 equivalent subintervals, and then, the back interval LSSVM (biLSSVM) and the synergy interval LSSVM (siLSSVM) models were both established for the improvement of prediction results based on the adaptive selection of quasidiscrete variable combination. In comparison with some common linear and nonlinear models, the best training model was acquired with the siLSSVM method while the best testing model was obtained with biLSSVM. The intelligent optimization of model parameters indicated that the proposed biLSSVM and siLSSVM deep learning methodologies are feasible to improve the model prediction results in rapid determination of the wastewater MIS content by the IoT-based NIR technology. The machine learning framework is prospectively applied to the fast assessment of the environmental risk of industrial pollutions and water safety.


Introduction
Internet of Things (IoT) has gradually permeated all kinds of technological landscape, such as monitoring, agriculture, irrigation management, healthcare, and security [1]. An IoT framework with cloud centric storage and processing can analyze the data received from various sensors along with decision support network nodes [2]. The IoT platform would support intelligent applicable technologies, providing solutions that are helpful and fruitful in smart dimensional irrigations of the agricultural landscape [3,4]. Intelligent spectral detection based on the IoT platform supports for sustainable eco-environment in combination with machine learning methodologies.
Near-infrared (NIR) spectroscopy has been developing as a sanctioned technology for rapid detection of complex analytes [5]. This technology uses the lights in the near-infrared frequency to illuminate the target sample. The components can be identified and quantitatively determined by studies of the intelligent chemoinformatic methods serving for model training based on the IoT cloud situations [6]. The NIR technology has advantages of reagent-free operation, nondestructive process, multicomponent simultaneous determination, and online implementation [7,8]. It is widely applied in the fields of precise agriculture, digital food safety, smart environment, and intelligent medicine [9][10][11]. The model prediction effect is prospectively improved in combination with the IoT cloud analysis. The investigation of deep machine learning approaches in the NIR procedures is able to help construct novel analytical models to find the insight characteristics of complex data, thus realizing the more intelligent, more efficient, and more customer-focused analyses for data mining based on the IoT application.
Water pollution has always been a global industrial concern [12]. Waste chemical composition emitted into the water may be toxic to many organisms and deteriorate the eco-environmental safety, posing uncertain threats that decrease the quality of human healthy life [13]. Therefore, completing a large-scale detailed survey on water quality is essential to assess the environmental risk of water pollution. In common sense, the chemical oxygen demand (COD) value calculated in wastewater treatment is used to determine the quality level of wastewater. However, if the value is not accurately measured, the samples ought to be abandoned as the experiment cannot be repeated. The industrial wastewater contains many unexpected chemical pollutants that strongly destroy the balances of the plant nutrients, microbial activity, and ecological toxicity [14]. Therefore, the pollutants spontaneously trigger some chemical reactions producing a lot of micro insoluble solids (MIS) suspended in water; then, the water is visibly turbid [15]. Thus, there is a tremendous need for rapid intelligent detection of water MIS with comparable accuracy to rapidly assess the environmental impact by industrial-discharged wastewater. The IoT-based NIR technology meets the requirement, and machine learning chemometric methods are on demand for smart analyses [16].
With the development of chemometric studies, NIR technology has become a strong analytical tool for rapid quantitative determination of chemical components and parametric properties for water samples. NIR statistic models specifically give the estimation formulae of predicting the water properties such as the citric acid, tartaric acid, malic acid, and oxalic acid [17]. Moreover, NIR combined with visible (VIS) and midinfrared (MIR) spectroscopy is also able to predict soluble analytes (such as the fat, dry matter, and total nitrogen) in water solutions [18,19]. NIR technology is also functional of detecting wastewater properties in minutes. The wastewater discharged from biodiesel fuel production plants was targeted. Besides, NIR contributes to fast prediction of specific chemical components. Solid organic contents were predicted using partial least squares (PLS) regression combined with standard normal variate (SNV) transform [20]. Oil and urea concentrations were determined by multiple linear regression models with variable selections [21]. After microbial treatment, methanol and glycerol concentrations were further identified. Little work has been found about NIR rapid detection of water MIS with intelligent chemometric models.
In algorithm, the PLS regression method mainly establishes linear models for NIR analysis. But the linear regression models cannot give the prospective prediction model to extract comprehensive information when confronting the complex spectral data [22]. Modern machine learning methods have been used in NIR field to improve the prediction ability. Diago et al. [23] proved that the automated plant-based methods were feasible for fast assessment of the water status of a grapewine. Chen et al. [24] discussed some kernel network methods for NIR prediction of the industrial wastewater. Ahn and Park [25] promote the adaptive atmospheric correction procedure for VIS-NIR analysis of the turbid water. Chen et al. [26] applied convolutional neural network model for fast prediction of wastewater COD, with optimizing multiple parameters. Given the success of these previous works, we hypothesize that similar methodologies of NIR calibration would be able to predict the wastewater MIS for rapid assessment of environmental risk of water pollution. If the NIR spectral data information could be extracted by variable selection techniques such as backward interval selection and synergy interval combination, the nonlinear calibration models can be improved further with kernel transformations. Support vector machine (SVM) is a relatively mature nonlinear machine learning algorithm for qualitative and quantitative analyses [27]. Previous studies showed that the least square SVM (LSSVM) model outperformed the common PLS model because the relationships between the spectral data and the target contents are practically nonlinear, accompanied with unknown components as well as complex noise interference [28]. When dealing with the NIR data of samples with heterogeneous properties, the LSSVM models can be optimized with in-depth tuning of the kernel functions. The radial basis function (RBF) is regarded as an appreciate kernel that is suitable for extract spectral information in the NIR field, because the nonlinear nature of data has been mapped into a linear space by the kernel function [29]. Therefore, to get robust and reliable results in NIR calibration and prediction, the LSSVM regulation and the selection of RBF kernel width is much important for successful prediction, especially in combined optimization of the variable selection techniques.
This study was conducted with industrial wastewater samples collected in south China, to establish intelligent analytical models for estimating the MIS content by using the IoT-based NIR technology. Machine learning framework is constructed based on the LSSVM algorithm, for adaptive detection of the complex wastewater samples. The NIR spectral information was extracted using the backward interval selection mode and using the synergy interval combination manner. The RBF-kernel LSSVM algorithm was used to establish the calibration models to achieve fast prediction. The LSSVM parameters and the kernel parameters are automatically optimized in combination of interval selection of NIR informative wavebands. The goal is to observe the optimal machine learning model for NIR fast estimating wastewater pollutions. The proposed methodological framework is expected to build up smart learning models for precise assessment of the prospective pollution level of wastewater, thus providing governance guidelines for environment sustainable with synchronous development in the development of the IoT cloud platform.

Methodologies and the Machine
Learning wastewater MIS using the IoT-based NIR technology. The model prediction accuracy for in situ measurement is influenced by the tuning of chemometric methods. The theory of LSSVM is to transform the original nonlinear spectral data into a high dimensional space, in which the relationship between the transformed NIR data and the target analyte would become absolutely or approximately linear [27]. The distribution of samples in the high dimensional space is determinant by the mapping function φð•Þ. In the high dimensional data space, a decision function ðQÞ was constructed for minimizing the regulation of prediction errors, that is, where γ is the regularization parameter. This is a convex optimization problem that can be solved by Lagrangian multiplier. Then, the prediction of the target analyte is determined by the formula.
where α L is the Lagrange multiplier, y is the targeted analyte, x represents the NIR spectrum, and Kðx, x i Þ refers to the kernel function defined as Kðx, Machine learning by the parametric LSSVM has global optimality for nonstationary data. The parametric regularization may limit the possibility of overfitting. The radial basis function (RBF) is taken as the most effective kernel for quantitative regression, and its training process is much simple and fast [29], that is, where σ 2 represents the kernel width. The NIR rapid estimation of wastewater target analytes (i. e., the MIS) needs to launch the large-scale optimization of the LSSVM model with RBF kernel (denoted as RBF-LSSVM for short). To find the suitable calibrationprediction strategy, the RBF-LSSVM model was trained by screening the regularization parameter (γ) and the RBF kernel width (σ 2 ) in the grid search mode. Grid search of these two parameters is particularly necessary to minimize the predictive error. When the model is well trained, the prediction on any new samples is simple, fast, and reliable.

Adaptive Modeling Using the Variable Selection
Techniques. The IoT-based NIR calibration model for rapid estimation of wastewater pollutant analytes requires to train the parametric RBF-LSSVM model in combination of adaptive variable selections from the full-range spectral data, because the full spectrum contains kinds of noises and irrelevant information that may lead to low prediction accuracy and high calibration errors [30]. Waveband selection tech-niques are able to well address this problem. Several interval selection techniques have been proposed for the linear PLS regression, such as interval PLS (iPLS) [31], backward interval PLS (biPLS) [32], and synergy interval PLS (siPLS) [33]. The core idea of these algorithms is to select the optimal interval of variables and then use the selected variables to establish the prediction model. On this basis, the backward interval and synergy interval methods are investigated in combination optimization with the parametric RBF-LSSVM models for rapid assessment of environmental risk of wastewater pollution. As there is no other kernel utilized, the RBF-LSSVM models could be remarked simply as LSSVM, for the convenience of the next abbreviations in combination with backward interval and synergy interval.
2.2.1. Backward Interval Selection. Backward interval selection is to divide the full range spectrum into n equal-width intervals. By targeting the i-th interval for prediction (i = 1, 2 ⋯ n), the LSSVM model is established using the other n − 1 intervals. Each interval is predicted after a closed loop targeting and thus to obtain the prediction error of each interval. By comparison, the interval that has the worst prediction effect is removed, and then, the remaining intervals are used for remodeling. In cycles of modeling and interval removing, the prediction effect will finally stop decreasing so that the remaining intervals are taken as the informative wavebands for variable selection. The LSSVM models combined with the backward interval selection technique are operated with the grid search parametric optimization for regularization and the tuning of RBF kernel. The combination optimization model is defined as the back interval LSSVM (biLSSVM) model.

Synergy Interval Combination.
Synergy interval combination is to evaluate the combination modeling effect of several intervals. Firstly, the full range spectrum is divided into n equal-width intervals. For each i = 1, 2 ⋯ n, each divided interval is predicted after a closed loop targeting, and then, the parametric LSSVM models were established for all possible combinations of t intervals (t = 2, 3 ⋯ i). By comparison, the most effective interval combination is selected for modeling. Moreover, the most optimal combination of intervals can be recognized in comparison of different i's. The synergy interval combination technique is designed for combined optimization with the parametric LSSVM model and is defined as the synergy interval LSSVM (siLSSVM) model.

Model
Indicators of the IoT-Based NIR Application. The NIR calibration model predictions of wastewater MIS were compared with the conventional analytical values. As the spectral data was acquired from various sensors along with IoT network nodes, the models are dynamically trained with the data flow. Models should be assessed using the indicators including the root mean square error (RMSE), the coefficient correlation (CC), and the relative RMSE (RRMSE) [34]. They 3 Wireless Communications and Mobile Computing are calculated obeying the following formulae: where y i is the reference value from conventional analysis of the i-th sample, y i ′ is the predictive value by NIR modeling, y ave is the average value of the target samples, y ave ′ is the average of the NIR predictive values, and m represents the number of target samples.
The IoT-based NIR calibration process requires to divide all of the wastewater samples into the calibration set, the validation set, and the test set. The calibration samples are used to train the LSSVM models; the validation samples are predicted and used to determine the optimal parameters for the LSSVM model, the biLSSVM model, and the siLSSVM model. Then, the test samples are used to test prediction abilities of the optimal models. With online detection by different sensors, the number of samples is dynamically changing in minutes. Experimental experiences show that the ratio of samples for calibration, validation, and test is suitable at about 2 : 1 : 1. Accordingly, the modeling indicators are denoted as RMSE X , CC X , and RRMSE X , where the subscript X represents C for calibration, V for validation, and T for test, respectively.

Experiments and Discussions
3.1. Data Acquisition. The study was conducted on the industrial production waste discharge line in south China. For data experiment, a total of 148 samples were collected from the 6 chemical plants. The sampling and monitoring operations obey the requirements of China's national environmental monitoring technical specifications. Amount of 50 milliliter wastewater was saved for each sample. Then, the samples were sealed storage and transported to the lab.
As the conventional analysis of MIS costs consuming and destructive, the NIR spectra were measured in advance. The 148 wastewater samples were one by one scanned using the FOSS NIRSystems 5000 grating spectrometer (produced in Denmark) with InGaAs accessory. The experiment was operated under the constant temperature of 25 ± 1°C and constant humidity of 46 ± 1%RH. The spectrum was recorded in the waveband range of 780-2498 nm; the spectral resolution was set as 2 nm. Thus, there spread 860 digital wavelength variables for each sample spectrum. The NIR spectrum of wastewater shows the comprehensive response of chemical components, in which the pollution effect is mainly reflected by the indicators of MIS. The spectrum is much characterized by the peaks of water absorption around the frequencies of 1400 nm and 1950 nm, while the pollutants are weak-responded in the smooth spectral curve.
After the spectrum measurement, the samples were delivered for conventional analyses to detect the MIS content. In the traditional analysis of MIS, the wastewater sample was weighed in advance (denoted as T) and repeatedly filtered using a fine mesh, and the mesh was dried and weighed for several times until the weight became constant. Then, the residue weight of the filtered mesh over the clean mesh was taken as the weight of MIS (denoted as t m ). Finally, the MIS content was calculated by t m /T and marked as the weight percentage (unit: %). The MIS content of the 148 samples falls in the range of 0.38-0.82 wt%, with the mean value and the standard deviation being 0.593 and 0.127, respectively. The data was used as the reference target values in NIR rapid modeling analysis.
As the experimental example, the whole of 148 wastewater samples were divided, with 72 samples for calibration, 38 for validation, and 38 for test.

Determination of MIS Using the Parametric RBF-LSSVM
Model. The RBF-LSSVM models were established for the determination of wastewater MIS content. By the former machine learning experiences [35], it is worth to fuse the regularization parameter (γ) and the RBF kernel width (σ 2 ) together, for automatic tuning to search for the optimal combination of parameters. Deep training of ðγ, σÞ is necessary for observing a minimum predictive RMSEV in the calibration-validation process. The experiment was designed with presetting γ changing from 10 to 200 with the step of 10 and σ changing continuously from 1 to 15, such to have σ 2 values of 1 2 , 2 2 ⋯ 15 2 , respectively. In grids, the RBF-LSSVM model was trained as obtaining all of the predictive RMSE V values for each combination of ðγ, σÞ. The results are showed in Figure 1.
We observed from Figure 1 that serious frustration appeared with the variation of γ and gentle smooth change goes with σ. Specifically, the optimal RBF-LSSVM model was identified with γ and σ equalling to 120 and 11, respectively. The correspondent model predictive RMSE V was minimized as 0.055 wt%. Correspondingly, CC V was acquired as 0.891 (the CC value over 0.85 is regarded as high enough for the NIR rapid detection in fields of agriculture and environment [36]). The validation results indicated that the fuse selection of γ and σ is on critical demand for the optimization of the RBF-LSSVM model in NIR quantitative analysis of wastewater MIS content. Nevertheless, more discussions about the variable selections are expected to have improved modeling effects. Hereafter, we could use LSSVM as short for RBF-LSSVM.

Model Improvement with Adaptive Variable Selection.
In the way of subinterval division, the waveband variables were partially selected for the test of LSSVM models. Based on the total of 860 digital wavelengths, we design to have the full waveband range split into 43 subintervals for adaptive machine learning, such that each interval includes 20 wavelength variables. The intervals were equivalently split and set with serial numbers (see Figure 2). Then, all of the 4 Wireless Communications and Mobile Computing intervals were tested for LSSVM models by adopting the backward interval and synergy interval methods, as to form a smart way to train the models based on different variable combinations. The intelligent learning models are denoted as biLSSVM and siLSSVM.

Optimization of the biLSSVM Model.
The biLSSVM method is to combine the grid search of LSSVM parameters with the backward removal of the "worst" waveband intervals. Targeting each of the 43 intervals, the variables in the other 42 intervals were used to establish the LSSVM model, to observe the prediction effect of the target interval. Based on the calibration and validation sample sets, the algorithmic parameters of γ and σ were optimized during the LSSVM process. In a closed loop modeling, all of the 43 intervals have their individual optimal modeling results. By searching for the best combination of intervals, the mono interval with the largest value of RMSE V was removed out of the variable set prepared for the next modeling, and then, the remaining intervals were used for remodeling. In our analysis of waste-water MIS, the No. 31 interval was firstly removed, No. 38 secondly, No. 6 thirdly, etc. The interval removal order is showed in Figure 3, and the correspondent changing RMSE V values by the biLSSVM prediction are also demonstrated. It is learnt from Figure 3 that the biLSSVM model became better at the beginning as the intervals removed. It reached the best after 30 times removal, and then, RMSE V goes back to increase. The model was trained with its minimum RMSE V of 0.044 wt%; the corresponding best combination of variables includes 13 intervals, which are particularly identified with the shape of the NIR spectrum of wastewater (showed in Figure 4). The selected interval combination of variables is taken as the informative quasidiscrete wavelengths for the NIR quantitative determination of wastewater MIS content using the biLSSVM optimization.

Wireless Communications and Mobile Computing
43 equal-width intervals. Each individual interval is predicted with LSSVM parametric optimization. After a closed loop, LSSVM models were further established based on possible combinations of t intervals (t = 2, 3 ⋯ 43). For example, when targeting t = 2, LSSVM models were established based on all possible synergy cases of two-interval combinations, and then, the best case was selected by minimizing the predictive RMSE V , such to identify the optimal combination of two intervals. As the number of synergy intervals changes, the optimal siLSSVM models based on each t value were identified with their RMSE V results (see Figure 5). We observe from Figure 5 that the most informative variable combination was obtained with t = 13, namely, 13 specific intervals were selected by siLSSVM as including informative wavelength variables for the determination of wastewater MIS. They are located in the full range with the spectrum shape of wastewater (see Figure 6). The selected intervals are combined to establish the LSSVM model, thus acquiring RMSE V of 0.041 wt% and CC V of 0.923. It was slightly better than the result from the biLSSVM optimization model. Also, we acknowledge that some of the intervals selected by siLSSVM are repeated with those selected by biLSSVM. Logically, the common LSSVM and the conventional PLS modes were also listed for comparison. The optimal results of each method are showed in Table 1. Table 1 shows that the models based on the LSSVM framework algorithm generate better prediction results than the PLS-based models. And also, the models with variable selections by the backward interval and synergy interval techniques play better roles in predictions of wastewater MIS. Of all, the siLSSVM model has the best optimal training results with its RMSE V as 0.041 wt% and CC V as 0.923. The correlation coefficient is well high enough for NIR spectroscopic detections in the environmental field. On the contrary, the   Figure 7. It is acknowledged that the evaluation correlations of biLSSVM and siLSSVM did not reach but were very near to 0.9, accompanied with the approximate 10% of RRMSE T . These are quite appreciating results for real applications. The conclusions demonstrated that the proposed intelligent methods of combining the backward interval and synergy interval techniques to the parametric LSSVM learning model are feasible to improve the prediction results in rapid determination of the wastewater MIS content using the IoT-based NIR technology.

Conclusions
In this work, the IoT-based NIR spectroscopic technology was utilized for quantitative detection of wastewater MIS contents. The goal is to observe the optimal machine learning model for fast estimating wastewater pollutions. For modeling innovation, the backward interval and synergy interval techniques were proposed as for adaptive variable selections, in combination use with the parametric LSSVM learning method. The machine learning framework utilized the RBF kernel as for common nonlinear mapping. Thus, the regulation parameter and the kernel width are fused together for two-dimensional grid search optimization.
The total of 860 digital wavelengths in the full scanning range was split into 43 equivalent subintervals. Each interval has 20 continuous wavelength variables. Then, all of the intervals were tested for LSSVM models in the way of backward interval removal and synergy interval combination. By the biLSSVM method, the model was trained with the optimal RMSE V of 0.044 wt% and the corresponding CC V of 0.918. The best combination of waveband variables includes the 13 intervals showed in Figure 4. By siLSSVM, the best training model was identified as the combination of the 13 intervals showed in Figure 6. These intervals are combined to form the optimal LSSVM model, acquiring the minimum RMSE V of 0.041 wt% and a correspondent CC V of 0.923. In comparison with the common LSSVM, conventional PLS, biPLS, and siPLS models, the optimal training results of the siLSSVM model were recognized as the best prediction based on the validation sample set, while the best evaluation model was established by biLSSVM for the test sample set. These prediction results demonstrated that the proposed biLSSVM and siLSSVM frameworks with the adaptive variable selection methods are feasible to improve the model prediction results in the rapid determination of the wastewater MIS content by the intelligent IoT-based NIR technology.
The conclusions indicated that the proposed machine learning framework of combining the backward interval and synergy interval techniques with the RBF-kernel-based LSSVM algorithm is prospectively an intelligent improvement for the IoT application to estimate the complex target of wastewater, as to partially assess the environmental risk of water safety in the industrial fields.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.