Data-Driven Quality Prediction of Batch Processes Based on Minimal-Redundancy-Maximal-Relevance Integrated Convolutional Neural Network

For batch processes that are extensively applied in modern industry and characterized by nonlinearity and dynamics, quality prediction is significant to obtain high-quality products and maintain production safety. However, some quality variables and key performance indicators are difficult to measure online. In addition, the mechanism-based model for batch processes is usually tough to acquire due to the strong nonlinearity and dynamics, which makes quality prediction a challenge. With the accumulation of historical process data, data-driven methods for quality prediction gain increasing attention, among which convolutional neural network (CNN) is quite successful for its automatic feature extraction of nonlinear features from raw data. Considering that most CNN-based methods mainly take the variety of extracted features into account and ignore the redundancy between them, this paper introduces the minimal-redundancy-maximal-relevance algorithm to select features obtained by original CNN and further improves it with a feature selection layer to form the proposed method referred as mRMR-CNN. )en, a quality prediction model is established based on mRMR-CNN and the effectiveness of it is verified on the penicillin fermentation process, where the proposed method shows remarkable performance.


Introduction
In modern industry, batch processes are extensively applied to the production of high-value products in many areas such as pharmaceutical, biotechnology, and polymer and semiconductor manufacturing industries [1][2][3][4]. Due to complex process mechanism, process uncertainties, and special production requirements, batch processes are characterized by strong nonlinearity and dynamics. For such nonlinear dynamic processes, it is quite prominent to ensure highquality products and safely running production process, which makes it necessary to monitor quality variables or key performance indicators of the process. However, constrained by hard sensors or economic efficiency, such variables or indicators are usually difficult to measure [5]. To address this issue, a generic scheme is used to establish a prediction model to estimate the value of quality variables using easy-to-measure variables in the process, which facilitates the existence of numerous quality prediction methods.
e methods for quality prediction are mainly categorized into two types, mechanism-based methods and data-driven methods [6]. Due to nonlinearity and complexity of batch processes, it is essentially tough to establish accurate mathematical models, which greatly inhibit the development of mechanism-based approaches to batch processes [7]. With the accumulation of industrial historical data, which are considered to cover adequate information of batch processes, data-driven methods grow into the mainstream of academia, estimating the value of target quality variables by mainly seeking information from historical data.
With respect to data-driven methods, they are mainly referred to statistic methods and machine learning methods [8], which include partial least squares (PLS) [9], supported vector regression (SVR) [10,11], back-propagation (BP) neural network [12,13], and autoencoder (AE) [14][15][16]. However, such methods with shallow structures may not fit the complex and large-scale industrial data well. en, with deep learning gaining increasing attention, there exist methods with deep structures, such as stacked autoencoder (SAE) [17,18], deep belief network (DBN) [14,19], and convolutional neural network (CNN) [20]. Among all these methods, CNN is well known for its excellent automatic feature extraction capability and is successfully applied to many fields, such as image recognition, computer vision, and natural language processing [21]. Owing to its outstanding performance and extensive applications, CNN is selected as the basis of the proposed method in this study.
When applying CNN to quality prediction, it is common to form a 2D matrix input by segmenting timeseries data composed of different variables alongside the sampling time. Basically, there exist both time and variable axes in the input data fed into the CNN model, in sense of which the CNN model can not only consider the relationship between local adjacent variables at the same sampling points but also consider the relationship between local adjacent sampling points of the same variables. It indicates that CNN can extract local features contained in historical data from both the spatial (or variable) and temporal perspectives. erefore, CNN can extract various features from industrial historical data, for which it is common to be of extensive applications in quality monitoring. Wei et al. [22] developed a soft sensor model based on CNN and extreme learning machine (ELM) to measure fill level inside ball mill, fully exploiting vibration frequency spectrum. Sun et al. [23] proposed a virtual sensor model using CNN to predict dynamic responses in structural systems and obtain relatively high accuracy. Zhu et al. [24] applied CNN to predicting next time step measurements and utilized moving window to cover timedependent correlation information, which achieved remarkable performance on industrial pyrolysis reactor. Shevchik et al. [25] used spectral CNN to classify data collected by acoustic emission sensors, thus achieving quality monitoring in additive manufacturing. Olivier et al. [26] employed CNN to characterize the feed size of run-of-mine ore and trained the constructed model using transfer learning to reduce required data. Wang et al. [27] took process dynamics into consideration and integrated finite impulse response with CNN, which can effectively improve the prediction accuracy. Jiang et al. [28] presented a semi-supervised soft sensor to balance the labeled and unlabeled data by constructing 2D data and used CNN to extract spatial information contained in the 2D data. Yuan et al. [29] comprehensively considered local correlations between variables to form a multichannel CNN for soft sensing, which significantly improved the estimation accuracy. In addition, Yuan et al. [30] also proposed a dynamic CNN to learn hierarchical dynamic nonlinear feature, based on which they developed a soft sensor model exploring both the spatial and temporal correlations from industrial data.
However, all these methods try to extract abundant discriminative features, while the redundancy between extracted features is neglected, which may degrade the performance of CNN-based models. It is quite a common contradiction. On the one hand, it is desired that numerous features can be extracted by CNN from historical data, which indicates the necessity of various representations to form an effective predictor. On the other hand, it cannot be guaranteed that all the obtained features are independent of each other. Furthermore, if certain dependent features cannot provide enough extra useful information to the prediction of target variables, they may even add noises [31]. Considering that eliminating irrelevant and redundant features can benefit the improvement of model learning performance [32], it is essential to enable CNN the capability of selecting features to behave better in quality prediction.
Although some works [33][34][35] suggest that L1 regularization can be added to CNN to make the structure sparse, which can implicitly perform feature selection, it is not straightforward and may not guarantee the required prediction precision. To take feature redundancy into account and select features under guidance, an algorithm named minimal-redundancy-maximal-relevance (mRMR) [36] is introduced to the original CNN model; that is, a CNN model is pretrained to extract plenty of features and then mRMR algorithm is applied to extracted features to form a feature subset that can best fit the model so that the original CNN can be enriched with a feature selection layer, after which the modified model will be retrained to promote the performance.
e main contributions of this paper are summarized into the following three points: (1) A CNN model, LeNet-5, is employed as the baseline to automatically extract adequate nonlinear discriminative features from given data.
(2) mRMR algorithm is applied to select extracted features and a feature selection layer is then integrated with the original CNN model to form the proposed method, noted as mRMR-CNN, so that feature redundancy is taken into consideration to improve the performance of original CNN model.
(3) A quality prediction model based on mRMR-CNN is established and the effectiveness of proposed mRMR-CNN method is validated on the penicillin fermentation process, where the proposed method distinguishes itself with considerable increase in quality prediction precision. e remainders of this paper are arranged as follows. Section 2 basically introduces primary conceptions of CNN and mRMR. en, Section 3 discusses the proposed method in detail, while Section 4 conducts an experiment to validate the proposed method. e final section draws a conclusion of this paper.

Related Works
e basic knowledge of CNN and mRMR is briefly introduced in this section.

CNN.
CNN is a successful deep learning technique proposed by Lecun et al. [20], which simulates the mechanism of cat's visual cortex [37]. Moreover, it is well known for its excellent automatic feature extraction ability and extensively applied to image classification, computer vision, and natural language processing fields. Recent years have seen the arising of multitudinous CNN models, such as AlexNet [38], VGGNet [39], GoogLeNet [40], ResNet [41], and DenseNet [42], all of them exhibiting elite performance.
However, this subsection is about to discuss the timetested fundamental CNN structure, namely, LeNet-5, on which the approach proposed in this paper is based. e basic structure of it is depicted in Figure 1.
As illustrated in Figure 1, LeNet-5 is composed of two parts, a feature extractor and a classifier (for classification)/ regressor (for regression). e former part consists of an input layer, alternating adjacent convolutional layer, and subsampling layer (or pooling layer), in charge of feature extraction from input data. e latter part contains a couple of fully connected layers, performing classification/regression tasks based on features obtained by the extractor [43].
It is obvious that CNN is essentially a type of multilayer feedforward artificial neural networks (ANNs). However, it is unique for its convolutional layer, pooling layer, and usually 2D input data when compared to conventional ANNs [44]. e convolutional layer, which typically contains several feature mappings, makes CNN special with sharing weights and local receptive field [20]. More specifically, in the convolutional layer, each feature mapping is connected to the previous layer through a different square convolution kernel; thus, each unit of the same feature mapping shares the weights contained in the related convolution kernel. Due to such sparse connections and considering that the size of a convolution kernel is much smaller than that of the input, each unit of the feature mapping in the current layer merely corresponds to a small zone of original input image data, which interprets local receptive field. It is notable that all these connections rely on convolution operation, which is to express below. Assume that there is an input feature mapping with size of w i × h i and a convolution kernel with size of f × f. s defines the sliding stride of convolution while there is no zero padding. e dimension w o × h o of the output feature mapping obtained by convolution operation can be calculated using the following equations (1) and (2): en, the basic convolution operation is visually illustrated in Figure 2 and precise value o i,j at coordinate (i, j) in output feature mapping can be mathematically expressed with the following equation: where w k,l represents the weight value at coordinate (k, l) of the convolution kernel, while b refers to the bias value.
x (i×s+k−1),(j×s+l−1) is the unit response at coordinate (i × s + k − 1, j × s + l − 1) of input feature mapping. σ denotes the activation function, as which rectified linear unit (ReLU) function is usually designated in CNN. As for the subsampling layer, which is as well referred as the pooling layer, it generally follows the convolutional layer, performing pooling operation to reduce spatial resolution and data dimension of feature mappings derived from the preceding layer [20]. ere exist mainly two types of pooling operations, average pooling and max pooling, the latter of which is the most used one. When max pooling functions, each feature mapping of the previous convolutional layer is implicitly divided into a couple of 2 × 2 square areas, known as pooling fields, with no overlapping and no gap. en, the maximum value of responses in each area is calculated to form the corresponding unit of the current pooling layer. ereby, the number of feature mappings in the pooling layer is identical to that in the prior convolutional layer. e max pooling is intuitively illustrated in Figure 3.

mRMR.
mRMR is an effective feature selection model proposed for classification tasks by Peng et al. [36], which is based on mutual information between variables. Assume two random variables X and Y, to which p(x), p(y), and p(x, y) are the probabilistic density functions related. en, the mutual information I(X; Y) between the two variables can be defined by the following formula: When implementing mRMR, the purpose is to find a feature subset S with m features, which is the solution of the following optimization problem: in which the specific expression of D and R is given in the following formulas: Mathematical Problems in Engineering where |S| indicates the feature number of set S. D is defined to quantify the average relevance between each feature variable x i and related class c, while R denotes the quantitative redundancy between feature variables when the feature subset is given. erefore, to optimize the problem defined by (5) is essentially maximizing the relevance between variables and related class while minimizing the redundancy among variables, under the guidance of which unnecessary features are screened out and a feature subset that can fit the target task well is reserved. Although mRMR is originally developed for classification tasks, it works for regression tasks as well. e difference lies in calculating the mutual information between dependent variable x and independent variable y, instead of variables x and related class c, to quantify corresponding relevance D. is makes the relevance D slightly different in formula as the following expression:

Proposed Method
Due to nonlinearity and complexity of batch processes, it is necessary to extract abundant features from the historical data to cover the most discriminative ones that can fit the desired estimation well, in sense of which the obtained features are usually more than it genuinely requests. However, traditional CNN lacks the ability to directly perform feature selection. en, how to make CNN able to decrease feature redundancy is a crucial problem to be handled. Although several works report that L1 regularization can make sparse the structure of CNN and implicitly attain feature reduction, it is not intuitive and may not guarantee prediction precision. Hence, this paper proposes an improved CNN model combined with minimal-redundancy-maximal-relevance, referred as mRMR-CNN to straightly select the extracted features so that a feature selection layer is added to original CNN, of whom the performance can be improved.
Specifically, a CNN model is initially constructed and pretrained slightly overfitting the given training samples, which empirically implies that the built CNN model achieves relatively acceptable performance and last convolutional (or pooling) layer of it covers numerous discriminative features. Since these features are automatically extracted and not necessarily independent of each other, then mRMR serves here performing feature selection to reduce redundant information. mRMR takes CNNextracted features and expected model outputs as algorithm input. When feature preservation proportion is given, mRMR derives from the extracted features a feature subset that can fit given data well and have the least redundancy. Feature preservation proportion is used here to indicate the number of features to be reserved, which should be experimentally determined. After mRMR functions, a feature selection layer is added into CNN, closely following the last convolutional (or pooling) layer, to achieve feature reduction. e feature selection layer mainly contains a constant weight matrix of size N e × N s filled with 0 and 1, where N e is the number of CNN-extracted features and N s is the number of mRMR-selected features. Each column of the weight matrix functions to pick out one feature; thus, each column contains only one element 1, while the rest elements are all zeros. Finally, the modified CNN, mRMR-CNN, is retrained to attain better performance. Figure 4 illustrates the main idea of proposed method and divides it into pretraining and retraining parts.
In addition, the detailed procedures of proposed mRMR-CNN are described as follows: (1) Step 1: preprocess raw data samples, scale them to [0, 1], and convert them into 2D matrix data. (2) Step 2: construct a classical CNN framework inheriting the structure of LeNet-5. (3) Step 3: keep revising the parameters and structure of the constructed model and pretrain the model to a relatively acceptable performance so that most of the discriminative features can be obtained.  (4) Step 4: recover features generated in the last convolutional layer (or pooling layer) of CNN model constructed in step 1, and perform mRMR algorithm to determine the features to be maintained when given the preservation proportion or percentage. (5) Step 5: based on the results of step 4, add a feature selection layer right between the last convolutional layer and the first fully connected layer. en, retrain and fine-tune the modified CNN model, mRMR-CNN. (6) Step 6: use the grid search approach to determine the best feature subset and eventually fix the structure of mRMR-CNN owning best performance, which is a loop to repeat steps 3-5 until all the given feature preservation proportions are examined.
To further state the proposed method, the final structure of mRMR-CNN is briefly depicted in Figure 5, and a more detailed architecture of it will be displayed in Table 1 of Section 4.2. In the proposed mRMR-CNN, the main difference from original CNN is an additional feature selection layer succeeding the feature extractor, which is determined by mRMR algorithm and CNN-extracted features.
Based on the aforementioned presentation of proposed method, a quality prediction model is established for batch processes and the scheme of it is depicted in Figure 6, which consists of two phases, training phase and testing phase. e training phase mainly adheres to the procedures of proposed method to determine the structure of mRMR-CNN, while the testing phase is simply an application of fixed mRMR-CNN.

Case Study
e effectiveness of mRMR-CNN on quality prediction modeling is verified on the penicillin fermentation process with simulation software PenSim v2.0 [45]. Two widely used indices, RMSE and the coefficient of determination R 2 , are introduced to evaluate the experiment results. e mathematical expression of two indices is defined as follows: where N T stands for the sample number. y i represents the real value, while y i denotes the estimation value, and y is noted as the mean of all the real values.
To implement the experiment, all codes are written in Python 3.7 under deep learning framework TensorFlow 2.1.0 except that mRMR-concerned codes are written in MATLAB. All programs are carried out in Windows 10 64 bit enterprise edition with Intel (R) Xeon (R) Sliver 4110 CPU @ 2.10 GHz, 32.0 GB RAM, and NVIDIA Quadro P620 2 GB GPU.

Penicillin Fermentation Process and Experiment
Configuration.
e penicillin fermentation process is a benchmark widely used for validating the performance of  Mathematical Problems in Engineering batch process quality prediction modeling. e simulation software of it can be downloaded from http://www.chee.iit. edu/∼control/software.html. Figure 7 displays the flowchart of penicillin fermentation process. Based on certain works [45][46][47], difficult-to-measure penicillin concentration is the key quality variable of penicillin fermentation process and the rest sixteen variables are regarded as auxiliary variables, which involve aeration rate (L/h), agitator power (W), substrate feed rate (L/h), substrate feed temperature (K), substrate concentration (g/L), dissolved oxygen concentration (g/L), biomass concentration (g/L), culture volume (L), carbon dioxide concentration (mmol/L), pH, fermenter temperature (K), generated heat (cal), acid flow rate (L/h), base flow rate (L/h), cold water flow rate (L/h), and hot water flow rate (L/h). 25 groups of data, totally 10000 samples, are obtained with parameter settings suggested by related works [46][47][48]. e obtained data are divided into training data set of 8000 samples and testing data set of 2000 samples according to production batches. Among training samples, 7200 of them are randomly selected for training the constructed model and the rest for validating and fine-tuning the model.

Structure Determination of mRMR-CNN.
is subsection attempts to determine the structure of mRMR-CNN with the best performance and the well-known dropout technique is adopted in the fully connected layer to ensure the prediction capability, which is as well applied to other neural network-based methods used in this paper, if not specified. Adam's algorithm is adopted to achieve error back propagation and learn the trainable parameters. Specifically, learnable parameters, including weights in convolution kernel and fully connected layer, can be iteratively updated using the following equations: where i is the iteration index, w is the weight to update, L is the loss function, m is the first momentum variable, and v is the second raw momentum variable. Meanwhile, α is the learning rate, ε is an extremely small positive number in case of zero division, and β 1 and β 2 are the exponential decay rates for the momentum variables. Here, hyper-parameters use the default settings, α � 0.001, β 1 � 0.9, β 2 � 0.999, and ε � 10 −7 . en, the average validating prediction values on test data set are adopted as the final prediction values, on which the calculation of RMSE and R 2 is based. If not specified, all  neural network-based models employed in this study take the same strategy. Detailed RMSE and R 2 values under varied feature preservation proportions are given in Table 2, where the best results are highlighted in bold font. e results indicate that validating RMSE value reaches the lowest and coefficient of determination R 2 value attains highest when feature preservation proportion is set as 0.35, which means the mRMR-CNN achieves its best performance under current given conditions. en, the specific structure of finally fixed mRMR-CNN is offered in Table 1. It is noticeable that feature selection essentially decreases nodes connected to those in the first fully connected layer. To put it differently, the number of trainable parameters in the first fully connected layer will increase to 38,440 if the feature selection layer is dropped. erefore, there is actually a 64.9% reduction in trainable parameters in the first fully connected layer when compared with CNN.
To visually present the validating results, Figures 8 and 9 are drawn to display part of the prediction results and prediction errors under different feature preservation proportions, respectively. In Figure 8, the closer the scattered points (painted in blue) distribute to the reference line (the red solid

Comparison between Different Methods.
To further validate the performance of proposed method, it is compared with SVR, AE, BP, SAE, original CNN, and CNN-L1, which applies L1 regularization to the first fully connected layer. SVR is a traditional machine learning method, and the rest are neural network-based methods.
In this paper, the basic parameter settings for SVR are sensitivity ϵ of 0.08, penalty factor C of 1000, and using radial basis function (RBF) with kernel coefficient c of 0.02 as kernel function. BP is a three-layer neural network (excluding input layer) with (336, 40, 1) nodes in each layer. AE shares the structure of 16-13-16 and SAE stacks two AEs, following the structure of 16-13-11-13-16. e features learned by AE and SAE are then fed into a three-layer neural network with 384-64-1 structure to regress the target variable. It is notable that both ReLU and sigmoid functions are employed in AE and SAE. e specific structure of mRMR-CNN here is the same as listed in Table 1. Baseline CNN is similar to the mRMR-CNN, merely removing the feature selection layer. CNN-L1 mainly distinguishes itself from CNN with L1 regularizer adopted in the first fully connected layer, and regularization coefficient is set as 0.0001. All neural network-based methods here are trained using training epochs of 300 and Adam to optimize trainable parameters. Other relevant hyper-parameters here are α � 0.001, β 1 � 0.9, β 2 � 0.999, and ε � 10 −7 , where α is the learning rate.
RMSE and R 2 of training and testing with different approaches are listed in Tables 3 and 4, respectively. e proposed mRMR-CNN outperforms other methods on testing data set with the lowest RMSE value of 0.0249 and highest R 2 value of 0.9893, while SVR behaves worst on both training and testing data sets. It is worth noting that CNN reaches almost the same performance as mRMR-CNN on the training data set but performs 3.4% inferior to mRMR-CNN on the testing data set in terms of R 2 , and mRMR-CNN behaves 50.8% superior to CNN on testing samples in terms of RMSE. e performance of CNN-L1 is acceptable on both training and testing samples but slightly worse than that of mRMR-CNN. To justify the performance results, Table 5 provides the prediction time per sample consumed by CNN, CNN-L1, and mRMR-CNN when testing. It can be seen that mRMR-CNN makes predictions slightly faster than the other two approaches.
Additionally, prediction results of different methods are illustrated in Figures 10 and 11. In Figure 10, a shorter average distance of scattered points to the reference line indicates better prediction results. It can be seen that scattered points under mRMR-CNN distribute densely close to the reference line, while those of other approaches distribute either sparser or even far away from the reference line. In Figure 11, tracking results given by SVR, BP, AE, and CNN-L1 exhibit several sharp leaps when real output grows from zero. Meanwhile, SAE seems to make predictions with noises, resulting in sustaining small fluctuation. In contrast, the tracking curve obtained by CNN and mRMR-CNN is smoother and more stable. Especially, mRMR-CNN obtains predictions closer to the real values.
To further investigate the prediction results, the prediction errors and their corresponding distribution under different methods are shown in Figures 12 and 13 ,      Bold font indicates the best result, which indicates the shortest testing prediction time. CNN-L1), which could be owed to the excellent feature extraction ability of CNN and the explicit feature selection in the training stage of mRMR-CNN. Other approaches except CNN-L1 fail either to capture valid features or to reduce feature redundancy, resulting in unsatisfactory performance. Although CNN-L1 obtains acceptable performance, it suffers subtle instability in prediction, probably due to implicit feature selection. Based on the above analyses, it can be concluded that mRMR-CNN, which explicitly facilities CNN with the ability of feature selection, can effectively promote quality prediction precision. It implies as well the significance of feature selection in CNN-based model, in consideration of feature redundancy.

Conclusion
In this study, a novel mRMR-CNN approach is proposed to model quality prediction for batch processes, of which the key idea is using mRMR to remove redundant features obtained by original CNN so that a feature selection layer can be added to the original CNN to enhance the quality prediction precision. en, the performance of proposed method is verified on the penicillin fermentation process, where the results indicate that the proposed method achieves a significant improvement when compared to the original CNN. Furthermore, in terms of penicillin fermentation process, the performance of proposed method is superior to that of certain methods such as CNN-L1, SAE, and SVR. Additionally, mRMR-CNN can also greatly decrease trainable parameters when compared to the original CNN, although it is not the concern of this study.
However, as mentioned before, the training stage of proposed method is divided into two phases, pretraining phase and retraining phase. erefore, the training time that the proposed method consumes is obviously more than the traditional CNN and CNN-L1 methods. en, the future focus may lie in how to dynamically integrate mRMR into the training stage of CNN to enhance model performance and shorten training time.
Data Availability e training and testing data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.