Online Semisupervised Learning Approach for Quality Monitoring of Complex Manufacturing Process

,


Introduction
1.1. Background. Predictive maintenance has attracted increasing interest from both academia and industry because it offers optimization of machine's life cycle, accurate planning of machine's maintenance, and prevention of unnecessary downtime and product's wastage [1]. In realm of tool condition monitoring, replacing a tool too frequently not only leads to expensive maintenance cost but also interrupts the production cycle. On the other hand, blunt tools incur high energy consumption due to the application of high cutting force or undermines the surface finishing. time information of product's quality [2]. Compared to the traditional first principle approach, the data-driven quality monitoring cuts down the development time significantly. It relies on a dataset collected from sensors or cameras installed at the end of production line to build a predictive model after being preprocessed via the signal processing and feature extraction techniques to produce meaningful features.

Related Works.
In-depth study has been devoted to developing reliable quality monitoring approaches. In [3], the tool condition of the metal-turning process is predicted using neural networks. A fuzzy neural network is utilized to predict the tool wear of the ball-nose end-milling process using vibration data [4]. In [5,6], a fault detection approach in the rolling mills process is proposed using all-coverage data-driven approach making possible to integrate many sensors. e rise of deep learning with its automatic feature engineering step to extract natural features allows simplification of the data-driven quality monitoring enabling to bypass complex feature extraction step. In [7], convolutional neural networks based on ResNet50 are put forward to perform quality monitoring in the laser-based manufacturing processes. A stacked sparse autoencoder (SSAE) combined with the genetic algorithm to tune its parameters is proposed to determine the laser welding quality [8]. Despite their success in various manufacturing applications, such approaches are offline in nature and fixed once deployed thus being unable to adapt to rapidly changing conditions of process parameters. eir iterative training process is not memory-wise and does not keep pace with the high-speed manufacturing process. A complete retraining process from scratch is solicited in handling the process change.
e online quality classification approach has been advanced in [9] where the GEN-SMART-EFS is combined with the incremental partial least square (iPLS) method for the feature selection method to monitor the quality of microfluidic chip. An extension of this work is presented in [10] where the forgetting strategy is implemented to handle the concept drift and the multiobjective evolutionary computation approach for process optimization. Another approach for prediction of tool wear in the metal-turning process is proposed in [11]. It is based on Parsimonious Ensemble+ (pENsemble+) algorithm making use of the online active learning approach to handle the issue of label's scarcity.

Our Approach.
e area of online quality classification still deserves investigation because existing methods are still far from being truly autonomous approaches. ey are mostly developed from the fully supervised learning principle necessitating considerable labelling efforts in streaming environments. It suffers from substantial operator dependencies to fully annotate data samples for model's updates notably in the high-speed production processes. In [12], a semisupervised deep learning approach is proposed for quality monitoring tasks using the stacked autoencoder approach. However, this approach is not designed for streaming environments. Another approach is proposed in [13] for online semisupervised quality monitoring using the notion of weighted principal component regression. is approach is, however, a non-deep learning approach. Another open issue lies in the feature extraction step often being application-specific [14] and calling for intensive offline phases. Notwithstanding the fact that deep learning solution starts to pick up research interest where the concept of deep features is utilized to bypass complex feature engineering step, they are built upon an offline training process thus becoming outdated quickly under nonstationary traits of manufacturing processes. Furthermore, they are developed under a fully supervised working principle incurring considerable labelling cost. Another issue lies in the existence of many-to-one label relationship [15] where a batch of data are associated with a single and constant class label. is problem might lead to the overfitting problem of a particular class or the loss of granularity if a batch of data is combined into a single instance. is problem is frequently found in the condition monitoring problem, in which a quality check is only performed after the whole lot is produced. In summary, there exists a strong demand for an online semisupervised deep learning algorithm for quality monitoring. Such algorithm is capable of learning from streaming data without retraining from scratch while bypassing a complex feature engineering phase. at is, a new concept arising due to changing environments can be quickly handled without compromising complexity while natural features are extracted on the fly.
An online semisupervised deep neural network, namely, Parsimonious Network++ (ParsNet++), is proposed to undertake real-time learning under scarcity of labelled samples for online quality monitoring in the injection molding process [16] and in the industrial transfer molding process. ParsNet++ forms a significant extension of a recently developed algorithm for semisupervised learning of high-pace data streams, Parsimonious Network (ParsNet) [17]. ParsNet++ is capable of starting its learning process from scratch with no predefined structure while its hidden node is automatically grown and pruned from data streams to overcome the concept drift. It handles the partially labelled data streams under two settings: random access of ground truth and infinitely delayed access of ground truth.
e key feature exists in the autoregularization method dealing with the accumulation of mistakes due to noisy pseudolabel. e underlying innovation of ParsNet++ lies in the existence of feature extraction layer coping with raw samples where the 1D convolutional layer is integrated to deal with multivariate time-series data collected from sensors and the many-to-one label relationship.
is property enables skipping a complex feature engineering step because of its aptitude in extracting natural features. e feature extraction layer is structured as a stacked convolutional layer generating deep features to be fed to the fully connected layer. Furthermore, the fully connected layer is structured as a selfevolving single-hidden-layer neural network to handle process change. e structural learning mechanism of ParsNet++ is driven by the network significance (NS) method derived from the bias-variance decomposition method. It differs from the original NS method in [18] with the presence of autonomous clustering mechanism (ACM) estimating the probability density function. ACM addresses the obsolete probability density function if the concept drift occurs while also relaxing a strict normal distribution assumption which does not fit for real-world cases. Unlike conventional clustering technique, ACM features a self-evolving property making possible for automatic generation and pruning mechanism of clusters. ACM distinguishes itself from AGMM of the original ParsNet often being unstable in the high input dimension cases. e parameter learning phase is carried out under a joint optimization problem minimizing both reconstruction loss and discriminative loss coupled with autoregularization mechanism.
at is, the regularization process is derived from the concept of synaptic intelligence (SI) proposed to prevent the issue of catastrophic forgetting problem [19]. It calculates the parameter importance using the accumulated gradient of network synapses. is technique is generalized here where it is used to memorize optimal network parameters induced by the clean labels. e label enrichment method is carried via the label augmentation mechanism where originally labelled samples are perturbed by injecting controlled noise while leaving their labels unchanged. By extension, the self-labelling mechanism is carried out to generate pseudolabel of unlabelled samples. It is inferred by the predictive output of ACM and network itself if both of them are confident with their own predictions.
Autonomous quality monitoring with weak supervision is formalised under two settings: random access of ground truth and infinitely delayed access of ground truth. e former case portrays partially labelled data streams where only a fraction of data samples possess true class label. e latter case goes one step ahead where labelled samples are served only during the warm-up phase leaving the rest unlabelled. Furthermore, the quality monitoring problem consists of two scenarios, current batch prediction and onestep ahead prediction. e current batch prediction is meant to predict the current product quality whereas the second one aims to forecast the product quality for the next data stream, all of which are carried out in the prequential testthen-train protocol, the standard simulation protocol of data streams and simulated using real-world use cases of injection molding machine, and industrial transfer molding machine from our own project. Our rigorous numerical study demonstrates the success of ParsNet++ for the online quality classification under weak supervision where it delivers the most encouraging results even compared to fully supervised competitors.
In summary, this paper delivers four major contributions discussed in the sequel: (1) is paper presents ParsNet++ to handle online quality classification of injection molding process and industrial transfer molding process under semisupervised environments. at is, the semisupervised environments are induced by both random access of ground truth and infinitely delayed access of ground truth. (2) is paper offers an extension of ParsNet [17] where 1D convolutional layer is introduced to address the issue of feature extraction and the many-to-one label relationship problem. (3) Autonomous clustering mechanism (ACM) is developed for a flexible density estimation approach navigating the structural learning phase. ACM replaces the role of AGMM in the original ParsNet [17] suffering from the execution issue of the high-dimensional problem. Furthermore, ACM incurs fewer parameters than those of AGMM. (4) e codes of ParsNet++, raw numerical results, and injection molding dataset are made publicly available in https://github.com/ContinualAL/ParsNetPlus to enable further study of the proposed research topic. e remainder of this paper is structured as follows: Section 2 discusses the problem formulation; Section 3 outlines the learning policy of ParsNet++; Section 4 elaborates the injection molding machine; our numerical study is explained in Section 5; and some concluding remarks are drawn in Section 6.

Problem Definition
Learning from data streams is defined as a learning problem of never-ending data batches B 1 , B 2 , B 3 , . . . , B K where K is the number of data streams and unknown in practice. is property demands the one-scan learning scheme where a data stream is discarded once learned to suppress the computational and memory complexities to a low level. A data stream comprises data samples having no label where X k denotes input data batch while x t ∈ R u denotes an input vector. T, u are, respectively, the batch size and the input dimension. In the realm of the fully supervised learning setting, the ground truth access y t ∈ l 1 , l 2 , . . . , l m where m is the output dimension can be instantly elicited. is assumption is unrealistic notably in the context of quality classification. Some delay is expected because the product quality is examined through visual inspection. Semisupervised data stream is formalised here under two settings: sporadic access to ground truth and infinitely delayed access to ground truth.
Random access to ground truth: this case delineates a fact where the operator labels data samples sporadically leading to partially labelled data streams. at is, a true class label y t arrives in the random fashion. In other words, B k is only partially labelled with the target label. Infinitely delayed access to ground truth: the second case is more stringent than the first case where the access of true class label is only given for prerecorded samples being fed in the warm-up period before process runs leaving the rest unlabelled. In other words, only initial labels are provided. Specifically, only the first data batch B 1 is labelled without changing the data order.

Complexity
As with conventional data streams, semisupervised data streams do not follow static and predictable data distributions where they contain the concept drifts. at is, there is changing data distributions resulting in the change of joint probability distribution P(X, Y) t ≠ P(X, Y) t+1 . It requires a model which can adapt to the concept drifts with/without the presence of true class labels. at is, a model should be capable of adapting to the concept drift even if the true class label is absent. e concept drift is induced in our experiment with the injection molding machine by varying the holding pressure and the injection speed of the injection molding machine to be 900, 700, 500, 300, 100 psi and 60, 70, 80, 90, 100 rpm, respectively. e online quality classification problem is presented as a multiclass classification problem with three classes, namely, good, weaving, and short-forming. e number of data samples in three classes is, respectively, 1008, 1074, and 870, respectively. is problem is guided by 48 input attributes recording different machine parameters.

Learning Policy of ParsNet++
Overview of ParsNet++'s learning policy is depicted in Algorithm 1. It starts from the learning process of ACM estimating the complex probability density function p(x) and determining the addition factor of hidden nodes M. Note that ParsNet++ directly injects M hidden nodes if the hidden node growing condition is satisfied. Furthermore, ACM itself is flexible to changing learning environments p(x) t ≠ p(x) t+1 since it features an elastic structure making possible for clusters to be added or pruned on the fly. e probability density function p(x) produced by ACM is fed to the structural learning phase of ParsNet++ where the generative learning phase is carried out first to condition the network structure with the absence of true class label. e structural learning phase involves the hidden node growing and pruning processes adapting to the virtual drift problem.
at is, the structural evolution is navigated by the reconstruction error. e parameter learning phase is devised to minimize the reconstruction loss and to create an ideal discriminative representation of unlabelled samples. e network parameters are further evolved in the discriminative phase with the access of true class labels once completing the generative phase. In other words, the generative and discriminative training phases occur in a fully coupled fashion. e label enrichment mechanism is carried out afterward by executing the augmentation of labelled samples module and the generation of pseudolabel mechanism. Both pseudolabel and augmented label are learned in the discriminative learning fashion minimizing the predictive loss and carried out along with the dynamic regularization method. Network parameters are shared during the generative and discriminative learning phases having a closed-loop configuration.
at is, the network parameters of the generative learning phase are passed to the discriminative learning phase while the network parameters of the discriminative learning phase are fed back to the generative learning phase to cope with upcoming data stream, in other words, the discriminative phase function to refine the generative learning phase using the ground truth information. In addition to the generative phase, the structural learning phase takes place in the discriminative phase to overcome the real concept drift and utilizes the same probability density function p(x) as per the generative training phase. Table 1 provides a list of notations used in the paper.

Parameter Learning of ParsNet++.
e parameter learning method of ParsNet++ is governed by the following loss function: where L 1 stands for the reconstruction loss solved in the generative phase via convolutional denoising autoencoder (CDAE), L 2 denotes the predictive loss of originally labelled samples having a much lower quantity than that of the batch size, and L 3 and L 4 label the predictive loss of augmented label and pseudolabel, respectively. e last term is the autoregularization term. e pseudolabel is induced by the self-labelling mechanism to unlabelled samples while the augmented label is produced by injecting small perturbation to originally labelled samples without changing its label. Nonetheless, the self-labelling mechanism does not reflect the ground truth and possibly delivers noisy label compromising the model's generalization. e autoregularization here plays a role in avoiding this situation by preventing the important parameters to move away from its optimal parameters as a result of the originally labelled samples. at is, θ and θ * , respectively, denote the current network parameters and the optimal parameters induced by the ground truth while ρ is the indicator of parameter importance. e original label, augmented label, and pseudolabel are mixed here to enable the autoregularization to be executed seamlessly [17]. Furthermore, the structural learning phase takes place in L 1 and L 2 here because the augmented label does not reflect the true data distribution undermining the drift adaptation mechanism and the pseudolabel risks on noisy label misleading the estimation of bias and variance. Equation (1) is formed as an unconstrained optimization problem allowing alternate optimization strategy via the stochastic gradient descent (SGD) method. Notwithstanding the fact that pseudolabels might be noisy, the pseudolabel generation mechanism still plays an important role to enhance model's generalization because it enriches the label representation; i.e., one might consider extreme label scarcity here. Moreover, the autoregularization is implemented to address the issue of noisy pseudolabels. e generative and discriminative phases are carried out alternately here. Note that the infinite delay case only relies on the augmented label and the pseudolabel.

Generation of Augmented Label.
e issue of label scarcity is addressed by the label enrichment strategy including the generation of augmented label. It results from the injection of small Gaussian noise to the originally labelled samples without changing their labels also known as the consistency regularization technique.
at is, small random Gaussian noise with zero mean is utilized to produce the corrupted version of originally labelled samples, i.e., N(0, 0.001) [17]. Since the augmented label is drawn from the true class label, it is not subject to the autoregularization method. Furthermore, only augmented label and pseudolabel are exploited in the infinite delay problem whereas original label is not retained during the process runs. In other words, original label is accessed in the warmup phase without being carried to the next data streams.

Generation of Pseudolabel.
e label enrichment mechanism involves the generation of pseudolabel produced by the self-labelling phase of unlabelled samples. e self-labelling mechanism relies on the network prediction as well as the ACM prediction if they return high confidence as follows: where α 2 , α 3 are two predefined thresholds set to be higher than 0.55. e ACM's output is calculated as per the output Input: partially labelled data batches: Testing and update performance metrics if k < S then {S: initialization batch number} (1) . Car m stands for the cardinality of the m − th cluster while Car o,m denotes the cardinality of the o − th class of the m − th cluster. Furthermore, the network and ACM predictions are normalized as P(Y|X) net/ACM � y 1 /(y 1 + y 2 ) where y 1 , y 2 denote the highest and second highest outputs. is trait underpins the class-invariant trait being similar to the binary classification problem. As a result, P(Y|X) net/ACM ≈ 0.5 indicates low confidence level and confused prediction. is condition implies the predicted output falls adjacent to the decision boundary. e pseudolabel is propagated to model's update only if the predictive outputs of ACM and network are agreeable. Despite the pseudolabel generation mechanism risks on the noisy pseudolabel, it is still integrated in the ParsNet++ learning mechanism because of the existence of autoregularization making sure only clean pseudolabels to be learned. On the other hand, α 2 , α 3 control the self-labelling mechanism where the higher values lead to the decrease of the pseudolabels whereas the lower values lead to the increase of the pseudolabels.

Autoregularization Method.
e autoregularization is developed to cope with noisy pseudolabel leading to accumulation of mistakes. It prevents a model to forget its optimal condition resulting from learning original label. Specifically, it prevents important parameters θ from moving too far from their previous locations θ * resulting in the performance degradation.
is approach is originally proposed in the so-called synaptic intelligence (SI) technique addressing the catastrophic forgetting problem of continual learning [19]. Our main contribution here is to contextualize this approach for the semisupervised learning environment to prevent the catastrophic forgetting problem as a result of noisy pseudolabel.
(1/2)α 1 ρ(θ − θ * ) 2 still accepts the pseudolabel by setting α 1 , regularization factor, as (e recons − e min recons )/(e max recons − e min recons ) where e recons stands for the reconstruction error of the generative phase only if clean pseudolabel is fed. at is, wrong pseudolabel distracts the direction of network's gradient resulting in the increase of reconstruction error. e Z-score is applied to scale the reconstruction error in the range of [0, 1]. ρ determines the importance of network parameters derived from the accumulated network gradient as follows: where θ T stands for the total parameter movement during the training process and Δθ denotes the parameter's movement during two consecutive time steps θ t − θ t− 1 . ε is a predefined constant to avoid division with zero. ρ is updated only when observing the original label and the augmented label because the autoregularization functions to compensate the noisy pseudolabel. Hence, step denotes the number of original label and augmented label. It is worth mentioning that the higher the network gradient is, the more important the network parameter is. e parameter importance indicator (3) is calculated in respect to the accumulation of network loss and network gradients.

Network Structure of ParsNet++.
ParsNet++ is built upon the convolutional denoising autoencoder structure where the feature extraction layer utilizes the stacked 6 Complexity convolutional layers while the fully connected layer is formed as a single-hidden-layer network having a self-organizing property. It receives raw input features collected from sensors X sen t ∈ R u which in turn maps them to the output space Y k ∈ R m . Specifically, the 1D convolutional layer is deployed to process the sensor data. Raw samples are executed by the convolutional layer F(·) as follows: where the convolutional layer F(·) is parameterized by a filter W l,i conv denoting the i − th filter of the l − th convolutional layer while Z i l stands for the feature map of the l − th layer produced by the i − th filter. e 1D filter W l,i conv ∈ R g is used here.
After stacking L convolutional layers, the output of the last 1D convolutional layer is flattened to produce an input vector Z L ∈ R r where r denotes the number of natural features extracted by the feature extraction part of ParsNet++. It is passed to a single hidden-layer neural network functioning to classify data samples into m target classes. Par-sNet++ is underpinned by a closed-loop configuration between the generative and discriminative learning phases where the denoising autoencoder (DAE) [21] is implemented to extract robust input features. e DAE makes use of noise injecting mechanism avoiding the identity mapping issue while functioning as the regularization mechanism. e DAE takes the natural features Z L and maps it into the latent space: where W enc ∈ R r×j and b ∈ R j are the connective weights and bias of the encoder while W dec ∈ R j×r and c ∈ R r are the connective weights and bias of the decoder. j denotes the number of hidden nodes. Note that W dec is the inverse mapping of W enc and is known as the tied-weight constraint. Z L is a partially destroyed input vector where the masking noise is used here. at is, a subset of input vector is set blank. e Relu activation function max(0, x) is used here instead of the sigmoid activation function. e discriminative phase utilizes a softmax function softmax(x) � (exp x/ m i�1 exp x) to produce the output class posterior probability: where W out ∈ R j×m , d ∈ R m are the connective weights and bias of the softmax layer. ParsNet++ utilizes shared network parameters between the generative and discriminative phases where W enc � W in , b in � b. Both phases are carried out in the closed-loop fashion where a model is firstly trained during the generative phase with the absence of ground truth. e discriminative phase further refines it with the presence of class labels.

Growing and Pruning of Hidden
Nodes. ParsNet++'s structural evolution is governed by the network significance (NS) method estimating the network bias and variance in the one-pass learning fashion. M new hidden nodes are added if the network experiences high bias condition whereas the hidden node pruning mechanism is triggered in the case of high variance. M stands for the number of clusters generated using the autonomous clustering mechanism. It is worth mentioning that both mechanisms are carried in the generative and discriminative fashions where the bias and variance are enumerated in respect to the predictive error while the reconstruction error is referred to during the generative phase. We only present the structural learning mechanism in the discriminative phase here for the sake of simplicity but the same step can be followed for the generative phase. e NS method can be expressed as follows: e key for solving (7) lies in the expected output E[y]. ACM is applied here to estimate the complex probability function p(x) and results in the following expression: where ω m , c m , respectively, denote the m − th mixing coefficient and center of clusters, respectively. Equation (8) can be derived independently for each cluster while the overall expected output is enumerated by applying the mixing coefficient ω m taking into account the contribution of each cluster to the overall estimation.
is step leads to the following expression: where M m�1 ω m � 1 meets the partition of unity property. e hidden unit growing condition is formulated using the statistical process control (SPC) method [22] as follows: where μ t bias , σ t bias are the empirical mean and standard deviation of the network bias while μ min bias , σ min bias are the minimum network bias up to the t − th time instant. μ min bias , σ min bias are reset once (10) is satisfied while μ t bias , σ t bias are calculated across all samples because of the nature of bias estimation being accurate when considering all samples. Formula (10) is meant to detect the high bias condition leading to the hidden unit growing condition. Note that the SPC method in essence functions to detect anomalous points or a drifting concept. e original SPC method is, however, modified here to induce the flexible confidence level with the use of k 1 ∈ [1, 2] being equivalent to the confidence degree between 68.2% and 95.2%. It implies the hidden unit growing process to be carried out in the case of high bias whereas it is hindered in the case of low bias.

Complexity 7
As with the hidden unit growing mechanism, the hidden unit pruning strategy is undertaken using the SPC method as follows: μ t var + σ t var ≥ μ min var + 2 * k 2 * σ min var , k 2 � 1.2 exp − var 2 + 0.8.

(11)
e key difference lies in the term 2 directed to avoid a direct-pruning-after-adding situation. is leads to the confidence level between 68.2% and 99.9%.
at is, the hidden unit pruning condition is carried out frequently in the case of high variance while the hidden unit pruning situation is prevented in the case of low variance. Once (11) is met, the hidden unit pruning condition is executed as follows: where E[s] � M m�1 ω m (c m W in + b in ) denotes the statistical approximation of hidden nodes. Equation (12) enables multiple hidden nodes to be discarded at once and results in rapid complexity reduction.

Autonomous Clustering
Mechanism. ParsNet++ is guided by autonomous clustering mechanism (ACM) to generate a complex probability density function p(x) during the hidden node growing and pruning processes. It differs from the original ParsNet [17] where autonomous Gaussian mixture model (AGMM) is applied. e bottleneck of AGMM exists in the high input dimension often being unstable. ACM features an open structure where clusters are added or discarded on the fly to cope with the concept drifts p(x) t ≠ p(x) t+1 and is capable of initiating its learning process from scratch. e component's growing process is governed by the compatibility measure examining the spatial proximity of a data point to existing clusters whether it is within the cluster's coverage. e cluster pruning technique makes use of the cluster's utility checking the cluster's activity during its lifespan.
Suppose that D(X, Y) is the L − 2 distance between two data samples; the compatibility test is formulated: where k 3 � 2 exp(− D(Z L , C win ) 2 ) + 1. μ D , σ D stand for the mean and standard deviation of distance calculation D(Z L , C win ). As with (10) and (11), (13) is formalised by the statistical process control (SPC) method. e use of k 3 controls the cluster's growing process in such a way that the growing process is performed frequently if a sample is remote to the existing cluster k 3 ≈ 2. is situation portrays a fact where a data sample is uncovered by existing clusters. On the other hand, this condition is difficult to be fulfilled if a data sample is adjacent to existing clusters, i.e., low clustering loss k 3 ≈ 3. A new cluster is constructed if (13) is satisfied. at is, the cluster center is set as the sample of interest C M+1 � Z L with Car M+1 � 1 where M is the number of clusters. If (13) is violated, the winning cluster is finetuned: Car m + 1 , where Car m denotes the cluster's cardinality. Note that the adaptation process is localized only to the winning cluster to avoid the cluster's overlapping case and associates the data sample of interest to the winning cluster. at is, the cluster's cardinality is incremented here. Equation (14) ensures the cluster's convergence as the factor of the cluster's cardinality. e cluster pruning procedure is implemented to prevent the issue of cluster's explosion due to the problem of outliers. at is, outliers are wrongly inserted as clusters by (13). It checks the cluster's significance whether it plays a major role during its lifespan. A cluster can be pruned without loss of generalization if it plays little during their lifespan. e cluster's contribution is examined from the average of cluster activity as follows: where Φ m � exp(− (Z L − C m ) 2 ) measures the spatial proximity of a data sample to the cluster of interest in the latent space while Life denotes the time period of a cluster since it is added. Furthermore, the unity variance is assumed in calculating Φ m where σ 2 � 1. e cluster pruning mechanism is executed as follows: e cluster pruning mechanism enables more than one cluster to be discarded at once leading to rapid complexity reduction and follows the half-sigma rule. Furthermore, the number of clusters M is also used as an addition factor in the network growing phase (10) because the clustering mechanism explores the true data distribution. As an implementation note, the monitoring period is applied here. at is, a cluster is not removed during the monitoring period to evolve its shape. On the other hand, the mixing coefficient, ω m , is formed as the relative cardinality as follows: where it features the partition of unity property and takes into account both the distance information and the cluster support. A cluster should possess high influence in the network bias and variance estimation if it is adjacent to the data sample of interest and has high population.

Injection Molding Process
eScentz, as shown in Figure 1, is a scent-emitting USB device made by SIMTech. It is used as the testbed product at the model factory@SIMTech. e injection moulding process is used to manufacture the black cartridge, white cartridge holder, and a transparent part which is used to contain the 8 Complexity scent in the cartridge. e injection molding machine (Arburg Allrounder 470 A) is shown in Figure 2.
Focus is on the transparent part as it is critical to the functionality of the device; i.e., defects in the part can lead to leaking of the liquid scent. ere are a number of different types of possible defects but the most common ones are flow lines which is a mark or line formed when two melt flow fronts meet during the filling of the injection mold and short shot where the mold is partially filled with plastic melt [23]. Examples of a good part and the different types of defects are shown in Figure 3.

Numerical Study
is section demonstrates the advantage of ParsNet++ in assessing the quality of transparent mold manufactured by the injection molding machine. ParsNet++ is simulated in two simulation environments: random access of ground truth and infinitely delayed access of ground truth. e former one describes a case where each data batch contains partially labelled data points with unknown class distribution while the latter one portrays a semisupervised problem where ground truth is accessed only in the initial phase leaving the rest unlabelled. 50% of labelled samples are set as the default setting for the random access of ground truth. e infinitely delayed access of ground truth only utilizes the first data batch. Furthermore, both scenarios are simulated in the prediction of current batch as well as future batch. e prediction of current batch monitors the current quality of transparent molds Y k based on the sensor data X k . e prediction of future batch relies on the current data batch to forecast the future product quality Y k+1 . e contribution of each learning module is studied in the ablation study section while the effect of label proportions is elaborated. Our numerical study follows the prequential test-then-train procedure, the standard evaluation protocol of data stream mining. Moreover, the t-test is put forward to statistically validate the numerical results.

Baselines.
e numerical results of ParsNet++ are benchmarked against recently published algorithms in the literature: (i) Online deep learning (ODL) [24] is an online learning algorithm constructed under the vanilla neural network structure. It makes use of the hedging idea where there exists a direct connection of the hidden layer to the output layer.
(ii) Neural networks with dynamically evolved capacity (NADINE) [18] adopts a flexible network structure under the multilayer perceptron (MLP) architecture. at is, both of hidden layers and nodes are dynamically grown and reduced in respect to variations of data streams. (iii) Parsimonious network (ParsNet) [17] is perceived as a predecessor of ParsNet++. ParsNet++ distinguishes itself of ParsNet with the presence of feature extraction layer crafted under the convolutional framework; 1D CNN is integrated to handle raw input features. In addition, ParsNet++ is underpinned by the ACM rather than AGMM to perform density estimation on the fly. (iv) SCARGC [25] is devised for the infinite delay problem and considered as a state-of-the art algorithm in this domain. It utilizes the pool-based principle.
Since these algorithms are not designed to handle visual data of high dimension, their predictions are only guided by sensory data X sen ∈ R 48 . e use of image data significantly reduces its performance due to the absence of the feature extraction layer. In addition, comparison is also made against two popular deep learning algorithms, ResNet18 [26] and VGG11 [27], only using the image data happening to be an RGB image with a size of X img ∈ R 150×150×3 . ey do not exploit the sensor data due to the absence of 1-D CNN. All of the algorithms except ParsNet and SCARGC are a fully supervised algorithm. e simplest structure of ResNet and VGG is adopted here because of the low data size leading to the issue of overfitting. All algorithms are executed under the same computational platform by using their published codes and run under the same simulation protocol as ParsNet++ to ensure fair comparison. e numerical results are taken from the average of five consecutive runs.

Network Structure and Hyperparameters
. ParsNet++ utilizes 1D CNN as a feature extractor to predict the mold quality Y k where 1D CNN looks after the raw sensory data. Extracted features from the CNN are concatenated into a long vector and fed to the fully connected layer, a singlehidden-layer neural network with the self-evolving property.    [8,4], respectively. e hyperparameters of ParsNet++ are fixed throughout our simulation scenario as α 2 � 0.6 and α 3 � 0.8 while the learning rates and momentum coefficient are selected as 0.01 and 0.95 of stochastic gradient descent optimizer (SGD). Hyperparameters of other algorithms are chosen as those reported in their original papers. We chose 100 as the batch size for all algorithms. Table 2 reports the hyperparameters of consolidated algorithms. For injection molding dataset, initialization batch S and epochs E, shown in Algorithm 1, are 5 and 10 in sporadic access experiment and 1 and 15 in infinite delay experiment. For transfer molding dataset, S and epochs are 1 for both sporadic access and infinite delay experiment which also signify that ParsNet++ runs in the single pass way. Table 3 reports our numerical results for the current batch prediction under the setting of sporadic access of ground truth. It is evident that Par-sNet++ outperforms ParsNet with significant gap. is finding clearly encourages the 1D CNN of ParsNet++ automatically extracting deep natural features and the ACM technique for estimation of probability density function. Moreover, ParsNet++ beats NADINE, ODL happening to be a fully supervised algorithm with significant margin. Note that ParsNet, ODL, and NADINE are guided by sensor data as with ParsNet++. ParsNet++ is compared with ResNet18 and VGG11 making use of image data and being popular deep learning approaches. Although the two approaches are an offline algorithm trained in the offline fashion and are fully supervised, ParsNet++ exhibits superior performances. at is, ParsNet++ exceeds VGG11 and ResNet18 with noticeable difference. is result is confirmed with the statistical test in Table 4 where the performance gap between ParsNet++ against all algorithms is statistically significant. Table 5 exhibits our consolidated numerical results for the next batch prediction. e same finding as the current batch prediction is found here where ParsNet++ beats ParsNet with significant performance gap. is facet substantiates the advantage of feature extraction module of ParsNet++ generating deep natural features while ACM approximates the true probability distribution better than the AGMM of ParsNet. By extension, ParsNet++ outperforms fully supervised algorithms, NADINE and ODL, working with more favourable condition than ParsNet++. NADINE, ODL, and ParsNet are akin to ParsNet++ where raw sensor data are exploited as input features but suffer from the absence of feature extraction layer. Our numerical results are statistically validated with the statistical test in Table 6 where ParsNet++'s performance is statistically better than its competitors.

Numerical Results.
In realm of infinitely delayed access of ground truth, ParsNet++ delivers superior performance with almost 40% improvement from ParsNet and SCARGC. ParsNet++'s accuracy is 85.60% for the current batch prediction and 83.25% for the next batch prediction whereas its counterparts deliver the accuracy below 50%.
is mechanism confirms the generalization power of ParsNet++ in dealing with various semisupervised learning situations. ese numerical results are presented in Tables 6 and 7. Note that the true class labels are only supplied in the initial batch for the infinite delay case being more challenging condition than the sporadic access case. is facet is confirmed by the fact where numerical results of all algorithms worsen. Figure 4 visualizes the predictive quality of ParsNet++ where precision, recall, and F 1 metrics show similar trend. is observation signifies the fact that ParsNeT++ handles all target classes equally well. e detailed numerical results are presented in Table 8. In addition, this figure also depicts the dynamic nature of ParsNet++ in which its hidden nodes are dynamically added and pruned on the fly. It is also observed that ParsNet++ timely responds on performance decrease as a result of concept drifts. at is, new nodes are injected if network's performance is compromised in the case of concept drift.

Ablation Study.
e ablation study is carried out to validate the influence of each learning module of ParsNet++. ParsNet++ is configured into three variations: (A) ParsNet++ is set with only the parameter learning scenario using the stochastic gradient descent method with the absence of other 10 Complexity learning modules. at is, the label augmentation mechanism, the dynamic regularization mechanism, and the structural learning mechanism are deactivated; (B) ParsNet++ is equipped by the label enrichment mechanism and the dynamic regularization mechanism but with the absence of structural learning method; (C) the structural learning mechanism of ParsNet++ is switched on but without the pseudolabel generation step and the dynamic regularization mechanism. Our     It is observed that the worst-performing result comes from the model (A) where all mechanisms are turned off. e label enrichment mechanism and the dynamic regularization mechanism enhance the performance by almost   10% as reported by Model (B). is fact clearly demonstrates the advantage of these learning strategies in coping with the issue of label's scarcity. Noticeable performance improvement is attained using the structural learning mechanism clearly confirming the advantage of a dynamic structure from that of a static structure as shown in Model (C). is case portrays the importance of drift handling mechanism when handling the problem of data streams. Note that Model (C) excludes the pseudolabel generation mechanism and the dynamic regularization approach. e numerical result increases further when combining the self-evolving structure, the label enrichment mechanism, and the dynamic regularization mechanism as exemplified by ParsNet++ configuration. is configuration enables the issue of label scarcity and concept drift to be simultaneously overcome.

Effect of Label Proportions.
is section examines the learning performance of ParsNet++ under different label proportions. at is, ParsNet++'s performance is evaluated under seven label proportions: 10%, 20%, 30%, 40%, 50%, 60%, 70%. e simulation protocol follows the sporadic access of ground truth in which two evaluation metrics, accuracy and F 1 , are applied. Table 10 reports the average numerical results across five independent runs.
Our numerical results show that ParsNet++'s performance is compromised with only 5% of labelled samples. e increase of label proportions improves its learning performance and this trend does not continue after 50% label proportion. e best-performing result is achieved with 50% labelled samples whereas performance's deterioration is observed with 60% and 70% labelled samples compared to that of 50% labelled samples. is finding demonstrates that the increase of labelled samples does not ensure the performance's improvement. e performance deterioration with 60% and 70% labels results from the issue of sample redundancy due to the consistency regularization step in which small perturbations are injected to original samples without changing their labels. e consistency regularization method might lead to the issue of overfitting if the proportion of labelled samples is high.
at is, it produces indistinguishable samples which slightly affect model's generalization. Note that the 50%, 60% cases are better than the 40% case.

Industrial Transfer Molding Process.
e industrial transfer molding process portrays a process from a semiconductor industry occurring in the encapsulation stage where a batch of integrated circuits (ICs) are packaged in a case to avoid corrosion and physical damage [15]. e quality monitoring step in this phase plays a key role because it might result in heavy penalties if defective products are sent to the customer. e encapsulation process makes use of an industrial transfer molding machine, very similar to the injection molding machine where it is used to form the support of electronic components. at is, the transfer molding is a process whereby the casting material is entered into the mold [15].
Each production is undertaken in lot sizes having 1 to 424 strips where each strip comprises a number of products. e product quality is examined only after the complete lot has been finished. e goal of this problem is to feed realtime prediction of the product quality while being still in production. e use of artificial intelligence (AI) is urgently required because it enables redundancy in checking such that the product's integrity is ensured. We collected production data over the period of six months. is problem is formulated as a binary classification problem and suffers from the class imbalanced problem where only 4% of data contains defects while the remainder is of the normal class. e unique property of this problem lies in the many-to-one label relationship where multiple instances are assigned with a single label. at is, the quality of product is not determined from a single product quality rather the whole lot. If a lot happens to have over 48 defects, the whole lot is thrown away or this case portrays the defect case.
Our numerical study follows the prequential test-thentrain protocol as the injection molding problem where one step ahead prediction is simulated. at is, a model is used to predict the quality of next lot based on the current machine parameters and process variables. Both the sporadic access of ground truth and the infinitely delayed access of ground truth are simulated here. Important parameters of the moulding process include cavity pressure, ram velocity, ram position, and mould temperatures. is problem is a high-dimensional problem with 608 input features. Tables 11 and 12 report our numerical results for both the sporadic access and the infinite delay scenarios. ParsNet++ is compared with ParsNet and SCARGC. Since this problem suffers from the many-to-one label relationship where many data samples are associated with a single class label, a simple mean operation is executed for the feature extraction strategy in ParsNet and SCARGC. Note that this case is not applicable for ParsNet++ because the use of 1DCNN enables automatic feature engineering where data points of each lot/strip are scanned using 1D filter.  It is obvious from Table 11 that ParsNet++ and ParsNet exhibit comparable performance in the context of sporadic access protocol. is finding is supported by the fact of class imbalance where only 4% of data samples belong to the positive class. On the contrary, ParsNet++ outperforms both SCARGC and ParsNet in the case of infinite delay as shown in Table 12.
is observation substantiates ParsNet++ generalization power in coping with different scenarios of semisupervised learning. Note that the infinite delay problem is more challenging than the sporadic access problem because true class labels are supplied only in the warm-up phase. is issue leads to performance degradation of ParsNet and SCARGC where the automatic feature engineering step is absent; i.e., features are extracted by applying the mean operation. Nonetheless, we acknowledge that the class imbalance problem still deserves in-depth future study. It is seen from the low F 1 scores of consolidated algorithms.

Sensitivity Analysis.
is subsection aims to study the effect of hyperparameters to the performance of ParsNet++. Specifically, the effect of α 2 , α 3 is analyzed while excluding other parameters. Other hyperparameters such as momentum coefficient and learning rate are set to the same values for all consolidated algorithms. In addition, they are default parameters of SGD method where their effects have been well-understood from the literature. ε is merely a small constant to avoid division with zero. e sensitivity analysis is carried out by varying α 2 � [0.2, 0.4, 0.6, 0.8] and α 3 � [0.2, 0.4, 0.6, 0.8]. Table 13 reports the numerical results of all combinations. α 2 , α 3 are required to be set higher than 0.55; therefore, 0.6 and 0.8 are selected for α 2 , α 3 , respectively. Note that we do not apply specific hyper-parameter selection in our main experiments.
at is, only simple hand-tuning is applied to set the parameters. Our sensitivity analysis is undertaken in the case of sporadic access of ground truth under the next batch prediction with 50% label proportion.
From Table 13, variation of α 2 , α 3 does not lead to significant performance deterioration. at is, the difference between the worst and best results is around 2%. It is worth stressing that α 2 , α 3 should be set higher than 0.55 since it reflects the confused prediction. is aspect should narrow down the choice of hyperparameters, i.e., α 2 , α 3 ∈ [0.2, 0.4] to be unreasonable values. Such a case leads to performance variation to be less than 1%. On the other hand, α 2 , α 3 govern the pseudolabel generation where predictions of the network and the ACM are used to generate the pseudolabel. e case of α 2 � α 3 � 0.2 produces the worst result because it produces too many noisy pseudolabels. e increase of α 3 improves the prediction because it reduces the prediction's uncertainty of ACM. Note that the ACM's prediction relies on the class posterior probability P(y o |N m ) where it no longer represents the class distribution in the case of extreme label scarcity.

Conclusion
A semisupervised quality classification in data stream environments including its deep learning solution termed Parsimonious Networks++ (ParsNet++) is presented in this paper. ParsNet++ features an open structure automatically generating and pruning its hidden nodes on the fly thereby addressing concept drifts of partially labelled data streams. e parameter learning strategy is formulated as a joint optimization problem of the reconstruction loss, the predictive loss of the original label, the predictive loss of the augmented label, and the predictive loss of pseudolabel. In addition, the regularization strategy is put forward to combat the noisy pseudolabel problem preventing the important parameters to be perturbed by the noisy pseudolabels. ParsNet++ extends ParsNet with the integration of feature extraction layer enabling automatic feature engineering mechanism. 1D CNN is integrated to perform the automatic feature engineering step and to handle the manyto-one label relationship while incorporating the ACM for flexible density estimation approach. Comprehensive experiments with the injection molding machine and the industrial transfer molding machine have been carried out to experimentally validate the advantage of ParsNet++. Par-sNet++ is tested in two semisupervised learning scenarios: infinite delay access of ground truth and random access of ground truth with comparisons against prominent algorithms for both current batch quality monitoring and future batch quality monitoring. ParsNet++ outperforms its   14 Complexity counterparts with noticeable margin and delivers comparable accuracy to those of fully supervised learning algorithms. ere are few important issues unexplored in ParsNet++. e issue of class imbalance still deserves an indepth study where this issue requires a specific strategy in order to reduce false positive rates of ParsNet++'s prediction. is aspect is seen in ParsNet++'s results of the industrial transfer molding problem where the F 1 score is rather low. Another uncharted area lies in the issue of transferability to different machines. Its solution makes possible to utilize a single model to be transferred across different machines of the same types or different types with little capital expenditure.

Data Availability
Codes and data of this paper can be found in https://github. com/ContinualAL/ParsNetPlus.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Weng Weiwei and Mahardhika Pratama contributed equally to this study.