A New Multinetwork Mean Distillation Loss Function for Open-World Domain Incremental Object Detection

,


Introduction
Object detection models, which are currently the most representative models for vision tasks, play a signifcant role in felds such as intelligent robot tasks [1], autonomous driving [2], and other edge intelligent terminal schemes [3].However, existing supervised models can only be trained on labeled task data from existing categories in the training dataset.Furthermore, to adapt to a new task, the network parameters of the model need to be adjusted, and it is difcult for existing object detection models to adjust to dynamic real-world environments, causing them to forget old knowledge [4].
In this work, we investigate the problem of classincrement multinetwork object detection based on a catastrophic forgetting mechanism.In the incremental setting, task queues are introduced sequentially to the object detector, and a high-performance agent should maintain the old task performance during the new task learning process.Terefore, the model adaptive parameter update process that is executed when new task input is limited by setting knowledge distillation [4] at the model parameter level [5][6][7].For Faster-RCNN [8] with a multinetwork structure, it is difcult to alleviate the catastrophic forgetting problem with the distillation of only a single network [9], and it is more efective to use multinetwork distillation [5,10] to retain old knowledge for the whole network.Furthermore, existing incremental object detection methods based on multinetwork knowledge distillation make the model more prone to learning the old task and minimize the new task learning efect.
Based on these considerations, Peng et al. [10] proposed an incremental learning approach to multinetwork adaptive distillation, where distillation is set up in multiple networks and the teacher network is used as a lower bound for adaptive knowledge extraction.However, forcing a comparison of the outputs of the teacher network and the student network as an adaptive extraction condition may lead to signifcant loss caused by past knowledge being identifed as more important for new task learning and zeroed out for distillation loss; the weights related to this output value may be equally important for the old task.In contrast, Joseph et al. [5] performed meta-learning by specifying an RoI head layer and setting a certain number of iterations to optimize the gradient update direction to better learn the new task.However, because the region proposal network (RPN) for the class is agnostic, the method does not include a distillation loss term.Te correct rate of RPN classifcation and anchor regression of an object and the background of an old task directly afect the accuracy of RoI head prediction and degeneration of the old class in the next stage because the candidate regions learned by the RoI head are generated from the previous step after pooling.Tis results in a lack of candidate regions for the old task in the learning process of the RoI head, which afects the ability of the object detection model to recognize new and old classes.In addition, by changing the network parameters during training to ft the new task, a direct calculation of the distillation loss of the network output at each stage will cause the output of the new model to be very diferent from the output data distribution of the old model.Tis will make the network output difcult to ft and unstable.
To address the aforementioned challenges based on the Faster-RCNN incremental object detector, we propose a new distillation scheme for the Faster-RCNN detector.We improve the distillation output of the ResNet50 backbone at the input level and the RoI head network at the output level, and we use adaptive distillation to maintain the past knowledge of the RPN.Moreover, we adopt the meta-learning strategy in [5] to mitigate the degradation of model learning performance for new tasks caused by knowledge distillation.In addition, to address the problem of bias in the output data distribution of the new and old models, we perform zero averaging on the output data of the ResNet50 backbone and RPN of the new and old models to mitigate bias in the output data distribution of the new model.Consequently, the primary contributions of our work are as follows: (i) We propose a new scalable Faster-RCNN detectorbased multinetwork distillation scheme that uses enhanced distillation values for the ResNet50 backbone and RoI head network and adaptive distillation for the RPN to mitigate the catastrophic forgetting problem.
(ii) To alleviate the instability of the network output caused by the diferences in the old and new network outputs, we perform zero averaging on the output of the backbone network and RPN of the old and new models and consider the RoI head network averaged over the class to produce a new set of distillation losses.(iii) We extensively evaluate the PASCAL VOC and COCO benchmark datasets and compare two advanced baseline methods.Te experimental fndings demonstrate the superior performance of our approach in various incremental scenarios.

Related Works
Incremental learning is a special machine learning paradigm that simulates the human brain's learning of sequential task streams, where the model can continuously learn new tasks and maintain old task performance.However, to maintain such properties, it is necessary to address the forgetting problem of the model due to new task learning [11].On this basis, this section proposes incremental learning techniques for knowledge distillation and loss minimization.

Incremental Learning Approach for the Knowledge Distillation Strategy.
Knowledge distillation [4] methods have been extended to mutual distillation learning [12][13][14], assisted distillation learning [15][16][17], spatial location distillation learning [18,19], and dataset distillation learning [20][21][22].In addition, knowledge distillation can be used in incremental learning because of its ability to transfer knowledge from one model to another [8][9][10][11][12][13].Knowledge distillation in incremental learning typically transfers old information from the teacher's network to the student's network to alleviate the forgetting of old knowledge.As a traditional incremental learning method for knowledge distillation, LwF [23] mitigates forgetting by freezing the old model as the teacher network, using a temperature factor to soften the softmax output of the logit, then adding the factor to the current task loss as a regularization term, and constraining the model to mitigate forgetting through parameter updates.However, LwF is vulnerable to a signifcant learning bias when there is an imbalance between the old and new classes.To address this problem, Zhao et al. [24] combined weight aligning (WA) with knowledge distillation by utilizing WA to balance the weights of the old and new class information in the fnal fully connected layer while using knowledge distillation to maintain the model's discrimination of the old classes.In contrast, Dong et al. [25] used a dual-teacher distillation framework to mitigate the class imbalance problem using sampled unlabeled data to extract knowledge from the base class teacher and new class teacher models and transfer the knowledge to the student model.Similarly, Abdelsalam et al. [26]

Loss Optimization Method for Knowledge Distillation.
In studies on knowledge distillation loss, existing distillation loss optimization methods [28][29][30][31][32] mainly optimize the defciencies of the incremental learning processes by combining them with other techniques.Li et al. [30] prevented the features extracted from the intermediate neural network layers from changing drastically by adding feature distillation loss terms and minimizing the feature diferences using a smoothed L1 loss function.EEIL [28] combines crossentropy and distillation loss into an end-to-end learning network, using cross-entropy to learn new classes and distillation to retain knowledge corresponding to old classes.Xiang et al. [29] proposed a dynamic correction vector algorithm that combines representational memory and knowledge distillation loss to optimize cross-entropy and knowledge distillation loss functions to alleviate knowledge distillation bias and model overftting problems.Douillard et al. [33] combined representation learning with distillation to mitigate the impact of feature extraction network changes by using a multiagent classifer through spatially based distillation loss-constrained representation evolution.To address the old-new data imbalance problem, Wu et al. [31] proposed a BiC algorithm for large-scale data processing based on distillation loss to correct old-new class bias.Similarly, Hou et al. [32] combined cross-entropy loss, feature-based distillation loss, and marginal ranking loss, which separates old and new classes, to mitigate the adverse efects of class imbalance.For the object detection problem, ILOD [9] uses knowledge distillation to regularize the output of the fnal classifcation and regression layers to retain the performance of the old task.Chen et al. [6] used cue learning to maintain the initial model feature information and added it to the distillation loss calculation while setting the confdence loss to extract the confdential information of the initial model to mitigate further forgetting.In the detector feature space, Yang et al. [34] investigated the applicability of both old and new classes and set a distillation loss term for two-stage Faster-RCNN using three perspectives-channelbased, point-based, and instance-based perspectives.Notably, the introduction of knowledge distillation aggravates the model's focus on new tasks and reduces their performance.
Based on this, Peng et al. [10] applied adaptive distillation to multiple networks in the Faster-RCNN detector, using the teacher network as a lower bound and adaptively extracting knowledge to improve new task learning.In contrast, Joseph et al. [5] set the Warp loss to optimize the gradient update direction by specifying a layer of the RoI head network for meta-learning to make it better adapted to learning new tasks.In conclusion, the design of the loss function has optimization efects both on the degradation of the learning performance of new tasks caused by the introduction of knowledge distillation and on the alleviation of forgetfulness, prompting us to place a greater emphasis on our work on optimizing the distillation loss.

Gradient Mate Learning.
In contrast to the aforementioned methodologies, contemporary scholars have focused their eforts on investigating the potential of meta-learning to facilitate enhanced computational efciency in models.Te initial work by Andrychowicz et al. [35] established the groundwork for the advancement of gradient meta-learning.Teir proposal involved the automatic learning of hyperparameters for model optimizers by specifying particular optimizers.However, this approach poses difculties in the selection of suitable optimization algorithms and parameter settings.Furthermore, model-agnostic meta-learning (MAML) [36] is a prominent method in gradient metalearning that aims to enhance the initial model parameters for various tasks by performing gradient meta-updates on multiple tasks.Tis approach has garnered signifcant attention in the domain of few-sample learning.Nevertheless, the efectiveness of MAML is contingent upon the availability of high-quality task data and its sensitivity to hyperparameters, which impose certain restrictions.To address the challenges, Franceschi et al. [37] introduced diferentiable convex optimization techniques in the feld of meta-learning.Tis approach aimed to enhance the stability of the metaupdate strategy and resolve the sensitivity issues associated with previous methods.In addition, Snell et al. [38] proposed a gradient-based meta-learning method utilizing a prototype network.Tis method proved efective in scenarios with limited training samples and overcame the challenges posed by sparse data through the concept of category prototyping.Similarly, Kedia and Chinthakindi [39] employed the reptile algorithm in meta-learning, combined with an inductive bias on pretrained weights, to enhance the generalization performance of the model.Notably, the meta-gradient of the reptile algorithm incorporates a gradient component that maximizes the inner product between diferent batch sizes from the same task, thereby facilitating greater adaptability to new tasks.Furthermore, Xu et al. [40] and other researchers have integrated reinforcement learning with gradient-based meta-learning techniques to enhance the efectiveness of deep reinforcement learning in large-scale applications.Tis is achieved by considering the payof function as a parametric function with adjustable meta-parameters and addressing the optimization of task-specifc objectives.Consequently, the integration of meta-learning has expanded the scope of incremental learning applications.Furthermore, Joseph et al. [5] introduced a meta-learning approach for incremental target detection.Tis method involves learning from intermittent data inputs.During the meta-learning process, newly acquired data are incorporated into the Faster-RCNN network, and the RoI head modules with unfrozen weights are adjusted to better align with the new task.Tis approach addresses the issue of performance degradation in the model when attempting to distill knowledge for new tasks.Inspired by the above work, a common strategy for incremental object identifcation methods to retain acquired knowledge is to simulate considerable activation of the original model by minimizing the frst-order distillation loss.A new Faster-RCNN-based multinetwork structure distillation strategy for object detectors is devised as a result.Consideration is given to the impact of the new task's degraded performance because of knowledge distillation.In contrast, the problem of model output data distribution bias resulting from the instructor and student model tasks during learning is investigated.

Proposed Method
Incremental object detectors do not require all data classes to be available in advance.When new data are input, the unique structure of the detectors can prevent catastrophic instances of forgetfulness.Incremental object detection (iOD) is a commonly used approach for target detection, and it is characterized by its multinetwork structure.In this approach, a new model, referred to as the student network, is trained to learn a new task while keeping the weights of the previous model, known as the teacher model, fxed.Tis is achieved by setting the distillation loss, which helps mitigate the performance degradation of the student network on the new task caused by distillation.However, iOD's method of directly calculating the distillation loss on the input parameters can result in an imbalanced data distribution, leading to instability in the model's output.As shown in Figure 1, our method mitigates catastrophic forgetting via multinetwork knowledge distillation and experience replay.Specifcally, we reinforce the focus on past tasks for the beginning and end of the model and consider the problem of poor RoI head training in the next phase due to the lack of RPN training on old tasks.We use adaptive distillation in the intermediate RPN phase.We employ adaptive distillation in the intermediate RPN stage to conditionally maintain the model's focus on the old task suggestion area; moreover, during the knowledge distillation process, we average the input to improve the stability of the model output.Furthermore, to prevent knowledge distillation from overprotecting past tasks and limiting the learning of new tasks, we use the gradient preprocessing meta-learning method in [5].Specifcally, the model learning process can be divided into two distinct phases: the incremental learning phase and the fne-tuning phase.In the incremental learning phase, the model learns the image features of the task by optimizing the specifed loss function.On the other hand, the fne-tuning phase involves further training the model using microdata.Tis process allows the model parameters to be adjusted to efectively adapt to both previous and new tasks.Te process of incremental learning can be divided into two stages: the initial stage involves learning a new task (referred to as new task loss), while the learning process is constrained by the distillation loss to limit changes in model parameters related to previous tasks (referred to as multinetwork mean distillation loss).Te second stage utilizes a meta-learning gradient matrix to adjust the direction of the model's learning gradient (referred to as warp loss).In the succeeding exposition of the experimental fndings, all experimental outcomes are refned, except those that lack a particular reference to the phase of incremental learning.

Problem Formulation.
For a continuous task stream T, task T t (T t ∈ T) is delivered to the object detector at moment t, where T t is composed of the incremental subtask set and task T t is composed of the image dataset D with labels at moment t.Te images contain several objects from diferent classes, but the labels are only valid for objects in task T i .Moreover, we set the update rule of M OD to be determined by θ.Te parameter θ, which is used to defne M OD , is divided into the task parameter C and the warp parameter ϖ; that is, θ � ψ ∪ ϖ, and ψ ∩ ϖ ≠ ∅.In terms of the learning process, the model learns the task parameter ψ in the frst stage and the warp parameter ϖ in the second stage.
Te specifc learning process can be described as follows: at time t, when there is a task T t , the object detector needs to learn the input picture M OD (I), which can be regarded as an aggregate function.For two-stage Faster-RCNN, M OD (I) can be formulated as a set function consisting of a backbone network M Backbone , regional proposal network M RPN , and RoI head M RoI Head ; i.e., M OD (I) � (M Backbone M RPN M RoI Head )(I).Te input I is subjected to feature extraction by M Backbone to generate the feature map F. M RPN uses these features to generate N candidate regions that may contain objects and the corresponding scores, and each candidate region is subjected to M RoI Head to calculate the probability of being assigned to one of the classes in task T i≤t and to perform regression calculations on its border positions.For incremental object detection, it is challenging to maintain the old task performance in a continuous task stream T without accessing all the data, and the incremental target detector needs to consider the classifcation of multiple networks with border regression on old knowledge memory compared to the normal incremental classifcation problem, such as the classifcation regression problem of M Backbone and the old class feature extraction problem of M RPN and M RoI Head on old knowledge in the Faster-RCNN mode.In our method, we employ a knowledge distillation strategy by freezing the past network model as the teacher network to guide the current task model as the student model.For the purpose of subsequent theoretical elaboration, elements labeled with "te" are defned as elements related to the teacher network, such as the teacher goal detector M te OD , and elements labeled with "st" are defned as elements related to the student network, such as the student goal detector M st OD .International Journal of Intelligent Systems where L cls (p, p * ) � −log p/p * is the log loss of the predicted class versus the true class, and L loc is the smooth L1 loss; when p * � 0, i.e., the background, the bounding box regression loss does not need to be calculated.Similarly, the training loss M RPN yields a prediction score o ∈ [0; 1], which indicates whether the selected region contains instances and corresponds to the bounding box prediction r.Tis loss is defned as follows: where o * indicates whether the region contains real labels; if the region contains real labels, o * � 1, and it is 0 if the region does not contain real labels.r * is the real bounding box regression target.Te weighting parameter λ is set to 1 in all subsequent experiments.

Multinetwork Mean Distillation
Loss.Similar to the way new tasks are learned, our model retains the performance of previous tasks, while new tasks are learned by calculating the mean distillation loss of multiple networks.Similar to Faster ILOD [10] and iOD [5], we use knowledge distillation to soften the softmax output by inserting a temperature factor T into the log output in equation ( 3) to maintain the model's performance on past tasks in a continuous task stream.However, unlike Faster ILOD, which uses multinetwork adaptive distillation, and iOD, which only distills the feature map and RoI head, we strengthen the distillation output at both ends M Backbone and M RoI Head to ensure the accuracy of backbone feature extraction at the very frst input and RoI head detection at the fnal output.On the other hand, to ensure that the anchor of the M RoI Head input contains past memory, we use adaptive zero mean distillation in the middle RPN layer to alleviate the overprotection of past knowledge while preserving the RPN's memory of past knowledge by adaptively increasing the distillation loss.
In addition, we consider that the direct introduction of knowledge distillation will cause the new model to update the network parameters adaptively during the training process to adapt to the new task, resulting in a large deviation of the output from the output data distribution of the old model, denotes the frst stage of the new task learning process, which is included in the distillation process, and we set up the model to employ diferent distillation processes in diferent network structures).
International Journal of Intelligent Systems which makes the network output difcult to ft and unstable.Terefore, we frst apply zero-mean fltering to the inputs of M Backbone and M RPN to obtain the new outputs before calculating the distillation loss.Specifcally, the zero mean is obtained by subtracting the mean of all pixels for each pixel f i , as shown in equation (4).Te model output after zero averaging has all pixel points distributed with the origin as the center point, preventing the situation where the data distribution is all negative or all positive at a certain time while keeping the original data distribution's shape and the mean value of all pixels after zero averaging zero.Tis makes the model output more stable by facilitating the convergence of the model weights during the back-propagation process.
Terefore, to retain the model performance on the previous task during the learning of the new task, we perform multinetwork distillation on the backbone network, RPN, and RoI head network and add a mean value strategy to remember past information.

Backbone Distillation
Losses.F is the layer containing the extracted object pixel features from the image, and F contains each feature pixel f i .To obtain the object features associated with the old and new classes, a distillation loss constraint needs to be applied to M Backbone .We learn M θ t Backbone by freezing the weight parameters of M θ t−1 Backbone and using the teacher network to teach the student network.For the same input I, the teacher network and the student network obtain outputs F te and F st , respectively.Furthermore, M Backbone serves as input to the subsequent classifcation and regression steps, and an accurate description of the old and new class features is particularly important for the subsequent steps, so we strengthen the distillation of M Backbone .In addition, for faster convergence, we obtain  F te and  F st by zero averaging the features obtained by equation ( 4).Distillation loss is defned as follows:  3) and zero averaging in equation ( 4) and use the KL dispersion loss as the classifcation loss.We determine the value on each dimension regarding l for the anchor regression of M RPN , and we regulate the regression by setting a threshold ξ.We take the empirical value ξ � 1 and use the sum of the o value of M te RPN over M st RPN and ξ as the activation value in the distillation loss calculation; however, higher values of M st RPN may be more important for new task learning and are therefore not involved in the distillation loss calculation.Te defnition of RPN distillation loss is as follows: where  o T st,i and  o T te,i are the scores of the output of M RPN successively after the zero averaging treatment of equation ( 4) and the distillation softening treatment of equation (3), respectively, and the empirical value is T � 6 in our experiments.N is the total number of anchors.

RoI Head Distillation
Loss.M RPN generates proposals for the old and new classes passed through pooling M RoI Head to obtain the fnal classifcation probabilities (p te , p st ) and border regression values (l te , l st ) for the teacher and student networks.In our approach, we focus more on the classifcation and regression of the old classes, considering M RoI Head as the fnal stage of the two-stage object detector.In addition to the normal distillation loss calculation, we calculate the mean of each channel with respect to the class by equation (7) to increase M RoI Head and to focus on the overall trend of the fnal classifcation probability and border regression.Te temperature factor T is also introduced into 6 International Journal of Intelligent Systems the log output of equation ( 3) to soften the "softmax" output and obtain more information about past knowledge.
where C denotes the number of channels and i denotes the i-th parameter of the j-th channel.Te M RoI Head loss is therefore defned as follows: where p T te,i , p T st,i , l te,j , and l st,j are all variables processed by the mean value of equation (7).Te parameters labeled with T are all variables processed through equation (1).

Total Mission Loss in the First
Phase.Te frst stage of our model's learning process for the task parameters can be characterized as learning the current task through each stage, followed by correcting the model parameters to maintain past task performance through the distillation loss in each stage.Tis allows the loss of the overall task to be defned as a linear combination of the loss of the new task and the multinetwork mean distillation loss.To balance the model performance in the past and present tasks, we employ a convex combination similar to that in [5] to set the stability and plasticity trade-of parameter α.Here, the total loss in the frst stage is defned as follows: where α is defned as 0.1 in our experiments and is defned in Section 4.3.

Gradient Matrix Warp Loss.
For the second phase of learning, the warp parameter ϖ, as depicted in Figure 2, is confgured in the network warp layer in M RoI Head to learn the preprocessing matrix (P(θ; ∅)).By establishing an image store, a small number of images are saved for each class during the distillation learning procedure I store .Images are taken from I store and put in the set feature store F store .
In the distillation learning process, a small number of images are stored for each class by setting up an image store I store , and the images stored in I store are stored in the set feature store F store after feature extraction by M Backbone .Notably, F store defnes a fxed size queue N feat for each class to mitigate the class imbalance problem.Te stored queue characteristics are incorporated directly into the task learner by utilizing the meta-learning parameterization of P(θ; ∅), which warps the gradient toward the steepest direction and enables the parameters to be updated in the most suitable direction for diferent learning tasks.
Each image in I store is passed through M Backbone and M RPN to generate the RoI pooled features and associated labels, which are then queued into F store .Let f be the RoI pooled features, where f generates the predicted classifcation value p and the border prediction l through M RoI Head .Te warp loss can then be calculated from the features and labels stored in F store , and l wrap is calculated as follows: where L cls is the log regression loss and L loc is the smooth L1 loss.For the evaluation metrics, the average accuracy of the 50% IOU threshold (mAP@50) was used as the main evaluation metric for both datasets.For MS COCO, we set multiple IOUs (AP, AP50, and AP-0.75) and sizes (APs: small, APm: medium, and APl: large) as evaluation metrics.

Incremental Experimental Scenario
Setting.Similar to [5], we simulate incremental scenarios for PASCAL VOC and MS COCO, where the dataset D t provides a set of selected classes C to be used for task T t and passed to the learner at moment t.For each image in the dataset D t that may contain multiple classes, one or several classes belonging to C will be learned as the task object class, and those instances of classes that do not belong to C will not be marked for learning.
According to the diferent difculty levels of the classifcation task, we considered the efect of the learning intensity of initial base class tasks and incremental tasks on the model output results and defned class fow tasks and batch tasks.Class fow tasks can be interpreted as having incremental tasks T i that fow into the model after learning the base class task T 0 with 1 to 2 additional classes per task, whereas batch tasks contain only one incremental task with 1 to 2 additional classes.As indicated in Table 1, we devised seven incremental scenarios based on the degree of difculty of the incremental experiment scenarios according to the divided class fow tasks and batch tasks.Te dataset used in experiments a to f includes the frst 20 classes of PASCAL VOC and the dataset used in experiment g includes 80 classes from the MS COCO dataset.

Stability Analysis of the Zero Mean.
To validate the infuence of the zero mean on the stability of the model output, we conducted fve consecutive replications of experiments (d) International Journal of Intelligent Systems through (f) and created box plots, as depicted in Figure 3, based on the output data.As seen from the fgure, in experiment (d), our method has a more obvious advantage; its lowest mAP50 value is 64.7%, but this value is higher than the highest value of 63.63% obtained by the iOD method.In experiment (e), our approach's stability is not only comparable to but substantially superior to that of the iOD method, which exhibits an outlier (62.68%).In experiment (f), the gap between our method and iOD in terms of stability is wider, and the stability of our model is still better than that of iOD, although the maximum value of 68.28% obtained by iOD is higher than the lowest value of 68.18% obtained by our method in terms of accuracy.Tis indicates that in the incremental task containing only one class, the gap between diferent methods fuctuates slightly, but the results from Figure 3(c) show that our method still outperforms iOD in terms of overall accuracy.Te results of the stability experiments reported in Figure 3 show that our method outperforms the iOD method in terms of output stability for all iOD methods without a zero mean, and the overall accuracy of the fve experiments in all three scenarios is higher than that of the iOD method, which fully demonstrates the reliability and stability of our method.
To search for the optimal stability equilibrium parameter α of equation ( 9), we conducted several experiments on the value of α based on the incremental task approach of experiment (d).

Ablation Experiment.
To confrm the increase in accuracy of the approaches introduced by our method (the RPN and zero mean), we carried out ablation experiments.As shown in Table 3, we still used the task form based on the incremental experiment (d).Te results of the experiment are the mAP values of the base class T 0 (frst 10 classes), the incremental task T 1 (last 10 classes), and the overall 20 classes.As seen from the table, when only zero averaging is introduced, the learning ability of the new task is improved, and the mAP of T 1 reaches 67.33%.When RPN adaptive distillation loss is introduced, the stability of past knowledge is improved, and the mAP of T 0 reaches 62.20%.When both strategies are implemented, the highest mAP values are attained for the new task T 1 , and all 20 classes (68.28% and 65.00%, respectively), and the experimental results are consistent with the conclusions of our theoretical analysis in Section 3.3.In regard to the remaining two approaches [9,10], it is worth noting that Faster ILOD [10] and the method by Shmelkov et al. [9] exhibit a comparative advantage in preserving performance on previous tasks.However, in practical production scenarios, emphasis is placed on the signifcance of new tasks.Consequently, our proposed method not only enhances the model's performance on new tasks but also  Tis improvement results in a more substantial enhancement of overall task performance compared to the aforementioned approaches [10,11], with superior performance demonstrated on all tasks.Furthermore, the results from Experiment 3 and Experiment 4 in Table 3 demonstrate that the inclusion of the zero mean in Experiment 3 leads to a decrease of approximately 0.5% in T 0 accuracy in Experiment 4, while the T 1 accuracy shows an improvement of approximately 2%.Tis improvement can be attributed to the utilization of the RPN adaptive distillation loss, which enhances the model's focus on the old task (T 0 : 62.20% in Experiment 3).Te introduction of zero mean learning causes the model to focus on the new input data, thereby allowing the model weights to be better adapted to the new task during distillation computation, resulting in a 2% enhancement in the performance of the new task.However, importantly, the performance of the old task is still maintained to some extent, albeit with a decrease of 0.5%.In comparison to Faster ILOD [10] and the model by Shmelkov et al. [9], our method achieves superior performance in T 1 , surpassing them by 13.81% and 5.14%, respectively.In addition, our method outperforms both approaches in terms of overall performance, surpassing them by 2.88% and 1.85%, respectively.
To provide additional evidence for the efectiveness of our approach, we conducted a more detailed analysis of the incremental learning phase in the ablation experiments (refer to Table 4).Te results indicate that the average precision (AP) values obtained by the model for identifying the old classes during the incremental class learning phase exceed 9.09.Tis can be attributed to the model's ability to generate probability distributions that are highly similar for these classes while still exhibiting subtle diferences that enable partial identifcation of the old classes.Tis phenomenon may arise from the inherent limitations of the student model in accurately replicating the probability distribution of the teacher model during the knowledge distillation procedure.Te student model can only approximate the output distribution of the teacher model.International Journal of Intelligent Systems Hence, as depicted in Table 4, the incorporation of RPN adaptive distillation (Experiment 3) during the incremental learning phase enables the model to efectively recognize the old classes, resulting in the identifcation of 9 classes.Moreover, the overall accuracy of recognizing the old classes is enhanced by 2.83% compared to the mAP value obtained in Experiment 1, incorporating the zero mean (Experiment 4) results in an enhancement of the model's performance on the new task, reaching 72.27.Moreover, the performance on the old task experienced only a marginal decrease of 0.11%.Tis outcome indicates that our approach successfully achieves a better trade-of between plasticity and stability, efectively addressing the requirements of both maintaining performance on the old task and facilitating adaptation to the new task.

Analysis of the Experimental Results of Incremental Object
Detection.We employ stochastic gradient descent (SGD) with 0.9 momentum.Te initial learning rate is set to 0.02 and is then decreased to 0.0002.Each job receives 18,000 iterations of base-class training on the PASCAL VOC dataset, followed by 100 iterations per image and a total of 90,000 iterations for each of the two tasks.For 2080Ti, the model training procedure is executed on a single GPU, and since each GPU simultaneously analyses two images, the batch size is two.Te N feat and N img queue sizes of the feature store F store and image store I store are set to 10. Te evaluation process considers 100 detections per image, and the NMS threshold is 0.4.Te coefcient of stability α is 0.1.As seen from the data in Table 5, in the increment scenario of experiment (a), we set the base class T 0 � 10, and our mAP value is higher than that of the iOD method for all class increment processes.However, the overall task mAP value gradually decreases with the input of T i at each increment step when T i � 1. Te largest mAP diference can reach 3.6%, while the average mAP value is 2.36% larger than that of the iOD as the class increments increase.Te unique changing process of experiment (a) is depicted in Figure 4(a).As shown, the diference in mAP between our model and the base class, old class, and all classes steadily widens as T i is added, demonstrating that our model is more stable in the case of an incremental task fow.In the incremental scenario of experiment (b), we enhanced the complexity of the incremental task by setting T i to 2. Table 6 shows that as the difculty of the incremental task increases, our model has a clear advantage over experiment (a) in terms of the overall accuracy, with an average mAP diference of 4.5%.In the details of experiment (b) depicted in Figure 4(b), the gaps in mAP between our model and experiment (a) for the base class, old class, and all classes are more pronounced.Specifcally, in learning the T 2 task, the all-class gap reaches 5.1%.In comparison to experiment (a), it can be observed that as the incremental task learning difculty grows, the mAP value of the model for the overall task declines at a faster rate.Experiment (c) enhances the difculty of learning basic classes by increasing T 0 to 15 and T i to 1.When the number of base classes learned increases to 15 classes, the partial gap of our models' mAP over that of iOD gradually increases with the learning of new classes, reaching a maximum of 3.9% in Table 7.In the details of experiment (c) shown in Figure 4(c), the mAP of our model is better than that of iOD for the base class, old class, and all classes.In the class fow task experiment, we can observe that relative to T i � 1 in experiment (a), when the difculty of learning T i is increased, as in experiment (b), the model learning efect decreases signifcantly for each learned T i , as depicted in Figure 4(b); however, the fnal mAP of the all-class scenario is comparable to that obtained in experiment a and remains at 46.9%.Compared with T 0 � 10 learned classes in experiment (a), increasing the base class task learning difculty, as in experiment (c), decreases the learning efect of the very frst task while keeping the number of total tasks and T i learned classes constant.However, as the incremental task stream decreases, the model's fnal learning efect is better than that of the class stream with many tasks, as in Figure 4(c), and it remains at 54.8%.Table 8 shows our results on the COCO dataset for the experiment (g).Specifcally, we set up incremental scenarios with T 0 � 40 and T 1 � 40 and used the standard COCO dataset evaluation method with multiple IOU metrics (AP, AP50, and AP-0.75) and sizes (Aps: small, Apm: medium, and Apl: large) for a comprehensive evaluation.As seen in Table 9, our model continues to show excellent performance even for the high-volume class learning scenarios in the complex COCO dataset.It outperforms the iOD technique by more than 2% across all scales of evaluation, and its AP50 performance is 4.7% greater than that of the iOD method.
On the PASCAL VOC dataset, we report the results compared with those of the model by Shmelkov et al. [9], Faster ILOD [10], and iOD [5] in terms of mAP, while on the MS COCO dataset, we compare our results with those of iOD [5] with the standard COCO dataset evaluation method.Tables 9-11 show the experimental results of our comparisons.
In experiment (d), we set up a batch task increment scenario with T 0 : T 1 � 10 : 10.Table 9 reveals that our model achieved the best learning efect in all experiments compared to the other methods; the mAP score reached 65.0%, and the best learning efect was that on the new tasks, where mAP reached 68.3%.Our method is also superior to iOD in the maintenance of old task performance, with the old class mAP reaching 61.0%.
In experiment (e), we adjusted the class learning ratio of T 0 and T 1 (T 0 : T 1 � 15 : 5).In the results of experiment (e) International Journal of Intelligent Systems Te bold values represent the mAP optimal values of iOD compared with our method.14 International Journal of Intelligent Systems      shown in Table 10, our method is slightly inferior to the methods in [9,10] in terms of overall task mAP, but it is superior to the most recent method, iOD, in terms of retaining the old task performance and new task learning efect.It obtained 64.5% overall task mAP, 66.9% old task mAP, and 64.5% new task mAP.
In experiment (f ), we increased the number of classes learned in T 0 to 19 classes.In the results reported in Table 11, our model is comparable to the other methods in overall task learning and old-class task performance, but it is optimal in terms of mAP.It obtains 68.9% mAP for the overall task and 68.9% mAP for the old-class task.
Te class fow task experiments and batch task experiments demonstrate that the smaller the number of classes learned by T i in the incremental task phase, the smaller the fuctuation of the model on the total mAP value during each new task T i learning process is, where the fuctuation is the smallest for the scenario when T i � 1.In contrast, when the number of classes learned by T i increases, the gap in the overall mAP score of various methods increases signifcantly, and our method signifcantly outperforms the other methods in various incremental scenarios in the experiment.12 presents a comparison of the training time between our method and the iOD method during the incremental learning phase.Te results indicate that our method requires slightly more time for training, primarily due to the increased computational parameters involved in the incremental learning phase.However, the diference in training time between the two methods is relatively small.Notably, our method achieves a higher mean average precision (mAP) for the performance on both the old and new classes compared to the iOD method, with an improvement of 4.4%.In future studies, we will continue to conduct additional investigations to decrease the time necessary for model training while maintaining the assurance of optimal model performance.

Conclusion
Te catastrophic forgetting problem is mainly addressed by knowledge distillation in existing target detection models; however, object detection models with multiple network architectures require distinct types of distillation procedures.In this work, we present a novel multiple networks mean distillation method for object detection that uses zero averaging to process the model output parameters.Ten, it adds the parameters to the distillation loss to further improve the model output stability while strengthening the distillation loss at the input and output sides of the model network structure and adaptively distilling the intermediate network structure to better obtain accurate outputs.We combine meta-learning with a multiple network mean distillation method.On the two basic datasets, we set up numerous incremental tests, and the outcomes show that our model performs better than the alternative comparison models.International Journal of Intelligent Systems

3. 2 .
New Task Loss.Te learning of new tasks by the model can be viewed as the application of a loss function to learn model parameters.Specifcally, the object detector uses the loss function to minimize the classifcation error and the bounding box positioning error for learning.Let p � (p 0 , . .., p k ) denote the predicted probability of k + 1 classes (k real classes and 1 background class); l � (l x ; l y ; l w ; l h ) denotes the predicted bounding box position after pooling features for each RoI.Te true labels are p * (true class) and l * (true class bounding box position), and L RoI head is defned as follows:

Figure 1 :
Figure 1: Implementation principle (the model uses diferent data inputs through diferent learning stages in the learning process; denotes the second stage of the meta-learning process;denotes the frst stage of the new task learning process, which is included in the distillation process, and we set up the model to employ diferent distillation processes in diferent network structures).

Figure 2 :
Figure2: Gradient meta-learning mechanism (the images included within I store were inputted into the student network via the I store .During this process, all network weights, except for those in the warp layer of the RoI head network, were kept fxed.Tis was done to enhance the model's performance on the new task using the newly acquired image data).

Figure 3 :
Figure 3: Stable output experiment (the height of the box plot represents the stability of the model output).(a) Experiment (d) stability experiment.(b) Experiment (e) stability experiment.(c) Experiment (f ) stability experiment.

4. 4
.1.Class Flow Tasks.Incremental simulations, in which the model learns the top or frst 15 classes from the PASCAL VOC dataset as base classes, are performed, and the detector is fed one or two classes at a time.Tables 4-6 show the experimental results for the class fow task.Te frst row displays the joint learning of 20 classes as the incremental learning upper bound; the second row displays the model learning the base classes, where the base class is the frst 10 classes in experiments (a) and (b) and the frst 15 classes in experiment (b); and the following rows display each class in turn according to its ordinal number in the class fow task.Te table shows the change in mAP values for all classes as well as the AP values of each class incrementally for each task.

Figure 4
depicts the trend in the efects of each incremental task carried out by the model during the class fow incremental task on the base class, the old class, and all classes.
[10]N Distillation Loss.M RPN suggests regions r � (r 1 , r 2 , ..., r j ) for the old and new class features that I extracted by M Backbone in the previous stage and determines whether the corresponding region has a class score o� (o 1 , o 2 , ..., o i ).As the frst stage of the two-stage target detector, whether the proposed region network extracted by M RPN contains old and new classes is particularly important for the next stage of M RoI Head in the classifcation and regression of the old and new classes; however, if the distillation constraint on M RPN is excessive, it will lead to an increase in M RPN 's focus on the old class member and afect the learning process of the new task, so we adopt the idea of Peng et al.[10], who suggested using the teacher network as a lower bound to adaptively choose whether to apply distillation constraints to M RPN .In contrast, we subject the output scores from M te RPN and M st RPNto distillation softening in equation ( Table 2 displays the individual experimental outcomes.Te table demonstrates that as α gradually increases, our model's experimental accuracy gradually declines from 65.0% to 60.7%.Based on the experimental results, we ultimately set the value to 0.1 in all tests.

Table 5 :
Experiment (a) number of base classes 10, number of incremental task learning classes 1 (%).

Table 6 :
Experiment (b) number of base classes 10 and number of incremental tasks learning classes 2 (%).
Te bold values represent the mAP optimal values of iOD compared with our method.In the batch task settings, we considered class batch learning of the model on the PASCAL VOC and MS COCO datasets with diferent numbers of base classes and incremental classes.T 0 validates the accuracy of our method.

Table 7 :
Experiment (c) number of base classes 15 number of incremental task learning classes 1 (%).

Table 10 :
Experiment (e): 15-5.MethodsAero Cycle Bird Boat Bottle Bus Car Cat Chair CowTable Dog Horse Bike Person T

Table 11 :
Experiment (f ): 19-1.Methods Aero Cycle Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Bike Person Plant Sheep Sofa Train T