Adaptive In-network Collaborative Caching for Enhanced Ensemble Deep Learning at Edge

To enhance the quality and speed of data processing and protect the privacy and security of the data, edge computing has been extensively applied to support data-intensive intelligent processing services at edge. Among these data-intensive services, ensemble learning-based services can in natural leverage the distributed computation and storage resources at edge devices to achieve efficient data collection, processing, analysis. Collaborative caching has been applied in edge computing to support services close to the data source, in order to take the limited resources at edge devices to support high-performance ensemble learning solutions. To achieve this goal, we propose an adaptive in-network collaborative caching scheme for ensemble learning at edge. First, an efficient data representation structure is proposed to record cached data among different nodes. In addition, we design a collaboration scheme to facilitate edge nodes to cache valuable data for local ensemble learning, by scheduling local caching according to a summarization of data representations from different edge nodes. Our extensive simulations demonstrate the high performance of the proposed collaborative caching scheme, which significantly reduces the learning latency and the transmission overhead.


Introduction
With the breakthrough of Artificial Intelligence (AI), we are witnessing a booming increase in AI-based applications and services.The existing intelligent applications are computation intensive.To process a huge number of data in time at the edge of the network, edge computing has rapidly developed in recently.Edge computing [1] takes a part of the resources and memory from the data center and puts it at the edge of the network to be closer to end users, which can reduce the network transmission delay, protect user's privacy and improve the network experience of end users.
The rapid uptake of edge computing applications and services pose considerable challenges on networking resources.Fulfilling these challenges is difficult due to the conventional networking infrastructure.The extensive application of deep neural network models at edge makes this problem more serious.Neural network models learn relationships among a huge number of training data.Meanwhile, this type of complex nonlinear models is sensitive to initial conditions, both in terms of the initial random weights and in terms of the statistical noise in the training data.This stochastic nature of the learning algorithm means that a neural network model is trained, it may learn a new group of features from inputs to outputs, which have different performance in practice.
It is the ensemble learning [2,3] that provide a feasible way to handle the variance of a single neural network model.Ensemble learning schemes train multiple models and combine their outputs to alleviate the variance of a single model.
We can achieve a high-quality neural network model by combining different neural network models to an ensemble result.
Although ensemble learning can enhance the capability of edge deep learning, there is no performance improvement if the similar individual sub-models are combined [4].Tumer et al. [5] analyzed simple soft vote ensemble methods by decision boundary analysis, revealing the importance of difference among sub-models.The same conclusion is also applicable to other ensemble methods.However, it is not easy to generate individual sub-models with high diversity.A significant challenge comes up that sub-models are obtained on similar training data, so they are often highly correlated.
To make all sub-models different from each other, we target on an adaptive collaborative caching scheme to guarantee sub-models learning different data and consequently being diverse.With comprehensive study of collaborative caching at edge, we propose a compact recording structure of the cached data, maximizing the difference among the data cached in different nodes.Besides the caching locality considered in the conventional caching schemes [6], it is also important to consider the diversity of the data cached to train sub-models.In this way, we improve the performance of the ensemble learning.Our main contributions are summarized as follows: 1. To efficiently record cached data among collaborative nodes, we study data recording and exchanging among edge nodes, and introduce a compact representation structure, Combinable Counting Bloom Filter (CCBF).2. We design an adaptive collaborative caching scheme based on CCBF, based on which a high-performance ensemble deep learning at edge is proposed.3. Comprehensive evaluation of the proposed scheme is performed on NS-3 based simulations over real-world deep learning models and data.
In Section 2, we study the related work.A compact data structure to collect the information of the cached data is present in Section 3. Base on this data structure, we propose an adaptive collaborative caching scheme for ensemble learning in Section 4. In detail, we exchange the information of the cached data, elevate these data to learn local knowledge, and achieve a high-performance ensemble result.The performance of our design is evaluated in Section 5. Finally, we conclude the work in Section 6.

Related Work
In recent years, ensemble learning at the edge has been used in all kinds of applications [7,8].Meanwhile, due to the storage limitation of each edge node, edge nodes always collaborate in data collection and model training [9,10,11].

Ensemble Learning
We investigate relevant works in recent years, and we find the performance can be effectively improved with the introduction of ensemble mechanism.To optimize the determination process of deep classification model structure and the combination of multi-modal feature abstractions, Yin et al. [12] proposed multiple-fusion-layer based ensemble classifier of stacked auto-encoder (MESAE) for recognizing emotions, in which deep learning is used for guiding autoencoder ensemble.Moreover, based on the assumption that different convolutional neural network (CNN) architectures learn different levels of semantic representations, Kumar et al. [13] developed a new feature extractor by ensembling CNNs that were initialized on a large dataset of natural images.Experiment showed that the ensemble of CNNs can extract features with a higher quality, compared with traditional CNNs.Xiao [14] proposed an ensemble learning method to improve the robustness in traffic incident detection.Galicia et al. [15] presented ensemble models for forecasting big data time series.Liu et al. [16] applied ensemble convolutional neural network models with different architectures for visual traffic surveillance systems.Liu et al. [17] designed an ensemble transfer learning framework which used AdaBoost to adjust the weights of the source data and target data, this method achieved good performance on UCI datasets when the training data are insufficient.Chen et al. [18] proposed an ensemble network architecture for deep reinforcement learning, in order to solve the problem that existing ensemble algorithms in reinforcement learning.

Collaborative Caching
Collaborative caching has been applied in the ensemble learning field to collect sufficient data for sub-models training and high-quality ensemble result achievement.Amer et al. [19] stated the role of wireless caching in low-latency wireless networks and characterized the network average delay on a per request basis from the global network perspective.Li et al. [20] proposed a cache-aware task scheduling method in edge computing.First, an integrated utility function is derived with respect to the data chunk transmission cost, caching value and cache replacement penalty.Data chunks are cached at optimal edge servers to maximize the integrated utility value.After placing the caches, a cache locality-based task scheduling method is presented.Chien et al. [21] proposed a collaborative cache mechanism in multiple Remote Radio Heads (RRHs) to multiple Baseband Units (BBUs).In addition, they use Q-learning to design the cache mechanism and propose an action selection strategy for the cache problem.Through reinforcement learning to find the appropriate cache state.Ndikumana et al. [22] proposed collaborative cache allocation and computation offloading, where the MEC servers collaborate for executing computation tasks and data caching.Tang et al. [23] proposed caching mechanisms in collaborative edge-cloud computing architecture, which can implement the caching paradigm in cloud for frequent n-hop neighbor activity regions.Khan et al. [24] proposed reversing the way in which node connectivity is used for the placement of content in caching networks, and introduce a Low-Centrality High-Popularity (LoCHiP) caching algorithm that populates poorly connected nodes with popular content.Wei et al. [25] presented a system that automatically parallelizes serial imperative ML programs on distributed caching.The system makes a static dependence analysis to determine when dependence-preserving parallelization is effective, and maps a computational process to a distributed schedule.

Problems and Our Insight
Although existing caching approaches can facilitate ensemble learning, enhanced collaborative caching should be studied.Actually, the ensembling learning process is highly related to different sub-models.As analyzed in Tumer et al. [5], if the sub-models are independent of each other, the error of ensemble learning will be reduced.If each sub-model is correlated with all others, the error of ensemble learning becomes larger.This analysis clearly reveals the importance of different sub-models [4].
To differentiate all sub-models as much as possible, we take different data to train various sub-models and improve the performance of ensemble learning: (1) An efficient way to record the cached data items is highly required.(2) Exchange the record of the cached data among different edge nodes.(3) Schedule the edge caching according to the records and train different sub-models for high-quality ensemble learning.

Composable Counting Bloom Filter
To make the valuable data remain on edge nodes, we need to exchange the compact records of the cached data.However the most popular compact recording method, Counting bloom filter (CBF), only support inserting, deleting and query on data, and cannot support the combination operation of multiple filters which will be used to summarize the exchanged compact records of the cached data.Therefore, to meet the aforementioned requirement, we introduce a combinable Counting Bloom Filter (CCBF) in this section.

Design Structure
To support the dynamic update of the record of the cached data, as well as the combination of multiple compact records of the cached data, we design a new structure for the proposed CCBF on the basis of basic bloom filter.We can combine multiple basic bloom filters by performing bitwise OR on these filters, but cannot combine CBFs in the similar way since CBF aggregates the information of the inserted data into its counters.Based on the above observation, we can stack several basic bloom filters to build a counting bloom filter which can support updating and merging operations simultaneously.
The detailed structure of CCBF is shown in Figure 1.The CCBF has k hash functions (h 1 , h 2 , • • • , h k ) and consists of the following two components: 1. G bit arrays (barr i , i = 1, 2, ..., g): The bit arrays used to replace the counter arrays in CBF to support counting operations of the inserted data items, the size of which is equal to m. g is set based on the requirement of counting.

orBarry:
The aggregation result of g bit arrays by performing bitwise OR, which is used to enhance the the query efficiency and facilitate the data caching among edge nodes.
CCBF not only supports the insert, query and delete of items, but also supports the combination of multiple CCBFs.These operations will be described one by one in the next section.

Related Operations
According to the needs of exchanging and updating cached data records, we have implemented the functions of inserting, querying, deleting and combination in CCBF.In this part, we introduce the related operations.

Cached data item
The preudo-random number generator is used to select the bit array according to the usage of each bit array

Inserting
To insert a data item into CCBF, we use the pseudo-random integer generator to generate a random matrix to perform an efficient and non-repeated insert operation.The specific inserting operation is shown in Algorithm 1 and the procedures are shown as follows: 1.A random matrix(matrix[g][m]) of size g × m is constructed by using a pseudo-random integer generator with different seeds on different columns.For each column, the value of each cell is different from the others, and belongs to a range from 1 to g. 2. Hash the cached data k times to get k hash results({p j }) (line 3). 3. Use the RandChoice function to searches the p j − th column of matrix[g][m] for the next available bit array, according to the number of arrays whose p j − th cells are used (line 4). 4. Set the p j − th cell in barr i to be 1 and update orBarr (line 5-7).
Notice that we will check If the bit arrays used to record the k hash results meets the following condition: It means that this data item has been inserted, and this insert operation will be abandoned.

Querying
We include an additional array orBarr to support efficient membership query operations, which can directly check whether the corresponding cells of orBarr are set to be 1.The specific query operation is shown in Algorithm 2: 1. Hash the queried data k times to get j hash results(p j ) (line 2). 2. Check if orBarr[p j ] is 1.If it is 1, the data is inserted before.Conversely, there is no corresponding data (line 3-7).

Deleting
The calculation process of the delete operation is similar to that of the insert operation.The key operation steps are briefly listed as the follows: p j = Hash j (d)

Combination
CCBF does not only support inserting, query and deleting data items, but also supports the combination of multiple CCBFs.The combination operation of multiple CCBFs is equivalent to combine the data items inserted in these CCBFs.The random matrix matrix[g][m] generated in CCBF can ensure that the bit array selected in a fixed sequence, and thus the repeated data inserting can be neglected.The specific combination operation is shown in Algorithm 3 and the combination procedure is listed as the follows: 1. Determine whether the number of items in the compacted representation after merging has exceeded the capacity of CCBF (n) (line 1-3).

Combine bit arrays one by one by bitwise OR (line 6-13).
CCBF can record the cached data in each edge node through a compact way.By inserting the data items in CCBF and exchanging and combining CCBFs among edge nodes, the cache of each edge node can be scheduled to store diverse data, which facilitates obtain different sub-models on these data and achieve an high-performance ensemble learning result.

Ensemble Learning Based on Collaborative Caching
In this section, we propose an edge ensemble learning scheme based on adaptive collaborative caching.In this scheme, CCBF is used to record the cached data information, so as to realize the exchange and collection of cache information between edge nodes.Also, the scheme can reasonably schedule the cache according to the data distribution, and support the sub-models learning and final ensemble learning of each edge node.

Learning Strategy
In order to improve the performance of ensemble learning at edge, we first study the ensemble learning process.The decision boundary analysis results of the simple soft voting ensemble method are as follows: To remain simple, it is assumed that all sub-models have the same error rate.We use θ to describe the relationship between different sub-models.The expectation error of ensemble learning is where err i (h i ) is the expectation error rate of a sub-model, and n is the size of the ensemble scale.Formula (2) shows that if the sub-models are independent of each other, i.e., θ = 0, the error of ensemble learning will be reduced by n times.If each sub-model is correlated with all the others, i.e., θ = 1, the performance of ensemble sub-models will not be effectively improved.This analysis clearly reveals the importance of different sub-models in ensemble learning, and the same conclusion applies to other ensemble approaches [4].In edge ensemble learning scenarios, different edge nodes often deploy similar models, build sub-models by learning the data around the edge nodes, and eventually form an ensemble model by distributing the sub-models on different nodes by the central node.For this case, it is necessary to provide different data for different edge nodes to get different training results of sub-models, so as to achieve a more accurate ensemble model.

Leveraging Collaborative Caching to Facilitate Ensemble Learning Process
According to the analysis in Section 4.1, we leverage a collaborative caching scheme to support different sub-models learning of ensemble learning.In detail, the process can be divided into the following five phases.

Efficiently Recording the Cached Data on a Edge Computing Node
Data is collected from neighboring end devices, which will be cached and recorded in the compact way by CCBF (see Section 3).

Exchanging Compact Representation of Cached Data with Neighbors
The compact representation (CCBF) of the cached data is exchanged among neighbors in a range, which adapts to the performance improvement of the sub-model training in step 4. Once a neighbor receives a representation from an interface, this representation will be stored, with a name of CCBF l , where l is the id of the corresponding interface.In addition, the representation is combined with an aggregated representation on this neighbor, CCBF g , and gain a global view about the data cached in the neighbors, which will be used to guide the neighbor to cache various data received subsequently.Notice that, although more data from a big range of neighbors can benefit the training performance of the sub-models, the collaboration in a big range will cause larger communication overhead, and thus our design make the collaborative range adapt to practical sub-model training results.

Caching Different Data among Neighbors
When the neighbor requests to cache some data, it first need to check whether this data has already existed in the neighbors.It queries the CCBF g that represents the global view about the data cached in the neighbors.If the record of this data is found in CCBF g , which indicates that the data has already existed in the cache of other neighbors, the data no longer need be stored in the neighbor's cache.And if this record does not exist in the CCBF g , indicating that the caches of other neighbors do not contain this data, this data can be added to the cache and a record is added to the corresponding compaction record.The above operation ensures that different data can be cached at neighbors for training different sub-models in ensemble learning, while reducing the communication overhead by collaborative caching.

Sub-model Training
The data cached on one node is used to train the local sub-model.When the local data is not enough to make the sub-model converge, we need to enlarge the collaborative range by requesting differentiated data from the other edge nodes.We compare the cached data records from different neighbors obtained in step 2, CCBF l , with the local cache record by performing merge on the orBarr of different CCBF l from different neighbors, obtain the required data compact representation ôrBarr and send it to the corresponding edge node.When the corresponding edge node receives the request, it queries the local cache according to ôrBarr and returns the differentiated data to the requesting node.After the requesting node receives the data, it caches the data and updates CCBF l and CCBF g , then inputs the data into the sub-model for training.The procedures are repeated until the sub-model converges.

Enhanced Ensemble Process
The ensemble method obtains the result by attaching different weights to the output result of each sub-model.The ensemble output result H (x) is: where ω i denotes the weight of h i , usually with constraints ω i ≥ 0 and n i=1 ω i = 1.These parameter weights from the sub-model are uploaded to the central node, which conducts ensemble learning in an enhanced way.Specifically, for n sub-models (h 1 , ..., h n ), the following method is adopted for ensemble learning.
Suppose the output of each sub-model can be written as the true value plus an error term: The ensemble error can be expressed as [26]: where p (x) is the input distribution, i is an error term, and The optimal weight can be solved in the following ways: By means of the Lagrangian multiplier, we get ω i is

PERFORMANCE EVALUATION
In this section, we conduct experimental simulations of two learning models on four different datasets to evaluate the performance of the proposed collaborative caching scheme for ensemble learning at edge.

Implementation
We evaluate the performance of our adaptive collaborative caching scheme on Ns-3 platform [27].It's a modular, programmable, extensible, open, open-source, community-supported simulation framework for computer networks.We connect neural networks library OpenNN (Open Neural Networks Library) [28] to Ns-3 for experimental simulations.
OpenNN is an open-source neural network library to facilitate the building of neural networks.It has found a wide range of applications, which include function regression, pattern recognition, time series prediction, optimal control,  As shown in Figure 3, we use a general topology for edge networks.It includes a remote data center, a gateway node, 4 edge computing nodes and 8 end devices, connected by Gigabit links.The cache size of edge computing nodes is 2,000.Edge computing nodes can cache data and efficiently record cached data and perform cached data computational tasks.
End devices generate the learning data of models, and send data to edge computing nodes.After receiving the data, edge computing nodes first carry out data caching and efficient recording, then use different data of cooperative caching to train sub-models, and finally send the training results of sub-models to the data center for ensemble learning.
The data center generates background traffic data and sends data to edge computing nodes.After receiving the data, edge computing nodes carry out data caching and send background traffic data to end devices.Two types of learning models are deployed on these edge nodes and the data center server.Correspondingly, four different datasets (D1, D2, D3 and D4) are used to compare the proposed adaptive collaborative caching scheme (C-cache) with two baseline schemes implemented as the follows: • Centralized: In the model training process, the data center requests training data from edge nodes to train ensemble learning models on the server.• P-cache: A caching mechanism in collaborative edge-cloud computing architecture [23] is proposed, in which edge computing nodes periodically request cached data for sub-models learning, and data center server performs ensemble learning.
To evaluate the performance of the collaborative caching scheme, we use four datasets to learn.Specifically, two text datasets (D1, D2) to train MLP model.To train VGG model, we apply an image dataset (D3) of tiger faces and a dataset (D4) of human faces.
There are four types of soil, corresponding to seven types of vegetation.581,012 data items forest vegetation are shown in four soil types.The numbers of data items for different soil types are not even.The number in type 4 is fewer than 3,000, and is almost 10,000 for type 5.The number of any other types is larger than 10,000.• Healthy Old People dataset [30] (D2): Sequential motion data from 14 healthy older people aged 66 to 86 years old using sensors for the recognition of activities in clinical environments.Participants were allocated in two clinical room settings (S1 and S2).S1 (Room1) and S2 (Room2) are set with different number and location of sensor receiver.The number of data items is 75,128, which are evenly categorized into 6 different behaviors.
• Atrw Reid-tigerface dataset [31] (D3): The images of Atrw Reid-tiger face are captured.After clipping, the image resolution is adjusted to 128 × 128.Each one of 500 tigers has 10 photos.According to the areas of activities, the Russian Far East region and the northern region of India, the dataset is separated into two scenarios.
• Casia-face dataset [32] (D4): The face images of human face are captured.After clipping, the image resolution is adjusted to 128 × 128.Each one of 500 persons has 10 face pictures.According to the Angle of the photograph taken, the position of the front and the rhombic 45 degrees, the dataset is separated into two scenarios.
We implement the following two learning models, among which Adam algorithm [33] is used, which can adaptively adjust the learning rate of the model: • Multilayer Perceptron (MLP) model.MLP is a feedforward artificial neural network that maps multiple input data to outputs.The layers of MLP are fully connected.We implement a six-layer MLP model, including an input layer, four hidden layers, and an output layer.
• Visual Geometry Group Network (VGG) model.VGG is a deep convolutional neural network used in computer vision.Our implementation includes 5 convolutional blocks with each consisting of 2-4 convolutional layers.

Evaluation Metrics
To evaluate the performance of adaptive collaborative caching scheme for ensemble learning at edge, we use three metrics: hit ratio, latency and accuracy.

Hit Ratio of Collaborative Caching
The concept of hit ratio is defined for any two adjacent level of memory in memort hierarchy.The performance of cache is measured in terms of hit ratio.If a data item requested by edge computing nodes is found in the cache, it is called a hit.Hit ratio is the number of hit data divided by total data items and consists of local hit ratio, global learning hit ratio and global background hit ratio.
where N l is the number of data items for training a sub-model in the local cache, and N c is the total amount of locally cached data items.
2. Global learning hit ratio (GLR hit ) represents how many data items in edge nodes can be used to train submodels.GLR hit is the ratio of the learning data items in the global cache and the overall cached data items, calculated as: where N g is the number of data items for training sub-models in the global cache and N gc is the total amount of globally cached data items.
3. Background hit ratio (R hit ) is the ratio of the background traffic data items in the global cache and the overall cached data items.Background traffic refers to the flow of data packet exchange between application program and network periodically or intermittently when there is no specific interaction.R hit calculated as: where N b is the number of background traffic data items.2. Learning latency is a time period how long the training model converges.

Learning Accuracy
The accuracy (Acc) of model training on a dataset is defined as: where True Positive (TP) is the number of positive items that are correctly classified, False Positive (FP) is the number of positive items that are incorrectly classified, False Negative (FN) is the number of negative items that are incorrectly classified, and True Negative (TN) is the number of negative items that are correctly classified.

Hit Ratio of Collaborative Caching
Since the Centralized scheme trains models in the data center without caching data in edge nodes, we only compare the cache hit ratio of the baseline, P-cache, and the proposed C-cache.Local learning hit ratio is depicted in Figure 4 and Figure 5. Global learning hit ratio is depicted in Figure 6 and Figure 7.The local learning hit ratios of C-cache and P-cache increase to their maximum stable value of 0.87 and 0.85.The global learning hit ratios of C-cache and P-cache increase to their maximum stable value of 0.83 and 0.81, while the learning data are generated and cached in different edge nodes.9 are depicted hit ratio of background traffic data.The cache hit ratio of background traffic data first increases over time, and when the learning data increases, more background traffic data are switched out from the caches of edge computing nodes.Consequently, the cache hit ratios of background traffic data in C-cache and P-cache decrease to 0.17 and 0.19.Regarding different training models and datasets, the cache hit ratio under C-cache declines faster than that under P-cache.This is because C-cache can use learning data better than P-cache, and less available cache space is reserved for caching background traffic data.

Transmission overhead and Learning Latency
To evaluate the communication and time overhead of the proposed scheme, we compare two baselines, Centralized and P-cache with our C-cache in terms of transmission overhead and learning latency.Among them, the transmission overhead is depicted in Figure 10.No matter which models or data sets are taken into consideration, C-cache always has the least transmission overhead.More powerful model like VGG will consume more communication resource, while the transmission overhead of the Centralized scheme is twice as much as that of C-cache.All learning data needs to be sent to the data center, which makes the Centralized scheme has the largest transmission overhead.In addition, the rational data requests and collaborative caching benefits C-cache regarding the transmission overhead.Valuable data are cached on edge nodes, and thus reduce redundant data transmission among different edge nodes.

CONCLUSION
In this paper, we propose an adaptive in-network collaborative caching scheme to support efficient ensemble learning at edge.In this scheme, edge nodes collaborate to obtain cached data items with different features as much as possible, train sub-models with large differences, thus effectively improving the performance of ensemble learning.The extensive simulations demonstrate that our proposed collaborative caching scheme in edge network can significantly reduce learning latency and transmission overhead for ensemble learning. barr

Figure 1 :
Figure 1: Structure of Composable Counting Bloom Filter

Figure 6 :
Figure 6: Global learning hit ratio of MLP

Figure 7 :
Figure 7: Global learning hit ratio of VGG

5 R 4 RFigure 8 :Figure 9 :
Figure 8: Global background hit ratio of MLP n s m i s s i o n o v e r h e a d ( M B ) C e n t r a l i z e d P -c a c h e C -c a c h e

Table 1 Figure 11
Figure 11: Learning latency Confirm whether the item exists in this CCBF by performing a query operation on this item (see 2). 2. Locate the bit arrays used in the last inserting operation according to the random matrix matrix[g][m].3. Clear the corresponding cells in these bit arrays, and update orBarr.
3:if CCBF.orBarr[p j ] = 1 then 1. Local learning hit ratio (LLR hit ) represents how many local cached data can be used to train a sub-model.For example, if 30 pieces of data items in the local cache can be used for training the model, and the total amount of cached data items is 100, then the LLR hit is 30/100=0.3.LLR hit is the ratio of the learning data in the local cache and the overall cached data:

Table 1 :
Learning accuracy comparison in a short time period, which effect the sub-models training on edge nodes and thus degrades the performance of the ensemble results.