Federated Learning: A Distributed Shared Machine Learning Method

Federated learning (FL) is a distributed machine learning (ML) framework. In FL, multiple clients collaborate to solve traditional distributed ML problems under the coordination of the central server without sharing their local private data with others. This paper mainly sorts out FLs based on machine learning and deep learning. First of all, this paper introduces the development process, definition, architecture, and classification of FL and explains the concept of FL by comparing it with traditional distributed learning. Then, it describes typical problems of FL that need to be solved. On the basis of classical FL algorithms, several federated machine learning algorithms are briefly introduced, with emphasis on deep learning and classification and comparisons of those algorithms are carried out. Finally, this paper discusses possible future developments of FL based on deep learning.


Introduction
In the era of big data, people pay more and more attention to data security and user's privacy; protection of data has become the focus of enterprises and individuals. In addition, data leakage has attracted the attention of governments and public media in recent years. e world major powers and major unions have enforced the supervision of citizens' data security and privacy in law; the General Data Protection Regulations (GDPR) issued by the European Union [1] have come into effect on May 25, 2018. China's Cyber Security Law, promulgated in 2017, requires Internet companies not to disclose or tamper with the personal information they collect from users, and when conducting data transactions with third parties, they need to ensure that both Internet company and third party comply with user data protection obligations [2][3][4]. e protection of data privacy in various countries becomes stricter, which makes large-scale user private data transfers between different companies in the future no longer allowed. e promulgation of these laws and regulations, on the one hand, protects the privacy of users and, on the other hand, prohibits big data from being excavated arbitrarily, which restricts the development of artificial intelligence. Big data is the basis of large-scale distributed ML. Under the restrictions of the abovementioned laws and regulations, data often exist in the form of isolated islands among different enterprises; even among different subsidiaries of one group. e term "federated learning" was put forward by McMahan et al. [5] in 2016: "We call our approach FL because learning tasks are solved through a loose federated of participating devices (what we call clients) coordinated by a central server." FL was originally defined as a distributed ML method that uses multiple user data to train a central model [6]. e purpose of FL is to carry out efficient distributed ML between multiparticipants or multicomputing nodes on the premise of ensuring the information security of big data exchange, protecting mobile data and personal privacy, and ensuring legal compliance. FL uses the framework of classical distributed ML and adopts distributed ML technology, but the control of the central server is different from that of distributed ML. Researchers can mine and utilize data without violating laws and regulations. In a broad sense, FL refers to a method where the data owner can realize the training of models without uploading local data [2]. e modeling of FL is based on the local model uploaded by each participant, and then the joint training model is returned to each participant to get similar results to traditional ML without violating laws; this makes FL have the advantage of confidentiality.
However, the classical algorithm of FL has some shortcomings in dealing with nonindependent and identically distributed data, communication transmission, and model establishment, and their resulting solution is too numerous to enumerate. erefore, after consulting the relevant literature, this paper introduces classical FL and FL algorithms that have been promoted in some aspects. Moreover, in the era of big data, the effect of FL based on deep learning is more effective. is paper focuses on recent developments of federated deep learning algorithms; they are sorted out and summarized. We hope this article can make it easier for readers to quickly review the whole FL field, especially federated deep learning subfield. e content of this paper includes the following: Section 2 introduces the basic knowledge of FL; Section 3 introduces some unsolved problems of FL; Section 4 introduces the FL algorithm based on ML; Section 5 introduces the FL algorithm based on pan-deep learning; Section 6 introduces the attack of FL; Section 7 describes future challenges of FL; and finally, the 8th summarizes this paper.

Machine Learning.
With the rapid development of ML, its models are becoming more and more complex and effective [7,8]. e core idea of ML is that the computer learns the mapping between input and output according to existing data samples: f w : x ⟶ y, where x is the input, y is the output, f is the corresponding rule, andw is the parameter to be learned. According to the corresponding relationship, the model predicts the output value of the next input. e purpose of ML is to make the gap between the predicted value and the real value as small as possible. e mathematics is expressed as In traditional ML, such as backpropagation neural network (BPNN) and convolutional neural networks (CNNs), the learning process of this parameter is all concentrated on one computer, and the commonly used methods are gradient descent and a series of improved algorithms. e core algorithm of FL is very similar to the Stochastic Gradient Descent (SGD) method [7]. In SGD, a sample is randomly selected from all samples to participate in the operation at each iteration.

Distributed Machine
Learning. Distributed machine learning combines multiple computers for computing. Its core goal is to disassemble computing tasks into multiple small tasks and perform computing on multiple local processors. Its final training requires a central server to deal with the data uploaded by local clients; as a result, communication and privacy security is difficult to be guaranteed. Algorithm 1 shows the distributed machine learning algorithm.

Federated
Learning. FL is different from distributed ML; in FL, the information uploaded by each participant to the server is no longer the original data, but a trained submodel. At the same time, the FL also allows asynchronous transmission [9], and the communication requirements can be appropriately reduced. On this basis, the formula of federated machine learning can be updated as follows: arg min w L(x, y, w) � k p k L k (x, y, w), (2) where k is the number of clients, p k is the weight value of the kth client, and the scenario for FL is the decentralized multiuser F 1 , F 2 , . . . , F k . Each client user has the current user's data set D 1 , D 2 , . . . , D k . In deep learning, these data are sorted out into a data set where δ is a nonnegative number. However, in actual situations, the aggregation model M SUM cannot be obtained in the end, because the basic requirement of FL is privacy protection. According to Professor Yang's book "Federated Learning" [10], the federated average algorithm of FL can be expressed in Algorithm 2. ere are three types of federated transfer learning: case-based, feature-based, and model-based. It is mainly used in retail e-commerce, financial investment, and medical research [10]. Figure 1 shows three categories of federated learning. e main differences and problems of horizontal federated learning, vertical federated learning, and federated transfer learning are shown in Table 1.

Federated
Distributed Machine Learning Server side (1) Input: data sample X, Y, initial model parameters w 0 , iterative step μ; (2) Divide X Perry Y into collections in units of records X 1 , X 2 , . . . , X m , Y 1 , Y 1 , . . . , Y m . m indicates the number of clients; (3) Send X k , Y k to the client k; (4) Execute t times (t ≥ 1) for each iteration: send w t−1 to the client; (5) Receive gradient update g k t from client k; execute w t ←w t−1 − μ m i�1 g k t ; (6) To determine whether the termination condition is met: if so, it will be terminated; otherwise, it will be executed t←t + 1; Client side k (1) Input: X k , Y k , w t−1 ; (2) Batches are randomly selected from X k , Y k records as training data  (4) e coordinator determines C t , that is, determines the set of max(k p , 1) randomly selected participants; (5) For each participant k ∈ C t do in parallel; (6) Update the model parameters locally: w (k) t+1 ← participants update (k, w t ) (see line 13); (7) Send the updated model parameter w (k) t+1 to the coordinator; (8) end for (9) e coordinator aggregates the received model parameters, that is, using a weighted average for the received model parameters: (10) e coordinator checks whether the model parameters have converged. If it converges, the coordinator sends a signal to all participants to suspend model training; (11) e coordinator broadcasts the aggregated model parameter w t+1 parameter-to all participants; (12) end for (13) Update in the participant (k, w t )(participants k, ∀k � 1, 2, . . . , K are executed in parallel) (14) Get the latest model parameters from the server, that is, set w (k) 1,1 � w t ; (15) For each local iteration from 1 to the number of iterations S i do; (16) Batches ← randomly divide the data set D k into the size of the batch M; (17) Obtain the local model parameters from the previous iteration, set w (k) 1,i � w (k) B,i−1 ; (18) For batch number b do from 1 to batch quantity B � n k /M; (19) Calculate batch gradient g (b) k ; (20) Update model parameters locally:

Unsolved Problems
e definition and classification of FL are described above. is section is mainly on its 5 unsolved problems.

e Problem of Nonindependent and Identically
Distributed Data Samples. In distributed ML, local data samples are often independently and identically distributed. Although FL is a kind of distributed ML, most of its data are nonindependent and uniformly distributed. Moreover, it is different from the batch training in traditional distributed ML; there are some differences in the training data obtained by FL in each round of training. Some scholars tried local data sharing or model migration to solve it, such as federated semisupervised learning and unsupervised learning, which we mention in Chapter 4.

e Problem at Different Participants Have Different Amounts of Data.
e amount of data owned by different clients is different; it is determined by the participants themselves and cannot be controlled.
ere is a similar problem in Business-to-Business (B2B); some large companies occupy a lot of data resources. How to get such large companies to participate in joint modeling is the first question.
e key is to establish a reasonable incentive mechanism to share the profits generated fairly and equitably with participants. Federated blockchain technology can well solve the problem of incentive mechanisms. Paper [13] describes the collection of FL and blockchain technology and how to reward participants.

Robustness of Participants.
In FL, many participants are mobile devices, and different participants have different network structures in data communication. When participating in joint modeling, some methods need to be adopted to ensure the robustness of the model. In addition, there will be some fake participants to attack within the global model established by FL, which we call fake local clients. Some scholars have proposed intrusion detection methods based on federated convolutional neural networks. In the deep learning part, they will specifically introduce how to use deep learning to improve the robustness of the model.

Communication and Computing
Problems. FL means that large-scale data are trained locally, and most of its real application cases are transmitted by wireless communication, so its exchange process requires a stable communication condition. But the task model and data distribution will frequently change with time, and the structure of the federated network, target data characteristics, feature extractors, business tags, etc. will also change, which will lead to communication and computing problems. At present, a large number   In this paper, we will introduce some deep learning algorithms which improved the problem in Chapter 5.

Privacy and Security of Federated
Learning. Although the purpose of FL is to protect the privacy and security of users, in the process of participating in the joint training of FL, even if there is no need to obtain the information of local users, the privacy of users cannot be 100% guaranteed. When constructing a joint model, participating devices need to upload model parameters or gradient values, these parameters come from local models, and the partially trained local devices contain all the information of the data. ere are many attacking models and part or all of the original data can be deduced from model parameters or gradients [14,15], some local device attacks or disguised local model training participants. erefore, many encryption methods have been proposed, and the common encryption methods are Secure Multiparty Computing (SMC) [16]; Homomorphic Encryption (HE) [17]; Data Disturbance (DD); Differential Privacy (DP) [18];and so on.
In view of the above problems, various researchers have put forward various solutions. In this paper, each method is divided into two categories: one is based on ML and the other is based on pan-deep learning.

Federated Learning Algorithm Based on
Machine Learning e most classical Federated Learning Average (FedAvg) is proposed by McMahan et al. [5]; it proves that FedAvg can achieve expected results when tested on the benchmark image classification data set (such as MNIST [19] and CIFAR-10 [20]). Since then, many FL are proposed. Here are several common FL algorithms based on ML; they are classified according to federated supervised learning, federated semisupervised learning, and federated unsupervised learning. Figure 2 shows the classification of federated ML [21].

Federated Supervised Learning.
Supervised learning is a classic ML method, which infers a functional ML task from marked training data. e training data include a set of training examples. In supervised learning, each instance consists of an input object and the desired output value.
where L(w; x n ; y n ) is the loss function, w is the parameter of the model, x n is the feature of the model, y n is the label of the model, and n ∈ 1, N { } is the amount of data. In the framework of this optimized federated algorithm, homomorphic encryption is added to encrypt the data and gradient of both sides. e whole training process can be described as the data of the unlabeled data holder α are d a � w α τ x , where w α τ represents the model parameters of the unlabeled data holder in the τ round state.
α ], and [Δd α ] to the labeled data holder β, and β calculates the gradient and loss and sends them back after homomorphic encryption. After receiving the encrypted gradients from α and β, the central server assists α and β to update their models.

Federated Support Vector Machine.
A federated support vector machine was proposed by Hartmann et al. [23] in 2019. e method optimizes and protects the parameters by updating blocks of local modules, attributing feature hashing and other ways. e objective function is as follows: where N is the training data, w is the parameters of the model, L(w, x i , y i ) is the loss at the point (x i , y i ), λR(w) is the regular term of the loss function, and λ is the hyperparameter to control the penalty. e objective function of support vector machine for traditional ML is as follows: e Support Vector Machine (SVM) performs dimensionality reduction hash processing on the eigenvalues to hide the actual eigenvalues. e federated support vector machine can update the parameters of the model by updating the gradient of the central server, which can better protect the privacy of the parameters of the model. In practical application cases, the federated support vector machine will not increase calculation, so its actual performance is even better.

Federated Decision Tree Algorithm.
Liu et al. [24] proposed a decision tree-oriented vertical federated learning method, a random forest implementation method based on a centralized FL framework, named as Federated Decision Tree (FDT). Its local participants upload the ranking of performance of their model parameters, not model parameters which the original FL constantly uploaded. us, it can greatly reduce communication frequency, a large amount of storage, and computing resources consumed by the encryption. In the joint modeling, the model mechanism of the whole random forest is scattered and stored, the central server holds the original complete structural information, and each participating node holds only their own the information [25]. When the federated decision tree model is used, the node information of the local tree is first obtained, and then the other local node information of the tree model is called jointly by Complexity 5 the central server. Among federated decision tree models, Secure Boost model [26] is a decentralized vertical FL framework based on gradient lifting decision tree. According to the common gradient lifting decision tree algorithm, the objective function is as follows: where L t is the minimum loss value of the objective function, t is the tth iteration of the regression tree, j(y n , y i ) is the loss on the leaf node of each tree function, and F(x) is the sum of the first derivative and the second derivative of the prediction residual. In order to prevent overfitting, a regular term is usually added to the loss function: where c and λ are hyperparameters. In order to adjust the characteristics and the number of trees, w is the weight value and L is the original loss function. In the original distributed ML, joint modeling is realized by sending F(x) to participants, but distributed ML can use F(x) to calculate data labels backward, resulting in data leakage, which does not meet the basic requirements of FL in principle [27]. e federated tree model is based on the Secure Boost [26] encryption algorithm, training the samples of the model that needs joint training, and the first sample and the second sample are trained to get the prediction model of the decision tree. According to the prediction model of the decision tree based on the sample label, it can ensure that the data will not be deduced and calculated in reverse.
Li et al. [28] proposed a decentralized horizontal FL framework for multiparty, named Gradient Boosting Decision Tree (GBDT) modeling-a learning model based on the degree of similarity between data. e encryption degree of hash table encryption is not high, which is not as good as that of differential privacy and federated blockchain [29], but it gives some compensation to communication efficiency when modeling up and down transmission. is is a new research direction of algorithm research under the federated tree model. If data disturbance is added, its confidentiality can be comparable to the differential privacy protection and federated blockchain technology.

Federated Semisupervised Learning.
Semisupervised learning is a key issue in the field of ML. It can use as much unlabeled data as possible to complete the task [30]. After FL is added to semisupervised learning, on one hand, FL can be used to ensure that sufficient training data are available, and, on the other hand, semisupervised learning can be used to alleviate the problem of the high cost of client-side scattered data labeling.
Jeong et al. [31] proposed a federated semisupervised learning framework according to the number of data tags. Its generative model is mainly to obtain data reconstruction from the perspective of probability, such as p(x, y) � p(x|y)p(y), so it can be estimated by a hybrid model. Recently, VAE [32] and GAN [33] have generated more complex models for semisupervised learning, which further improve the efficiency of semisupervised learning.
According to different split positions of sample identification and feature space, federated semisupervised learning can be divided into two categories: horizontal federated semisupervised learning and vertical federated semisupervised learning [31].
In horizontal federated semisupervised learning, the participating parties 1, 2, . . . , N { } have the same feature space χ but different ID logo spatial data of each participant; that is, I j ≠ I k , j ≠ k, which is held by all parties involved in the horizontal federated semisupervised learning. For each participant, i has its own data   Complexity Vertical federated semisupervised learning has the same ID logo space of all parties involved, but each party Yang et al. [34] proposed a logical regression method of decentralized longitudinal FL in fact, the label data side to replace the central server. In decentralized vertical FL, data are divided into tagged data and untagged data, in which tagged data are dominant. Assuming that there is an agreement between the unlabeled data holder α and the labeled data holder β to cooperate in modeling, α first sends the modeling key to β, α and β initialize the parameters w 1 and w 2 respectively, and calculate w ixi , where i ∈ 1, 2 { }. After β is calculated, the results are sent to α. α averages both calculation results and then uses logical regression equation to get the final. At last, both tagged and untagged parties are updated by gradient. Table 2 shows the articles of three types of federated machine learning algorithms.

Federated Unsupervised Learning.
Unsupervised learning is a ML method mainly used to discover potential patterns in data. Its input data have no label, and only the input variable (X) is provided, no corresponding output variable (Y).
In unsupervised learning, the algorithm needs to find the pattern structure in the data by itself [35]. e data on each participating client of FL is basically collected in a nonindependent and uniformly distributed way, so there is a problem of domain migration between clients. is problem of domain migration makes it difficult to extend the model and its training to new devices. Based on the framework of FL and without user supervision, knowledge is transferred from decentralized nodes to new nodes with different data domains. Peng et al. [36] defined an Unsupervised Federated Domain Adaptive (UFDA) method; it can align the representations learned among the different nodes with the data distribution of the target node. In the domain adaptive system of FL, models on different nodes have different convergence rates. In addition, the domain migration between the source domain and the target domain is different; as a result, some nodes may not contribute to the target domain or even show negative contribution [36].

Federated Learning Algorithm Based on Pan-Deep Learning
Federal learning combined with deep learning is one of the mainstreams of federal learning. is chapter focuses on this area; Figure 3 shows a classification of federated pan-deep learning.

Federated Neural Network.
McMahan et al. [37] proposed a federated neural network model and carried out tests on neural networks on MINST data sets. In this paper [37], five groups of experiments are introduced, and this section only introduces the part of neural network (NN). e model has a four-layer network structure, including one input layer, two hidden layers, and an output layer; each hidden layer has 200 neurons. e MINST data set is assigned to each client, and these clients do not intersect. en the federated training was carried out and the experiment was carried out in two groups: Experiment 1 uses the same random seed to initialize local model parameters allocated to the two clients. Experiment 2 uses different random seeds to initialize local model parameters assigned to the two clients. e different local model parameters of the two groups of experiment are weighted and integrated proportionally to obtain the final federated neural network model, namely, Among them w FL is a federated model parameter, w and w ′ are model parameters distributed at different nodes, and ε is weight, which changes between 0 and 1. e experiment in this paper shows that when using FL, the federated model with the same random initialization seed has the best effect, and, at the same time, the optimal loss is achieved when the ratio of model parameters is 1 : 1.

Federated Convolutional Neural Network.
Zhu et al. [38] proposed a federated CNN; it used a simple CNN network to do text recognition work in unclassified scenarios, and the whole model is built based on TensorFlow and PySyft to test the impact of FL infrastructure and local clients [39]. e built-in reference [38] is a simple CNN with four convolution layers, two fully connected layers, using ReLU activation function, and four output layers defined by the author. e structure of convolutional neural network is described in Figure 4. e CNN classifier is used for dictionary-free text recognition in the Chinese character corpus, and the parameters in CNN are optimized to minimize the aggregate negative logarithmic likelihood of the character sequence: where N is the training data set, M is the total number of classifications, and p (k) ij and p (k) ij are the probability that the kth character of sample i is marked as j. In their experiments, we compared two prevalent federated learning frameworks, namely, TensorFlow Federated and PySyft. Results show that federated text recognition models can achieve similar or even higher accuracy than models trained on deep learning framework. Figure 4 shows the convolutional neural network diagram in [38].
Rong et al. [40] proposed an intrusion detection method based on a federated CNN. is paper uses the data joint training model of multiple participants to expand the number of local participants. Based on the original FL Complexity framework, an intrusion detection model based on deep learning is designed. First of all, the data dimension is reconstructed by data filling to form a two-dimensional array.
en, Diffusion-Convolutional Neural Networks (DCNN) are used to extract and learn the feature parameters under the mechanism of FL. Finally, it is combined with the Softmax classifier training model for detection. is method greatly reduces the training time and maintains a high detection rate. In addition, compared with the general intrusion detection model, the improved model also ensures data security and privacy [40]. Federated convolutional neural networks are generally implemented by a simple CNN model. References [38][39][40] use a CNN model with four convolution layers and two fully connected layers. is model is suitable for horizontal FL. e ID of the sample is used as the basis, and then the data set is randomly assigned to different clients to form different subsets to simulate distributed data. During the training, the client first carries out gradient calculation and parameter update on the local data set. At the end of each training iteration, the   [40]. In experiment 1, the effectiveness of the method of transforming one-dimensional data into two-dimensional data intrusion detection network is verified. is method not only improves the accuracy of the model but also reduces the operation cost of the model. In experiment 2, the depth of the DCNN model is determined. e experimental results show that the two different models have little change in training and testing time, but in terms of accuracy, the accuracy of the model with two hidden layers is improved by an average of 1%. When it is increased to three hidden layers, the performance is not significantly improved, so simply increasing the number of hidden layers has little effect on the performance improvement. In experiment 3, the intrusion detection model is constructed by the FC algorithm, and the multi-classification experiment is carried out on the NSL-KDD standard data set. e accuracy of the test set has no obvious change; it is optimized in recall rate and false alarm rate, but the optimization effect is obvious in training time. Because the FC model only needs to transmit a small number of parameters when training data, it has certain advantages over other centralized training models in terms of data security. Generally speaking, the federated CNN cannot only improve the security performance in deep learning but also improve the computing power of the model by using GPU.
For the model parameter transfer between clients and the server, in order to reduce the occupation of bandwidth, the CNN is generally compressed. Sattler et al. [41] proposed a new framework of Spare Ternary Compression (STC), which is specially designed to meet FL. e training process of FL includes downloading the model, training the model locally, and updating the trained model to the server for aggregation. e number of bits in which data are transmitted is where N iter is the total number of training iterations performed by each client, ϖ is the communication frequency, |ϑ| is the size of the model, H(Δw up/down ) is the entropy of the weight update exchanged during upload and download, and η is the inefficiency of coding, that is, the difference between the real update size and the minimum update size (given by entropy). STC extends the existing top-k gradient sparse compression technology through a new mechanism to achieve downstream compression, internalization, and optimal Golomb coding of weight updates. e existing compression algorithms assume that the local data are independently and identically distributed, and most of the training data in FL are nonindependent and identically distributed data. In the independent and identically distributed data, it is considered that the local gradient is an unbiased estimation of the global gradient; that is, (14) where p i is the data distribution of client i and R(w) is the empirical risk function of the whole data, but this assumption of independent and identically distributed data is difficult to hold in FL, and we can only expect that the mean value of the distribution is unbiased; that is, e gradient of a single client will be biased towards the local data set: Experiments show that if each edge device sees a unique data distribution, the quality of model training will decline. For neural networks trained with highly skewed non-IID data, the accuracy of FL is significantly reduced by about 55%. It is further proved that the accuracy reduction can be explained by weight divergence and can be quantified by the Empirical Mode Decomposition (EMD) between the distribution of each category and the overall distribution on each device. is paper proposes a strategy: the author improves the training of non-IID data by creating a small part of data that is globally shared among all edge devices. Experiments show that the accuracy of CIFAR-10 data sets containing only 5% globally shared data can be improved by 30%.

Federated Bayesian Network.
Yurochkin et al. [42] proposed to apply Bayesian networks based on FL. Under the assumption that both local data and local models are available, a probabilistic FL framework is developed and studied, with special emphasis on training and aggregation neural network models. Estimated local model parameters (in the case of a neural network, a set of weight vectors) between data sources are matched to build a global network [43,44]. When the data are available, the method is proposed by training the local model for each data source in parallel.
en, the estimated local model parameters (weight vector group in the case of neural network) are matched between data sources to build a global network. Parameter matching is controlled by the posterior of the Beta-Bernoulli Process (BBP), which is a Bayesian Nonparametric (BNP) model that allows local parameters to match existing global parameters. Or if the existing global parameters do not match, new global parameters are created [42]. e federated Bayesian structure provides several advantages over existing methods [40]. First of all, the federated Bayesian separates the learning of local models from the fusion of local clients to become a global federated model. is decoupling allows us to remain unknown to local learning algorithms, which can be adjusted as needed, and each data source may even use a different learning algorithm. Secondly, given only pretrained models, their Complexity BBP information matching process can combine them into joint global models without additional data or learning algorithms for generating pretrained models. Last but not least, federated Bayesian can effectively learn to compress the federated network from the pretrained local network, and under a moderate communication budget, it can outperform the state-of-the-art algorithm of FL using neural networks. In order to apply the joint probabilistic neural matching method to FL, the feature extractors of Multilayer Perceptron (MLP) sets must be grouped and combined in the process of constructing global feature extractors (neurons). e goal of the Bayesian nonparametric mechanism is to identify the subset of neurons in the J local model that matches the neurons in other local models. en, the matched neurons are combined to form a global model. Suppose we train J Multilayer Perceptron (MLP) j � 1, 2, . . . , J, and each perceptron has a hidden layer and each sensor has a hidden layer. Let V (0) j ∈ R D×L i and v (0) j ∈ R D×L i respectively denote the weight and offset of the hidden layer and V (1) j ∈ R D×k and v (1) j ∈ R k represent the weight and offset of the softmax layer. D represents the data dimension, the number of neurons in the hidden layer of BL i , and k represents the number of classes. We consider a simple architecture: where σ(·) is a nonlinear activation function. A set of weights and deviations learns a global neural network with weights and deviations. Figure 5 shows the Bayesian network diagram of a single hidden layer, single-layer probabilistic federated neural matching algorithm. e nodes in the figure represent neurons, and neurons of the same color are matched. is paper uses the corresponding neurons method in the output layer to convert the neurons in each J batch into a weight vector of the reference output layer. Figure 5 shows Bayesian network with hidden layers. [45] was proposed in 1997. For its unique design structure, LSTM is suitable for dealing with and predicting important events with long intervals and delays in time series. Some researchers have applied LSTM to the centralization-based FL model to predict the character MINST [46,47]. LSTM is specially designed to avoid longterm dependency problems. Memorizing long-term information is the default behavior of LSTM in practice [48]. e LSTM is added to the local model training. Its input gate determines the next input parameters, the forget gate loses some parameters, and the output gate outputs the required parameters, which makes the iterative effect better. In LSTM, the first σ on the left is the activation function of the forget gate; the second middle σ and tanh are the activation functions of the input gate; the rightmost σ and the middle tanh are the activation functions of the output gate; x t is the input, h t is output, h t−1 is the output at the previous time, C t−1 is the state at the previous time, and C t is the state at the current time. Figure 6 shows the internal structure of the LSTM network unit. e study in [45] proposed to segment the data sets of multiple participating clients. When LSTM is placed in the FL framework, the data are nonindependent and identically distributed, and the appropriate hyperparameters are selected. e nonindependent and identically distributed data model is adjusted to the model accuracy of the conventional situation [46,47]. Li et al. [49] trained LSTM classifiers in federated data sets and proposed a FL framework federated proximal term (FedProx) to solve statistical heterogeneity for sentiment analysis and character prediction. Compared with traditional FedAvg, FedProx has a faster convergence speed. In the case of system heterogeneity, each local client based on the FedAvg framework cannot complete the variable work according to the change of local client. e FedProx framework proposed in reference [49] introduces a regular term to improve the stability of the whole framework. e essence of the modified term is to increase the limitation of the difference between the parameters in the local model and the parameters in the global model, so as to provide a theoretical basis for explaining the heterogeneity between global and local information. Traditional FedAvg objective function is arg min

Federated LSTM. LSTM
where n k means that there are n k samples on the kth device. Generally, it is set to p k � n k /n, where n is the sum of all n k , and local functions are minimized F k . E in FedAvg plays an important role in the convergence of global objective function. e higher the E, the more local computation and less communication between devices, which can effectively improve the overall convergence speed of the global objective function. On the other hand, for the heterogeneous local objective F k , the E value is too large, which may cause each device to strive to achieve the optimization of its local objective function, rather than the optimization of the global objective function, which will affect the convergence of the global objective function and even lead to divergence. e framework FedProx proposed in this paper [49] is similar to FedAvg in that it selects a subset of devices to participate in the update in each round, performs local updates, and then averages these updates to form global updates. However, FedProx makes some simple and critical modifications to converge. e objective function of ah k improved FedProx: A two-layer LSTM classifier with 100 hidden units and 80 embedded layers is used in FedProx.
Its task is to predict the next character, a total of 80 categories of characters. e model takes a sequence of 80 characters as input, embeds each character into a 8-dimensional space, and outputs one character for each training sample after two LSTM layers and a dense connection layer [46]. e experimental results show that FedProx has a faster convergence rate than FedAvg. In particular, in a highly heterogeneous environment, FedProx shows a more stable and accurate convergence behavior than FedAvg, which improves the absolute test accuracy by 22% on average.

Federated Reinforcement
Learning. Nadiger et al. [50] first proposed the overall framework of Federated Reinforcement Learning (FRL), which includes grouping strategies, learning strategies, and federated strategies. Reinforcement Learning (RL) and other artificial-intelligence-based technologies have recently been used to achieve personalization. However, reinforcement learning faces the challenge of realizing individualization. In this paper [50], the author proposes a federated reinforcement technology, and its main goal is to improve personalization time. FL, which is applied to reinforcement learning techniques, is an example of hierarchical learning, which enables agents at lower levels to communicate their findings. Local clients with similar environments can be joined more efficiently [51]. e article proposes the use of Deep Q Network reinforcement learning algorithms in a federated environment to achieve faster personalization. e client model and the shared model are regarded as a large Q network and optimized by the Behrman equation. However, in the current work, there is a separate Q-learning on each client, and a joint strategy determines the shared model parameters. e personalized implementation scheme of this article is as follows: where PM refers to a set of games with more personalized measures, N r4 is a set of games with long-distance greater than or equal to 4 rounds, and N r is the total number of gatherings of various lengths in a game. e server sends the global model to all clients. is provides a "hot start" method for each customer. e global model is built offline. en, the client updates the weight of the Nonplayer Character (NPC) model according to the local RL algorithm. e server starts waiting until the NPC model is received from all customer groups. e global model is  Complexity (21) where w g is the global model, w c is the client model, ∝ is the global model regularization factor, the percentage of rebounds with a length greater than or equal to 4 on client i, and k is the number of clients. e experimental results show that this paper proposes a method to speed up the personalization of agents by using federated reinforcement learning. It also puts forward the grouping strategy, learning strategy, and federated strategy, which makes up the whole FRL architecture. e effectiveness of this method is shown by testing on 3, 4, and 5 human players, in which the personalization time is accelerated by about 17%. Anwar et al. [52] analyzed multitasking federated reinforcement learning from the perspective of confrontation, analyzed the attack performance of many common attack methods, and proposed an adaptive attack method. e general countermeasure is not enough to attack the mobile terminal effectively, so a model poisoning attack method based on minimizing the gain of training information is proposed. In FL, we have multiple local clients. In addition to preventing data poisoning and policy poisoning, we must also consider that the model is attacked. Because we have more than one local client, a complete local client can play the role of an attacker.
Attackers can enter false data and deliberately destroy the federated model. In the attack of federated model, the attacker tries to directly modify the learned model parameters by providing error information that intentionally damages the global model [53,54]. Because the classical FL uses an average algorithm to merge the local model parameters of a single client learning, such an attack will seriously affect the performance of the global model. In Multitask Federated Reinforcement Learning (MT-FedRL), each client runs in its own environment, which can be characterized by different Markov Decision Processes (MDP). Each agent operates and observes only in its own environment. e goal of MT-FedRL is to learn a unified strategy that is jointly optimal in all n environments. Each agent shares its information with a centralized server. In each of these n environments, the state and action space do not need to be the same. If the state space does not intersect across the environment, the joint problem is decoupled into a set of n independent problems. e goal of the MT-FedRL problem is to find a unified strategy π * to maximize the sum of the long-term discounted returns for all environments, namely, Solving the abovementioned equations will produce a uniform π * , thus achieving balanced performance in all environments. Where V π i is the value function of the strategy π * , in the states of the ith environment, we use ρ to represent the initial state distribution on the action space of the ith environment. In this article [55], it is proved that multitask federated reinforcement learning can converge to a unified strategy, which can achieve the best performance in every environment. If the client's goals are positively correlated, this joint optimal strategy works best when evaluated in each environment. If the client's goals are not positively correlated, a unified strategy may not produce a near-optimal strategy for a single environment. In this article, three common attack models are discussed in detail: the random strategy attack model, the reverse target strategy attack model, and the counterattack model with minimum information gain. Finally, we propose a modification of the general federated reinforcement learning algorithm to solve the antiattack problem, which is equally effective with and without attacks. e federated reinforcement learning process and federated reinforcement learning algorithm are given in reference [52], in which several cooperative models try to maximize the sum of discounted returns in the presence of hostile models in different environments. Figure 7 shows the flow chart of federated reinforcement learning.

Federated Meta-Learning.
Chen et al. [56] proposed a Federated Meta-learning (FedMeta) framework, which shares parameters rather than the previous global model.
is article evaluates the LEAF data set and the actual data set and proves that the communication cost required by FedMeta is reduced by 2.82-4.33 times, and its convergence speed is faster, compared with FedAvg, by 3.23%∼14.84%. In the field of FL, the local model uses SGD training to achieve high accuracy while balancing the computational and communication costs; in the field of meta-learning, the MAML algorithm is used to quickly converge on new tasks and show good generalization; on this basis, a federated meta-learning framework was built.
e FedMeta framework integrates the MAML algorithm and meta-sgd into FL, which improves the accuracy of the joint training model and reduces the communication overhead. Meta-learning A φ algorithm is where φ uses a set of task updates in the meta-training process, and the task test in the meta-training consists of a support set D T S � (x i , y i ) both containing marked data points [57]. Algorithm A trains a model f on the support set D T S and outputs D T Q called internal update, evaluates the model f w T on the query set D T Q , and calculates the test loss L D T Q (w T ) to reflect the training ability of A φ [58]. Finally, A φ is updated to minimize test loss, which is called external update. Each episode of meta-learning algorithm A will sample a batch of tasks from a meta-training set, so the optimization goal of meta-train can be expressed as arg min For each task T, the algorithm makes φ � w, so that the parameters of the algorithm are equal to those of the model f. en the parameters of model f are trained on the support set and updated according to the loss function: 12 Complexity Finally, the model parameters are tested on query set, and then the loss function of the test is calculated: e experiment is verified on the LEAF data set, which shows that the convergence speed is faster and the accuracy is greatly improved over the traditional FL. At the same time, it also reduces the cost of communication. e goal of meta-learning is to train an algorithm. Federated metalearning means that many devices join together to train the same meta-learner. Each device has its own meta-learner, but the parameters are aggregated on the server, and then the global meta-learner is trained. e global model trained by FL is the same on every device. Because of the strong data heterogeneity of each device, it is necessary to use meta-learning to personalize the model. Meta-learning generates a metamodel locally, and then metamodels generate personalized models locally, which are suitable for local heterogeneous data. Figure 8 shows the federated meta-learning framework.

Federated Residual Network.
Huang et al. [59] proposed a new compression strategy Residual Pooling Network (RPN) [60] in order to improve the communication efficiency of FL. Compared with traditional FL, RPN alleviates the problem of communication computing overhead by selecting appropriate parameters and can maintain the original performance while reducing data transmission. RPN is an end-to-end process, and it can also be applied to CNN-based model training scenarios to improve the communication efficiency of federated models. e total number of bits that must be transmitted during model training is given by where T is the total number of iterations, M is the number of clients that the server chooses to update in the T round, G t represents the global model after t aggregations, and F(G t ) is downloaded to the optional parameter bit of the client. Similarly, F(G t i ) is the selected parameter bit of the client i used for uploading to the server. e article improves communication efficiency from four aspects: iterative frequency, pruning, importance-based update, and quantification. R t i is defined as a residual network, and the definition of a residual network is given in the following formula: where G t i is the i parameterized by ; that is, G t � f(w t ). e experiment in this article includes classification, object detection, and semantic segmentation. ey prove that RPN not only effectively reduces data transmission but also achieves almost the same performance as traditional FL. Most importantly, RPN is an end-to-end process, which makes it easy to deploy in real-world applications without human intervention. e federated residual network learning workflow includes (1) selecting clients for local model updates, (2) restoring local models, (3) training local model based on local data sets, (4) calculating remaining networks, (5) spatial aggregation, (6) sending RPN to the server and aggregating, and (7) sending RPN back to the selected client and repeating the cycle. Figure 9 is a schematic diagram of the federated residual network.  Complexity Table 3 shows the current federated learning methods based on deep learning.

Privacy and Security Issues of Federated Learning
Although FL can ensure that the data are trained locally on the client, it still has privacy and security issues in the event of malicious attacks, which are mainly reflected in the following three aspects. Firstly, the data collector collects user data privately without permission, leading to direct data leakage during data collection; secondly, there is indirect privacy leakage due to insufficient generalization ability of the model; finally, the model may be polluted for the lack of safety precaution [61]. is section discusses the prevention and attack aspects of FL.

Byzantine Prevention of Federated Learning.
In recent years, security issues in FL have attracted widespread attention; especially in some scattered environments, some unstable clients may behave abnormally and even have Byzantine failures-arbitrary and potentially hostile behaviors [62]. Byzantine-robust FL aims to accurately learn the global model on the server side when a limited number of clients are malicious. e key idea of the existing Byzantinerobust FL is that the service provider performs statistics in the client's local model update and removes the available models before aggregating them to update the global model  Figure 9: Federated residual network workflow. [63]. At present, the main vulnerability of FL is the concern of SGD. How to ensure the robustness of distributed SGD and sending poisonous hostile Byzantine clients in the training phase is a hot research topic [64]. In the learning process of the hostile Byzantine client, the learning model may be biased due to data corruption, communication failure, or malicious sending of incorrect information to the server side [65]. Learning the defense against the Byzantine problem, Blanchard [53] proposed Krum, the first provable Byzantine algorithm for distributed SGD, which satisfies the elasticity of the aggregation rule. In face of potential abnormal clients, Yin et al. [62] proposed two robust distributed gradient descent algorithms based on median and pruning average operations for sharp analysis and proved that the distributed algorithm based on median is robust, having the same optimal fault tolerance rate of the distributed gradient descent algorithm. Li et al. [65] proposed the Byzantine-Robust Stochastic Aggregation (RSA) method. RSA regularizes the objective function to enhance the robustness of the learning task. Compared with most algorithms, the RSA method can adapt to independent and identically distributed FL, so it is suitable for a wider range of applications. Shejwalkar and Houmansadr [66] proposed divide-and-conquer (DnC) and demonstrated that DnC outperforms all existing Byzantine-robust FL algorithms in defeating model poisoning attacks.

Local Model Attack of Federated
Learning. In addition, some researchers study the robustness of FL from the attack method. e attacks of FL mainly come from internal attackers participating in the FL process and unique model training strategies. Malicious opponents may interfere with or backdoor the process of distributed learning. Baruch et al. [67] proposed a new attack method, through limited changes to many parameters, in Moran's paper, a variant of trimmed mean is to be chosen among existing defenses, producing the best results for convergence attack excluding the choice of naive averaging, which is obviously vulnerable to other simpler attacks [67].
Bhagoji et al. [68] explored the threat of model poisoning attacks on federated learning, initiated by a single, noncolluding malicious agent where the adversarial objective is to cause the model to misclassify a set of chosen inputs with high confidence. ey use a suite of interpretability techniques to generate visual explanations of model decisions for both benign and malicious models and show that the explanations are nearly visually indistinguishable. eir results indicate that even a highly constrained adversary can carry out model poisoning attacks while simultaneously maintaining stealth, thus highlighting the vulnerability of the FL setting, to develop effective defense strategies [68]. Bagdasaryan et al. [69] used the privacy protection mechanism of FL and added abnormal data to carry out vicious attacks on the model, making the existing Byzantine anomaly detection unrecognizable. So how to design robust FL systems is an important topic for future research. Fang et al. [70] performed the first systematic study on local model poisoning attacks to FL. ey assume an attacker has compromised some client devices, and the attacker manipulates the local model parameters on the compromised client devices during the learning process such that the global model has a large testing error rate. Experiments show it is valuable future work to design new defenses against local model poisoning attacks, new methods to detect compromised local models, and new adversely robust aggregation rules [70].

Data Privacy Issues.
Under the framework of FL, although the user's local data do not need to be uploaded to the server, it will be directly used in local modeling. If you do not independently add noise to these local data to protect their security, an attack by a malicious user may take place [71]. ere are two modes of attack: one is an active attack, and the other is a passive attack.
When setting the FL algorithm protocol, if we assume that the active participant is a malicious attack, which destroys the security performance of the model, we call the malicious attack of the active participant the active attacker of FL. e server can obtain the model update parameters from various devices, and it can carry out FL model attacks by analyzing the model parameters of each round of updates.
We call semihonest but curious server-side attacks passive attackers. e main difference between the active attacker and the passive attacker is that the attack behavior is initiated by different malicious users, the initiating user of the active attacker is the client, and the initiating user of the passive attacker is the server. Both types of attacks damage the confidentiality, integrity, and availability of the FL model [72,73]. e attacked federated model and the jointly trained model will lose their balance. In the worst case, the jointly established model cannot be returned to the local client.

Data Communication Issues.
In the framework of FL, client-side and server-side devices communicate and transmit model parameters or gradients, and its communication rate is more frequent than the traditional distributed machine transmission rate. But each model participating in joint training cannot have the same computing power and stable transmission rate, which will often cause communication instability. For example, the input method of mobile phone uses FL, some mobile phones use mobile data and some mobile phones carry out joint modeling in WIFI state, the stability of data transmission in mobile data state is usually worse than that in WIFI state, and it is easy to cause communication interruption when uploading or downloading model parameters. Even if the same mobile phone is in the same network state, the communication will be unstable due to the different number of parameters transmitted. erefore, in the modeling of FL, the data communication problem is a problem worthy of various researchers to ponder. In addition, the problems of communication bandwidth proposed in [74], the convergence of the joint training model, and the communication between cloud service providers are all problems that need to be researched.

Data Heterogeneity Issues.
e data of distributed ML are often independent and identically distributed, but FL is different from traditional distributed ML. Devices in FL often exist in the network in a nonindependent and uniformly distributed way. e data participating in training is generally nonindependent and identically distributed. For example, banks and Internet shopping, although they have the same customers to some extent, their data storage structures are heterogeneous. In addition, the uneven distribution of data held by crossdevice data holders will also lead to data heterogeneity. erefore, many common algorithms for independent and identically distributed data cannot be used directly. How to research algorithms that are more compatible with FL heterogeneous data is also a very important development direction of FL.

Data Overhead Issues.
In the application scenario of FL, most of the local models that participate in the training need to perform computing and communication tasks on mobile terminal. Because the number of local models involved is very large, it is not only a challenge for the communication but also a great test for computing. FL is not only technical labeling but also a business model. Encryption is a very important link in the financial industry, and the original cloud computing model has been challenged in encryption. Adding the encryption algorithm to cloud computing data transmission is a common encryption method. Some researchers have proposed secure mixing [74] and secure mixing [75] methods, but this increases the cost of communication. After adding the encryption method, it needs to be decrypted, and the computational cost of the model data is further increased. Literature [76] puts forward the problem of keeping a balance between communication cost and accuracy and guides the balance between them by evaluating the distributed statistics and learning rate of a certain bandwidth. At present, no researchers have applied it to FL, and there is no up-to-date method to solve the problem of high data computing overhead, so this field needs people to open up and improve. e problem of data computing overhead is urgently waiting to be solved.

Lack of a Trusted Central Server.
In the process of FL, a trusted central server is needed to ensure the privacy and security of users. Some scholars put forward the decentralized algorithm, which is based on the local update scheme of heterogeneous data decentralization training. FL requires a central server to coordinate the training process and receive models uploaded by all clients. erefore, the server is a central participant, and it may also have a single point of failure. Although large companies or organizations can play this role in certain application scenarios, in more collaborative learning scenarios, a reliable and powerful central server may not always be available. Even if centralized differential privacy is adopted in the protection of data, the central server must be trusted by users. Otherwise, it will cause data leakage. Future researchers can start with how to build a trusted central server to further improve the server structure of FL, making it less vulnerable to attacks and failure. e existing trusted server transformation mainly includes ARM's Trust-Zone architecture and Intel's SGX-enabled CPU architecture [59]. Table 4 shows the federated learning problem.

Conclusion
is paper discusses the classification and development of FL and several existing problems of FL. It expounds from the point of view of FL algorithm, focusing on federated deep learning on the basis of the introduction of federated ML. In the chapter of federated deep learning, existing deep learning algorithms are discussed from the perspectives of communication, data heterogeneity, privacy protection, and trusted server in FL. At present, FL is still in the stage of rapid development, and there are still many unsolved problems about ML and deep learning algorithms under the framework of FL. With the further expansion of the amount of data in the future, the implementation of deep learning algorithm is not only a feasible scheme for practicing in the field of artificial intelligence but also a more efficient and comprehensive method for the use of distributed ML and edge data. In the future, FL will develop incoordination in multiple fields, such as edge computing, blockchain, privacy protection, and other coordinated development to improve the performance of FL and, at the same time, make the commercial value better. In order to facilitate readers to  Table 5.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.