Federated Learning Optimization Algorithm for Automatic Weight Optimal

Federated learning (FL), a distributed machine-learning framework, is poised to effectively protect data privacy and security, and it also has been widely applied in variety of fields in recent years. However, the system heterogeneity and statistical heterogeneity of FL pose serious obstacles to the global model's quality. This study investigates server and client resource allocation in the context of FL system resource efficiency and offers the FedAwo optimization algorithm. This approach combines adaptive learning with federated learning, and makes full use of the computing resources of the server to calculate the optimal weight value corresponding to each client. This approach aggregated the global model according to the optimal weight value, which significantly minimizes the detrimental effects of statistical and system heterogeneity. In the process of traditional FL, we found that a large number of client trainings converge earlier than the specified epoch. However, according to the provisions of traditional FL, the client still needs to be trained for the specified epoch, which leads to the meaningless of a large number of calculations in the client. To further lower the training cost, the augmentation FedAwo ∗ algorithm is proposed. The FedAwo ∗ algorithm takes into account the heterogeneity of clients and sets the criteria for local convergence. When the local model of the client reaches the criteria, it will be returned to the server immediately. In this way, the epoch of the client can dynamically be modified adaptively. A large number of experiments based on MNIST and Fashion-MNIST public datasets reveal that the global model converges faster and has higher accuracy in FedAwo and FedAwo ∗ algorithms than FedAvg, FedProx, and FedAdp baseline algorithms.


Introduction
Federated learning, a distributed machine-learning framework that can efectively protect the privacy and security of user data, has received extensive attention from academia and industry in recent years. Federated learning involves cotraining a machine-learning model by servers and clients. Te server sends the global model to clients, receives local models trained by clients, and aggregates them to generate a new global model until the training of the global model ends. Clients use local data to train the global model given by the server and return the trained local model to the server [1]. Federated learning efectively protects the privacy and security of data by transmitting model parameters between the server and the client (data do not leave the client) and is used in many felds. Te most typical example is Google's keyboard input method, which uses a federated learning platform to train a recurrent neural network (RNN) for next word prediction. In addition, federated learning is also widely used in clinical auxiliary diagnosis, new drug development, and precision medicine in the medical industry, portrait recognition, and voice print recognition in the security industry. Although federated learning efectively solves the problem of data privacy and security, it is diferent from traditional distributed machine learning and brings serious challenges to system heterogeneity and statistical heterogeneity. Traditional distributed machine learning is usually deployed in the same data center or in a network with a good communication environment, and the clients for model training have similar hardware conditions. However, the clients of federated learning are often widely distributed in geographical locations. Tere are great diferences among them in network conditions, hardware environment, and computing power, and the time when clients can participate in model training is also diferent. Te above phenomenon is called system heterogeneity which may lead to the problems of falling behind (nodes that cannot complete the specifed training rounds within the specifed time) and fault tolerance [2]. In addition, data distribution and data volume of local data held by diferent clients are also diferent, which is the statistical heterogeneity of data. Both statistical heterogeneity and system heterogeneity have a negative impact on the convergence speed and fnal accuracy of the global model [3].
At present, most researchers try to reduce the negative impact of heterogeneity by sampling clients and modifying clients' loss function. Te sampling method is that the server flters out the local models that are more conducive to the global model convergence for aggregation. In sampling algorithms [4,5], the method of importance is widely used [4,[6][7][8]. Tis method selects the "important" clients by comparing client gradient information and aggregates their local gradients. Te method of modifying the loss function of the client is more mainstream at present [2,9,10]. Its idea is to modify the loss function of the client, such as adding a near term [2] in the loss function or normalizing it with the last round of the global model [11,12]. However, the above methods ignore a crucial phenomenon: the imbalance of computing power between servers and clients in the federated learning system. We know that in the actual application scenario, the computing power of clients is relatively weak. Te method of modifying the client loss function further increases the computing burden of the client. Servers often have strong computing power and network conditions, and they only undertake the task of aggregating local models and generating global models.
Obviously, in the federated learning system, the computing power and network environment of clients are poor, but they are responsible for heavy work of model training. Te server with the strong computing power and network environment undertakes light work, which does not match its ability. In order to make better use of system resources and improve performance, this paper studies how to use server resources to solve the problems of statistical heterogeneity and system heterogeneity without increasing the load of clients. Tis paper proposes the federated learning algorithm for automatic weight optimization (FedAwo) and its enhancement algorithm (FedAwo * ) and verifes the feasibility of the methods from both theoretical and experimental aspects. Our main contributions in this paper are as follows: (1) We design a federated learning algorithm for automatic weight optimization (FedAwo). In this algorithm, the server calculates the optimal weight for the local model through the machine-learning algorithm to solve the problem of statistical heterogeneity and system heterogeneity in federated learning. Te FedAwo algorithm efectively utilizes server resources and does not increase the burden on clients.
(2) We prove the convergence of FedAwo and propose the enhancement algorithm FedAwo * for FedAwo to further reduce the training cost. Te rest of this paper is organized as follows: Te second section introduces the related work of federated learning in solving heterogeneity. Te third section introduces the federated learning algorithm for automatic weight optimization (FedAwo) in detail. In the fourth section, we prove the convergence of the FedAwo algorithm. In the ffth section, we propose the optimization algorithm FedAwo * . In the sixth section, we verify the performance of FedAwo and FedAwo * through experiments. Finally, we summarize this paper.

Related Work.
Te research studies on the convergence of federated learning [2,9,11,13] show that the system heterogeneity and statistical heterogeneity in federated learning have a great negative impact on the convergence speed and accuracy of the global model.
Te optimization methods of heterogeneous problems mainly focus on modifying the loss function of clients or sampling clients. For modifying the loss function of the client, literature [2] proposed the FedProx algorithm, which aims to add a proximal term (μ/2)‖x − x (t,0) ‖ 2 to help improve the stability of federated learning. At the same time, the FedProx algorithm would dynamically adjust the number of client-training epochs to solve the straggler problem caused by system heterogeneity. Te efect of this method is more obvious in the environment with stronger heterogeneity. However, the original intention of the Fed-Prox algorithm is to solve the problem of straggler. Due to the introduction of the proximal term, the computing overhead of the client increases instead. In some cases, the problem of client struggling is even more serious. Literature [11] proposes the SCAFFOLD algorithm, which corrected the client-drift phenomenon that occurs in the FedAvg algorithm by introducing the correction term (c − c i ). Literature [10] proposed the FedNova algorithm, which eliminated objective inconsistencies and maintained fast convergence by normalizing local models. Te SCAFFOLD algorithm and the FedNova algorithm are the same as the FedProx algorithm. Although the communication overhead has been further optimized and the model quality has been improved, it still increases the computing overhead of the client. Literature [14] proposed the FedDyn algorithm to keep the local model and global model distribution approximately consistent by assigning a dynamic regularization optimizer to each client in each round. All of these methods can reduce the infuence of heterogeneity on the convergence speed and model accuracy, but they all increase the computational overhead of clients. Te computing power of the server is better than that of the client. In practice, most clients are always busy, but the server is often idle.
For the sampling method, the authors in [4] established a general sampling-federated learning system and obtained an unbiased optimal sampling probability to alleviate the infuence of heterogeneity on the global model. Literature [15] proposed the FedL algorithm, which was a graph convolutional network (GCN)-based sampling method that maximized the accuracy of the global model by learning the relationship between network attributes, sampling nodes, and generated ofoads. Literature [16] classifed local models according to the importance^of each round of clients, aggregated the "important" local models, and proposed an approximate unbiased sampling optimization algorithm. Literature [17] proposed the FOLB algorithm by estimating the gradient information of the local model, which inferred the performance of the client and performed weighted sampling based on it. Tis method could cope with system heterogeneity and made the global model converge quickly. Although the sampling method can promote the global model to converge quickly, the quality of the fnal global model is poor.
In addition, literature [18] proposed the FedHQ algorithm to solve the system heterogeneity by minimizing the upper limit of the convergence speed as a function of the heterogeneous quantization error of all clients and assigning diferent aggregation weights to diferent clients. In order to address heterogeneity, literature [19] proposed an algorithm with periodic compressed communication, which introduced a local gradient tracking scheme and obtained fast convergence speed matching communication complexity. Literature [20] analyzed the convergence bound of gradient descent-based federated learning from a theoretical perspective and obtained a novel convergence bound. Using the above theoretical convergence bound, literature [20] proposed a control algorithm that learns data distribution, system dynamics, and model characteristics, and based on which, it dynamically adapts the frequency of global aggregation in real time to minimize the learning loss under a fxed resource budget. Literature [18][19][20] solved the system heterogeneity caused by external environment such as system confguration and hardware conditions, but do not pay attention to the statistical heterogeneity caused by local data diferences.
Due to the limitations of the above two methods, this paper hopes to solve the problem of heterogeneity by introducing adaptive learning. Before that, literature [13,21,22] tried to combine adaptive learning with federated learning. Literature [21] proposed a federated learning optimization scheme with an adaptive gradient descent function. Tis algorithm improved the privacy performance of the local training process by diferential privacy and the scaling of update volume. Tis algorithm can enhance the privacy security of each client in the process of joint learning, but it cannot efectively suppress the negative impact of heterogeneity. Literature [22] proposed an adaptive-personalized federated learning (APFL) algorithm, where each client would train their local models while contributing to the global model. Te APFL algorithm adaptively learns the model by leveraging relatedness between local and global models as learning proceeds, which efectively improves the convergence speed of the global model. Literature [13] proposed federated adaptive weighting (FedAdp) that assigns diferent weights to nodes for global model aggregation in each round of communication. Te FedAdp algorithm allocates the weight of the client by calculating the intercept between the global model and the local model. However, when the performance of the local model is due to the global model, FedAdp will still assign a lower weight to the local model according to the intercept value, which is obviously unreasonable. We summarize the limitations of the above methods in Table 1.
Terefore, the method of modifying the client loss function increases the computational overhead of the client, and the sampling method has the problem of low accuracy of the fnal global model. However, the current federated learning algorithm combined with adaptive learning does not focus on solving the problem of heterogeneity. Tis paper is diferent from the above methods. From the perspective of resource allocation of the federated learning system, this paper makes full use of the advantageous resources of servers and combines adaptive learning to reduce the negative impact of heterogeneity. As far as we know, this paper is the frst work aimed at using server-computing resources to solve the optimal weight allocation value.

Federated Learning Algorithm for Automatic Weight Optimization (FedAwo)
In this section, we establish the system architecture and propose the automatic weight optimization algorithm FedAwo. Finally, we introduce the specifc process of it in detail.

System Model.
A federated learning system generally includes one server and K clients. Te server plays the role of coordinating the training for each client, aggregating, and distributing the global model. Clients hold their own local dataset D 1 , D 2 , . . . , D K , and the total amount of data of all clients is K k�1 |D k | [23][24][25][26][27]. Clients perform a local learning operation under the coordination of servers. We frst defne f(θ) as a loss function, where θ is the model parameter. Tus, the global loss function of clients can be defned as Te local loss function for each client is defned as Computational Intelligence and Neuroscience where l k (θ, x) is the loss function evaluated at the data sample x, and the model θ.p k represents the training data weight value of the k-th client Te global model aggregation mode is defned as Te purpose of federated learning is to fnd the optimal value in (1), and the FedAvg algorithm is to repeat the process of (3) and (4) until the global model converges. Te most popular and de facto optimization algorithm to solve (1) is FedAvg [1]. Here, denoting t as the index of a federated learning round, we describe one round (e.g., t-th) of the FedAvg algorithm as follows: (1) Te server uniformly broadcasts the global model θ t to each client. (2) Each client uses local data to perform local SGD to calculate the updated model θ k t . Ten, the client sends the updated model back to the server.
(3) Te server aggregates (with a weight p k ) the clients' updated model and computes a new global model θ t+1 .
Te above process repeats for many rounds until the global loss converges.
At present, the research on the negative efects of heterogeneity mostly uses the sampling method or modifes the loss function of clients. Diferent from the previous algorithms, we modify p k in (1) to reduce the infuence of heterogeneity on the global model by fnding the correction value q k . So the global model aggregation mode is rewritten as As shown in Table 2, the loss function of the global model is updated to θ t+1 � K k�1 q k · θ t k .

Federated Learning Algorithm for Automatic Weight
Optimization. We design a federated learning algorithm FedAwo for automatic weight optimization to obtain q k .
Te FedAwo algorithm aims to reduce the negative impact of statistical heterogeneity and system heterogeneity on federated learning and makes full use of the computing resources of the server. Compared with traditional federated learning, this algorithm needs to have a certain amount of high-quality data in the server, which is achievable in most federated learning tasks. We would use these high-quality data as the server's datasets in the server and use the way of machine learning to calculate the optimal weight correction value q * k . Te specifc process of the federated learning algorithm for automatic weight optimization is as follows: (1) Te server S establishes a federated learning global model θ t and a weight allocation model ϑ t . Ten, the server S calculates the initial weight value q 0 k for each client according to the data quantity. Te initialization weight distribution formula of each client is q 0 k � (|D k |/ K k�1 |D k |), and according to the above formula, we can get the initial client weight allocation vector c � [q 1 , q 2 , . . . , q K ]. At the same time, the global model θ t is broadcast to each client k, and the server has the dataset D s . Te data in D s are independent and identically distributed(IID) highquality data. Te total amount of data are J, and each data has a unique corresponding label L j , which is a one-hot type data. For example, in the MNIST dataset, the one-hot type label of digital zero is [1, 0, . . . , 0]. We can get a matrix of all data labels (2) Each client would use its own local data for SGD for the received global model θ t until it is trained for the specifed criterion, and send the model θ t k to the server S.
(3) Assuming that D s,j is a data sample in the dataset D s , we input data D s,j into the local model θ t k , and the output is M j k , which is a one-hot type data. Ten, we input all the data in D s to get a matrix We carry out the above operations on all client models to get a matrix M � [M 1 , M 2 , . . . , M K ] (4) Te server calculates χ, which is the product of M and c. Tus, we have Table 1: Limitations of the approach in federated learning.

Approach
Challenges of the approach Modifying the loss function of clients [2,[10][11][12]14] Te approach increases the computational overhead of the client Approach of sampling [4,[15][16][17] Te approach has low accuracy of the fnal global model Heterogeneous quantization [18] Te actual quantifcation standard is not specifc Approach of local gradient tracking [19,20] Te approach ignored the statistical heterogeneity Approach of combining adaptive learning [13,21,22] Te previous work was not intended to solve heterogeneity Table 2: Adjustment of the loss function.
Existing loss function Updated loss function Computational Intelligence and Neuroscience Note that each element in χ represents the average prediction result for the j -th sample in D s . We then calculate the cross-entropy loss between χ and ω, i.e., where H(χ, ω) refects the prediction loss under the current weight c. By minimizing H(χ, ω), we can obtain the best weight c * , which is given by We take q 1 * , q 2 * , · · · , q k * in c * as the optimal weights. In this paper, we adopt a machine-learningbased approach in the server to get c * . In particular, a neural network model ϑ t is trained so that H(M · c, ω) is minimized. (5) Te server S aggregates the models according to the current round of updated weight correction values q * k to obtain the global model of the next round θ t+1 � ( K k�1 q k ) * · θ t k . (6) Te server broadcasts the new global model θ t+1 to each client and repeats the process of 1-6 until the global model θ T converges.
For Algorithm 1, we need to defne the initial global model θ 0 , the initial adaptive learning model ϑ 0 , and the initial weight value q 0 k . Te server broadcasts the global model θ 0 to all clients within the specifed time T of the system. Te client uses local data to train the model to the specifed epoch I and then returns the model θ t,I k to the server. Tis process is shown in 2 − 5 of Algorithm 1, which represents the process of local model training. Ten, in the server, the optimal weight value q * k is obtained through the adaptive learning model ϑ 0 . Te model aggregation is carried out according to the optimal weight value q * k , and the latest global model θ t+1 is obtained. Tis process is shown in 6-9 of Algorithm 1, which represents the process of model aggregation [28][29][30][31][32][33].
Te federated learning algorithm of automatic weight optimization adds an adaptive weight allocation algorithm to the FedAvg algorithm. In the traditional weight allocation method (3), the weight of the client is allocated according to the amount of data, which is fully applicable under the condition of IID. However, under the infuence of heterogeneity, only considering the amount of data cannot fully refect the quality of client data because the data of most clients tend to shift to a certain feature in practice afected by statistical heterogeneity. In other words, most data in one client often have similar features. If such a client has more data, it would often lead to a poor aggregation efect according to (3). Te correct approach is to adjust the weight to minimize the cross-entropy. When the cross-entropy is the smallest, predicted local distribution is closest to global distribution, which is also the biggest advantage of FedAwo compared with the traditional weight allocation algorithm. FedAwo can converge quickly and improve the accuracy of the global model, which is still applicable under IID conditions.

Nonconvex Loss Functions.
As is known to all that for convergence of nonconvex loss functions, the expected gradient norm is usually taken as the index of convergence to ensure convergence to a stagnation point [15][16][17]34]. Terefore, this article takes the norm of the expected gradient as the convergence index, namely, As is commonly used in literature studies [20][21][22], the following assumptions are adopted in this article.
where L denotes the Lipschitz constant.
Assumption 2. Stochastic gradients in clients are unbiased, and the second raw moment of a stochastic gradient for all functions is f k . (bounded).

Computational Intelligence and Neuroscience
For the sake of simplicity, (23) is rewritten as inequality: Next, the induction would be used to derive Teorem 2. Obviously, inequality (11) holds for t � t 0 , and then, assuming that inequality (21) is true when s > t 0 . Ten, we have Next, from inequality (24) and (25), we obtain inequality as follows: Terefore, inequality (21) is true; i.e., Teorem 2 is true.

Federated Learning Enhancement Algorithm for Automatic Weight Optimization (FedAwo * )
System heterogeneity is caused by the client's computing power, storage capacity, load capacity, and network environment, and it means that the converged clients still need to carry out model training for the specifed epoch. Tis phenomenon results in the computing resource waste and Require: initialized the global model θ 0 and initialized the weight distribution model ϑ 0 , q 0 k and server dataset D s Ensure: fnal global model θ T (1) for t � 0 to T do (2) Broadcast θ t to clients (3) for e � 0 to I do (4) Pass θ t to server S (8) Calculate q * k through III-V (9) Aggregate the new global model Server updates θ t+1 � K k�1 q * k × θ t k (10) end for ALGORITHM 1: FedAwo (federated learning enhancement algorithm for automatic weight optimal allocation).
Require: initialized the global model θ 0 , initialized the weight distribution model ϑ 0 , q k and the server dataset D s , and initialized ∇l 0 � 0,ò, δ Ensure: fnal global model θ T (1) for t � 0 to T do (2) Broadcast θ t to clients (3) for e � 0 to I do (4) θ t,e+1 k � θ t,e k − η t ∇f k (θ t,e k ): (5) Update ∇l e : // for the optimization part of the FedAwo algorithm (6) ∇l e � ‖l e − l e− 1 ‖: (7) Calculate whether the local model converges: (8) if ∇l e < ε, and ∇l e < ε, or |δ − l e | < ε then (9) Break; (10) else (11) Continue (12) end if (13) Pass Model t k to the server (14) end for (15) θ t k ←θ t,I k :  Computational Intelligence and Neuroscience energy waste in clients. Terefore, we further optimize the FedAwo algorithm and propose an enhanced algorithm (FedAwo * ). Based on the FedAwo algorithm, the FedAwo * algorithm adds an adaptive training round optimization algorithm to the client, which can efectively reduce the model training overhead of clients. Te above phenomenon is common in federated learning, but traditional federated learning algorithms do not pay attention to this problem, and this phenomenon is aggravated with the progress of federated learning, which leads to a large number of invalid calculations in the client and adds a lot of meaningless computational overheads. Terefore, it is necessary to add discriminant conditions for model convergence in local training. Tis is where the FedAwo * algorithm is optimized for the FedAwo algorithm. Tis method returns to the server a local model that satisfes the convergence criteria, even if the specifed epoch has not been completed. Tis idea seems to be similar to that of the FedProx algorithm [2], but the starting points of them are completely diferent. FedProx is to solve the problem of struggling, while FedAwo * is to reduce training costs. When the model trained by the client reaches the convergence criteria we set, the local training would automatically stop even if the training numbers are less than the epoch set by the system. And local converged model would been returned to the server, so as to reduce the computational overhead of the clients.
Te specifc process of FedAwo * is as follows:   Computational Intelligence and Neuroscience would be returned to the server. ε represents a very small parameter and δ represents a parameter close to the global model convergence loss. Te value is adjusted according to the specifc situation. In section 6, we would set δ � 0 and ε � 0.001.
(3) If the conditions in II cannot meet the specifed criterion, after training the specifed criterion, θ t k would be returned to the server.
In Algorithm 2, we need to defne the initial global model θ 0 , the initial adaptive learning model ϑ 0 , the initial weight value q 0 k , the initial loss function diference ∇l 0 � 0, and other parameters ϵ, δ. Te server broadcasts the global model θ 0 to all clients within the specifed time T of the system. Te client uses local data to train the model for the specifed epoch I and then returns the model θ t,I k to the server. At the same time, in each local training epoch, we would record the diference ∇l e between the loss function of this epoch and the previous epoch. When the diference ∇l e between the loss functions of two consecutive epoch is very close, or the diference between these two epochs is less than ϵ, we consider that the local model has converged at this time and immediately return this model to the server. Tis process is shown in 2-14 of Algorithm 2, which represents the process of local model training. Ten, in the server, the optimal weight value q * k is obtained through the adaptive learning model ϑ 0 . Te model aggregation is carried out according to the optimal weight value q * k , and the latest global model θ t+1 is obtained. Tis process is shown in 15-18 of Algorithm 2, which represents the process of model aggregation.
Te algorithm reduces the computational overhead of clients by dynamically performing local-training epochs. According to a large number of experiments, we found that in    the process of federated learning, some clients have converged before performing the number of specifed epochs. Following the previous federated learning algorithm, these clients still need to perform training until the specifed epoch. Tis process inevitably results in the waste of computing resources [2]. Terefore, the FedAwo * algorithm adaptively judges whether the SGD process converges during the client training. If the convergence conditions are reached before the specifc epoch, the SGD would be stopped and the converged local model would be returned to the server. Otherwise, the SGD would continue and stop after reaching the specifed epoch.

Experimental Environment.
In order to analyze the performance of FedAwo and FedAwo * algorithms, we established an experimental environment based on PyTorch     1.10.1 and CUDA 10.2. Te software environment is Python 3.8. Te hardware environment is 3.60 GHz AMD Ryzen 7 3700X 8core processor CPU, 16.00 GB, Win10 64 bit, and NVIDIA GeForce RTX 2070 system. Te simulation experiment strictly follows the protocols and rules that may be used in distributed federated learning [35]. More details of the experimental environment are shown in Table 3.

Experimental Setup.
In this paper, MNIST and Fashion-MNIST datasets are selected as experimental datasets to verify the performance and stability of FedAwo and FedAwo * algorithms. MNIST and Fashion-MNIST are two image datasets. In the experiment, we normalize the two datasets, respectively. For IID dataset partition, data samples are evenly and randomly distributed to clients. For nonIID dataset partitions, data samples are sorted by their labels and divided into 2K groups, and each client receives two groups (i.e., samples corresponding to two labels).
For MNIST, the dataset has 60000 training samples and 10000 test samples. It is an image dataset containing 0-9 hand-written digits, and each sample contains 28 × 28 pixels. We set a total of K � 100 clients, and we allocate 600 training samples for each client. In addition, when using the FedAwo  algorithm, we get 2000 data from 10000 test datasets and take these data as the server dataset D s for adaptive learning of weight distribution and the remaining 8000 data as test datasets. For comparison, we confgured the same CNN model according to the method proposed in [1]. Te model has two 5 × 5 convolution layers of CNN (the frst has 32 channels, the second has 64 channels, and each channel is followed by 2 × 2 maximum pool), and one has 512 units, ReLU activation, and fnal Softmax output layer.
For Fashion-MNIST, the dataset also has 60000 training samples and 10000 test samples. It is an image dataset containing diferent commodities, and each sample also contains 28 × 28 pixels. Other experimental settings are consistent with the MNIST dataset.

Computational Intelligence and Neuroscience 13
Te specifc experimental setup details are as follows: we set the learning rate to 0.01, batch size to 64, and epoch to 5. Since the MNIST and Fashion-MNIST datasets have the same input and output and are both image datasets, we set the same CNN model. Te details of the specifc model settings are shown in Table 4.

Results of the Experiment.
We chose the most classic and widely used FedAvg, FedProx, and FedAdp algorithms as the baselines of the experiments.
For the MNIST dataset, we frst used the data distribution of IID to compare FedAwo, FedAwo * , FedAvg, FedProx, and FedAdp algorithms. As shown in Figures 1 and  2, we could see that under the dataset with IID distributed data, the fve algorithms converge in 10-15 communication rounds. FedAvg has slower convergence speed and lower accuracy of the global model than the other four algorithms, but it is not clear.
NonIID experiments heavily distribute skewed data to individual clients, and the results of the experiment are shown in Figures 3 and 4. In Figures 1 and 2, we could see that the convergence rates of the fve algorithms were affected by statistical heterogeneity. Te FedAvg algorithm was seriously afected, which led to a signifcant decrease in the convergence speed, and converged after the 70th communication round. At the same time, the quality of the global model was obviously inferior to the global model under the IID condition. Due to the addition of the near term to the loss function, the quality of the global model was not afected in the FedProx algorithm, but the convergence speed was still slowed down. Te same is true for the FedAdp algorithm. For FedAwo and FedAwo * algorithms, both the convergence speed and the quality of the global model were minimally afected by statistical heterogeneity, and they reached convergence around the 30th communication round.
We also simulated both systematic and statistical heterogeneity of federated learning. Obviously, the infuence of heterogeneity on the global model was further increased. It could be seen from Figures 5 and 6 that the FedAvg algorithm had a great impact on the convergence speed and global model quality. Te model did not completely converge until round 80. Te convergence speed of FedProx and FedAdp was not signifcantly slowed down compared with the condition of only statistical heterogeneity, but the quality of the global model was degraded. For FedAwo, both the convergence rate and the quality of the global model were still minimally afected, while FedAwo * had some fuctuations under the infuence of system heterogeneity. Te convergence speed and global model quality of FedAwo and FedAwo * algorithms were better than those of FedAvg, FedProx, and FedApd baseline algorithms.
In order to ensure the superiority of FedAwo and FedAwo * algorithms, we conducted the Friedman test on the model accuracy and loss of these fve algorithms and obtained the results of stat � 14.68, p value � 0.00184 and stats � 10.24, p value � 0.02626. Te Friedman test can only show that there are diferences between the accuracy and loss of the models, but it cannot show which model is better. Terefore, we conducted the Nemenyi test on the above algorithms to further verify whether there is a signifcant diference between the two models. According to the results shown in Table 5, it can be concluded that FedAwo and FedAwo * algorithms are superior to the other three algorithms. In addition to accuracy and loss, we also cited the fnal precision, recall, AUC, and F1 values of the global model as performance indicators to compare the fve algorithms under statistical heterogeneity, as shown in Table 6.
For the Fashion-MNIST dataset, we obtained similar conclusions as in the MNIST dataset. According to Figures 7  and 8, the convergence speed of the FedAvg algorithm would be slower under the IID condition, and the other four algorithms were not much diferent.
According to Figures 9 and 10, the experimental results were also similar to those in the MNIST dataset of the infuence with only statistical heterogeneity.
In Figures 9-12, we can see that although the FedAwo and FedAwo * algorithms have some fuctuations, their convergence speed and model accuracy are better than those of the baseline algorithms.
For the Fashion-MNIST dataset, we conducted the Friedman test on the model accuracy and loss of the fve algorithms, and the results obtained were stat � 13.27, p value � 0.00181 and stats � 10.24, p value � 0.02626. We conducted the Nemenyi test on the above algorithms to further verify whether there is a signifcant diference between the two models. According to the results in Table 7, it can be concluded that FedAwo and FedAwo * algorithms are superior to the other three algorithms.
Similarly, for the Fashion-MNIST dataset, we also cited the fnal precision, recall, AUC, and F1 values of the global model as performance indicators to compare the fve algorithms under statistical heterogeneity, as shown in Table 8.
In addition, we tested the computational overhead of four algorithms under the IID condition and nonIID condition. Te results of IID condition are shown in Figures 13  and 14, and the other four algorithms did not judge whether the local model converged, so clients would train 5 epochs in each round. Te total calculation amount of 100 clients in a communication round is 500. Due to the IID dataset, the convergence speed of the local model and the global model was fast. When the model and global model was close to convergence, clients would save more computing resources.
Under the condition of nonIID (statistical heterogeneity), FedAwo * can still save computing resources of clients. However, compared with the IID condition, the convergence speed was slower in the case of statistical heterogeneity, and the saving efect of saving computing resources in the FedAwo * algorithm was slightly worse, which is shown in Figures 13-16. Figures 1 and 2, in the MNIST dataset, we can see that each federated learning algorithm has similar performance without heterogeneity. When we use the local dataset with statistical heterogeneity, as shown in Figures 3 and 4, the global model accuracy and convergence speed of the FedAvg algorithm are signifcantly reduced. Te global model accuracy of the FedProx algorithm and the FedAdp algorithm is not affected, but the convergence speed is signifcantly reduced, reaching convergence in the 70th round. However, the global model accuracy and the convergence speed of the FedAwo algorithm and the FedAwo * algorithm are almost not afected by statistical heterogeneity and can reach convergence 20 rounds before. On this basis, we add system heterogeneity. When two types of heterogeneity exist at the same time, heterogeneity has a more signifcant negative impact on model aggregation. As shown in Figures 5 and 6, the global accuracy and convergence speed of the FedAvg, FedProx, and FedAdp algorithms are signifcantly reduced. Te FedAwo and FedAwo * algorithms have received a slight impact, but the global model accuracy can still reach 90% and can converge within 20 rounds. In the Fashion-MNIST dataset, we get consistent results, as shown in Figures 7-12. Trough the above experiments, it fully refects the optimal weight value calculated according to the adaptive learning algorithm; compared with the weight value assigned by the traditional federated learning algorithm according to the amount of client data, it has signifcant advantages. Te FedAwo * algorithm optimizes the computational cost of the FedAwo algorithm for the client. As shown in Figures 13-16, FedAwo * can signifcantly reduce the computing overhead of the client and is applicable to the situation of both IID and heterogeneity.

Discussion on Experiment. According to
Trough the above experiments, we can clearly fnd that the ability of FedAwo and FedAwo * algorithms to solve the heterogeneity of federated learning is better than that of the other three baseline algorithms. Even under the condition of system heterogeneity and statistical heterogeneity, the algorithm in this paper can still converge quickly and ensure excellent global model quality. In addition, the algorithm in this paper is still applicable to IID. Terefore, FedAwo and FedAwo * algorithms are universal, and they can be applied to most federal learning scenarios. Te FedAwo * algorithm optimizes the convergence criterion of the local model. As shown in Figures 13 and 16, FedAwo * signifcantly saves the computing overhead of the client compared with other algorithms. Terefore, FedAwo * is an adaptive weight optimization federated learning algorithm that can efectively solve the heterogeneity and save the computational overhead. Compared with existing algorithms, it has great advantages.

Conclusion
We investigate an automatic local model weight optimization strategy to reduce the negative efects of systematic and statistical heterogeneity in federated learning and propose federated learning algorithms FedAwo and FedAwo * . Te FedAwo algorithm can improve the convergence speed of the global model and obtain a global model with higher accuracy, and the enhancement algorithm FedAwo * can reduce the training overhead. Experimental results verify the superiority of our proposed schemes in terms of convergence speed and global model accuracy, as well as the efectiveness of FedAwo * in saving the client-computing overhead. In this paper, we combine adaptive learning with federated learning to solve the heterogeneity problem and have achieved remarkable results. Tis paper puts forward a new idea to solve the negative impact of heterogeneity in federated learning.

Future Work
However, the FedAwo and FedAwo * algorithms also have some instability. As shown in Figures 9 and 10, in the Experiment section, the global model shows the zig-zag spike phenomenon when it is close to convergence. Te reason for this phenomenon is that the learning rate is too high when the algorithm is about to converge fast. In the future work, we hope to improve the zig-zag spike phenomenon by dynamically adjusting the learning rate. In addition, we will further improve the adaptive learning model $\vartheta^0$ in future work to further improve the performance of the FedAwo algorithm.
used to support the fndings of this study are included within the article. Te MNIST and Fashion-MNIST datasets used to support the fndings of this study are included within the supplementary information fle(s). Te experimental code has been open source to the "https://github.com/amazingyx/FedAwo." Fashion-MNIST is available at https://www. kaggle.com/datasets/zalando-research/fashionmnist MNIST is available at https://www.kaggle.com/datasets/ oddrationale/mnist-in-csv. Our experimental code for the manuscript is as follows: https://github.com/amazing-yx/ FedAwo.

Conflicts of Interest
Te authors declare that they have no conficts of interest.