Genetic CFL: Hyperparameter Optimization in Clustered Federated Learning

Federated learning (FL) is a distributed model for deep learning that integrates client-server architecture, edge computing, and real-time intelligence. FL has the capability of revolutionizing machine learning (ML) but lacks in the practicality of implementation due to technological limitations, communication overhead, non-IID (independent and identically distributed) data, and privacy concerns. Training a ML model over heterogeneous non-IID data highly degrades the convergence rate and performance. The existing traditional and clustered FL algorithms exhibit two main limitations, including inefficient client training and static hyperparameter utilization. To overcome these limitations, we propose a novel hybrid algorithm, namely, genetic clustered FL (Genetic CFL), that clusters edge devices based on the training hyperparameters and genetically modifies the parameters clusterwise. Then, we introduce an algorithm that drastically increases the individual cluster accuracy by integrating the density-based clustering and genetic hyperparameter optimization. The results are bench-marked using MNIST handwritten digit dataset and the CIFAR-10 dataset. The proposed genetic CFL shows significant improvements and works well with realistic cases of non-IID and ambiguous data. An accuracy of 99.79% is observed in the MNIST dataset and 76.88% in CIFAR-10 dataset with only 10 training rounds.


Introduction
Federated learning (FL) [1,2] has risen as a groundbreaking subdomain of machine learning (ML) that enables Internet of ings (IoT) devices to contribute their real-time data and processing to train ML models. FL represents a distributed architecture of a central server and heterogeneous clients, aiming to reduce the empirical loss of model prediction over nonindependent and identically distributed (non-IID) data. In contrast to traditional ML algorithms that require large amounts of homogeneous data in a central location, FL utilizes on-device intelligence over distributed data [3,4]. e limited feasibility of ML in industrial and IoT applications is overturned by the introduction of FL. Some and encryption to reduce the model size and protect privacy. Communication load is also determined by the number of edge devices. Sparsification of communication [11] implemented over clients is modeled to increase convergence rate and reduce network traffic on the server. Many models also utilize hierarchical clustering [12] to generalize similar client models and reduce the aggregation complexity.
Apart from communication, training ML models in a heterogeneous setup presents a huge challenge [13]. Once the server model is broadcast, the clients train on it considering some hyperparameters such as client ratio (i.e., from a strength of 100, number of clients chosen), learning rate, batch size, and epochs per round. With each edge device, computational power and properties of data (ambiguity, size, and complexity) vary drastically, and diversely trained client models are hard to aggregate. In a realistic scenario of thousands of edge devices, the updated global model may not converge at all. Existing aggregating algorithms such as FedAvg and FedMA [14] focus more on integration of weights of the local models. Convergence rate and learning saturation are common concerns when it comes to training and aggregation. Several novel approaches work around model aggregation either by using feature fusion of global and local models [15] or by a grouping of similar client models [16] to increase generalization. Some literatures also utilize multiple global models to better converge data [17].
Research on making FL models adaptive to non-IID data has focused primarily on model aggregation. Local training of the model itself is an undermined step, given its role in the final accuracy. In this paper, we propose three novel contributions to lessen the empirical risk in FL, as shown in Figure 1: (i) Clustering of clients solely based on model hyperparameters to increase the learning efficiency per unit training of the model (ii) Implementation of density-based clustering, i.e., DBSCAN, on the hyperparameters for proper analysis of devices properties (iii) Introduction of genetic evolution of hyperparameters per cluster for finer tuning of individual device models and better aggregation In particular, we introduce a new algorithm, namely, Genetic CFL, that clusters hyperparameters of a model to drastically increase the adaptability of FL in realistic environments. Hyperparameters such as batch size and learning rate are core features of any MFL model. In truth, every model is tuned manually depending on its behavior to the data. erefore, in a realistic heterogeneous setup, the proper selection of these parameters could result in significantly better results. DBSCAN algorithm is used since it is not deterministic, static in terms of cluster size and uses neighbourhood of model hyperparameters for clustering. We also introduce genetic optimization of those parameters for each cluster. Genetic algorithm is algorithm since it is highly application flexible and scalable to higher dimensions. As defined, each cluster of clients has its own unique set of properties (i.e., hyperparameters) that are suitable for the training of the respective models. In each round, we determine the best parameters for each cluster and evolve them to better suit the cluster. e rest of this paper is organized as follows. Section 2 discusses the recent work done in the fields of FL, clustering, and evolutionary optimization algorithms. e proposed algorithm is defined in Section 3, followed by the results in Section 4. Finally, the paper is concluded in Section 5.

Related Work
In this section, we survey the current literature on the topics of FL, density-based clustering, and evolutionary algorithms, respectively, and try to understand their limitations.

Federated Learning.
Recently, FL as a distributed and edge ML architecture is being studied extensively [1,18].
is decentralized nature of FL directly contradicts traditional ML algorithms which are genuinely difficult to train in a heterogeneous environment consisting of non-IID data. Novel approaches have tried to overcome this difficulty through various model aggregation algorithms, namely, FedMA [14], feature fusion of global and local models [15], and agnostic FL and grouping of similar client models [16] for better personalization and accuracy. Clustering takes advantage of data similarity in various clients and models [19] and efficient communication, and lastly improves global generalization [20]. In general, much work is yet to be done in terms of efficient model training on non-IID data.

Density-Based Clustering.
Clustering in FL is primarily used for efficient communication and better generalization. In a realistic scenario with thousands of nodes, aggregating everything into a single model may damp the convergence greatly. Several partitioning, hierarchical, and density-based clustering algorithms have been applied to work on some of the problems existing in FL. Partitioning clustering algorithms such as k-means clustering [21] demand a predetermined number of clusters, but in actuality that is not feasible. Some examples of nondefinitive clusters include agglomorative hierarchical clustering [22] and generative adversarial networkbased clustering. In this paper, we propose to use DBSCAN (density-based spatial clustering of applications with noise) [23], a density-based clustering algorithm that only groups points if they satisfy a density condition.

Evolutionary Algorithms.
Hyperparameters of a model determine their ability to learn from a certain set of data. Optimization of ML models and their hyperparameters using evolutionary algorithms [24] such as whale optimization [25] and genetic algorithms [26] are explored by many researchers. In addition, these algorithms have been extensively used over DL frameworks that have become a trend for optimization tasks [27]. e same has yet not been adopted for FL extensively. Also, algorithms such as reinforcement learning (RL) with focus on Q-Learning are not suitable for highly complex scenarios [28]. e need for hyperparameter tuning increases even more in FL due to the ambiguity in data, and the abovementioned optimization algorithms assist in tuning those parameters beyond manual capacity. Since optimization of each client model parameters is not feasible, we propose to do so for each cluster. rough the survey, we observe that FL is greatly limited by efficiency of individual client training that includes apt choice of hyperparameters, increasing adaptive nature of the models and optimization of such process.

Genetic CFL Architecture
In this section, we give a detailed mathematical model of our algorithm, genetic CFL. e complete pipeline is divided into two parts, the initial broadcast round represented by Algorithm 1 to determine the clusters and the federated training using genetic optimization represented by Algorithm 2.
e variational behavior of the algorithm with different hyperparameters, including client ratio (n), number of rounds, ϵ, minimum samples, learning rate (η), and batch size is explained in this section. Table 1 elucidates all the symbols utilized in the algorithm. e purpose of Algorithm 1 is to discreetly determine the data characteristics of an edge device without intruding on their privacy. A server model (w 0 ) is initialized and broadcast to n clients, C⊆ C 0 , C 1 , . . . , C tot . With each distributed model, three different η are broadcast. e sample size is chosen to introduce variance in training, while more number of samples can also be used for experiments. ese learning rates are chosen from an array (η m ) ranging . Each edge device receives w 0 that is cloned for all values of η and trained individually for a single epoch. Data properties unique to an edge device such as size, complexity, ambiguity, and variance drastically affect the training, and thus, hyperparameters of a model, η, batch size, are chosen accordingly. Naturally, from the three trained models in an edge device, the one with the least loss, denoted as w 0 min , is chosen. Each edge device then returns w 0 min , η min , and losses min . e significance of these values is their data representative capacity of the respective edge devices.
At server, the models w 0 0 , w 0 1 , . . . , w 0 n , the learning rates η 0 , η 1 , . . . , η n , and their respective losses are attained. e model aggregation technique is used to obtain the server model by combining edge device models. e weights of the models (w 0 n ) are summed iteratively as follows: After summation, the output of the equation is divided by the number of clients to obtain model aggregation as   Initialize return w min 0 , η min , losses min (15) procedure Server Initialize DBSCAN Clustering Algorithm (18) ϵ←1e − 6 (19) points min ←2 (20) model←DBSCAN(ϵ, points min ) (21) clusters � model.fit_predict (η n ) ALGORITHM 1: Initial broadcast round and clustering.
rounds: Number of loops for training the federated model (1) function Mutate (η) (2) factor ← random ([−1, 0, 1]) (3) η←η + ((η × factor)/10) (4) return η (5) function Crossover (η n ) (6) Initialize temporary array η temp to store η return η temp (12) function Evolve (losses n , η n ) (13) losses n , order←sort(losses n ) (14) η n ←sort(η n , order) (15) return Crossover (η n ) (16) procedure Train (17) len←size(cluster) (18) ind←0 to len (19) Initialize η global with shape (len, size (clusters [ind]) (20) clusters unique � unique(cluster) (21) for i←0 to rounds do (22) for k←0 to size(cluster) do (23) ind Empty arrays losses, η n (26) for k←0 to n do (27) w 0 ALGORITHM 2: Genetic optimization based FL on clustered data. 4 Computational Intelligence and Neuroscience (2) After server model aggregation, the DBSCAN clustering algorithm is applied. In a realistic scenario, the number of edge devices and their variance cannot always be determined. In deterministic partitioning clustering methods such as K-means clustering, the number of clusters has to be predetermined and is not dynamic. DBSCAN, on the contrary, uses density-based reasoning for the grouping of similar objects. It takes two mandatory inputs, ϵ and min samples. Any point x forms a cluster if a minimum number of samples lie in its ϵ−neighbourhood. is value can be calculated by Here, HP space represents the domain in which the point x must be presented. In our case, it is the range of hyperparameters, specifically learning rate η. Each ϵ-neighbourhood must contain a certain number of points (MinPts) to be called a cluster as follows: In the object space of only learning rate, |η 2 − η 1 | gives the Euclidean distance used for ϵ neighbourhood. When the number of dimensions is increased with the addition of batch size (B) the Euclidean distance formula for the 2coordinate system is used and logarithmic values of hyperparameters are taken to scale the exponential values to liner ones. e calculation can be observed as After each edge device is allotted a cluster-ID, we implement phase-2, shown by Algorithm 2. is section of the algorithm works under the main control loop which runs for i rounds. In every ith iteration, (1) Hyperparameters are optimized per cluster using genetic algorithm involving evolution followed by crossover and finally mutation (2) e server model with optimized hyperparameters is broadcast to each client clusterwise (3) Each client is trained based on said parameters (4) Client models are aggregated to form the latest server model Every cluster has a different set of characteristic hyperparameters suitable to the edge devices belonging to them. ese clustered parameters are evolved genetically followed by training for every ith round. Using genetic optimization for tuning converges the set of hyperparameters to an optimal set each round. η global [k] is initialized that stores learning rates for each cluster, and its contents are modified every round. It is of shape 〈C, size(C i )〉, where C is the number of clusters, C i represents the ith cluster, and size(C i ) represents the number of edge devices in each i th cluster. e hyperparameters of a cluster having shape m 0 are sorted through their losses: losses n , order←sort losses n , η n ←sort η n , order .
Once sorted, we obtain new individuals through crossover and mutation, respectively. e best individuals (hyperparameters in a cluster) retain their genes and are promoted to the next generation (round), while the others are formed by mating of individuals from the last generation as e new learning rates η new are chosen either directly or by mating. e number of η taken from old generation can vary. From (9), we derive the modified parameters: where P A , P B ∈ [0, 9] and f ∈ [−1, 1]. After genetic evolution, the server model is again broadcast to all devices with their respective cluster hyperparameters. Each edge device trains for 1 epoch, and the complete process of genetic optimization, training, and model aggregation is repeated for i − 1 rounds.

Experiments and Results
is section deals with the experiments that have been conducted to validate and test the proposed genetic CFL architecture. Section 4.1 deals with the clustering of the client edge devices and the clustering behavior under various parameters. Sections 4.2 and 4.3 after DBSCAN are concerned with the performance of the genetic CFL architecture on MNIST and CIFAR-10 datasets, respectively, and their comparison with the generic FL architecture. e overall performance analysis for the genetic CFL architecture is discussed in Section 4.4.

DBSCAN Clustering of the Client Models.
e DBSCAN algorithm, as discussed in the previous section, focuses on the Euclidean distance between the observations to calculate the density and cluster the observations based on this density. e models in each edge device are assigned a particular learning rate and batch size for training. ese two hyperparameters serve as the primary two dimensions for each observation for the process of clustering. e DBSCAN algorithm takes two main parameters for clustering a set of observations: ϵ and Min Samples. We note that ϵ is the maximum Euclidean distance for an observation from the Computational Intelligence and Neuroscience closest point in the cluster in question. e Min Samples parameter is the least number of observations possible in the clustering algorithm. us, the tuning and selection of these parameters become essential to obtain proper and efficient results. Table 2 summarizes the conditions tested for the quality and effectiveness of clustering with the said parameters. For each value of ϵ, two values of Min Samples are tested to validate the clustering effectiveness and detecting outliers in the data. For the ϵ values 0.2 and 0.175, the number of clusters for both 1 and 2 Min Samples stay constant at 7. is constant value for the generated number of clusters for both the Min Samples indicates that there are no outliers in the data, and each observation in the cluster holds a strong relationship with each other. Since the number of clusters for both the epsilon values are the same, it is evident that the clusters are locally isolated. For the ϵ values 0.150 and 0.100, the number of clusters changes drastically indicating weak clustering among the observations. e change in number of clusters for different Min Samples is proof that there are outliers in the data which can cause issues while performing the genetic optimization due to the lack of population. e parameters can therefore be safely assigned either of the four combinations to obtain 7 distinct clusters, as shown in Figure 2.
e number of observations in each cluster is plotted in Figure 3.

Performance of the Genetic CFL Architecture on MNIST
Dataset. In this section, we discuss the performance and analyse the training curves of the models. e server model is initially trained on a subset of the MNIST handwritten digits' dataset [29]. is model is then distributed among the clients based on the client ratio. e total number of clients chosen for this experiment is 100 and the client ratios tested for are 0.1, 0.15, and 0.3. In essence, we evaluate the performance of the models on 10, 15, and 30 clients, respectively. Each client device is provided with a random subset of the dataset with a random number of observations. is is to make sure that the data is non-IID, and the characteristics of the real-time scenario is emulated. For the initial round, the hyperparameters (learning rate and batch size) of the client devices are randomized within intervals (4) and (5), respectively. e client devices are trained for two epochs and the hyperparameters are subjected to genetic evolution as discussed in Section 3. ese rounds are tabulated in Table 3, and the best performance is plotted against each round in Figure 4.
Since the model training hyperparameters are no longer predetermined, the performance of the models and their respective training are optimized locally in the cluster, thus providing a more personalized training for each cluster. e performance of the server model obtains a smooth learning curve and converges faster than the normal training of the model using FL. Table 3 represents this performance of the models for both the architectures. e superiority of performance of genetic CFL over generic FL is evident for each round.
e accuracy of the genetic CFL architecture is consistently higher and the loss is consistently lower as compared to the generic FL architecture. e increase in accuracy and the decrease in loss signify that the models are indeed training and useful information is aggregated at the server.

Performance of the Genetic CFL Architecture on CIFAR-10 Dataset.
is section deals with the performance and the training of the models on CIFAR-10 dataset [30] using genetic CFL architecture and its comparison with the performance of the generic FL architecture. e training process of this dataset is similar to the training of the MNIST handwritten digits' dataset. e server initializes the model and distributes the weights of the server model to every client device; the models are trained on the random subset of the dataset assigned for two epochs; the current hyperparameters are subjected to genetic evolution; the trained weights are sent back to the server to get aggregated.
is process is repeated for several rounds. e performance of the server model after each round, at the end of the aggregation phase, is plotted in Figure 5 and tabulated in Table 4. e performance of the models trained on the hyperparameters that are optimized using genetic algorithm for the respective clusters is higher than those that are not. is performance is consistent with any number of client devices. e performance also improves as the client ratio increases. e lowest loss is encountered at the second round for client ratio 0.3. e accuracy however peaks at the fourth epoch with a decent amount of loss for prediction. Any further training of the models does not provide better performance causing overfitting. e training of the models is stopped at round two. e aggregated model therefore provides a significant performance boost for very few rounds. is provides speed and high throughput while deployment in a real-time system.

Performance Analysis of Genetic CFL.
e genetic CFL algorithm performs better with a higher sample size. Higher number of observations per cluster should therefore improve the optimization of the hyperparameters. However, taking into consideration the diversity of datasets both in the data characteristics and the number of data points, proper clustering of similar scenarios should provide higher throughput for the models individually.
is calls for a balance between the number of clusters and the size of the cluster. A proper balance can ensure that the performance of the models in the federated architecture provides the best output in the given scenario. In a real-time application, the amount of edge devices expected is higher as compared to a synthetic environment. Following the progression of the performance, the higher number of total clients increases the performance significantly. e optimization of the hyperparameters using genetic CFL provides higher throughput for comparatively less number of rounds.
Our architecture, genetic CFL, outperforms both algorithms [31,32] in accuracy and rounds. is holds up the fact that genetic CFL architecture performs better while taking less number of rounds. In case of iterative clustering [16], our      Computational Intelligence and Neuroscience architecture outperforms in the case of MNIST dataset but does not in the case of the CIFAR-10 data. is behavior is attributed to the rotation and augmentation of data. is gives an upper hand in better feature extraction and representation. Genetic optimization provides an elastic and adaptive framework for optimization of the hyperparameters. is flexibility gives the architecture an edge over other methods by adapting to the dataset and the required environment. Most of the other types of architectures need to perform hyperparameter tuning beforehand and thus requiring manual intervention. is causes the system to be reset and a different set of parameters for a different type of data and application. is rigidity can cost both time and resources. Moreover, importance to every single client is given, thus affecting not only the server model performance but also the performance of every single client device. A better delivery of service for each and every client device is ensured while increasing the performance of the server model as a whole. Table 5 shows the comparison between the performance of our architecture, genetic CFL, with other architectures that incorporate clustering in federated learning. e table consists of the best accuracy of the models on the MNIST handwritten digits' dataset and the CIFAR-10 dataset for a given number of rounds. It is evident that the number of rounds taken is significantly less keeping the accuracy higher.

Conclusion
In this work, we have applied the genetic evolutionary algorithm to optimize the hyperparameters-learning rate and batch size-during the training of the individual end device models in a cluster for the FL architecture. We have identified and filled the gaps in the existing techniques and contributed algorithm of the genetic CFL architecture. is architecture has been tested using MNIST handwritten digits' dataset and CIFAR-10 dataset. An accuracy of 97.99% and 76.88% has been, respectively, achieved on the datasets. We discussed and analysed the observations and the performance of the genetic CFL architecture. We have also covered the favourable conditions and the limitations for the algorithm to provide the best performance in deployment. e overall performance of the models display significant rise in efficiency while reducing communication and computation cost.
As part of the future work, the amount of clients and the client ratio can be scaled into larger samples closely mimicking the real-time situation due to the high scalability of the model. As the population sample increases, the optimization of the hyperparameters gets more efficient thus delivering higher throughput in the real-time scenario. e type of data processed is not limited, and this architecture can be used for various scenarios such as natural language processing tasks, image classification tasks, and recommendation systems. Genetic CFL can also be integrated with time sensitive systems to deliver better performance in very less number of rounds.
Data Availability e datasets used in this work are MNIST handwritten digit dataset and CIFAR-10 dataset from Kaggle (publicly available platform).
Disclosure is manuscript is available as a preprint in Arxiv at "https:// arxiv.org/abs/2107.07233." e code for this work is available in the repository at https://github.com/sagnik106/ Clustered-FL-GA.  Computational Intelligence and Neuroscience 9