Towards Fair and Decentralized Federated Learning System for Gradient Boosting Decision Trees

At present, gradient boosting decision trees (GBDTs) has become a popular machine learning algorithm and has shined in many data mining competitions and real-world applications for its salient results on classification, ranking, prediction, etc. Federated learning which aims tomitigate privacy risks and costs, enables many entities to keep data locally and train amodel collaboratively under an orchestration service. However, most of the existing systems often fail to make an excellent trade-off between accuracy and communication. In addition, they overlook an important aspect: fairness such as performance gains from different parties’ datasets. In this paper, we propose a novel federated GBDT scheme based on the blockchain which can achieve constant communication overhead and good model performance and quantify the contribution of each party. Specifically, we replace the tree-based communication scheme with the pure gradient-based scheme and compress the intermediate gradient information to a limit to achieve good model performance and constant communication overhead in skewed datasets. On the other hand, we introduce a novel contribution allocation scheme named split Shapley value, which can quantify the contribution of each party with a limited gradient update and provide a basis for monetary reward. Finally, we combine the quantification mechanism with blockchain organically and implement a closed-loop federated GBDT system FGBDT-Chain in a permissioned blockchain environment and conduct a comprehensive experiment on public datasets. ,e experimental results show that FGBDT-Chain achieves a good trade-off between accuracy, communication overhead, fairness, and security under large-scale skewed datasets.


Introduction
Machine learning (ML) has achieved extensive success in many practical applications. However, a well-trained ML model heavily depends on massive data. In reality, there may be sensitive information in the data sets which may lead to growing concerns about personal privacy and even national security. And data is considered as a valuable asset and a critical strategic resource increasingly. All these constraints greatly motivate federated learning (FL) [1], which enables multiple entities to collaboratively train a model under an orchestration service for immediate aggregation and store data locally. e data in FL may be generated at different contexts. is may lead the data distribution to be unbalanced or Non-IID. e data sets' scale and quality may be different.
ese may lead to different intermediate computation and communication cost for different parties. And data is a significantly important asset to organizations, so a nice FL scheme could stimulate and incent the parties with high-quality datasets to join the training to form a better model and guarantee their rewards that match their contribution in addition to privacy preservation. In this context, it is necessary to consider factors such as privacy protection, unbalanced/skewed data distribution, fairness, to form a closed-loop federated learning system (FLS) [2]. On the other hand, gradient boosting decision trees (GBDTs) has become a popular machine learning algorithm and has shined in many machine learning and data mining competitions [3,4] as well as real-world applications for its salient results on classification, ranking, prediction, etc., (especially for tabular data mining task) [5]. And several works have studied the horizontal federated GBDT system [6,7]. ey focus on training and publishing a single decision tree among multiple federated parties to compose the global ensemble model. But in these systems, there are still some challenges as follows: (i) Balance of efficiency, learning accuracy, and privacy-preserving. In most of the existing schemes, each party trains a single decision tree, and then shares the tree with the next participating party [6,8]. And the global communication cost of building each tree is a multiple of the corresponding trainer's data. Other schemes may adopt cryptographic methods or differential privacy [7]. Cryptographic methods may bring prohibitive overhead. And the accuracy is relatively lower in the existing federated GBDT scheme with differential privacy in skewed data distribution. (ii) Contribution quantification. Many data owners may not actively participate in federated learning, especially when the data owners are enterprises rather than devices [9]. As mentioned previously, a nice FL scheme could stimulate parties with high quality datasets to join the training to train a better model and guarantee their rewards that match their contribution. It is also essential to prevent participants from inflating their contributions. Most of the existing schemes overlooked this and failed to provide an outstanding quantifying mechanism. (iii) Accuracy measurement and verification. In the FL setting, there is no guarantee that all parties are honest and trusted. To tackle these issues, [6] proposed to use MAE to measure the accuracy, and [8] adopted the blockchain for verification. However, it leads to additional communication overhead to achieve higher accuracy. It is necessary to consider two factors in accuracy measurement: (1) whether the feature with the most information gain is correctly selected; (2) whether the samples are in the correct sorting position [10]. To the best of our knowledge, there is no effective solution to measure and verify the accuracy contribution of each party.
In response to the above challenges, we propose a closedloop federated GBDTsystem FGBDT-Chain which consists of two components: FV-tree and FQ-chain. More specifically, FV-tree is our federated GBDT framework. And we combine FV-tree with blockchain organically and design FQ-chain to quantify the contribution logic on the smart contract to attain a decentralized verification and auditability. Our scheme can achieve a relatively better balance of efficiency, learning accuracy, and privacy-preserving in skewed distribution of data. Particularly, it can also quantify parties' contribution for the global model, provide a value-driven incentive mechanism that encourages parties with different data sets to be honest, and suit to large-scale datasets.
Our contributions can be summarized as follows: (1) We propose FV-tree, a federated GBDT framework that can achieve constant communication cost and less precision loss in skewed distributed data.
FV-tree is based on the data-parallel algorithm of the decision tree to find the global top-2 candidate features and utilizes private spatial decomposition (PSD) to capture other parties' distribution and refits gradients to vote on the local most informative feature. We also design a scalable differential privacy mechanism in this process to enhance privacypreserving. (2) We design a contribution quantifying mechanism with a metric, namely, split Shapley value and a decentralized verification endorsement mechanism, namely, FQ-chain, which can reach a relatively fair and auditable federated GBDT. It can encourage and incent organizations with different datasets to train a better model. (3) We implement the system FGBDT-Chain in a permissioned blockchain environment and conduct a comprehensive experiment on public datasets. e results show that FGBDT-Chain has high performance and can meet the practical application, especially for large-scale datasets. e rest of the paper is organized as follows. Section 2 reviews the related work about federated GBDT systems. Section 4 introduces the design outline of our system. e technical details of FV-tree and FGBDT-Chain are introduced in Section 5. Section 6 presents the performance evaluation of our system in terms of accuracy and fairness. We give a brief discussion and analysis in Section 7. Section 8 summarizes the paper and puts forward the potential research directions in the future.

Related Work
In this section, we review the literature on the federated GBDT and fairness in federated learning.

Federated Gradient Boosting Decision
Tree. Gradient boosting decision tree (GBDT) and its effective implementations such as XGBoost [3] and LightGBM [4] are widely used machine learning both in industry and academic applications [5,11,12]. In distributed GBDT, the training data is located in different machines and should be partitioned according to the sample level. Generally, the local histograms of features are broadcasted to all the parties to obtain the global distribution. en each party chooses the most informative splitting points [13]. Among them, the parallel voting decision tree (PV-tree) [14] is a representative scheme. It performs full-granular histogram communication according to the features selected by each machine, then calculates the global split point. PV-tree can achieve a very low communication cost (independent of the total number of features/samples) in the context of uniform data distribution and has great scalability in the context of large datasets.
In recent years, with the growing concerns about data security and privacy, several horizontal federated GBDT systems have been developed. [6] designed a distributed GBDT scheme, in which each party trains a differential privacy decision tree and uses Mean Absolute Error (MAE) to evaluate the accuracy of each decision tree. [8] took a similar approach and extended this learning process to the blockchain. However, in these tree-based sharing schemes, the quality of the shared tree is low. To solve this problem, [7] proposed Sim-FL, in which, each instance gathers similar instances' gradients of other parties through a local sensitive hash (LSH) to learn the distribution of other parties. is weighted gradient boosting strategy can significantly improve the accuracy of each decision tree, and achieve a primary level of privacy protection. Unfortunately, the communication overhead in each iteration is proportion to the number of local instances in the training party, which is not feasible in large-scale datasets learning. Intuitively, we summarize the existing federated GBDT system and compare them with our scheme in Table 1.

Fairness in Federated
Learning. Many data owners may not actively participate in federated learning, especially when the data owners are enterprises rather than devices [9]. erefore, the fairness of the federated learning system needs to be taken into account. In the existing federated learning research, fairness is mainly realized through an incentive mechanism. ere are two main ideas: (i). All parties enjoy a global model; (ii). According to the contribution of parties, parties get different model rewards [15]. e goal of incentive mechanism is to make the party get a reward commensurate with its contribution. A number of literature focused on designing incentive mechanisms by clients' resources [16] and reputation [17]. Whereas, we concentrate on the incentive mechanism based on the contribution of data quality. Because data quality is a key factor that affects the model. In the scheme based on data quality contribution, Shapley value [18] has a wide range of applications, and [15,19,20] studied the Shapley Value of the data point contribution during ML training. In the training process of federated learning, [21] proposed to record the intermediate results (i.e. gradients and models), and then use them to reconstruct the model for approximate the contribution indexes.
is approach is efficient and feasible in horizontal federated learning. Unfortunately, there is an essential difference between gradient-based distributed GBDT and Gradient Descent-based algorithms. Because reconstructed models are not always useful and internal nodes will not affect the prediction score. erefore, we need a new contribution measurement mechanism for the scenario without an intermediate model.
In addition, some works use blockchain technology to record the training milestones of clients and ensure the security of the incentive mechanism [22][23][24]. ese works do not promise a good balance of privacy-preserving, efficiency, and learning accuracy to form a practical federated GBDT.

GBDT.
GBDT is an ensemble model of sequential training for several decision trees. In each iteration, the following objective function is minimized to fit the residual of previous learners [25]: where g i � z y (t−1) l(y i , y (t− 1) ) is first-order gradient and Ω(f) is a regularization term. Let I � I L ∪ I R , where I is the instance set of the father node, I L and I R are the instance sets of left and right nodes after a split. e gain of a split point is given by: To reduce the computational complexity of traversing all feature values, histogram-based algorithms like [4,26] use discrete bins to find the approximate optimal split. e detail of the histogram-based algorithm as shown in Algorithm 1.

Private Spatial Decompositions (PSD).
Generally, any dataset with ordered attributes or moderate to high cardinality (e.g. numerical features such as salary) can be considered as spatial data. In addition, if a dataset can be indexed through a tree structure (such as a B-tree, R-tree, kd-tree etc.), it can be implicitly treated as spatial [27]. Formally, a spatial decomposition is a hierarchical (tree) decomposition of a geometric space into smaller areas/hyperspaces, with data points partitioned among the leaves. Indexes are usually computed down to a level where the leaves either contain a small number of points, or have a small enough area, or a combination of the two. ere have been many approaches to spatial decompositions. Some are data-independent, such as quadtrees which recursively divide the data space into equal quadrants. Other methods, such as the popular kd-trees, aim to better capture the data distribution, and they are data-dependent. [27] gives a full framework for privately representing spatial data. We use the PSD to share a coarse distribution summary with other data owners. And it is both used in collaborative learning and calculation verification under statistical heterogeneity scenarios. [28] is a kind of chained data structure that combines data blocks in order according to time sequence. e append-only data are ensured that they are tamperproof and unforgeable through cryptographic primitives. e main advantages of blockchain are decentralization, security, transparency, and traceability. Hyperledger Fabric [29] is a popular and efficient enterprise-level permissioned blockchain framework. And Fabric also realizes the modularization of consensus mechanism, authentication, and other components, which is more suitable for business cooperation between enterprise organizations. In summary, the fabric can provide a decentralized trust environment for a group of organizations to carry out complex business transactions for collaborative GBDT training tasks.

The FGBDT-Chain Framework
is section describes the overall design of FGBDT-Chain, including the design objectives and system overview. We adopt the general assumption of federated learning, in which one model requester publishes a model request and multiple parties participant in the collaborative learning task. e problem description is included in Section 3-A. e system summary is shown in Section 3-B. e main symbols used in this paper are given in Table 2.

Design Objectives.
We assume that there are M parties, and each party is denoted by We focus on the collaborative training of GBDT model, in which M parties (data owners) include one requestor cooperate to implement a federated GBDT training task. For example, as shown in Figure 1, due to the different distribution of patients, two private hospitals P 1 , P 2 may prefer accurate test predictions for female and young patients, respectively [15]. Without relying on unrealistic public datasets and third-party central servers, they hope to achieve peer-to-peer collaborative learning and obtain highquality models in a trusted environment. More importantly, they need to be guaranteed that they can get rewards corresponding to their own contributions. Out of this assumption, our federated GBDT system tries to meet the following three objectives: (i) Model accuracy and efficiency. It is the basic requirement of all parties to build a high-quality global model in multiple skewed data sets. In addition, the geographical distance between parties may be far away, and the intermediate process can be stored in blockchain for the sake of fairness and security. e communication cost should be strictly reasonable. For this reason, we propose FV-tree, which can reduce the communication to a small range, and obtain good model performance in the case of skewed data distribution. (ii) Fairness: As mentioned previously, data is considered a valuable asset and a critical strategic resource increasingly. In addition, participants need to invest tremendous of computation and storage in FL. Without any revenue, data owners may not voluntarily provide data and training resources. To encourage more parties to participate in a collaborative learning program, it is necessary to accurately calculate the cooperative contribution of each participant. We use the split gain generated by the party's updated gradients to calculate the split Shapley value of each party. In this way, we can fairly quantify the contribution of each party in the whole process, and provide the mechanism for the monetary reward of delayed payment. (iii) Security: We assume that parties are curious, and they will not maliciously attack the federated model e accuracy of federated GBDT model performance well in skewed data distribution. Notice: "✕" representative does not meet the requirement, "✓" meets the requirements. 2 e system has differential privacy extensibility. 3 e system's communication architecture, especially the shared training information in federated GBDT training. 4 e communication cost is independent of the number of samples in the local dataset. 5 In the absence of a third-party server (none of the above systems need it), the blockchain is used as an autonomous platform to coordinate the training process.
Input: I: instance set of the current node, F:feature set.
unless they can get higher income. is means that our system not only needs to avoid leaking the original data in the learning process but also needs to provide a necessary veri cation mechanism. We also have to eliminate the potential that greedy participants deliberately exaggerate contribution through updated information. erefore, we propose FGBDT-Chain which can provide an extension of di erential privacy, and a decentralized endorsement mechanism to lter distorted update information.

e Proposed
Architecture. Our proposed system consists of two modules: permissioned blockchain module and federated GBDT module. e permissioned blockchain establishes secure connection channels among all nodes. FGBDT-Chain is based on the FV-tree training framework, which includes three stages: distribution preprocessing, features voting, and gradient histogram aggregation. Permissioned blockchain module includes four types of transactions: model request transaction, feature voting transaction, gradient histogram upload transaction, and contribution indexes allocation transaction. e contribution indexes assignment is implemented by smart contracts according to historical transactions. e stored information in the permissioned blockchain is shown in Figure 2.
Step 1. In the beginning, a model requester initializes the permissioned blockchain and speci es the requirements of the learning task, such as dataset requirements and model   Total number of ensemble model split; h m P m 's histogram of ordered gradients; bin g , bin n e sum of gradients and counts of each bin in one histogram; gain q e split gain of q-th split in the GBDT model; split q e split point of q-th split in the GBDT model, which includes the split feature and split threshold; psd m Privacy spatial decomposition structure of P m ; c m q , C m ey represent the P m 's contribution index of the q-th split and the contribution index of total splits respectively; ϕ q m P m 's split indexes (split Shapley value) during the q − th split; κ q m P m 's voting contribution indexes during the q − th split; W m e P m ' distribution weight matrix; w m * e P m ' global distribution weight vector; pk m , sk m e P m ' key pair for signing and veri cation respectively; parameters. Parties that wish to join the learning task or receive a request should be authenticated, then upload the rough distribution summary (i.e., PSD) of their datasets. e model requester has the right to refuse a party to become a federation member according to the observation of the distribution summary.
Step 2. After a speci ed number of organizations join the federated learning task, each party downloads all PSDs, and establishes the distribution matrix and global distribution vector. So far, the initialization work is completed.
Step 3. In the stage of collaborative training, each party uses the local dataset I m and the global distribution vector to calculate the local most informative features and uploads the feature index through the voting transaction. At the same time, all parties can calculate the top-2 features with the highest number of votes as candidate features according to on-chained transactions.
Step 4. Parties broadcast the local original gradient histograms of candidate features. After one party receives most signatures corresponding to his histogram, the histograms and signature set are written into the transaction. With the help of the distribution matrix, the veri cation algorithm can detect malicious updates in skewed data distribution (Malicious update refers to the gradient histogram stretched by greedy participants to improve their contribution indicators).
Step 5. e smart contract will calculate the best split point and allocate contribution indexes according to the historical transactions. ese two sub operations can be parallelized and the complexity is low. In addition, since the update records are stored in transactions, the contribution indexes can be calculated after the emergency task training process is completed. e above 3-5 steps will form a loop that continues to execute until the stop training condition is met. When the learning task is nished, the federated GBDT model and parties' update/contribution records are stored in the blockchain's transactions. e whole learning process does not depend on any single party. In addition, because all the records created during the training of the decision tree are tamper-proof, the federated member can be audited at any time.

The Design Detail of FGBDT-Chain
FGBDT-Chain is a collaborative learning framework based on blockchain for GBDT. We will introduce the framework in two parts: FV-tree and FGBDT-Chain. Firstly, we will introduce the PSD-based preprocessing phase, which provides the basis for our framework (Section 5-A). Secondly, we will describe the GBDT training framework FV-tree in detail, which includes tree growth processes based on feature voting, gradient histograms publishing, and the expansion of di erential privacy (Section 5-B). Finally, we introduce FGBDT-Chain's fairness assurance, including the fair guaranteed incentive mechanism based on a novel  contribution measurement algorithm, and the decentralized verification scheme on the blockchain (Section 5-C).

Preprocessing Stage.
When a party receives the model request transaction, it first checks the dataset requirements and filters out the instances that meet the task description in the local instance, which is expressed as I m . en it starts the preprocessing operations. e main idea is to capture the data distribution of all other parties by generating a rough distribution matrix W m ∈ R N m ×M and a global distribution vector w m * ∈ R N m . Where W m ij is the distribution weight of P m 's instance x m i in party P j 's instance set I j , and w m * i is the distribution weight of the instance x m i in the global instance set I. In our scheme, w m * is an optional term. When distributions are badly skewed, it will be used in the voting stage to select the most informative local feature (Section 5-B1), and W m is used for verification subsequently (Section 5-C2).
More specifically, party P m firstly calculates the psd m by I m , which has been well studied in previous research [27]. Let V m l be the value of l-th leaf in psd m . Intuitively, the psd m is a tree model represents the rough data distribution summary of P m , where the value V m l is the number of instances corresponding to the hyper-space represented by the leave node l, and the count value V m l has been perturbed by differential privacy. Party P m can upload psd m with the blockchain's transaction, and download other parties' psd in the collaborative learning task.
en P m maintains the distribution weight matrix W m and the global distribution weight vector w m * . e detail is shown in Algorithm 2. After party P m downloads psd j from P j , it uses a local instance set where δ is a parameter of fitting distribution degree, N and N j denote the number of instances of global and party P j respectively, which is got from the accumulated leaves' value of different psd s. In addition, N j /N represents a fitting budget of P j . e more instances a party has, the larger fitting budget needs to be allocated. For Algorithm 2, we have the following observations. Firstly, the calculation of PSD only needs one time, and the distributed structure of tree model will greatly reduce the communication cost compared with the approach of sending each sample hash [7]. Secondly, the structure of psd s can be different, which means parties do not need to communicate in advance to use a unified structure of psd. In other words, parties can choose any tree model or inner nodes, whether it is a quad-tree or a kd-tree. It will not affect other parties to generate their weight matrix.

FV-Tree.
When the local weight matrix W m and global weight vector w m * are established, parties can start to enter the training stage. In the training phase, each party does not train a complete tree, instead, it sends minimal update information. ere are two types of update information: (i) parties' split feature voting and (ii) gradient histogram of candidate feature which is used to calculate global split points. In each node split, parties calculate the split feature with the most informative gain locally and vote on it. e top-2 features with majority votes in the global voting will become candidate features, and then parties send the gradient histograms of them. According to the above two kinds of update information, each party can update the global GBDT model synchronously.
However, this method may produce errors due to the split feature may be not globally optimal, especially in the context of decentralized data owners with different distributions/sizes. So, we consider gradient refit to alleviate this problem. e basic idea of gradient refit is to adjust gradients according to the global weights of the instances, then calculate the most informative feature according to the refitted gradients. When the global candidate features are selected, the two local original histograms are sent. e details of FVtree are shown below.
At the beginning of an iteration, party P m has a local instance set I m , and the global distribution weight vector w m * . First, P m updates gradients and synchronizes the split information of each new node. Details are shown in the Algorithm 3 and Figure 3. For each new node generated in the decision tree, P m calculates the local split gain of all the split points. e split gain is calculated as follows: When the local split point with the highest split gain is selected, party P m will publish the corresponding feature's index as a vote. And after receiving all the local votes, every party can sort features according to the number of votes. So far, each party can get the ranking of the same features, then select the top-2 features as candidate features, and upload the corresponding gradient histograms. It should be noted that the original uploaded gradients histogram is not the fitted one. After receiving the histograms from other parties, each party will traverse all the split points in the aggregated histograms to find the best split with the highest split gain. e gain of each split point is calculated as follows: where, i∈I m L g m i , i∈I m R g m i , M m�1 |I m L |, and M m�1 |I m R | are calculated from the aggregated histograms. When the node Security and Communication Networks Input: PSD model set psd 1 , psd 2 ..., psd M , instance set I m Output: distribution weight matrix: W; global distribution vector: w * //establish distribution weight matrix for j ← 1 to M do for i ← 1 to N m do S ← psd j .getLeafNode((x i , y i )); S.push(i); //set hyperspace′s weight to matrix W for l ← 1 to L j do reaches the max depth, it becomes a leaf node and the value is calculated through the following equation: In the training process of FV-tree, a participant needs to update information from other parties to split none-leaf node, and the value of a leaf node is directly generated by the histograms of its parent node. So, we only need to allocate the privacy budget to the none-leaf nodes. In the communication process of FV-tree, local feature voting and histograms aggregation may lead to privacy leakage. For the local best split point selection, the information gain is used as the utility function, and the exponential mechanism is used to return the split point with the largest gain value. Let g * be the gradient with the largest absolute value. By introducing the conclusion of previous work [13], the sensitivity is ΔG � ((3λ + 2)/((λ + 1)(λ + 2)))g * . Before updating histograms, the count of each bin is perturbed by Laplace noise [14]. e sensitivity of the gradient histogram is 2g * , and the sensitivity of the count histogram is 1. To maintain the effectiveness of boosting, we use the two-level boosting structure (EOE) to allocate the privacy budget for multiple decision trees [13], and our method satisfies the ϵ-differential privacy.
Proof. Assume that the privacy budget of a tree is ϵ t , and the max depth of a decision tree is d. Since the nodes in one depth have disjoint inputs according to the parallel composition, each instance will go through at most d − 1 times node split. Further, each split will be regarded as five queries, namely, the best split feature voting and twice gradient histograms and count histograms updating respectively. e privacy budget for each split is ϵ split � (ϵ t /5(d − 1)). ence, the privacy budget of a single decision tree satisfies ϵ t -differential privacy. In EOE, if there are a total of E ensembles, the privacy budget of each tree is ϵ t � ϵ/E, and the whole FVtree training process satisfies ϵ-differential privacy.
In summary, our scheme leverages voting split features and updating gradient histogram to make a tradeoff between accuracy, communication cost and security, and we give a brief discussion in section 7-A.

FGBDT-Chain.
To attract more institutions with highquality data into the federal learning task, it is necessary to quantify the contribution of each party fairly and provide incentive mechanisms according to the contribution index. A widely used approach is to quantify the contribution of each participant's local model [9]. However, it is infeasible when the local model does not exist. For example, in our FVtree scheme, there is no local model, and split points are decided by all parties. We should design a new approach and mechanism to quantify the contribution of federated parties. We first define the fairness of the federated GBDT task.

Definition 1. (Collaborative fairness in GBDT)
In a collaborative GBDT learning task, multiple parties train a global model together. e party that provides more valuable information for the global model will get a higher contribution index. Specifically, fairness can be measured by the parties' split gain.
We define what is valuable information as follows.
Definition 2. (Valuable information in gradient-based collaborative GBDT): Suppose party P and P' participate in distributed GBDT learning. Once the global best split point is determined, we can informally say that party P provides more valuable information than P', if the gradients submitted by P bring more split gain than the gradients submitted by P' on the global split point. e growing process of decision tree is to constantly find the split point which can bring the maximum split gain. e split gain provided by party's update information for the global model can reflect the corresponding contribution because split gain represents the reduced uncertainty in the selection process of the split point. Formally, let C ≜ P 1 , . . . , P M denote a set of M parties. We call a subset B a coalition of parties if B⊆C. e histogram vector of P m ∈ B is represented by h m , coalition B's histogram set is denoted by HB. And we denote the best splitting point as split q , the global gain of split q is G q . en, we define the utility function U B : e above equation is the histogram form transformed from (5). Where h m L / h m R denote the set of bins on the left/ right parts segmented by split q , bin g and bin n denote the sum of gradients and counts in the corresponding bin respectively. According to the observation of (7), two properties fulfill the standard assumptions of cooperative game theory: Property 1. Histogram of the empty coalition has no utility: U ∅ � 0; Property 2. Histogram of any coalition B⊆C has nonnegative value: ∀B⊆C, U B ≥ 0; Proof. e above two properties can be proved simply. For Property 1, when B⊆∅, each bin in HB equals 0, so the G(HB; split q ) equals 0. For Property 2, because ( m∈B bin∈h m L bin g ) 2 ≥ 0, and n bin is a natural number, the minimum value of U B is (0/λ) + (0/λ) � 0.
To guarantee that the histograms' contribution measurement is fair to all M parties, we use Shapley Value, which is the unique value division scheme that satisfies symmetry, null player, additivity, and efficiency properties. Next, we define the contribution of a federated party in a single split: For simplicity, we use ϕ q m denotes the split Shapley value of P m at the q-th splitting, it can be called as split contribution index.
In addition to the split contribution, the voting contributions are required to encourage parties to choose the most informative features. In the q-th split, the voting contribution κ of P m is defined as: If P m 's vote hits the split feature, G q , If P m 's vote does not hit the split feature, Finally, the party P m 's total contribution index of the q-th splitting of the federated GBDT model is defined as c q m : where κ q m is the voting contribution, α ∈ (0, 1] is a variable parameter that controls the voting contribution, and ϕ q m is the split contribution comes from Equation [eq_split]. When the federated GBDT model training is complete, the contribution of party P m is C m � Q q�1 c q m , where Q is the total number of split (number of nonleaf nodes).
In the previous section, we described in detail how to quantify the contribution of a party. However, it is a challenge to calculate C when there is no trusted third party because C is directly related to the interests of each participant. To ensure the security of the logic of contribution measurement, we use a smart contract to retrieve historical transactions and record the contribution of each party.
Even smart contract can achieve the security of computing process, due to the sensitivity of split Shapley value, greedy parties can get a higher split contribution ϕ by tampering with the local histograms. As a concrete example, it is shown in Table 3. Suppose two parties P 1 , and P 2 submitted their local histogram transactions h 1 and h 2 where For simplicity, let λ � 0, we can get G is 0.45, and split contribution ϕ of P 1 and P 2 was 0.375 and 0.075, respectively. However, if P 2 tampers with its gradient histogram h 2 by doubling the magnification, the global G increases to 0.85. Accordingly, the split contribution ϕ 1 , ϕ 2 is changed to 0.275 and 0.575. It can be seen that P 2 has increased his split contribution a lot.
Based on the above analysis, it is necessary to verify the updated information in our system to maintain fairness. In federated GBDT, the only existing verification scheme is to use local datasets to measure the performance of the updated model [6,8]. Because it is difficult to generate public validation data sets, this scheme is considered as a minimized method in the federated scenario [30]. We inherit this idea of using a local dataset as the basis of verification. However, we cannot directly use the performance of the model, the reasons are as follows: First, updating information in FVtree is gradients rather than models. Using gradients to reconstruct a model requires additional calculation; Secondly, the verification of model quality cannot fundamentally solve the above problem, because the contribution value of a histogram will be significantly higher after it is stretched proportionally. But the quality of the model using the stretched histogram may not be much different from the original one. In response to the above problems, we take the histogram overlap degree as the verification algorithm, in which the histogram used for verification is constructed by the distribution matrix W and the local histogram h. And we integrate this method into the endorsement mechanism of the permissioned blockchain to implement the FV-tree's decentralized verification scheme.
Specifically, as shown in Figure 4, before party P m submits a histogram transaction, it first needs to broadcast the histogram h m to other parties for signature. When P j ∈ C∖ P m received the signature request of h m from P m , the P j ' local gradients and the distribution vector W j m will be used to construct the refitted histogram h j * m , which denotes the histogram constructed by P j to verify h m . For P j , there is only its histogram, which can simply denote as h * m . e details of this process are similar to Algorithm 1, except that x · gradient, 1 are replaced by gW where bin m g , bin m n denote cumulative gradients and count respectively. e overlapping degree can verify the correlation of bin values and whether they are stretched. When the overlapping degree is less than the threshold, P j will sign the histogram h m , and send sig(h m , sk j ) to P m , where sk j is a private key of P j . When P m obtains the signatures of most parties, it will write the histogram and signature set into the transaction and sends it to orderers, then the histogram transaction will be packaged into block. e above design is suitable for the overall architecture of our federated GBDT, which can detect the histogram with exaggerated contribution, and will not signi cantly a ect the e ciency of the system. Firstly, to consider the data distribution of parties, we can avoid misjudging the correctly calculated update information as malicious by using the re tted histogram to a certain extent, and the stretched histogram can be easily discovered. For the e ciency of the veri cation scheme, the whole decentralized veri cation process is very similar to Fabric's high-level transaction ow [29]. e only di erence is that the party uses the local data set under blockchain instead of simulating the execution of the smart contract. In addition, this process is also di erent from the processing method of Proof of Quality (PoQ) [8], where they suggest checking the quality of all models after block generation. If there is a malicious transaction, the block needs to be repackaged, which means retraining the whole GBDT model. In our scheme, orderers can lter out the transactions that are not recognized by the majority of participants when ordering transactions.

Implementation and Evaluation
6.1. Experiment Setup. We implement FV-tree based on LightGBM (https://github.com/microsoft/LightGBM). For PSD, we use a data-independent tree model. Each time of the PSD's node splitting, we randomly select a feature in the unused feature set and divide it according to the average of the global maximum and minimum values (the maximum and minimum values are speci ed in the task initialization transaction), we also treat the label as a feature. e maximum depth of PSD is 8, the maximum value of each leaf node is 500. Laplace noises are injected into the leaf nodes, where the privacy budget ϵ 1. For the GBDT model, the maximum depth of each tree is 8, the number of iterations is 500, the regulation parameter λ is set to 0.1, and the maximum number of bin in the feature histogram is 16 (more bin will bring higher accuracy, but this small accuracy di erence is not signi cant for the federated GBDT framework).
We used three public datasets to evaluate our scheme (https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/), as shown in Table 4 And 75% of these datasets are used for training, the rest are used for testing. To allocate skewed local datasets, as the realistic scenario requires, we used the partition method of previous work [31], which allocates the datasets for each party according to the unbalanced ratio θ ∈ 0, 1 { }. After allocation, half of parties got (θ * N class0 )/M instances of class 0, and ((1 − θ) * N class0 )/M of instances of class 1, the other parties are just the opposite. is partition method well represents the data distribution in the federation scene. Speci cally, in addition to label skewed, there is also feature skewed between local datasets [32]. As shown in Figure 5, we use kernel density estimation (KDE) to  intuitively show the skew degree of feature distribution between local and global datasets. We compare our federated GBDT system with the other two frameworks: Standalone framework.
is framework assumes that the parties training integration model only use their local dataset. e standalone setting shows the performance of the local training model of the party. In addition, there are two types of local dataset distributions in the unbalanced partition. We represent one part of the parties with more positive samples as Standalone A, and the other part as Standalone B. Centralized framework: is framework assumes that there is a trusted server accessing all parties' data, and uses global data to train the ensemble model without any privacy concerns. e centralized framework is high-precision, but it is hindered to implement in practice due to various restrictions. In addition, we also compare our scheme with other advanced federated GBDT frameworks in several same settings, such as TFL based on tree model communication and SimFL based on both tree model and gradients communication.

Voting by Re tted Gradients.
We rst show the accuracy of FV-tree without considering di erential privacy. To evaluate the e ect of gradient re t, we compare FV-tree and PV-tree by convergence speed. Without losing generality, the number of parties is set to 4, and the ratio θ is set to 80%. e default parameters are used in all frameworks. e experimental results are shown in Figure 6. We can observe the following points. First, FV-tree performs better than PVtree and Standalone models in all datasets. And because of  the data skew, the accuracy of standalone mode is greatly reduced. is is because each party is a ected by the data distribution bias in the learning process. And FV-tree uses a gradient to re t through PSDs, so it has a greater probability to select the most informative feature. Second, in the datasets a9a and SUSY, the centralized framework may lead to over tting, while there is no such problem in the schemes based on FV-tree and PV-tree. Finally, the accuracy of PVtree is signi cantly higher than the Standalone mode. is means that when considering di erential privacy, we can get a tighter sensitivity without using the gradient re t.

e Impact of Unbalanced Ratio θ.
To show the inuence of di erent skew degrees on the FV-tree, we simply set the number of parties to 2. e experimental results are compared with SimFL, an advanced work without di erential privacy. We observe the in uence of di erent unbalanced distribution degrees on the prediction accuracy, as shown in Figure 7. We can observe that the accuracy of the standalone model decreases greatly with the skew of distribution. Secondly, although the accuracy of our framework and SimFL can be higher than local training when the unbalanced ratio is greater than 70%, FV-tree is much less a ected than SimFL.
is may be because the model accuracy is only a ected by the feature selection in the FV-tree framework. While SimFL is a ected by the feature selection and calculation of leaf weight. is means FV-tree is more suitable for skewed data distribution.

e Impact of the Number of Parties M.
e number of di erent parties will also a ect the accuracy of the model. We set a di erent number of parties when the unbalanced ratio θ is set to 80%. e experimental results are shown in Figure 8. Firstly, we can observe that FV-tree outperforms Standalone and SimFL in di erent number of parties settings, even the test error on dataset SUSY is less than that of over tted centralized model. Secondly, with increasing number of parties, it does not have too much impact on FVtree. is advantage may also come from the fact that FVtree is not a ected by the calculation of leaf weight. 6.2.4. e Impact of Di erential Privacy. Based on the above experimental evaluation, FV-tree can achieve almost the same accuracy in distributed settings as centralized settings. en, we test the FV-tree with di erential privacy. Generally, we set the number of parties M to 4, and the unbalanced ratio θ is still set to 80%. To control the consumption of privacy budget, we set the maximum depth d of a single decision tree to 3. For dataset a9a, which has a small number of instances, is set as two ensembles, and each ensemble contains 20 trees. Dataset SUSY and HIGGS, which have a large number of instances, are set as one ensemble. To ensure a strict total privacy budget, PSD is not used. We evaluated the test error for di erent privacy budgets ϵ, as shown in Figure 9. Due to the randomness of di erential privacy, we conducted 10 experiments and showed the maximum, minimum and average values (To be fair, the default parameter settings are still used in centralized and standalone models. Because there is no need to consider the consumption of the privacy budget, the iterations T and depth d can be increased to achieve higher accuracy).
We can observe that the accuracy of the FV-tree can still be higher than that of local training after using di erential privacy on large-scale HIGGS and SUSY datasets. However, in the a9a dataset, due to the small amount of data, too much noise is added to the histogram, which reduces the accuracy of the model, but it is still comparable to the best training e ect of local training. is means that our scheme has a good performance in large-scale datasets, and can meet the needs of practical applications. e accuracy loss of the FV-tree comes from the selection of the best split features. In the balanced data partition, we assume that the feature values of each dimension are i.i.d. uniform random variables, and assign the same number of instances to each party. en, the possibility of selecting the best feature is as same as PV-tree   Figure 7: Comparison of the test errors given di erent unbalanced ratio θ, where the number of parties is set to 2. (a) a9a (b) HIGGS (c) SUSY. [33]. In the scenario of the skewed data partition, the experiment shows that FV-tree still has high accuracy. Moreover, in the case of signi cantly skewed data distribution, we can use the weight distribution calculated by PSDs to re t feature distribution, which can improve the possibility of selecting the best feature. However, the global distribution weight vector is used may cause high gradient values, which will make the privacy boundary loose. Under these circumstances, gradient cutting may be a feasible choice [34]. In addition, our scheme is not e ective for small and continuous feature data sets. is obstacle is mainly due to adding a lot of noise to histograms, which reduces the e ectiveness of the gradient histogram. erefore, in smallscale dataset scenarios, we still need to use other federated GBDT frameworks.

Communication Overhead.
e communication cost of our federated GBDT system is constant. First, in the pretraining phase, assuming that the depth of a PSD is d psd , each party has to send one PSD model and receive M − 1 PSD models, so the cost is M(2 d psd − 1). In the training phase, assuming that there are T trees, and the depth of each tree is d, 2 d− 1 − 1 times node splitting is needed. Because each inner node needs to communicate three times, including one voting and two histograms uploading, where the voting communication is a real number. And the cost of a party sending M − 1 times histogram to communicate histogram is 2(M − 1)n bin . When two 2/3 of the signatures are received, the transaction can be sent. Let L sign be the length of signature, then the cost of receiving the signatures is 2/3ML sign . In addition, they need to receive other parties' histograms and sign them, where the cost is  federated GBDT framework [7]. In addition, the storage cost in the permissioned blockchain can reach an acceptable level to ensure fairness and tamper-proof.

Fairness and E ciency.
We regard the growth process of the decision tree as multiple cooperative games. Shapley value is used to measure the individual contribution in cooperation, the fairness of Shapley value is widely recognized. In our design, every node segmentation is fair, and the details can be obtained from Section 5-C. In addition, because the bene ts obtained by the participants each time directly come from the gain value, it is also fair for the whole training process. For example, in the early stage of training, each split will produce a great gain, and each party will get more contribution value from it. On the other hand, the computational complexity of split Shapley value is acceptable. We can see only M is variable through (8), and in organization-cross federated scenes, M is usually a relatively small value. Besides, we do not need to traverse all the split points in histograms to calculate of U, because the global best split has been determined in split q .

Security.
It is assumed that all parties will aim at maximizing revenue and act honestly in the stage of voting characteristics because in the absence of any data of other parties, they can only choose the feature with the highest gain value to vote according to their real data to obtain voting awards. Similarly, in the phase of communicating gradient histogram, if the modi ed gradient histogram is detected, the histogram transaction cannot be published because of the need for a similarity test. Hence, a party can only get the histogram contribution reward if it publishes the real histograms.
Further, if there are malicious participants in the alliance, our system is still robust. Firstly, suppose that in the voting feature stage, if multiple malicious participants FV-tree dp  Figure 9: Comparison of the test errors given di erent total privacy budgets ϵ, the unbalanced ratio θ is set to 80%, where the maximum depth d of a single decision tree to 3. Dataset a9a is set as two ensembles, and each ensemble contains 20 trees. Dataset SUSY and HIGGS, are set as one ensemble with 50 trees. (a) a9a (b) HIGGS (c) SUSY.
conspire to select a feature f ′ with less gain to enter the global candidate features. At the same time, as long as one honest party selects another feature f, f ′ is still likely not to be the split point, because the gain value of f may be greater than it. On the contrary, if the gain value of f is less than f ′ , it means that, f ′ is a good segmentation feature, and dividing nodes according to f ′ , f ′ will not cause great harm to the model. Secondly, in the histogram aggregation stage, because the gradient histogram of the malicious party needs to be verified by two-thirds of the parties, it is necessary for the malicious parties involved in the conspiracy to reach two-thirds of the total number to make the histogram of the damage model accepted by the federation.

Conclusion
In this paper, we aim to present a closed-loop federated GBDT system. In our scheme, each party can get a good performance model and be allocated to a fair contribution index. At the same time, with the help of blockchain and decentralized verification mechanism, the calculation of the contribution index will remain secure, the results cannot be tampered with, and provide additional functions such as delayed payment or audit for any need. Besides, the communication overhead is constant which enables our method to fit federated GBDT tasks with large-scale datasets very well. Due to privacy constraints, this scheme may not be suitable for small-scale data sets, which is the direction we plan to study in our future work. [35].
Data Availability e experiment source data used to support the findings of this study have been deposited in the https://www.csie.ntu. edu.tw/~cjlin/libsvmtools/datasets/. And the experimental results data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.