Application of Multiattention Mechanism in Power System Branch Parameter Identification

,


Introduction
Accurate identification of branch parameters is very important for the development of modern power systems [1]. Solving the problem of intelligent identification of steadystate branch parameters of the power grid, realizing efficient deployment of power grid regulation and control system, is conducive to providing guarantee for online safe and stable operation of large power grid [2]. Stable and effective management in power systems depends on accurate prediction of future branch parameters in different time ranges. Most of the existing power grid branch parameter identification methods are mainly model-driven, with relatively low identification accuracy and poor reliability. Reliable and effective power grid branch parameter identification technology can be applied to the online application of transmission systems such as state estimation and power flow calculation, so as to improve the reliability of power grid transmission and the credibility of dispatching auxiliary decision-making and support the correctness of power grid analysis and decision-making. is greatly improves the practical level of the whole application of dispatching automation systems and is of great significance to promote sustainable development and harmonious society.
For many years, researchers have put forward various methods of branch parameter identification. ese methods are mainly divided into four categories. (1) eoretical calculation method: in the long-term practical work, the line parameters are obtained from the design manual and product catalog according to the experience value or approximate calculation. However, due to environmental factors and changes in operating conditions, theoretical calculations cannot reflect the changes in the real parameters of transmission lines. However, at present, the coverage of PMU devices is not wide enough and the cost of PMU devices is too high, so this method has not been popularized.
In the research of parameter identification, there are many works combined with machine learning. In order to maintain the stability of the power grid, Eskandarpour and Khodaei [3] proposed to make use of knowledge discovery methods and statistical machine learning for predicting the risk of failures for components and systems. Wang et al. [4] chose Random Forest (RF) as the basic classifier of AdaBoost to carry out feature construction engineering, which improved the detection accuracy of the model. Although machine learning approaches have witnessed the progress of power system branch parameter identification, there are some issues that need to be tackled in ML-based parameter identification for the branch of power systems. First of all, the robustness of the traditional least square method is not reliable. When the input data contain too much noise or introduce noise in the measurement process, the identification result of the least square method will become very poor. Secondly, methods like support vector regression (SVR) increase the dimension of input data and carry out predictive regression on input data in high-dimensional space. However, SVR depends on the selection of parameters and kernel function and is greatly affected by data. Finally, an integrated method like RF, which determines the final prediction result by voting of each tree, but when regression is carried out, it is difficult to get the final prediction result, cannot make predictions beyond the range of training set data, which may lead to overfitting in some specific noise data modeling, and this problem has been verified [5].
In recent years, deep learning has developed rapidly. Particularly, deep neural networks have made great progress in the fields of computer vision, natural language processing, and speech recognition [6]. e convolution kernel used in the traditional deep neural network is Convolutional Neural Networks (CNN) [7], and the convolution method is shown in Figure 1(a). e processed data are Euclidean data such as image data [8] and speech data [9]. However, as far as the power transmission system is concerned, the number of grid nodes is numerous and irregular, and the data structure is shown in Figure 1(b). For such non-Euclidean data, there are few deep learning models that can be applied to deal with this type of data. Researchers tried to use a fully connected neural network (FCN) to deal with the task of power grid branch parameter identification, thus incorporating massive historical data to predict the development trend of power grid branch parameters. However, the general FCN model cannot consider the topological structure of transmission systems, and with the increase of the number of layers, the prediction results are easy to be overfitted, and the model becomes difficult to train, which limits the performance of the model and makes the prediction results inaccurate.
In this work, we aim to accurately identify the parameters of the power grid branches by adopting the latest graph neural network model [10] and the multihead attention mechanism [11]. Instead of stacking multiple hidden layers between input and output, this work adopts the structure of Graph Transformer [12]. is method can take the adjacency matrix of graph structure data and graph data as input and completely depends on the attention mechanism to describe the relationship between input and output. e introduction of an attention mechanism makes the proposed model pay more attention to the global feature information and avoids the repeated convolutional process of deep nets, leading to the proposed model better expresses branch information.
e main contributions of this paper are as follows: (i) We propose a novel multitask Graph Transformer Network (GTN). e encoding layer of the network is constrained by the grid structure, and the multiattention mechanism is used to consider the feature information of different branches. Based on fusing global information, important feature information and node information are fully captured. As far as we know, we are the first to use Graph Transformer to capture features from power transmission systems and apply them to the task of power grid parameter identification.
(ii) e decoding layer uses the fully connected layers as the decoding structure and decodes the branch feature information fused in the coding layer according to the task information of different branches. e module can decode multiple branches in the power grid loop at the same time. e experimental results of our proposed model have higher accuracy and robustness because of the combination of topology information and global information.
(iii) Compared with the machine learning models and deep learning models, the model we proposed has better performance. In addition, the Graph Transformer structure performs well in the face of noise and data loss.
2 Complexity e rest of this paper is arranged as follows. In Section 2, we introduce the development history of the branch parameter identification in the past decades. In Section 3, we introduced how to combine a graph neural network with the multihead attention mechanism. In Section 4, we introduce and analyze the experimental results. Finally, in Section 5, we summarize the above work and point out the shortcomings of this work.

Method for Acquiring Transmission Line Parameters.
In the past decades, researchers have put forward various methods to solve the problem of parameter identification of the power grid branches.
ese studies can generally be divided into the following four categories: (1) eoretical calculation method: the theoretical calculation of line parameters is based on Carson's model [13]. e resistance, reactance, and susceptance are calculated according to the formula by using the physical parameters such as self-geometric mean distance, mutual geometric mean distance, and wire material of the line and combining with the external environmental factors such as soil moisture and air temperature. However, the electromagnetic model of transmission lines is greatly simplified by the theoretical calculation method, the influence of uncertain factors such as temperature and wire sag [14] does not be considered, and the calculation results are inconsistent with the actual situation. In addition, due to environmental factors and changes in operating conditions, theoretical calculations cannot reflect the changes in real parameters of transmission lines. (2) Parameter measurement method: the transmission line parameter measurement methods are a group of technologies to test the transmission line on the spot by using additional measuring devices in the state of power-on or power-off, which can be divided into instrument methods, digital methods, and injection measurement methods. Instrument methods realize the measurement of various states of the line by using various instruments such as voltmeter, ammeter, power meter, and frequency meter under the power-off state and then calculating the parameters according to the corresponding formula after manual reading. Crotti et al. [15] proposed to establish a new measurement framework, which was used to realize the traceability measurement of PQ parameters in the power grid system when there was interference from the power grid system. However, due to the instrument problems, it is still impossible to accurately identify the parameters. e principle of the instrument method is simple and easy to operate, but there are inaccuracies in human readings and environmental interference. Digital methods improved the experimental data of instrument method by using a single-chip microcomputer and digital signal processing technology and improved the measurement accuracy, but it does not fundamentally change the shortcomings of traditional measurement methods in actual voltage operation environment. Injection measurement methods can be implemented when the electrical powers are "off" or incomplete "on." Based on the time pulse provided by GPS, it measures the manually added synchronous voltage and current signals and calculates the corresponding parameters through the transmission line model. Nezhadi et al. [16] proposed a new method to use stationary wavelets to denoise current and voltage signals. In the frequency range where signal energy is greater than noise energy, accurate impedance estimation can be realized by using signal injection. Ye et al. [17] declared that asynchronous time should be introduced into two-terminal fault record information, and the asynchronous time should be solved by the electric quantity constraint equation. Based on the modified synchronized voltage and current phasors at both ends, the steady-state parameters of the transmission line were determined to achieve the result of parameter identification. e injection measurement method is complicated to operate, which requires additional experimental devices, and it is difficult to reflect the true conditions of line parameters under different working conditions and operating environments. (3) Estimation of line parameters based on SCADA: state estimation is an important part of the Energy Management System (EMS), which often leads to unsatisfactory estimation results due to inaccurate parameters, so SCADA data is used to estimate line parameters. It mainly includes two categories: augmented state estimation and measurement residual sensitivity analysis. Debs's work [18] proposed a recursive filtering type algorithm, which proved the feasibility of parameter estimation in power systems. Do Coutto Filho et al. [19] put forward an offline processing method for branch parameters of the suspicious power grid, which could complete branch parameter identification by temporarily eliminating the participation of suspicious parameters in the process of state estimation until the suspicious parameters are corrected. Stacchini de Souza et al. [20] proposed a method of network parameter estimation and correction which is based on a genetic algorithm, combined the genetic algorithm and branch power to complete system state estimation. In Chen et al. [21], a method based on long short-term memory (LSTM) and autoencoder (AE) neural network is introduced to assess sequential condition monitoring data of the wind turbine. Parameter estimation based on SCADA data uses the field operation data to identify and estimate the line parameters of the whole network uniformly. Because the dimension of state quantity is increased, parameter estimation is carried out by equation redundancy, which may lead to numerical instability. In addition, measurement configuration needs to be fully considered to satisfy observability, and it is difficult to measure the estimation accuracy of measurement errors at different locations for a single line parameter.
(4) Estimation of line parameters based on PMU: compared with theoretical calculation, traditional measurement, and state estimation, PMU measurement can decouple a single line from the whole network and identify it independently. Ding et al. [22] proposed the method of window sliding total least squares, PMU data of sliding window are used for parameter identification, and the influence of white noise is effectively overcome by minimizing the sum of squares of errors in the window. Zhao et al. [23] developed and implemented an online PMU-based transmission line (TL) parameter identification system (TPIS), which could consider transmission tower geometries, conductor dimension, estimates of line length, conductor sags, and so on to improve the accuracy of parameter identification. Asprou and Kyriakides [24] reported that a methodology was proposed for identifying and estimating the erroneous transmission line parameters using measurements provided by PMU and estimated states provided by a state estimator. However, in the process of practical application, there are inevitable errors in PMU measurement data, and there is a certain gap between the identification results and the theoretical values, which leads to the problem of credibility and availability of the identification results. erefore, the related factors affecting the identification results need to be further studied.

Graph Neural Network.
In recent years, graph neural network (GNN) has demonstrated its efficiency in social networking, link prediction, traffic flow prediction, and other fields. To some extent, parameter identification of transmission lines in power transmission systems can also be regarded as a special graph node regression prediction. Zhou et al. [25] showed that when dealing with graph structure data, graph convolution neural network had unique advantages, which could consider both node features and node topology, and aggregated the information of adjacent nodes by using graph convolution kernel, and these convolution kernels could extract local features by end-to-end training. In other words, through the adjacency matrix constructed previously, the graph convolution neural network can obtain local features by aggregating the feature information of neighboring nodes.
Graph convolution neural network was first proposed by Scarselli et al. [26], in which the computation of graph convolution is defined in Fourier domain, while Kipf and Welling [10] proposed that first-order ChebShev polynomial could be used to generate graph convolution kernel approximately, which greatly improved the computational efficiency of graph convolution neural network. However, the feature information obtained by these methods still depends on Laplace feature related to a graph structure. In recent years, GAT [27] (graph attention network), GraphSAGE [28] (graph sample and aggregate), and other graph neural networks had appeared one after another. ey have a common feature; that is, they assign different importance to different nodes in the neighborhood by using the attention mechanism and have achieved relatively good results. In addition, when FCN is used to process data, the number of layers of the model is too shallow to train and fit the desired model effect, while the number of layers of the model is too deep to easily lead to overfitting. is inspired us to use the attention mechanism to create a model; that is to say, we can use the attention mechanism to describe the relationship between input and output completely instead of traditional convolution.
is can avoid overfitting of the model due to too deep layers, and the attention mechanism makes the model itself can pay attention to important nodes and feature information through learning.

Multihead Attention Mechanism.
e structure of the multihead attention mechanism was first proposed by Vaswani et al. [11], and it was applied in natural language processing (NLP) [29,30] firstly.
rough the attention mechanism, the network emphasizes the regions of interest in the way of dynamic weighting and suppresses those regions with irrelevant backgrounds at the same time. With the weak improvement of CNN's indicators in the fields of visual inspection and classification in recent years, the multihead attention mechanism, as a convolution structure different from CNN, shines brilliantly in the field of computer vision. For example, Dosovitskiy et al. [31] put forward the ViT model, abandoned the traditional CNN model, fully utilized the attention mechanism, applied Transformer to image classification, and achieved good classification results. Carion et al. [32] combined the common CNN and transformer architecture, took CNN as the backbone to learn the 2D representation of the input image, then used the transformer to supplement the position encoding of the input image, and finally directly predicted the detection results. Based on the above work, DETR (Detection Transformer) model was proposed. Zheng et al. [33] proposed a semantic segmentation model named Segmentation Transformer (SETR), which used Vision Transformer (ViT) as the encoder of images and then added a CNN decoder to complete the prediction of semantic graphs. e above papers show that dividing the model into multiple headers and forming multiple subspaces can make the model pay attention to different aspects of information. In other words, multihead attention can make the network capture richer feature information and finally combine the outputs by concatenating. In this paper, the multihead attention model can obtain different position information from multiple subspaces to obtain more comprehensive information.

Multitask Learning.
Multitask learning is a kind of transfer learning, which aims to use the knowledge learned from other tasks in the target task when doing multiple tasks, so as to improve the effectiveness of the target task [34,35]. Multitask learning can make the model adapt to multiple task scenarios, which can effectively increase the anti-interference ability of the model. ere are two modes of multitask learning, as shown in Figures 2(a) and 2(b). ey are hard sharing of hidden layer parameters and soft sharing of hidden layer parameters, respectively.
(i) Hard sharing of parameters: multiple tasks share the same hidden layer of the network but do different tasks near the output of the network (ii) Soft sharing of parameters: different tasks use different networks, but the network parameters of different tasks use L1 regularization or L2 regularization as constraints to encourage parameter similarity e model in this paper adopts parameter hard sharing, which was beneficial to reduce the risk of overfitting [36]. When the tasks we learn at the same time are more, the model we proposed can capture the same representation of the more tasks, resulting in an overfitting risk. rough multitask learning, we hope to predict the parameters of multiple branches at the same time and avoid overfitting through this learning method, so as to improve the robustness of the model.

Proposed Algorithm
In this section, we first define the branch parameter identification of transmission systems. en, we introduce the technical details of our proposed model.

Problem Statement.
Given the features of the power grid branch, the goal is to predict the true values of line susceptance b and branch conductance g of each branch. In this paper, a multitask Graph Transformer Network is designed to achieve that. By connecting the transformer nodes in the power system, we construct a graph G(V, E) composed of vertex set V and edge set E representing the connectivity between points. Assuming that the power transmission network has N transformer nodes, for line k, we express the input features of the distribution system as follows: k , in which i and j represent the nodes at both ends of the k-th branch, then P k i and P k j represent the active power at both ends of the k-th branch, and similarly, Q k i and Q k j represent reactive power at both ends of the branch, while U k i and U k j represent both ends of the branch, and y k represents the susceptance to the ground of the k-th branch. According to equations (1) and (2), which are derived from π-type equivalent circuits, we can calculate the label values of line susceptance b and branch conductance g.
e inputs of our proposed multitask Graph Transformer Network are feature matrix X ∈ R n * 7 and adjacency matrix A. e features of the input data contain N nodes, and each node contains the above seven features. If a power transmission system topology contains M branches, each branch needs to calculate the true value of the corresponding line susceptance b and branch conductance g.

Traditional Machine Learning Model.
Traditional machine learning models can be used for parameter identification of transmission system branches. e most typical one is the linear regression method, which minimizes the sum of squares of errors. Dividing the data into the training set and the test set, calculate the sum of squares Q of the total error of the training data and get the linear regression model. Complexity e linear regression model is applied to the test set to verify the quality of the model. As far as the linear regression method is concerned, its effect is very close to the true value without noise and other types of interference, such as node data loss. However, as far as the actual transmission system is concerned, noise interference and data loss often occur in the process of collecting data. When this happens, the linear regression model becomes unsuitable because of its poor robustness. When there is a little noise in the data, the prediction results will deviate greatly. In addition to the linear regression method, we will compare with some classical machine learning methods, including SVR (support vector regression), RF (Random Forest), and deep learning method FCN, to show the superiority of our proposed model.

Overall Framework.
Our goal is to learn more fusion information by making full use of local and global structures, so as to make the prediction results more robust and accurate.
As shown in Figure 3, the multitask Graph Transformer consists of two parts: the encoding part and the decoding part. In Figure 3 encoding part, it takes the feature h l and adjacency matrix A of graph structure data of power grid topology nodes as inputs and pays attention to different branch information and feature information in different subspaces by using the multihead attention mechanism. Finally, we concatenate these different subspaces, so that the information learned before is fused and input into the coding part. e structure of the coding part is shown by the decoder in the figure, which is composed of m parallel twolayer fully connected layers, and m represents m branches in the distribution system, using the branch network of different branches to fit the characteristics of different branches and achieve the purpose of accurate identification of branch parameters.

Application of Multiattention Mechanism.
e multihead attention mechanism has played an important role in many fields, including NLP and computer vision. erefore, we consider applying the multihead attention mechanism to the parameter identification of transmission system branches combined with a graph neural network. e specific implementation of the multihead attention mechanism is shown in the encoder part in Figure 3. Firstly, the node feature h l and the adjacency matrix A are considered as input data: According to equations (3)-(6), h l is the output features of each layer. When l � 0, h 0 is the original input data. In order to introduce the input data into different subspaces, firstly, the target node features are divided into source node features h (l) i and pointing node features h (l) j according to the adjacency matrix. e source node feature h (l) i and the pointing node feature h (l) j are, respectively, converted into the query vector q (l) c,i and the key vector k (l) c,j by using linear functions. In the above formula, W (l) c,q , W (l) c,k , b (l) c,q , and b (l) c,k are all trainable weight coefficients. In equation (5), ) represents the ratio of dot product functions of q and k to , d represents the number of hidden neurons in each subspace (that is, the head), and α (l) c,ij represents the attention coefficient of a branch relative to the central node 6 Complexity in the c-th subspace. In addition, as shown in equation (6), the node-pointing feature h (l) j is transformed into a value vector V (l) c,j by using a linear function.
In equation (7), || represents the operation of concatenating multiple subspaces. At first, the value vector V (l) c,j is multiplied by the attention coefficient, and then the information pointing to the node feature j is transmitted to the source node i according to the adjacency matrix A to form the source node feature h (l) i . en, the source node feature h (l) i is transformed into the source node feature r (l) i by using a linear function in equation (8), and the source node feature h (l+1) i of the next layer is obtained by adding the new source node features r (l) i and h (l) i in equation (9). e above is the realization process of the multiattention mechanism.

Multitask Regression Model.
In our proposed GTN model, we use a hard parameter sharing mechanism, and the specific implementation of the multitask regression model is shown in the decoder part in Figure 3. According to Figure 3, we can find that the decoding part of the multitask Graph Transformer Network model proposed in this paper realizes decoding through multiple two fully connected layers. e encoding layer in the figure fuses rich feature and semantic information in different subspaces by taking the topology of power grid and node feature information as input and fuses the feature information of different subspaces by concatenating, which ensures that the encoding layer fuses global information as the input of the decoding layer. As a branch of the power grid system, each branch has its own characteristics, which realize decoding through the fully connected layers and complete the task of parameter identification of power grid branches. Each branch network can fit the branch characteristics according to the branch characteristics, so as to achieve the purpose of accurate prediction.

Dataset.
Our data set comes from the actual grid line data collected by China Electric Power Research Institute, and the collection frequency is once every minute. e data set contains 8460 sets of data; there are 17 lines that need to be identified. We selected data of seven days, including 6000 sets of data as our training data, 1000 sets of data as test data, and the remaining 1460 sets of data as verification data. Figure 4 shows the topology information of collected data, which shows the connection mode between nodes.

Baseline and Noise Settings.
In order to prove that our model can simulate the branch parameters which are closest to the real results under various error conditions, we added three kinds of noises to the original data and compared the identification results without noise and with noise as follows: (1) Gaussian noise: according to the method proposed by Brown [37], we added two kinds of Gaussian noises to the node features, which made their SNR reach 50 dB and 30 dB, respectively. (2) Node loss: in the actual distribution system, there are often cases where a node line is damaged and data cannot be collected. In order to simulate this problem in model training, we decided to simulate the loss of grid nodes, randomly select one node from each group of data, and set its characteristics to 0. (3) Loss of node features: in the process of collecting circuit data, it is common for a sensor to be damaged, and it often happens that a branch current or voltage cannot be collected. In order to simulate the Complexity occurrence of this situation, we randomly select one of the seven features of each group of data and set it as 0, so as to compare the situation that no data can be collected during the actual operation of the power grid. In order to prove the validity of our proposed model, we adopt the following methods as baselines: (1) Linear regression: the least square method is usually used as a common method in engineering. Because of its simple principle and a small amount of calculation, the least square method is often used in engineering. However, because its parameters are small and the global information cannot be considered, when a considerable amount of noise appears in the data set, the accuracy of parameter identification by the least square method will drop a lot, so its robustness is poor and it cannot achieve the purpose of accurately identifying branch parameters. (2) SVR: support vector regression machine is a machine learning method for regression tasks based on a support vector machine. Similarly, the kernel function is used to map features to high-dimensional space and regress them, but it partly depends on the integrity of training data and the choice of the kernel function. (3) RF: Random Forest is a classical algorithm in machine learning. It is a combination of multiple decision trees and depends on each decision tree to make a prediction about the target task. Finally, the final average value is obtained by averaging the predicted values of all decision trees. Its advantage is that, for unbalanced data, it can balance errors and maintain prediction accuracy when features are lost. Similarly, when the Random Forest is faced with noisy data, it will be overfitted, which cannot achieve the purpose of accurate identification. (4) FCN: fully connected neural network is one of the most commonly used neural networks in deep learning. It can constantly update the weights of its neurons by training and learning to identify different branch parameters. But for fully connected neural networks, overfitting is a fatal weakness. In the face of missing data or loud noise, the performance of the model cannot be fully developed.

Evaluation Indicators and Parameter
Settings. In the model evaluation, it is usually necessary to determine the evaluation index to measure the quality of the model experiment. In order to evaluate the quality of our model, considering that our task is a kind of linear regression, we decided to use MAE, MSE, and RMSE as evaluation indexes of the model. MAE is also called mean absolute error, and its calculation formula is as follows: MSE is also called mean square error, and its calculation formula is as follows: RMSE is also called root mean square error, and its calculation formula is as follows: where m represents the number of test sets, y (i) test represents the true value of the i-th branch in the test set, and y (i) test represents the predicted value of the i-th branch in the test set. In the comparison diagram of model training in Figure 5, we choose MAE and RMSE as evaluation indicators.
e parameters of the model are set as follows: (1) LinearRegression: as a basic algorithm commonly used in parameter identification, it will be included in the basic model. (2) SVR: the radial basis convolution kernel (RBF) of SVR is set with C � 100 and c � 0.1 and SVR in the scikitlearn library [38] in python is used to realize the support vector regression machine in this paper. (3) RF: the number of trees is set to 300, the minimum sample number of each leaf is set to 35, and the minimum sample number required for splitting is set to 3. After five cross-verifications, we determined the superparameters of SVR and RF. (4) FCN: as the most commonly used baseline model in deep learning, in order to prevent overfitting, our fully connected neural network has two layers, and its hidden neurons are 512 and 256, respectively. In the FCN model, we use the linear activation function (ReLU) as the activation function.
RMSE is used as the evaluation index in Tables 1 and 2. By comparing the experimental data table and combining the model training diagram, we can find the following.
When there is no noise or little noise, the least square method performs well and is simple and easy to use. However, the actual situation is often not ideal. We can find that the accuracy of the linear regression method drops rapidly when the noise is added to the experimental data; especially when the signal-noise ratio reaches 30 dB, the effects of other models become rather poor. As for other machine learning algorithms, although some models perform well in some of the above tasks, the accuracy of these models is not up to our requirements, which is due to the limitations of the machine learning model itself. However, on the basis of considering the relationship between  Complexity topological structure and multisource data, the accuracy of the proposed model does not decrease a lot because of increasing noise. It is robust to resist the influence of noise. In addition, by comparing with FCN, we can find that the gradient of our proposed model drops rapidly and tends to converge after the 10th generation epoch, and the accuracy has not changed much, which shows the superiority of our model in deep learning-based parameter identification algorithms.

e Practical Application of Our Proposed Method.
In the actual power grid transmission operation, parameter identification, as the basis of power grid regulation and control systems, has always been a hot topic of research. Most of the existing power grid branch parameter identification methods are model-driven, which have low identification accuracy and poor reliability and perform poorly when there is noise in actual power grid operation. From the experimental comparison results, it can be found that the model proposed by us has high prediction accuracy, excellent performance, and good robustness in the case of adding various noises, because of considering the topological structure constraints of power grid branches and paying attention to key branches and feature information by multihead attention mechanism. Compared with the traditional parameter identification method, the effect is improved. If the predicted model is deployed to the terminal of the power grid dispatching center, the predicted results of the model can effectively solve the problem of intelligent identification of steady-state branch parameters of the power grid, improve the reliability level of the analysis results of the dispatching system, and more effectively guarantee the online safe and stable operation of the large power grid.

Discussion. Number of headers in Graph
Transformer: the number of headers in Graph Transformer represents the number of subspaces predicted by the model. e larger the number of subspaces, the richer the information of model fusion, but the more parameters of the model, the slower the process of model training. Choosing an appropriate number of heads is a problem that needs to be solved. At present, only an appropriate number of heads is selected through multiple experiments. In addition, because there are too many parameters in Graph Transformer, the model parameters can be reduced by model pruning or neural network architecture search in future research, which makes the model lighter while ensuring the accuracy of the model. is is more conducive to the deployment of the model to the terminal of power grid dispatching center and improves the reliability and real-time performance of power grid branch prediction.
How to identify branch parameters of different magnitude? As far as the branch parameter identification task in this paper is concerned, the order of magnitude of line susceptance b and branch conductance g is quite different. If the simple method of loss function addition is adopted, the model will ignore the accuracy of branch conductance g and mainly focus on the accuracy of line susceptance b; therefore, we adopted the approach that separately identified the two targets to avoid this situation. In future research, we can set a dynamic weight value to give different weight values to the line susceptance b and branch conductance g of the same line. By training an attention-based neural network, we can suppress a large number of targets and promote a small number of targets.
is method will be able to identify different levels of branch parameters at the same time.

Conclusion
In this work, we propose a novel multitask Graph Transformer Network (GTN) to identify the branches of the power grid. GTN uses Graph Transformer to construct the input of graph data and abandons the traditional convolution while model learning features, and fully makes use of attention mechanism to realize the aggregation of branch features. Specifically, in the training process, the model can fuse rich global information by setting different subspaces. In addition, the attention mechanism can enhance the extraction of local information, highlight the importance of different neighbor nodes, and increase their influence by giving relatively important branches and high weights features. GTN aims to complete the task of power grid parameter identification by using the topological constraints and connections of the power grid structure. Experiments on the actual data collected by China Electric Power Research Institute show that our proposed GTN model can cope well under different noise conditions because of the integration of global information. Compared with the traditional model, the robustness of the model is improved, and the identification accuracy is also improved, which provides a comprehensive guarantee for power grid operation and dispatching.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.