An Improved Self-Organizing Migration Algorithm for Short-Term Load Forecasting with LSTM Structure Optimization

Establishing an accurate and robust short-term load forecasting (STLF) model for a power system in safe operation and rational dispatching is both required and benefcial. Although deep long short-term memory (LSTM) networks have been widely used in load forecasting applications, it still has some problems to optimize, such as unstable network performance and long optimization time. Tis study proposes an adaptive step size self-organizing migration algorithm (AS-SOMA) to improve the predictive performance of LSTM. First, an optimization model for LSTM prediction is developed, which divides the LSTM structure seeking into two stages. One is the optimization of the number of hidden layer layers, and the other optimizes the number of neurons, time step, learning rate, epochs, and batch size. Ten, a logistic chaotic mapping and an adaptive step size method were proposed to overcome slow convergence problems and stacking into local optimum of SOMA. Comparison experiments with SOMA, PSO, CPSO, LSOMA, and OSMA on test function sets show the advantages of the improved algorithm. Finally, the AS-SOMA-LSTM network prediction model is used to solve the STLF problem to verify the efectiveness of the proposed algorithm. Simulation experiments show that the AS-SOMA exhibits higher accuracy and convergence speed on the standard test function set and has strong prediction ability in STLF application with LSTM.


Introduction
Electric load forecasting has a great impact on dispatching work and production scheme of the power system [1,2]. Accurate STLF is not only necessary for the power grid's steady and safe functioning but also provides signifcant economic benefts to power corporations [3,4]. Te research shows that the prediction error is reduced by 1%, which can save 1.6 million dollars per year for a 10 GW power plant [5]. Moreover, as the key to intelligent power system research, intelligent load forecasting is of great signifcance for promoting the construction of smart city in the future [6]. Te STLF problem is to predict future power consumption by analyzing the power consumption in the past period [7]. Te traditional forecasting method analyses the chart, which is greatly afected by the weather and has a low degree of accuracy. With the development of statistical software and artifcial intelligence technology, several prediction methods with higher accuracy have appeared. It mainly includes the time series method, autoregressive integral moving average (ARIMA), and so on [8]. Tis method's basic idea is to use the temporal nature of the historical load data to forecast. So, it has a higher prediction accuracy for the data with solid timing. But its predictive ability is limited for data with many nonlinear relationships. With the development of power systems, the amount of data is so large that its nonlinear relationships become more complex. Intelligent algorithms are mainly machine learning methods represented by the support vector machine (SVM) [9], random forest (RF) [10], and artifcial neural network (ANN) [11]. Most of these algorithms require setting the time feature manually, and the data temporal correlation feature must be fully considered. For long-term data series, their predictive ability is limited.
Te deep LSTM network is widely used in STLF problems [12], which is the latest time series forecasting model. According to their composition, methods for using an LSTM network to solve STLF problems are classifed as the mixed model (M-model) or optimization robustness than GA and SOMA in 25 test functions. mNM-SOMA [31] uses the NM crossover operator to fnd the optimal leader, and compared with SOMA, GA, and PSO, the performance of mNM-SOMA is the most superior. Te CCMA-ES-SOMA [28] model uses the CCMA-ES algorithm to detect the feasible region of the optimization problem quickly. SOMA has obvious advantages in solving NP-hard problems. However, there needs to be literature on structural optimization of using neural networks. Terefore, a SOMA improvement scheme was proposed to solve the LSTM structure optimization problem in this study.

LSTM and Parameter Optimization
2.1.1. LSTM Memory Unit. LSTM solves the gradient explosion and dispersion problems of RNN. Te subsequent nodes become weaker in perceiving the previous nodes, and it appears that they forget the previous information as time passes when the number of network layers increases. In short, LSTM can perform better in longer sequences than a normal RNN. Figure 1 shows the operation of the LSTM memory unit. Compared with RNN, LSTM adds a memory unit specifcally for saving historical information. Te control of input, forget, and output gates updates the history information in the network.
Te forward propagation process of LSTM can be expressed as formulas (1)- (5), where W xf , W hf , W xi , W hi , W xo , W ho is the weight matrix, the b i , b f , b c , b o is the bias corresponding to the weights, the σ is sigmoid function, and ⊙ is the matrix dot product operation. Te input and output vectors of the implicit layer of the LSTM are x t and h t at a time step of t. Te memory unit is c t and the input gate is used to control that how much of the network's current input data x t fows into the memory cell, in other words, how much can be saved to c t , and the values are expressed as follows: Te forget gate is a key component of the LSTM unit, which controls information to retain and forget. It uses somehow to avoid the gradient disappearance and explosion problems triggered when the gradient propagates backwards in time. Te forget gate determines what historical information will be discarded. Te information in the memory cell c t−1 of the previous moment has an impact on the current memory cell c t .
Te output gate controls the efect of the memory cell c t on the current output value h t , and the part of the memory cell will be printed at the time step of t.
Because of its excellent performance, LSTM is used for a large number of sequence learning tasks, such as robot control [32], speech recognition [33], time series prediction [34,35], and market prediction [36,37].

Deep LSTM.
Te classical LSTM model is composed of an input layer, hidden layer, and output layer. A deep LSTM can be formed by stacking multiple (≥2) hidden layers.
Each layer solves a portion of the task before passing it on the next layer, until the fnal layer provides the output. Graves et al. constituted a deep LSTM by stacking LSTM hidden layers [38]; it is applied to the speech recognition problem which has obvious advantages in benchmark tests. Te deep LSTM architecture is defned as a model that consists of multiple LSTM layers. Te upper LSTM hidden layer provides sequential output to the lower layer instead of outputting a single value. Tis shows that when building the model, the depth of the network is more important than the number of LSTM cells in a given layer. Te structure of the time-expanded deep LSTM recurrent neural network is shown in Figure 2.

Optimization of Deep LSTM Parameters.
Te LSTM recurrent neural network is a stable technique for time series processing, and the number of hidden layers in the LSTM changes the abstract value of the input observation. Tis method increases the training time and memory cost exponentially. Meanwhile, the disappearance of interlayer gradient will lead to the weakening of network performance, and this phenomenon becomes more signifcant when there are many layers. Tat will lead to slow update iterations in the hidden layers which is closer to the input layer and the efectiveness and efciency of convergence will decline sharply, and will even have easier access to local minima. As shown in Figure 3, the input to the LSTM is a three-dimensional vector consisting of batch size, time step, and features, which are described below.
Te current mainstream training method for DNN is gradient descent, which is trained using the model inputs for prediction. Ten, the predicted values are compared with the actual values as an estimate of the error, the specifed loss function is used to update the model weights, and the process is repeated. Each time step is the feature data that input to the model each time, whose size determines the overall size of the model's single input features. In addition, it also afects the mapping of the model's prediction results to times.
Te network model weights are updated based on a subset of the training data, which is set in batches. Batch size is the number of samples used to estimate the error gradient in the training data set, in which a sample is the set of eigenvalues at one of the abovementioned time steps. As a statistical estimation, the more training data for the error gradient and the more estimates are computed; the more likely the weights of the network to be better tuned, the better the model's performance. Tus, if the model is used for more predictions, the cost of error gradient estimation is constantly improved. Batch size is an important hyperparameter that afects the learning algorithm's dynamics.
Te learning rate is a hyperparameter used by the network model. It controls the weight degree of the updating model according to the estimation error and the speed at which the model adapts to the problem. Lower learning rates require more training cycles for adequate training, while more signifcant learning rates lead to rapid changes in network model weights and require fewer training cycles. However, if the learning rate is too high, the model may converge to the suboptimal solution quickly, while if the learning rate is too low, the process may be stagnant.
Epochs defnes the number of times the learning algorithm is trained over the entire training data set. Iteration, which Figure 1: Schematic diagram of LSTM memory cell. consists of one or more batches, means that each sample in the training dataset has completed one prediction of the model. Te number of iterations directly afects the number of features learned by the model. Too much or too little of it will result in overftting and underftting of the trained model.

AS-SOMA Model
2.2.1. SOMA. Tree main phases of the SOMA are as follows: initialization, migration cycle, and end of migration. In the frst stage, each particle of the initial population fnds the corresponding ftness value according to the specifc problem and determines the best leader. Te key to the SOMA is that the second phase of the migration cycle composes multiple cumulative migration updates. Te particle repeatedly makes small jumps to the leader by adopting a specifc step size and randomly initializing the guidance of the perturbation PRTVrctor. It is a constraint variable that controls the particle's movement dimension, and the expression is defned as follows: where rand is a random value between [0, 1], and PRTVrctor is regulated by setting the coefcient prt � [0, 1]. Te mode of particle migration can be expressed as follows where X ML i,state and X ML L , respectively, represent the particle to be migrated and the leader in the migration cycle of the frst ML generation and X ML temp is the new position obtained by the particle migration. t ML i is expressed as the interval length of step as follows: Te cycle will be stopped when the migration of particles reaches the accumulated maximum PathLength. Meanwhile, the selected leader will lead the particles to carry out the next round of migration and the end of migration when the algorithm stop condition is met.

Shortcomings of SOMA.
SOMA has signifcant faws in the initialization and migration cycle phases due to its design principles. For example, the initialization scheme and update mode used the idealized randomness and the fxed step, which has problems with slow convergence and easy to fall into local optimal in optimization. Te following is a detailed introduction.
To follow the law of biological population, SOMA adopts a random scheme in the initialization stage. However, this random initialization scheme may lead to problems, such as the particle is only in the local optimal region, the population density is too large, and the individual spacing needing to be more prominent in the optimization process. Ten, the slow convergence, local optimum, and advance convergence will occur in the optimization process.
In the migration cycle stage, due to the fxed value of the migration step, the particles cannot migrate fully and effectively. Large or small steps afect the execution efciency and convergence ability of the algorithm to varying degrees. Te selection method of SOMA for the leader of each migration is the maximum ftness, which can efectively guide the population towards a better direction of migration. But it needs to include the diversity of guidance and may even lead to incorrect guidance at the initialization stage of algorithm execution when the leader is locally optimal.

AS-SOMA.
Te logistic chaotic mapping and the adaptive step size method are used to optimize SOMA in the initialization stage and migration cycle stage, respectively, in this paper. Te particles can be evenly distributed in the whole search space in the initialization stage, which promotes the balance between development and exploration in the renewal process. Logistic map is one of the simplest chaotic maps, and it is described as formula (9), where c i represents the ith chaotic number, the c i ∈ (0, 1), and μ is an adjustable parameter.
In the initialization stage of AS-SOMA, the multiple chaotic sequence numbers were generated by changing the initialization value of logistic chaotic map c 0 . Te obtained chaotic number is multiplied by the search space of diferent dimensions, taking it as a coefcient to obtain the particles evenly distributed in the whole search space. Te specifc calculation is described as follows: where p jmax and p jmin are upper and lower bounds of particles in the j-dimension, respectively. Assuming that p jmax � 120 and p jmin � 0, in Figure 4, there is no blind area when c 0 � 0.6, i � 200, and μ � 4. Terefore, the value of μ is 4 in this document, c i is evenly distributed in the interval of (0, 1), and the particles in the search space are uniformly distributed in (0, 120). Te adaptive step method is used to optimize the fxed value of the original step in the AS-SOMA migration cycle. Te method stores the step size information of successfully migrated particles by creating an archive of success information S step , which is going to accumulate to form the mean of the normal distribution. Te next migration of particles is guided by generating random step in the way of normal distribution, which is described as follows: Te μ step is initialized to 0.21 for better results [39] and updated at the end of each migration cycle in this paper, and the updated formula is as follows: where c is a constant between 0 and 1, and S step represents the step size archive of successfully migrated particles.  Figure 5. Te parameters to be optimized include hidden layer, number of neurons, time step, learning rate, epochs, and batch size. To make the network weights converge more stably during the model's training process, this paper uses exponential descent for the hyperparameter learning rate, which decreases exponentially as the number of LSTM training generations increases. So, the optimization process only searches for its initial value.

STLF Problem Based on the AS-SOMA-LSTM Optimization Model.
As shown in Figure 6, the AS-SOMA based on the above LSTM training model encodes the particles as x i (h i1 , h i2 , ts i , lr i , mi i , bs i ) where i is the ith particle, h i1 and h i2 are the number of neurons in the frst and second hidden layers, respectively. ts i is the time step of the model input, lr i , mi i , bs i are the size of the initial learning rate, the epochs of training iterations, and the batch size in model training, respectively. Te abovementioned particles were decoded as parameters of the LSTM network model for training and prediction, and the optimal parameters were obtained through continuous iteration according to AS-SOMA. Te STLF problem processing process based on the AS-SOMA-LSTM optimization model is shown in Figure 7. First, the AS-SOMA initializes the population to generate N individuals, which generates the corresponding set vector PRTVrctor by setting the coefcient prt. Subsequently, the dimension parameters are confgured in LSTM to construct the corresponding prediction model, which is derived from decoding individuals in the population. Te model is trained and tested to calculate the root mean square error (RMSE), which is returned to AS-SOMA as a ftness function value. AS-SOMA sorted the ftness function values of the obtained population and selects the leader with the minimum value to guide the next population migration. Te calculation process of population migration is repeated, and the number of repetitions is the maximum number of iterations set by AS-SOMA. Te value of the last generation leader is the hyperparameter of the optimal model.

Experimental Parameters.
MATLAB is used to conduct comparative experiments on ffteen test functions of CEC2015 [40] in this paper. According to the basic characteristics, the CEC2015 can be divided into the following four categories: F1 and F2 are single-peak functions, F3, F4, and F5 are basic multipeak functions, F6, F7, and F8 are three mixed functions, and F9-F15 are seven synthetic functions. Te experimental comparison of AS-SOMA with the mainstream population intelligence algorithm PSO and its modifed algorithm CPSO and the parameter settings of each algorithm are given in Table 1.

Algorithm Evaluation.
In order to improve the convergence accuracy of AS-SOMA, the experiments were evaluated by the mean and standard deviation of the best run results on the functions, and it is described as f(x) − f(x * ). Ten, this paper is compared with other algorithms in multiple run results, and the mathematical expression is described as follows: where f(x * ) is the global optimal value of the algorithm on the function and i is the number of times the function runs. Wilcoxon's rank-sum test is performed on the obtained experimental data to verify the signifcance of the performance between individual algorithms. If P value is less than 0.05 (H � 1), it indicates that the operating results of the current algorithm are signifcantly diferent from those of AS-SOMA. Te convergence rate is compared by observing the convergence rate of the ftness value of the benchmark function.  Table 2 shows the experimental results of each algorithm on CEC2015 benchmark function set. "+", "�" and "−" indicate that the accuracy of the corresponding algorithm is better, similar, and worse than AS-SOMA. Te bold display indicates the best result of the mean and the best variance. AS-SOMA achieves optimal experimental results on both single-peak functions, which are compared with the fve comparison algorithms as shown in Table 2. Wilcoxon's ranksum test results show that the performance of AS-SOMA is signifcantly better than other comparison algorithms on function F 2 . In the test results of function F 1 , AS-SOMA has a similar convergence performance with SOMA, and it is signifcantly better than 4 of the fve comparison algorithms. Compared with basic multimodal functions, AS-SOMA achieved two best results in three functions. Wilcoxon's ranksum test results showed that AS-SOMA is signifcantly better than PSO, CPSO, SOMA, LSOMA, and OSOMA on the test functions of (F 4 , F 5 ), (F 4 , F 5 ), (F 3 ), (F 3 , F 4 and F 5 ), and (F 3 , F 5 ). In the mixed function, it is obvious that the experimental results of AS-SOMA on function F 6 are better than other algorithms. It is obvious that the experimental results of AS-

Algorithm
Parameter setting PSO, CPSO [41] w � 0.8, Step max � 0.4, Step max � 0.7 OSOMA [43] prob � 0.5 STLF schematic diagram of AS-SOMA-LSTM optimization model    To sum up, AS-SOMA can solve the synthesis function efectively. Te excellent results are mainly attributed to the step size adaptive mechanism and the population initialization mechanism based on the logistic chaotic mapping. Te interaction between the two mechanisms improves the solving accuracy of the algorithm.

Te Convergence Speed.
To further verify the optimization efect of the algorithm in the search process, Figure 8 shows the average convergence result of the comparison algorithms in 15 benchmark functions based on the CEC2015 data set 25 times. Te ordinate is the natural logarithm of the average value of the independent 25-times running results of each algorithm. Te horizontal coordinate represents the sampling point, and its value is from FES � 1000 and mod (FES, 10000) � 0.
It can be seen from the experimental results in Figure 8, AS-SOMA achieves peak performance on F 1 , F 2 , F 5 , F 6 , F 8 , F 10 , F 11 , F 13 , and F 14 while being competitive on F 4 , F 7 , F 9 , F 12 , and F 15 . In addition, CPSO has the best convergence speed on F 3 , OSOMA has the best convergence speed on F 4 , and SOMA has the best convergence speed on F 7 . However, the convergence speed of PSO and LSOMA is not optimal on all benchmark functions.
Te experimental results in Figures 8(e) and 8(k) show that the AS-SOMA has excellent convergence speed compared with other comparison algorithms. In addition to the experimental data in Table 2, it can be seen that AS-SOMA has excellent performance on F 5 and F 11 both in terms of solving accuracy and convergence speed. It can be concluded that AS-SOMA has good convergence in solving basic multimode functions and synthetic functions.

Search Behavior Analysis.
In order to explain the convergence performance of AS-SOMA, this paper analyzes the search behavior of AS-SOMA by analyzing the population evolution of AS-SOMA on function F 4 . Schwefel ′ s function has a global minimum point that is far from another local optimum, and it is a typical cheating problem. So, it is hard to get out of local optimum. Figure 9 shows the search process performed by AS-SOMA on F 4 when the population size is 100 and the evolutionary algebra G is 1, 3, 5, and 10, respectively. It is obvious that Schwefel ′ s function has a very complex model structure. Te results of the threegeneration evolutionary search performed by the algorithm are shown in Figure 9(b) when the population with a population size of 100 is initialized uniformly and randomly in the solution space. It can be observed that the population migrates toward the global optimum while maintaining diversity. Figure 10 shows the migration step change diagram of AS-SOMA individuals in the search process of Schwefel ′ s function, and abscissa represents the evolution time. It is easy to see that step decreases as the number of evaluations increases. Te population explores the search space in the early stages of algorithmic search tentatively, the less successful step information in the archive makes less population learnable information at this time, the larger step ensures the search speed of the algorithm. After a period of searching, the success information is increased gradually in the archive.
Step begins to decrease steadily through the use of adaptive regulatory mechanisms. It reduces the probability of particles falling into the local optimality. , and air conditioning system power (ASP) (kWh). Tis paper uses the replacement of null outliers and data accumulation to preprocess the total active power original data. As shown in Table 3, the statistical unit is expanded from hours to days. Te target problem is the prediction of daily electricity load in the next week. In addition, this paper focuses on the prediction of the TAP, which is the sum of the power consumption of the appliance [44][45][46]. Te 3 : 1 scale partition dataset is presented in this study in Figure 11(a), and the frst three years and the fourth year of data completed the   training and testing of the model. Figure 11(b) shows the statistical indicators of maximum, minimum, mean, median, range, standard deviation, skewness, and kurtosis to conduct a descriptive statistical analysis of the TAP.

Model Evaluation.
Te mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percent error (MAPE) are metrics used to evaluate the predictive performance of the models. Te MSE, RMSE, MAE, and MAPE are defned as follows:  Figure 11: Te total active power data.
where y is the predicted value, y i is the true value, and the value range of RMSE is (0, +∞).

Prediction Experiment Results and Analysis.
In order to reduce the operation error, each algorithm is independently run ten times to record the maximum, median, minimum, mean, and standard deviation of the execution results of each algorithm. Wilcoxon's rank-sum test is used to compare the performance of the algorithm.

Hidden Layer Optimization.
Structure optimization of LSTM is divided into two stages. Te hidden layer number has a small search space and only integer, which is the frst stage of structure optimization. Te optimization of the number of nodes in each hidden layer is the premise of structural optimization. Other hyperparameters set include hidden layer neurons, learning rate, training algebra, batch size, time steps, and the values are 200, 0.001, 500, 16, and 14.
Te data within fourteen days were used to predict the data for the next 7 days, and Adam optimizer [47] is used to complete the weight update. Tis paper compared the predicted results to model the optimal number of hidden layer nodes.
On the data prediction of STLF, Table 4 shows the prediction performance of LSTM with diferent layers. It can be seen that the network has better performance with the hidden layer digit of 2, and there is a signifcant diference compared with layer 1 and layer 4. As a result, a model with fewer hidden layers will result in insufcient expression ability of the model to the problem. On the contrary, the disappearance of gradient between layers will lead to weakened network performance when there are too many hidden layers. In the process of prediction model optimization mentioned below, the number of hidden layers of LSTM is determined to be two.

Optimization of Other Parameters.
Te second stage is to optimize other parameters, including the number of neurons in the hidden layer, time step, learning rate, epochs, and batch size. In order to make the model converge more stable in the training process, this paper uses the method of exponential decline for the hyperparameter learning rate. Table 5 shows the size of search space corresponding to the parameters when AS-SOMA is used to optimize the network model. Te mutation probability of GA parameters in the comparison algorithm is 0.001. Parameter settings of others are consistent with Table 1. As a control experimental algorithm, AS-SOMA_1 implements the improvements in the initialization phase but not in the migration phase. Te population size of all algorithms is 10, and the algorithm execution termination condition is 500 times of ftness function evaluation.
For the fnal optimization model of AS-SOMA-LSTM, the network structure with the best efect is that the frst layer contains 47 hidden nodes, the second layer contains 86 hidden nodes, the time step is 11, the batch size is 24, the learning rate is 0.03414, and the epochs is 338. Te comparison between AS-SOMA-LSTM model prediction results and original data is shown in Figure 12. Table 6 shows that AS-SOMA achieves the best results on each metric and outperforms the state-of-the-art methods by 0.44%, 0.37%, 0.117%, and 0.005% in terms of mean MSE, mean RMSE, mean MAE, and mean MAPE. Table 7 shows the statistics of the experimental results of diferent algorithms in STLF based on the LSTM optimization model on the RMSE. It can be seen that AS-SOMA has achieved the best results in maximum, median, and average among the eight algorithms. AS-SOMA has more advantages than AS-SOMA_1 in terms of average value and standard diference. Wilcoxon's rank-sum test shows that there is no signifcant diference in the performance between the two algorithms compared with the other fve algorithms. Tis phenomenon shows that the two SOMA improvement schemes contribute equally to the fnal performance, and the two schemes jointly improve the performance of AS-SOMA.
AS-SOMA ranks fourth out of seven algorithms in the minimum index. One of the reasons is the algorithm's randomness, where each algorithm has a certain probability of obtaining the optimal value of the problem. Secondly, AS-SOMA proposed an adaptive step mechanism based on SOMA, which improved the convergence ability and stability of the algorithm. Meanwhile, it also reduces the exploration ability of the algorithm. OSOMA ranked frst in minimum index, which is based on reverse learning mechanism to improve SOMA. Te idea of reverse learning is to promote the particle to migrate in the opposite direction of the best particle with a certain probability. Tis mechanism enhances the exploration ability of the algorithm greatly. But statistics shows that the results of multiple experiments are inconsistent. According to the combination of mean and standard deviation in Table 5, AS-SOMA has obvious advantages in solving STLF problem by LSTM. Figure 13 is the average convergence curve of eight algorithms on STLF problem with LSTM. Figure 13(a) shows the poor performance of GA at the initial stage of search. Te reason is that the initial ftness function value of GA in a search reaches 1893.852, which is the particle to get a set of poor solutions in the case of random initialization. Te model is not trained as it should be when the particle is decoded into the LSTM. It makes the prediction performance of the network poor. Figure 13(b) is obtained by trimming the original convergence curve. It can be seen that AS-SOMA has a certain advantage in the whole search process. OSOMA based on reverse learning has the worst performance among the eight algorithms, it shows that the scheme based on reverse learning is not suitable for LSTM structure optimization. Figure 13(c) shows the eight algorithms in the early stage of search. It can be seen that CPSO, A-MsPSO, and AS-SOMA based on logistic chaotic mapping initialization have faster convergence in the early stage of search. In the initialization stage, the whole population is evenly distributed in the entire search space, which ensures the exploration ability and diversity of the population in the early stage.      Figure 13(d) shows the situation of eight algorithms in the late search period. It can be seen that AS-SOMA has a fast convergence ability in the later stage of the search. Te reason is that the improved scheme based on adaptive step size can solve the problem of poor convergence efectively, which is caused by the fxed step size. Additionally, the dynamical adjusting step size balances the development and the utilization of the algorithm.
In the early stage of the search, PSO is slightly ahead of SOMA, and SOMA gradually accelerates convergence ahead of PSO in the late stage of the search. Te reason is that the movement of each particle in the PSO is guided by the optimal individual of the population. Tis guidance pushes the particles closer to gbest in every dimension by subtracting the vector. Similarly, the movement of each particle is infuenced by the leader of the population in SOMA. However, not all dimensions of particles migrate in all directions, and PRTVrctor interferes in the process of migration. Terefore, in the process of LSTM structure optimization, the full-dimensional migration of searching the early PSO can make the population approach better network parameters quickly. However, the multidimensional motion tends to make the population skip the optimal parameter when the population concentrates near the optimal solution in the late searching period. SOMA can immobilize certain dimensions and migrate others. It will approach the optimal solution gradually by controlling the direction.  Te convergence process of AS-SOMA is verifed in the LSTM hyperparameter variation in Figure 14. In the early stage of the search, the LSTM prediction model presents certain fuctuations in the time step, epochs, and batch size. Te hyperparameters quickly stabilize after the 100th optimization and gradually tend to the optimal. Figure 14(b) shows that the learning rate presents a certain steady state in the process of AS-SOMA convergence. Figure 15 shows the correlation degree among time step, learning rate, epochs, and batch size, whose absolute values are 0.37, 0.026, 0.066, and 0.029, respectively. It has the highest correlation between time step and batch size, with a value of 0.37. Terefore, it is necessary to analyse the hyperparameter constraint conditions in the process of combinatorial optimization in order to better train the network.
Tis paper constructs the AS-SOMA-LSTM optimization model to solve the STLF problem. Te original SOMA is improved by a logistic chaotic and adaptive step size method, and its solution accuracy and convergence speed are verifed. Te LSTM optimization model based on AS-SOMA is proposed and applied to the practical problem of STLF. First, the optimal number of hidden layers is determined by the exhaustive method. Ten, the remaining fve hyperparameters are optimized by using the AS-SOMA, including the number of neurons, time step, learning rate, epochs, and batch size. Simulation experiments show that the AS-SOMA-LSTM model has signifcant advantages over other methods.

Data Availability
Te dataset used to support the fndings of this study can be accessed upon request.

Consent
Not applicable.