^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

The application of existing datasets to construct a probabilistic network has always been the primary research focus for mobile Bayesian networks, particularly when the dataset size is large. In this study, we improve the K2 algorithm. First, we relax the K2 algorithm requirements for node order and generate the node order randomly to obtain the best result in multiple random node order. Second, a genetic incremental K2 learning method is used to learn the Bayesian network structure. The training dataset is divided into two groups, and the standard K2 algorithm is used to find the optimal value for the first set of training data; simultaneously, three similar suboptimal values are recorded. To avoid falling into the local optimum, these four optimal values are mutated into a new genetic optimal value. When the second set of training data is used, only the best Bayesian network structure within the five abovementioned optimal values is identified. The experimental results indicate that the genetic incremental K2 algorithm based on random attribute order achieves higher computational efficiency and accuracy than the standard algorithm. The new algorithm is especially suitable for building Bayesian network structures in cases where the dataset and number of nodes are large.

The effective expression of uncertain knowledge is an important content of knowledge intelligent learning. In this research field, Bayesian network has always been the focus of attention. Bayesian network is a probability graph model. It represents the dependency relationship between a group of random variables through a directed acyclic graph. The conditional probability table (CPT) formed by each variable represents the probability relationship between variables [

The Bayesian network is primarily based on Bayesian network learning, which is divided into two steps: structural learning and parameter learning. Structure learning is to obtain a directed acyclic graph that can represent attribute dependencies based on training data and a priori knowledge. Parameter learning is to obtain the conditional probability of each node based on the directed acyclic graph. It is usually called conditional probability table. In these two learning, structural learning is more difficult, and it is also a research hotspot. It mainly focuses on how to avoid falling into local optimization and find the best structure when there are many attributes and few sample data [

Method based on scoring and searching: it uses the scoring function to measure the matching degree between the Bayesian network structure and the training sample set. After defining the scoring function, apply the search strategy to find the network structure with the highest score. K2 algorithm is the common one [

Method based on conditional independence test: this method abstracts the learning process of the Bayesian network structure as the process of discovering a set of variables hidden in the network structure that satisfy the independence condition test. Spirtes proposed the SGS algorithm in 1989 [

Hybrid algorithm: because the method based on score search and the method based on constraint have their own advantages and disadvantages, the hybrid optimization algorithm combining the two has gradually become the mainstream of research. The improved whale optimization algorithm is used to optimize the structure of Bayesian network. The optimization efficiency and accuracy of this method are good, but the complexity is very high [

For the K2 algorithm, as stated by Cooper, the K2 algorithm can reconstruct a moderately complex belief network rapidly, but it is sensitive to the ordering of the nodes [

This paper presents a new Bayesian network structure learning method based on random node order and genetic incremental search for an optimal path and compares this method with the K2 algorithm. Experiments demonstrate that the method of random node order can yield a better Bayesian network structure without expert knowledge, and the genetic incremental structure learning method can greatly improve the computational efficiency when tested on big datasets, especially when the number of samples and nodes are large. It always exhibits a runtime that is shorter than that of the K2 algorithm.

The K2 algorithm [

The node sequence is given in advance, and each node greedily searches its parent node set from its predecessor node according to the Bayesian scoring function and finally obtains the network structure with the best score.

The learning of the network structure can be attributed to the given dataset

For equation (1), the first multiplicative symbol traverses each random variable

The second multiplicative symbol traverses all parent variables of

The last multiplicative symbol traverses all possible values of the current variable

It can be assumed that the probability of each structure obeys a uniform distribution, that is, the probability

The objective is to obtain

As can be seen from the above equation, as long as the local maximum for each variable is provided, the overall maximum can be obtained. The component of each variable is presented as a new scoring function, as follows:

The core of Bayesian network structure optimization is to narrow the search scope through search strategy after determining the scoring function. Greedy search algorithm is the most commonly used method. But it is easy to fall into local optimization. In 2017, the authors of [

The K2 algorithm must initialise the order of nodes such that only the node in front of node

The disadvantages are as follows. First, the order of nodes is not easy to obtain in most actual network structures, and the expression of this a priori knowledge is not conducive to the understanding of expert knowledge. Second, the fault tolerance for the order of nodes is poor. If the order of nodes that is input into the K2 algorithm is dissimilar to that in the real structure, the accuracy of the K2 algorithm will be greatly reduced; this is owing to its algorithmic theory. In this study, our first objective is to reduce the dependence on the order of nodes. For nodes, each iteration randomly generates an array of nodes, and the generation procedure (see Algorithm

Procedure randomarray(len, start, end) (

{Input

{Output: an array that has

Define the array[1,n]: order

for1 1-n {

Random generation of a number between 1 and n

If (the number is the first number) {insert the number into the array; }

Else{

for2 1-length(array){

If(the number exits){ regenerate a random number, loop for2}

else{insert the number into the array; loop for1}

End for2}

End else }

End for1}

End randomarray }

The basic idea is to divide the training data into two groups and use the first set of training data to learn a basic Bayesian network structure using the K2 algorithm. In the process of learning, not only the current optimal value, i.e., the decision-making of the algorithm each time, but also several suboptimal values are saved [

In addition, the optimal score function value is preserved in each iteration. After the iteration, the node order of the Bayesian network with the optimal score function value is considered to be the best node order.

The algorithm is divided into two parts. The first part (see Algorithm

Procedure GCDK2{

{Input: K2 algorithm need parameter initialisation, mutation rate pm = 0.5}

{Output: a bayesian network and a matrix contain optimal value and location }

For

While

For

End for}

Get

Get

Get

Mutate optimal values get a new genetic value

If (

Input

End if}

End while}

End for}

End GCDK2}

The second part (see Algorithm

Procedure GIMK2(part){

For

For

Get

End for

End for

End GIMK2}

To test the algorithm, the general ALARM, Asia, and CANCER networks were selected in the experiment. Under different sample numbers, the running time and structural hamming distance [

The experiment adopted the Bayesian network toolbox in the MATLAB platform. The operating environment was Windows 7, Intel (

Result of learning ALARM by learn_struct_K2 and GIK2.

Sample size | K2 | GAK2 | GIK2 | |||
---|---|---|---|---|---|---|

Time (s) | SHD | Time (s) | SHD | Time (s) | SHD | |

4000 | 7.928 | 2 | 5.675 | 2 | 0.861 | 2 |

5000 | 8.348 | 2 | 6.253 | 2 | 0.831 | 2 |

10000 | 11.047 | 2 | 8.754 | 2 | 1.110 | 2 |

15000 | 12.740 | 2 | 10.546 | 2 | 1.348 | 2 |

20000 | 15.456 | 1 | 11.328 | 1 | 1.547 | 1 |

50000 | 29.877 | 1 | 18.429 | 1 | 3.128 | 0 |

100000 | 55.659 | 0 | 37.839 | 0 | 5.987 | 0 |

Result of learning Asia by learn_struct_K2 and GIMK2.

Sample size | K2 | GAK2 | GIK2 | |||
---|---|---|---|---|---|---|

Time (s) | SHD | Time (s) | SHD | Time (s) | SHD | |

200 | 0.234 | 3 | 0.205 | 2 | 0.058 | 2 |

500 | 0.240 | 1 | 0.218 | 1 | 0.073 | 1 |

1000 | 0.255 | 1 | 0.220 | 1 | 0.057 | 1 |

3000 | 0.225 | 1 | 0.231 | 1 | 0.062 | 1 |

5000 | 0.232 | 1 | 0.236 | 1 | 0.078 | 1 |

10000 | 0.264 | 1 | 0.238 | 1 | 0.089 | 1 |

15000 | 0.291 | 0 | 0.247 | 0 | 0.141 | 0 |

20000 | 0.376 | 0 | 0.275 | 0 | 0.137 | 0 |

50000 | 0.593 | 0 | 0.279 | 0 | 0.235 | 0 |

100000 | 1.084 | 0 | 0.884 | 0 | 0.423 | 0 |

This algorithm relaxes the strict requirement of the K2 algorithm on node order and improves the efficiency of learning the Bayesian network structure.

In the ALARM network (comprising 37 nodes), the experiment began with a sample size of 4000, the running time of K2 was 7.928, GAK2 was 5.675, and GIMK2 was 0.861; when the size varied to 50000, the times were 29.877, 18.429, and 3.128; when the size was 100000, the times were 55.659, 37.839, and 5.987. SHD is large at first; however, it eventually reduces to zero.

In the Asia network (comprising 8 nodes), the experiment began with a sample size of 200; however, when the size varied to 5000, the running time of K2 was 0.232, GAK2 was 0.236, and GIMK2 was 0.078; when the size varied to 50000, the times were 0.593, 0.279, and 0.235; when the size was 100000, the times were 1.084, 0.884, and 0.423.

In the CANCER network (comprising 5 nodes), the experiment began with a sample size of 200; however, when the size was 5000, the running time of K2 was 0.158, GAK2 was 0.139 and GIMK2 was 0.053; when the size varied to 50000, the times were 0.248, 0.218, and 0.188; when the size was 100000, the times were 0.353, 0.245, and 0.191.

The results of the program operation indicate that the genetic incremental K2 algorithm has a shorter running time than K2 and GAK2 for the same sample size. When the mobile Bayesian networks with different number of nodes, particularly when the dataset size increases with the number of nodes, the running time of K2 and GAK2 becomes extremely long.

As data analysis is now being conducted on big data, if we need to analyse big data with uncertain knowledge, especially in the case of numerous attributes, the genetic incremental K2 algorithm can reduce the search space and considerably improve the efficiency of the algorithm. The improved algorithm in this paper is effective; however, it has disadvantages such as the fact that the search space of each algorithm depends on the current optimal path; thus, it is easy to fall into local optimum. The algorithm should thus combine particle swarm optimization, ant colony, or another optimization algorithm to avoid falling into local optimum.

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This work was supported by the training object of “Blue Project” in Jiangsu Universities in 2019.