Security Feature Measurement for Frequent Dynamic Execution Paths in Software System

The scale and complexity of software systems are constantly increasing, imposing new challenges for software fault location and daily maintenance. In this paper, the Security Feature measurement algorithm of Frequent dynamic execution Paths in Software, SFFPS, is proposed to provide a basis for improving the security and reliability of software. First, the dynamic execution of a complex software system is mapped onto a complex network model and sequence model. This, combined with the invocation and dependency relationships between function nodes, fault cumulative effect, and spread effect, can be analyzed. The function node security features of the software complex network are defined and measured according to the degree distribution and global step attenuation factor. Finally, frequent software execution paths are mined and weighted, and security metrics of the frequent paths are obtained and sorted. The experimental results show that SFFPS has good time performance and scalability, and the security features of the important paths in the software can be effectively measured. This study provides a guide for the research of defect propagation, software reliability, and software integration testing.


Introduction
The increase in complexity of software requirements makes software developers unsure of the development quality of software system; in effect the "software crisis" still has not been completely solved.How to effectively excavate the inherent characteristics of the software system structure, to recognize, measure, manage, and control the complexity of software structure, becomes a key problem for solving the development bottleneck in the software industry.
Research on the complexity of software network structure can combine the methods of complex system science and statistical physics.Depending on the granularity, software systems can be composed of different types of software entities, such as functions, classes, subroutines, packages, and artifacts.With these entities interacting with each other, software systems can achieve specific functional requirements.If the software entities are viewed as nodes and the relationship between the nodes is abstracted as edges, the software execution process presents a nonlinear network structure according to the relationship of the entities [1] and also a linear sequence structure according to the sequential characteristics of the execution order.Then, the software system can be expressed as an abstracted complex network model and a sequence model, which provides a new train of thought [2] for the description of the software system.
The root cause of the security danger hidden in software lies in the vulnerability of the entity itself.The vulnerability is the measurement of the potential danger of a software entity to be used as an attack and can be discussed from the perspective of computer network [3,4] or software static code analysis, but the integrity (whole structure) and the dynamic execution (behavior characteristic) of software system are ignored.In addition, the degree to which software system Security and Communication Networks security is threatened depends not only on the severity of the fault, but also on the fault propagation capacity of the entity.If one or more functions fail, the fault may be propagated to other functions by invocation relationships and further lead to a part of or the whole software system crashing, known as "cascading failure" [5].Therefore, the software security feature measurement should take into account the vulnerability and propagation of software entities.
How to quantitatively measure the security features of nodes from the software complex network is the premise and basis for further analysis of the software behavior trajectory path.At present, there are lots of methods for discovering the important nodes in complex networks.The classic methods based on centricity contain degree centrality [6], closeness centrality [7], betweenness centrality [8], eigenvector centrality [9], subgraph centricity [10], and so on.The classic methods based on random walk model include PageRank [11], LeaderRank [12], and their improved algorithm NodeRank [13].Wang and Lü [14] by means of the influence node mining method prove that the defect propagation capacity of a node is stronger if the in-degree and out-degree of the node are bigger.Huang et al. [15] based on the invocation and dependency relationships between functions with the fault probability of nodes calculate the fault accumulation degree of upper nodes by the iteration from the leaf nodes.These methods attempt to describe the relevance of software node importance to fault generation and propagation, but fail to form a measurement of software security.
Sequence or path is the most basic and important way for the description of dynamic software execution process.The full execution path of the whole software can reflect the occurrence order and frequency of the software internal entities.However, the method of path extraction and mining is restricted by the nested, circulatory, iteration and the continuous invocation relationships of entities.Most software path mining algorithms are extracted on the basis of complex networks.For example, Tang et al. [16] propose an algorithm for shortest path mining between any two vertices in complex network.Zhang et al. [17] minimize the length of the extracted path and reduce the unnecessary time overhead by further processing the repetitive structure.The GP method proposed by Nguyen et al. [18] can automatically detect and fix software vulnerabilities according to the software execution path.Murtaza et al. [19] predict future software possible defects by analyzing the historical vulnerability sequence data with characteristics of Markov to provide adequate response time.Zou et al. [20] analyze the reliability of Digital Instrumentation and Control software system based on the flow network model by finding sensitive paths in the complexity software.These algorithms are based on the network to extract path, which can lead to the phenomenon of repeated reading and approximate connection; also, these software security analyses cannot work without existing vulnerability information or real faults as their training data.
In this paper, the Security Feature measurement algorithm of Frequent dynamic execution Paths in Software, SFFPS, is proposed.A complex network model and a sequence model are formed based on software dynamic execution behavior.It is for early security feature measurement, before there are real vulnerabilities or faults generated, which can provide the premise for the software quality and reliability evaluation.The main contributions are as follows.
(1) The software system is mapped to a complex network model and sequence model, from the nonlinear perspective to effectively express the characterization of complex correlation between software entities and from the linear perspective to capture sequential characteristics of the dynamic execution.
(2) The behavior nature of fault accumulation and propagation is analyzed based on the system structure of software dynamic execution and standard measurement of security features (vulnerability and propagation) being defined.
(3) Frequent paths in software dynamic execution are mined and weighted by the node security features.The key paths which are worthy of attention are ensured by both their frequency and security features.
The remainder of the paper is organized as follows.Section 2 gives the model construction.Sections 3 and 4 develop the definition of the security features and the SFFPS algorithm.Section 5 provides some examples.Section 6 presents the performance study of SFFPS and shows the rank of the important paths.Section 7 contains the concluding remarks.

Constructions of Complex Network Model and Sequence Model
The dynamic execution trace of software systems contains three phases, which are data collection, tracking data simplification, and data visualization as shown in Figure 1.The modeling process of simple functions is shown in Figure 2.
Phase 1. Match the entry and exit configuration functions of the GNU compiler toolchain (gcc), and insert the analysis function into the entry and exit of the application functions trace.txt0x8048690 calls 0x804854d 0x804854d calls 0x8048585 0x8048585 calls 0x8048656 0x804854d calls 0x80485f0 to trace the function execution process.The tracking results are recorded in the file trace.txt.
Phase 2. The letters "E" and "X" before the tracking addresses represent the entry and exit of a function, respectively.A simplification tool Pvtrace is used to analyze the function invocation according to the letters "E" and "X."An address transformation tool Addr2line is used and the address is transformed to function name.
Phase 3. Map the function invocation order to sequence model and a visualization tool Graphviz is used to form the complex network, which defines the global relationship between all the functions.
According to Figure 2, the corresponding relationships of function address and function name are as follows: Only the addresses with the letter "E" are used for sequence model construction.

The Security Feature Definition and Measurement of Function Nodes
The security feature measurement of a function node is based on the software structure; the analysis of vulnerability and propagation is according to cumulative effect and the spread effect caused by the mechanism of fault production and propagation.The global accessibility and fault tolerance with step attenuation effect are fully considered, so the node security features are calculated according to the degree distribution and step attenuation factor.
Definition 4 (software complex network).In a software complex network, functions are defined as the nodes; the invocation relationships between functions are defined as edges.
Definition 5 (vulnerability).Vulnerability of a function node is the characteristic that a function node may break down because of the effect of its invocated fault node through invocation relationship.
Typically, if a node invocates more other nodes, it is more functional and vulnerable.That is to say, it is more likely to be affected and be faulted.The calculation of  (vulnerability) is as follows: where ,  represent function nodes, () represents the vulnerability of node , OutDegree () represents the outdegree of node ,  represents the step attenuation factor, which satisfies  ∈ (0, 1), and OS() represents the direct outneighbor set of node .fault to the nodes by which it is invocated.The calculation of  (propagation) is as follows: where () represents the propagation capacity of node , inDegree () represents the in-degree of node , and IS() represents the direct in-neighbor set of node .
Algorithm 1 describes the calculation process of vulnerability and propagation.

Mining Frequent Paths from Dynamic Execution with Security Feature Measurement
The importance of a software dynamic execution path takes into account two aspects: one is the occurrence frequency of the path and the other one is the security feature coming from the nonrepetitive nodes contained in the path.These two aspects are complementary.For example, if there are lots of loop bodies in the software execution, loop body and its subset are always frequent.But because most of its contained nodes are the same, the fault influence range is small.Similarly, if a path contains many different nodes with a lower occurrence frequency, its impact range is large, but its occurrence possibility is small.That is to say, if the frequency of a path is very high and the path contains more nonrepetitive nodes, the path is worthy of more attention.

An Illustrative Example
The complex network in Figure 2 is a variant of the tree-like structure in Figure 3, which is redrawn for easier understanding.
Without losing generality, the coordination factor is set to 0.5.Security features of each node are calculated as follows.As the "main" function is special (vulnerability is always large and propagation is 0), it is excluded for measurement.The mining method of frequent 2-path is based on the position set of the frequent 1-path by using the adjacent position value as index to find the extended paths.For example, the position set of node  is {4, 8}, and its extended position set is {5, 9}.The function nodes in positions 5 and 9 both correspond to node .So, Pos () = {5, 9} is obtained, sup() = 2, and path EF is a frequent 2-path.The security features of frequent 1-path included in the function nodes are calculated as before, and the security features of frequent 2-path are calculated as follows.Table 1 shows the security features of all the frequent paths. () =  () +  () = 1.5 + 1 = 2.5;  () =  () +  () = 3 + 4 = 7.

Experimental Results
Experiments are performed on a PC with Intel5 Core6  4 is the runtime test of SFFPS with different support thresholds and Figure 5 is the scalability test with different length percentages of the sequence when the support threshold is set to 0.01.From Figure 4, SFFPS performs well in the support threshold range [0.005, 0.010].This is due to the adjacency table which is for the storage of the complex network model.The calculation of the out-degree and in-degree of the nodes is made easier, which improves the calculation of node security feature.Furthermore, as the sequence model is based on the start order of each function, the detailed invocation and end time of a node are ignored, and the length of the sequence model is simplified.Also, position value index is used for the mining and pattern growth of the paths, which avoids candidate generation, and index methods are always effective.Finally, the weight appending process achieves efficiency because fewer nodes are involved by the strategy of nonrepetition.
From Figure 5, SFFPS shows good scalability on the software Cflow.With the increase of the length of the sequence, the execution time of SFFPS is essentially a linear growth.From the experimental data, the number of frequent sequences is also increasing.This indicates that the functions of Cflow are uniformly distributed.However, the time overhead of software Tar is quite expensive around 40% of sequence length; the number of frequent sequences increases rapidly from 194 when the percentage is 20% to 1123.After that, the time overhead and the number of frequent sequences reduces.This indicates that there are more core functions in software Tar and there are more invocations of core functions in the early stage of the program. 2 and  3 show the security feature rank and value of the function nodes in the newest versions of Cflow and Tar.

The Security Features of the Function Nodes. Tables
From Tables 2 and 3, the security features of the same function nodes are relatively stable for different versions of the same software.So, in the process of version evolution, it can be inferred and predicted that the same function should   4 and 5 show the frequent paths of Cflow-1.4 in the top 10 security feature ranks of vulnerability and propagation.
There are double meanings of the paths listed in Tables 4  and 5.One is that the paths are frequent, which first affirms that the occurrence possibility of the path is relatively large.The other one is that the security feature values of the paths are larger, which evaluates the security risk of the path.Only when both of them work together can we make a persuasive security measurement.

Figure 1 :
Figure 1: Theory process of model construction.

Figure 2 :
Figure 2: Modeling process of invocation relationship between simple functions.

Figure 4 :
Figure 4: Runtime test of SFFPS with different support thresholds.

Figure 5 :
Figure 5: Scalability test of SFFPS with different percentages of sequence length.
Frequent Path.Let  = { 1 ,  2 ,  3 , ...,   } be a set of function symbols. is a software execution path, and it is composed of function symbols with time-ordered occurrence.Minimal support count (mincount) can be calculated by mincount = minsup * ||, where minsup is a given threshold and || is the number of function symbols in .If there are  symbols in ,  is a -path.Definition 8 (support number). is a path; the support number of , denoted as sup(), is defined as its occurrence number in the software execution.Property 9 (frequent path).A path  is frequent if its support number sup() is equal to or more than mincount.Property 10 (antimonotone).If path  is not a frequent path, any path  containing , which is a superpath of , cannot be a frequent path.

Table 1 :
The security features of frequent paths.
Runtime and Scalability Tests of SFFPS.By testing the runtime and scalability of SFFPS, two newest versions of each Cflow and Tar are selected.The support threshold is from 0.005 to 0.01 for runtime test, and the upper threshold 0.01 is used for scalability test.The total runtime is composed of three parts, node fault feature calculation, frequent pattern mining, and weight appending.Figure

Table 2 :
Rank and value of function node security features in Cflow.approximaterank in a new software version.Also, the function rank in the old version can be used as a basis for the version upgrade process with function nodes remove, merger, or update.The nodes with larger rank changes should be given more attention.Tables have