A Constraint-Aware Optimization Method for Concurrency Bug Diagnosis Service in a Distributed Cloud Environment

The advent of cloud computation and big data applications has enabled data access concurrency to be prevalent in the distributed cloud environment. In the meantime, security issue becomes a critical problem for researchers to consider. Concurrency bug diagnosis service is to analyze concurrent software and then reason about concurrency bugs in them. However, frequent context switches in concurrent program execution traces will inevitably impact the service performance. To optimize the service performance, this paper presents a static constraint-aware method to simplify concurrent program buggy traces. First, taking the original buggy trace as the operation object, we calculate the maximal sound dependence relations based on the constraint models.Then, we iteratively check the dependent constraints andmove forward current event to extend thread execution intervals. Finally, we obtain the simplified trace that is equivalent to the original buggy trace. To evaluate our approach, we conduct a set of experiments on 12 widely used Java projects. Experimental results show that our approach outperforms other state-of-the-art approaches in terms of execution time.


Introduction
Cloud computing organizes and integrates different computing resources (including software and hardware), providing end-users with different services in remote location over the Internet.Testing-as-a-Service (TaaS) based on cloud platform provides automated software testing services, saving capacity and reducing expense [1,2].With the increasing popularity of service computing, a vast amount of services-related business applications has emerged, such as service composition [3,4], service recommendation [5][6][7][8], service evaluation [9][10][11], and service optimization [12][13][14][15].As an important guarantee to the QoS (Quality of Service), such as test effectiveness and efficiency, service optimization has attracted much attention of researchers in software engineering.
Prevalent multicore architecture and big data applications today accelerate the development of concurrent systems [16,17].To fully utilize multicore CPUs, multiple execution flows can run simultaneously, i.e., data access concurrency.However, that is more likely to suffer from concurrency bugs, which can pose a great threat to the security and privacy in cloud [18][19][20][21][22][23][24].Furthermore, more scalable and efficient anomaly detection and intrusion detection techniques are needed in big data applications [25][26][27][28].
Previous studies have proposed a lot of approaches to expose and detect all kinds of concurrency bugs [29], such as deadlocks [30,31], data races [32,33], atomicity violations [34,35], and order violations [36].Also, the studies have obtained many excellent results.In addition, a variety of record-replay systems are implemented to replay concurrency bugs effectively [37][38][39].However, few researches focus on concurrency bug diagnosis.Concurrency bug is difficult to diagnose as frequent context switches hinder developers to understand concurrent program execution traces.In a concurrent program execution trace, most context switches are the fine-grained thread interleaves which are conflicted on accessing the shared memory.The order of accessing the shared memory for two threads forms a dependence relation.The more dependence relations there are, the more difficult it is to reason about concurrency bugs.Additionally, a series of operations will happen when CPU executes a context switch, including preserving the current site and loading the next site.These operations obviously bring tremendous performance consumption.
Therefore, it is necessary to introduce an optimization technique that can reduce the shared memory dependences and increase the granularity of thread interleaving with the promise of replaying the same bugs.In this paper, we present a static constraint-aware approach to optimize the process of concurrency bug diagnosis.We analyze the original buggy trace offline and simplify it automatically to get a new equivalent trace with less context switches.Our experiments are conducted on 12 widely used Java concurrent benchmarks.Experimental results show that our approach performs better than or is comparable to the compared method (i.e., SimTrace [40]) in reduction as well as performance.
In summary, the main contributions of this paper are listed as follows: (1) We present a static constraint-aware optimization method-CAOM for obtaining simplified traces which are equivalent to the original buggy traces.
(2) We demonstrate the effectiveness and efficiency of our approach with extensive evaluation on a suit of popular multithreaded projects.
The remainder of this paper is organized as follows.Section 2 summarizes the related work.Section 3 describes the problem formulation and the research motivation.Section 4 presents our constraint-aware optimization method for concurrency bug diagnosis service.In Section 5, we conduct an empirical study to show its validity and finally, in Section 6, we conclude the paper.

Related Work
How to optimize cloud users' service invocation cost is always a hot research topic in cloud computing.With considering the service's past invocation costs, FL-FL method was proposed in [41] to evaluate and predict a cloud user's service invocation cost.Unfortunately, it cannot generate an accurate service invocation cost.Work in [42] presented a cost-benefit-aware cloud service scheduling method Costplus, but it failed to minimize the service invocation cost.Some researchers focus their work on minimizing the service invocation cost.For example, Li et al. [43] proposed FCFS, which utilized the role of "Fist Come Fist Serve" to reduce the waiting time of user job to optimize the service invocation cost.However, these methods neglected many important factors, such as user job size.Recently, CS-COM was put forward in [12] with considering multiple factors, which significantly optimized the service invocation cost.
In addition to the service invocation cost optimization, researchers also put forward many methods to optimize the performance of other services, such as concurrency bug diagnosis service.Concurrency bug diagnosis attempts to reason about concurrency bugs in buggy traces.An effect optimization approach for improving the performance of concurrency bug diagnosis is to simplify the buggy trace.
Trace simplification techniques can be divided into online analysis and offline analysis.
Online trace simplification technique uses vector clock or lock assignment and then groups variables with transitive reduction and thread/spatial locality.The author in [44] proposed to record only the conflicting shared memory dependences with transitive reduction, reducing the time and space overhead of recording.To further reduce the record overhead, Xu et al. proposed FDR (Flight Data Recorder) [45] and RTR (Regulated Transitive Reduction) [46], which record the strict vector dependences based on hardware.In [47], an execution reduction system was developed combined with checkpoint, which removes the events irrelevant to errors.Recently, a software-only algorithm (bisectional coordination protocol) was presented in [48] to reduce the shared memory dependences.Experimental results indicated that the software-only approach was effective and efficient in trace simplification.
Offline trace simplification technique first obtains a complete buggy trace and then simplifies it offline.SimTrace [40] is a classical offline trace simplification technique, but it consumed too much time on constructing the dependence graph and random selection.Jalbert and Sen were the first to present a heuristic dynamic trace simplification approach, Tinertia [49].To reduce the context switches in the buggy trace, they performed three operations (i.e., Remove Last and Two-Stage Consolidate Up and Consolidate Down) iteratively and constantly replayed the middle trace to validate the equivalence, which increased the runtime overhead seriously.To speed up replay, the authors in [50] simplified the process of replaying concurrency bugs using replay-supported execution reduction.However, multiple replay verification reduced the simplification performance.
In view of the limitations of the existing approaches, we propose a new constraint-aware static trace simplification approach to optimize concurrency bug diagnosis service, as elaborated in the next section.

Problem Formulation and Motivation
In this section, we first formulate the problem of trace simplification for concurrent programs.Then, we present an example to motivate our research.

Problem Formulation. Trace simplification technique
attempts to obtain a simplified trace with less context switches yet still equivalent to the original buggy trace.Next, we give the relevant definitions in detail.
(1) Event.A minimum execution unit in that cannot be interrupted in a concurrent program execution.If this event is an access to the shared variables, it is a global event.Otherwise, it is a local event.
(3) Context Switch (CS).In a trace, a context switch occurs when two consecutive events are performed by two different threads.(4) Dependence Relation.A dependence relation is the minimum transitive closure over the events in the trace, denoted as   →   .

Security and Communication
The dependence relation can be divided into local dependence and remote dependence according to the fact that whether two events occur in different threads.If   immediately precedes   in the same thread,   and   are in local dependence relation.Otherwise, they are in remote dependence relation.
Note that two events belonging to remote dependence relation must access the same shared variable.Therefore, the remote dependence relation can be further classified into conflict order and synchronization order according to the types of events.Conflict order contains read→write, write→read, and write→write.Synchronization order consists of unlock→lock, fork→start, exit→join, and notify→wait.
(5) Equivalent Trace.The original trace is equivalent to the simplified trace if and only if they arrive at the same final state from the same initial program state.The simplified trace is called the equivalent trace of the original trace.
(6) Thread Execution Interval (TEI).The largest set of consecutive events in a thread is a thread execution interval.As we can see, the relationship between the number of context switches and the number of threads execution interval is || and || represent the number of TEI and the number of CS in the trace, respectively.The goal of trace simplification problem is to make || as small as possible, that is, to make TEI as large as possible.Therefore, in the process of trace simplification, under the premise of ensuring trace equivalent, we can put together as many adjacent events in a thread as possible.

Research Motivation.
We use the example in Figure 1 to illustrate trace simplification problem.Assume that the variables count and lis are initialized to zero and null, respectively.There are two threads accessing the shared variables count and lis concurrently under the sequence consistency memory model (SC).All statements are executed atomically.A null pointer exception will happen in the case that Thread2 executing "list.clear()"occurs between Thread1 executing "count++" and "list.get(count)".In fact, this is a concurrency bug. Figure 2 shows an execution sequence obtained after running the example program.Like [40], we call this sequence a buggy trace.In Figure 2, there are six context switches.
When developers debug concurrent programs, they may run them many times.Each time the program gets error, such as crash, hang, or inconsistent results, developers have to reason about the concurrency bugs along with frequent context switches.That undoubtedly consumes too much time and energy.Trace simplification technique can alleviate this problem effectively.However, two challenges arise in trace simplification: (1) the program semantics are easy to be changed by mistakes and (2) the efficiency of simplification is reduced tremendously because of too many instruments and dynamic verification.
In view of these challenges, we propose a new static constraint-aware approach to simplify concurrent program execution traces and optimize the process of concurrency bug diagnosis.The detailed description of our approach will be given in the next section.

A Constraint-Aware Optimization Method for Concurrency Bug Diagnosis Service
In this section, a constraint-aware trace simplification method is proposed to optimize the performance of concurrency bug diagnosis.We first briefly describe the overview of our method.Then, we present the algorithm and corresponding explanation.preprocessing.Test program is compiled to be the corresponding bytecode.After this bytecode is instrumented by Soot [51], we run the instrumented test program and collect the buggy execution trace.In the second step, we calculate local dependences, synchronization dependences, and read/write dependences.The final step is scheduling generation.We take the original trace as input and iteratively check the move forward condition for each event with the constraint dependence relations.If the condition is satisfied, the event is moved forward to extend the thread execution externally.Finally, we obtain the simplified trace.Note that, in the process of preprocessing, we use the existing instrumentation and record tools to collect original traces.Therefore, we mainly focus on the last two steps: dependence relation calculation and scheduling generation.

Dependence Relation Calculation.
The root cause of nondetermination for thread scheduling is the shared memory access.This leads to the fact that multiple threads can access the same shared variable simultaneously.Therefore, it is a precondition for trace simplification to accurately identify the dependence relations between events in the original buggy trace.
Calculating local dependences only needs to divide the events into several sequences in order according to the threads they belong to.The number of sequences is equal to that of threads.Synchronization dependences can be obtained during collecting the original trace.Then, in this step, we focus on calculating remote read/write dependences.In order to get accurate remote read/write dependence relations, we successively deal with two adjacent accesses on the same shared variable to ensure that the value by read access is always written by the latest write access.
First, we traverse the original trace and divide events into different lists according to the shared variables they access.Then, for the access sequence of every shared variable, we check two successive events in sequence from the first event.If the current two events belong to different threads, they form a remote dependence relation.Furthermore, if two events are both write accesses, they form a remote write-write dependence.If a read event precedes a write event, they form a remote read-write dependence.If a write event precedes a read event, they form a remote write-read dependence.

Scheduling Generation. Scheduling generation attempts
to reduce the number of context switches by two operations (i.e., check and move forward) without breaking all the dependence relations in original traces.A natural thought is to employ a constraint solver that solves three constraints (i.e., write-write dependences, read-write dependences, and writeread dependences).Although the obtained trace satisfies dependence relations in the original trace, the number of context switches may not be reduced.Therefore, we directly take the original trace as the operate object.We check and move forward the atomic events in sequence.Checking is to maintain constraints.Moving is to extend thread execution interval, reducing as many context switches as possible.The detailed process of scheduling generation is shown in Algorithm 1.
Algorithm 1 takes the original trace and dependence relations as input and takes the simplified scheduling sequence as output.Algorithm begins from the second event.If the current event has no dependent event (synchronization dependences or remote read/write dependences), it is moved forward to the location behind the latest event which belongs to the same thread (lines ( 9)-( 10)).If the dependent event is before the latest event which belongs to the same thread with current event, the current event is moved forward to the location behind the latest event (lines ( 14)-( 15)).Otherwise, the events are not moved.
According to Theorem 1 in [40], we know that any rescheduling of events in a trace respecting the dependence relation generates an equivalent trace.In our method, all the feasible events were moved without breaking the dependence constraints.First, we check the dependence relations of the current event.Then, we move it forward under the constraint conditions.That is, all the dependence relations of every event in the new scheduling are the same as that in the original trace.Therefore, the generated trace simplified by our method is equivalent to the original trace.

Experimental Configurations.
In this section, we conducted experiments on 12 widely used Java multithreaded programs.The details are listed in Table 1.For every program, its lines of code (LOC), number of threads (#Thread), (1) input:  -the original trace (2) deps -map of sv to its dependence relations (3) output: scheduleSeq -sequence of thread schedule (4) begin (5)  number of shared variables (#SV), and the origin (Origin) are summarized.It involves large-scale (LOC>10,000), middlescale (10,000>LOC>1000), and small programs (LOC<1000).The program scales in terms of LOC vary from 73 for Critical to 17,596 for SpecJBB-2005.The number of shared variables is obtained using escape analysis [52].Specifically, the number of threads or shared variables is not integer by accident.The reasons are that (1) the results were averaged over 50 runs for each program and (2) different paths may be chosen during program execution due to the natural character of dynamic analysis and the dynamic thread creation of Java.In addition, each subject has at least a concurrency bug.For example, Critical has 16 data races and 14 atomicity violations.
To evaluate the effectiveness and efficiency of our approach, we compared it with the state-of-the-art approach named SimTrace.Concretely, we designed three groups of experiments to validate the following three questions: (1) Effectiveness: how many context switches can be reduced in trace simplification for CAOM?
(2) Efficiency: how much time does it consume in trace simplification for CAOM?
The experiments are conducted on a Samsung notebook running 64-bit Ubuntu-14.04and jdk1.7 with 3.06 GHz Intel Core 4 processor and 4 GB memory.We utilize Soot to instrument bytecode programs.To collect original traces and replay concurrency bugs, we employ the existing recordreplay tool LEAP [54].We first use random testing to generate an original buggy trace for each subject.All the results are averaged over running 50 times.

Experimental Results and Analysis.
Experimental evaluation is conducted in terms of effectiveness, efficiency, and comparison to answer the above three questions, respectively.Profile 1 (Effectiveness).The effectiveness of trace simplification technique can be shown by the reduction of context switches.For better understanding, CAOM preserves all the program information; that is, we do not conduct any delete operations to subjects.
Table 2 lists the number of threads (#Thread), the length of original trace (Size), the number of context switches in the original buggy trace (#CSori), the number of context switches in the simplified trace (#CSsim), and the reduction (Reduction(%)), where the length of trace is the total number of synchronization operations and memory accesses.As CAOM does not conduct any delete operations for subjects, the length of trace stays the same before and after simplification.However, we can see that context switches are reduced obviously.The context switches in the simplified trace are reduced by 27.36%∼99.97%(54.39% averaged) compared to that of the original buggy trace.Specifically, for the large-scaled subject SpecJBB-2005, the context switches are reduced from 124200.3 to 37.6, and the reduction is 99.97%.Besides, we can find that the more threads and more synchronization operations or memory accesses there are in the original trace, the higher the reduction we can get, such as Manager, Tsp, Cache4j, and SpecJBB-2005.

Profile 2 (Efficiency). The efficiency of trace simplification
technique can be shown by the time consumption.This can affect whether it can be applied in practice.The time consumption of CAOM consists of three perspectives: data loading, dependence relations calculation, and scheduling generation.As CAOM is an offline approach and the original buggy trace is collected using instrument and record in the preprocessing step, the complete trace information needs to be loaded before starting simplification.Table 3 lists experimental results in terms of the time consumption.Columns 4-7 represent the time consumed in data loading, dependence calculation, scheduling generation, and the total time, respectively.As we can see, for the 12 Java multithreaded programs, the maximum time consumed is no more than 30 min, which indicates good efficiency of our method.For example, for Tsp whose length of trace is 1001 K, the total simplification time is only 2.7 min.
Concretely, for most middle-scaled and small programs, the time is mainly consumed in data loading.For example, for ArrayList and Loader, the time consumed in data loading accounts for 84.49% and 84.16% of their total time, respectively.However, for Tsp and Cache4j, the time consumed in dependence relation calculation and scheduling generation is far more than that of data loading.The reason is that there are many synchronization operations, memory accesses, and context switches in the original trace, which leads to the fact that the dependence relations are much more complex; then, frequent check and move operations have to be conducted.Specifically, for Cache4j, the time consumed in dependence relation calculation is more than that of scheduling generation.The reason is that there are many lock dependence relations in which two locks are adjacent and belong to the same thread, saving many move operations.
For large-scaled programs, the time consumed in trace simplification increases because of the large trace size and a vast amount of context switches.However, our method still has good efficiency as it does not conduct multiple iterations and replay validation.For example, for SpecJBB-2005, the time consumption for the whole simplification is less than 30 min.

Profile 3 (Comparison Analysis).
To evaluate that our approach performs better than the state-of-the-art approaches, we compared CAOM with SimTrace.Both SimTrace and CAOM reduce trace simplification problem to the combinatorial optimization problem.The difference is that SimTrace takes it as an optimization problem with graph merging.
Comparison results between SimTrace and CAOM on the reduction of context switches are presented in Figure 4. Figure 4 shows that, for most programs, CAOM can reduce more context switches compared with SimTrace.For example, for BuggyProgram, CAOM increases the reduction by 14.15%.However, there is an exception to the overall result.For Account, our approach reduces the reduction by 13.83%.The reason is that SimTrace mainly pursuits performance improvement, which leads to a risk of replay failure.This situation cannot exist in our approach as we calculate the strict dependence relations between two successive events and simplify the original buggy trace in the promise of replay.
Comparison results between SimTrace and CAOM on time cost are shown in Figure 5.We separate three programs (shown in Figure 5(b)) from 12 subjects to show their time cost as they cost too much time in the process of trace simplification.Figure 5 shows, for all the middle-scaled and small programs, CAOM is superior to SimTrace in terms of efficiency, because our method does not need to construct any abstract models.For example, for Tsp, CAOM improves the performance by 58.60%.But for SpecJBB-2005, CAOM is inferior to SimTrace.The reason is that the context switches are significantly reduced by our method, which implies that it suffers from frequent check and move operations, consuming too much time.
To sum up, we can conclude the following points based on the above experiments, which can also answer three questions in Section 5.1.
(2) CAOM is efficient in trace simplification.Even for the large-scaled programs, it can finish the simplification within 30 min.
(3) CAOM performs better than SimTrace on both reduction of context switches and time cost.

Time Complexity Analysis.
The time complexity of CAOM is O(n 2 ), where  represents the number of events in the original trace.Given the original buggy trace, we first calculate the dependence relations for each event, whose time complexity is O(n).Then, in scheduling generation, for each event, we need to search for the location of its dependence node and the latest node in the same thread, whose time complexity is O(n 2 ).Dependence relation calculation and scheduling generation are executed in sequence.Therefore, with the above analysis, we can conclude that the time complexity of our method is O(n 2 ).

Discussion.
Based on our experiments, we can find that CAOM can simplify concurrent program execution traces with a length of million magnitudes effectively and efficiently, which is helpful to be applied in practice.Specifically, in theory, our method can provide enlightenment for the new optimization algorithms design about concurrency bug diagnosis service.In practice, experimental results show that CAOM can use less time but reduce more context switches, which implies that developers can save much time on debugging concurrent programs.Thus, our proposed method can speed up the concurrent software development.
However, there are still a few directions that may further improve our method.First, for efficiency, we only considered a one-way checking and moving.A bidirectional or a twostage simplification [55] may further improve the effectiveness and reduce more context switches.Second, in the step of preprocessing, we used escape analysis to identify the shared variables.Both Static-TSA and Dynamic-TSA [56] are more precise and efficient than escape analysis.Moreover, they are scalable to real-world large multithreaded applications.Next, we will employ Static-TSA or Dynamic-TSA to improve the availability of our method.Third, for completeness, we calculated the dependence relations of all the events, which consumed much redundant time.We plan to utilize programming slicing [57] or Collaborative Filtering [58,59] to extract the critical variables and the corresponding events.

Conclusions
The advance of cloud computation and big data applications accelerates current software development and produces various concurrency cloud services in the distributed cloud environment.However, it is a great challenge to guarantee both service quality and service performance.Concurrency bug diagnosis service is to reason about vulnerabilities in concurrent applications.The existing trace simplification techniques are either online analysis or based on the complex graph structures, which limits the performance of service optimization.In this paper, we present a novel static constraint-aware optimization method for concurrency bug diagnosis service in the distributed cloud environment.We obtain a simplified trace by iteratively checking dependence constraints and moving forward feasible events if the condition is satisfied.With the constraint-aware idea, we can guarantee that the simplified trace is equivalent to the original buggy trace.Furthermore, the effectiveness and efficiency of trace simplification have been significantly improved as we optimized concurrency bug diagnosis service offline without any complex structures.Finally, through a set of experiments conducted on 12 widely used java projects, we further demonstrate that our proposed CAOM outperforms other state-of-the-art approaches.

Figure 4 :
Figure 4: Comparison results between SimTrace and CAOM on reduction of CSs.

Figure 5 :
Figure 5: Comparison results between SimTrace and CAOM on time cost.