An Efficient Algorithm for Onthe-Fly Data Race Detection Using an Epoch-Based Technique

Data races represent the most notorious class of concurrency bugs in multithreaded programs. To detect data races precisely and efficiently during the execution of multithreaded programs, the epoch-based FastTrack technique has been employed. However, FastTrack has time and space complexities that depend on the maximum parallelism of the program to partially maintain expensive data structures, such as vector clocks. This paper presents an efficient algorithm, called iFT, that uses only the epochs of the access histories. Unlike FastTrack, our algorithm requiresO(1) operations to maintain an access history and locate data races, without any switching between epochs and vector clocks. We implement this algorithm on top of the Pin binary instrumentation framework and compare it with other on-the-fly detection algorithms, including FastTrack,which uses a state-of-the-art happensbefore analysis algorithm. Empirical results using the PARSEC benchmark show that iFT reduces the average runtime andmemory overhead to 84% and 37%, respectively, of those of FastTrack.


Introduction
Synchronization in parallel or multithreaded programs is an enforcing mechanism used to coordinate thread execution and manage shared data in various computational systems, including HPC (High Performance Computing).However, multithreaded programs may contain synchronization defects such as data races, which occur when two concurrent threads access a shared memory location without explicit synchronization, and at least one of them is a write.It is well known that data races are the hardest defect to handle in multithreaded programs, because of their nondeterministic interleaving of concurrent threads [1][2][3][4].
The main drawback of dynamic detection techniques is the additional overhead of monitoring program execution and analyzing every conflicting memory operation.A sampling approach was introduced to solve the overhead problem of dynamic data race detection.Sampling-based techniques [23][24][25] can be performed efficiently when testing multithreaded programs via local thread burst-sampling [24] or a global execution time sampling strategy [23].Although they provide significantly reduced runtime overheads, these techniques are still ineffective in detecting data races when the sampling rates are low.
FastTrack is a state-of-the-art happens-before algorithm and is an improved version of the Djit + algorithm with vector clocks (VCs) [26,27].This technique exploits the idea that full generality of VCs is often unnecessary for data race detection.The technique replaces heavyweight VCs with a lightweight identifier, called an epoch, that uses only the tuple of the clock value and the thread id.
Epoch-based happens-before analysis decreases the runtime and memory overhead of almost all VC operations from () to  (1) in the detection of data races, where  designates the maximum number of simultaneously active threads during an execution.However, FastTrack requires a time and space overhead of () for the shared read accesses to shared memory locations.Therefore, the overhead problem still exists, because the small fraction of shared read accesses make it difficult to dynamically analyze programs with a large number of concurrent threads [13].
This paper presents an efficient algorithm, called FT, that uses only epochs to detect data races.Thus, FT represents an improvement over the FastTrack method.Our algorithm maintains only two epochs of earlier read accesses to shared memory locations, instead of the full VCs, using the left-ofrelation [11].Thus, it requires only (1) runtime and memory overhead to maintain the access history and locate data races, without any switching between epochs and VCs, unlike FastTrack.Furthermore, the technique is guaranteed to report a subset of data races detected by FastTrack.
We implement the new algorithm on top of the Pin instrumentation framework [28], which uses a just-in-time (JIT) compiler to recompile target program binaries for dynamic instrumentation.To compare the accuracy of FT for on-the-fly data race detection, we also implement two other detection algorithms, Djit + and FastTrack, on top of the same framework, and employ the same optimized VC primitives.We compare the efficiency of FT with Djit + and FastTrack, which use a happens-before analysis to detect data races.The experimental results on C/C++ benchmarks using Pthreads show that our algorithm reduces the runtime and memory overheads compared with the other algorithms, while soundly detecting similar data races to FastTrack.
In summary, the contributions of our work are as follows: (i) FT provides a significant improvement in efficiency, exhibiting an (1) runtime and memory overhead for each access history, whereas FastTrack requires () VC operations.
(ii) FT matches the well-established precision of FastTrack, although it uses only two epochs instead of the full VCs for earlier read accesses to shared memory locations.
(iii) FT reduces the average runtime and memory overhead to 84% and 37%, respectively, of those of FastTrack.
The remainder of this paper is organized as follows.Section 2 discusses important concepts of happens-before analysis with VCs, and Section 3 introduces the FastTrack algorithm and its limitations.We present our improved algorithm in Section 4 and evaluate it empirically in Section 5 by comparing with existing techniques for data race detection.We introduce some related work in Section 6 and conclude our argument in Section 7.

Background
On-the-fly methods of detecting data races typically use VCs to precisely analyze the happens-before relation.This section presents important rules for allocating VCs to the concurrent thread segments introduced in this paper and describes how VCs represent the happens-before relation during the execution of multithreaded programs.

Execution of Multithreaded Programs.
In this work, we consider multithreaded programs using the POSIX thread standard (Pthread) as a model of concurrent threads.Pthread is widely used not only on C/C++ applications, but also on many Unix-like operating systems (Linux, Solaris, Mac OS, FreeBSD, etc.), because it provides various APIs and libraries for creating, manipulating, and synchronizing threads.
In a multithreaded program, a block of thread  that is partially serially executed is represented as a thread segment, denoted by .Thus, a thread can be represented as a set of thread segments, denoted by  =  1 ,  2 , . . .,   ( ≥ 1).A thread segment  is delimited by thread operations that can take one of the following forms: (i) () models the creation of a thread segment  and the start of the execution of thread .
(ii) (, ) models the creation of a thread segment  from the current thread segment  and the start of a new thread segment   on the same thread .
(iii) (, ) models the termination of a thread segment  and the creation of a new thread segment   on the same thread  from the current thread segment .
A thread segment  contains a finite sequence  that consists of at least one event , denoted by  =  1 ,  2 , . . .,   .  denotes the sequence of events generated on a thread segment .An event takes one of the following forms: (i) Access Events () and ().The former models the reading of a shared memory location , and the latter simulates the updating of .
(ii) Mutual Exclusion Events (, ) and (, ).The former models the acquisition of a lock  to enter a critical section.The latter models the release of a lock  to leave a critical section and the start of a new thread segment   on the same thread .
(iii) Condition Variable Synchronization Events (, V) and (, V).The former models the wait for condition variable V until another thread wakes V and the subsequent start of a new thread segment   on the same thread .The latter models the wake-up of a thread waiting on V and the start of a new thread segment   on the same thread .
(iv) Barrier Event (, ).This models the waiting of multiple threads until the number of waiting threads is  and the start of a new thread segment on each of the waiting threads.
In this work, we consider the above thread operations and events as synchronization primitives rather than access events.

VC-Based
Happens-Before Analysis.Happens-before analysis uses a representation of Lamport's happens-before relation [27] to determine the logical concurrency between two thread segments.VCs are widely used to analyze the happens-before relation ℎ   →, because they can inform the execution order of thread segments and the synchronization order of thread operations and events.A vector clock : Tid → Nat records a clock value  for each thread while the program is executing.Thus, thread segment  maintains a VC   = ⟨ 1 , . . .,   ⟩, which has  entries if the maximum number of active threads in the execution of a multithreaded program is .The VC of each thread segment is partially ordered (⊑) pointwise, with a minimum element ⟨0, . . ., 0⟩ and associated synchronization primitives that define pointwise maximums.For instance, the entry   [] for any thread segment  stores the latest clock value of  that happened before the current synchronization primitive of .
During program execution, the VCs of the thread segments are maintained according to the following rules: where   is a vector clock for each lock ; The other synchronization events, (, V), (, V), and (, ), can be modeled with the () operation.
Figure 1 represents a multithread execution with synchronization primitives as a directed acyclic graph, called a Partial Order Execution Graph (POEG) [10,11].In the POEG, a vertex is either a thread operation or a synchronization event, and an arc represents a logical thread segment started  by the synchronization primitives.The dashed lines indicate the synchronization order in the execution of the program.The events  and , represented by small disks on the arcs, denote read and write events at a shared memory location, respectively.The numbers attached to each thread segment and event name indicate an observed order, and the VCs are allocated for each thread segment by the above rules.
Using the VCs of each thread segment, we simply analyze the happens-before relation between any two thread segments.If the clock value of a thread segment  is less than or equal to the corresponding clock value of another thread segment , we can conclude that  happens before .Otherwise,  is concurrent with .Formally, Obviously,   [] ≤   [] means that thread segment  was synchronized from an earlier thread segment  by one of the synchronization primitives.Then,   is partially ordered with   , denoted by   ⊑   , and is never involved in any race.Finally, the happens-before analysis locates a data race during the execution of a multithreaded program whenever any two events on two concurrent thread segments access a shared memory location, and at least one of the events is a write.Scientific Programming Definition 1.Given two access events   and   to a shared memory location from two distinct thread segments  and , respectively, if the two events are not synchronized (i.e., neither   ⊑   nor   ⊑   ) and at least one of the events is a write, there exists a data race between   and   .
For example, in Figure 1, consider two events  3 and  4 on two different thread segments  3 and  3 , respectively.The two events constitute a data race, because neither and therefore  3 ‖  3 .

FastTrack Algorithm
VC-based happens-before techniques, such as Djit + [3], obviously require () space to maintain the VCs for each thread segment and access history and also require () time for VC operations (e.g., join, copy, and comparison).
FastTrack [6], which improves on Djit + , exploits the insight that the full generality of VCs is often unnecessary for data race detection.The key ideas behind this insight are as follows: (1) all writes to a shared memory location  are totally ordered by a happens-before analysis, which assumes no data races have been detected on  so far, and (2) writing to  could potentially conflict with the last read of  performed by any other thread, although reads are not totally ordered, even in race-free programs.By exploiting these results, FastTrack replaces heavyweight VCs with a lightweight identifier for a thread segment, called an epoch, using only the tuple of clock value  (≡   []) and thread id , denoted by @.Thus, FastTrack reduces the runtime and space overhead of almost all VC operations from () to (1) in the detection of data races.
For a shared memory location , the FastTrack algorithm defines an access history using two entries: (i)   : it records a VC for all concurrent read events or an epoch for the last read event of .
(ii)   : it records only an epoch for the last write event to .When a new event   occurs on thread segment , the algorithm for reporting data races and maintaining each entry is as follows.

Upon a Read Event of 𝑥 by Thread 𝑡
(1) If the epoch of the current   is the same as that of   ,   = (), the algorithm takes no action.
(2) If   ̸ = (), then the algorithm checks   ⪯   to report a data race between an earlier write event and   .
Table 1: Access history states for detecting data races in Figure 1 using the FastTrack algorithm.
Upon a Write Event to  by Thread (1) If the epoch of the current   is the same as that of   ,   = (), then the algorithm takes no action.
(2) If   ̸ = (), then the algorithm checks   ⪯   to report a data race between an earlier write event and   .
(3) If there exists only one epoch in   , then the algorithm checks   ⪯   to report a data race between an earlier read event and   .Otherwise, the algorithm checks   ⊑   for a full VC maintained in   .
(4) The previous epoch or VC is removed from   , and () is inserted into   .
Table 1 explains how the FastTrack algorithm reports data races and manages the access history during the execution of the program shown in Figure 1.Initially,   starts from ⊥  , indicating that the shared memory location  has not yet been written.When the first read event  1 occurs on thread segment  2 , the epoch 2@ id is recorded in   instead of a full VC, where  id indicates the thread id for  2 .When the second read  2 on thread segment  1 accesses ,  2 shares  with the first read event  1 , because  2 ‖  1 , where we say that  is in a Read Shared state.In this state, as read may consist of either one or more data races with a later write event, the VCs of all shared reads of  are kept in   .Thus,   switches to a VC representation ⟨2, 1, 0, . ..⟩ to record the clocks of the last reads by the two thread segments in Table 1.With this adaptive switching between epochs and VCs in   , FastTrack greatly reduces the overhead of the () VC operations.
When read event  3 occurs on  3 ,   [] = 3 is directly updated in the corresponding entry of   , although   maintains a VC for the Read Shared state.Thus, the updating takes (1) time.A data race { 3 - 4 } is reported because Definition 1 is satisfied (i.e., neither   ⊑  3 nor  3 ⊑   is true) when a write event to  occurs.The VC of prior read events in   is removed by resetting   to ⊥  , and the epoch for  4 , 3@ id , is stored in   .When a read of  occurs on V 1 , only the epoch of  5 is kept in   , because the read event is not shared with any others, and a data race { 4 - 5 } is reported.Finally, three concurrent events,  4 ,  5 , and  6 , give rise to two data races, { 4 - 6 ,  5 - 6 }, because ( 3 ) ⪯  6 and (V 1 ) ⪯  6 are not satisfied.
A common problem with using VCs for happens-before analysis is the space and time overhead, which depends on the number of threads in the multithreaded programs, whereas the FastTrack algorithm provides a significant performance improvement over the lockset analysis by utilizing the lightweight epoch clock.Moreover, it suggests the design of a hybrid technique with both precision and efficiency, such as AccuLock [20].However, there is further room for improvement, because the algorithm requires () VC operations to guarantee no loss of precision when shared data enters the Read Shared state, such as  2 and  3 in Figure 1.Therefore, the overhead problem still exists, because the shared read accesses make it difficult to dynamically analyze programs with a large number of concurrent threads.

Efficient Data Race Detection
FastTrack precisely reports data races with significantly improved performance, because epochs require only a constant space and a constant time for almost all VC operations.However, the algorithm still needs VC operations whenever a shared memory location has shared read events on concurrent thread segments.As this situation makes it impossible to dynamically analyze programs with a large number of concurrent threads [13], the overhead problem potentially exists, with the space overhead being more critical than the time overhead.Thus, we efficiently improve the FastTrack algorithm to reduce this overhead problem.
Our improved FastTrack (FT) algorithm reports data races in a constant amount of time and space, even in the worst case, because it maintains only two epochs instead of full VCs for   using the left-of-relation.The notion of the leftof-relation was originally suggested by Mellor-Crummey [11].Mellor-Crummey's technique maintains two concurrent read events in an access history to detect data races with a write event.Techniques based on the left-of-relation guarantee that a program is free of data races, although it maintains only two read events in each access history, because it locates at least one data race (if any exist).However, Mellor-Crummey's technique does not support synchronization primitives other than fork/join operations, such as thread locking and waitsignals.Moreover, the left-of-relation does not apply to VCbased detectors, because VCs cannot analyze the logical position of thread segments, unlike Mellor-Crummey's OS labeling [11].
We simply define a left-of-relation that is a partial ordering of two concurrent thread segments   and   for two distinct events   on   and   on   in an execution graph, such as the POEG of Figure 1, and the events are not related to   ⪯   .To apply the left-of-relation to the FT algorithm, we use a breadth value  instead of the thread id  id of the original FastTrack algorithm.The breadth value  is produced by performing a left-to-right preorder numbering or an English Order numbering of the EH labeling scheme [29] and is used to identify the position of a current thread considering its sibling threads.If a thread segment   precedes another thread segment   and   ‖   in an execution of a multithreaded program,   for   is less than   for   .
Thus, an epoch (  ) of thread segment   is redefined as the tuple of clock value   and breadth value   , denoted by   @  .Now, the left-of-relation between any two thread segments is simply analyzed by comparing their breadth values from each epoch.Definition 2. Given two read events   and   to a shared memory location on two concurrent thread segments   and   , respectively, if   for (  ) is less than   for (  ), one says that   is left of   , denoted by   ⋞   .Formally, where We now provide a detailed description of how FT locates three kinds of data races for concurrent events: read-write races, write-write races, and write-read races.
Read-Write Races.Detection is possible because a write event to a shared memory location  can conflict with prior read events of  performed by any other thread.To detect readwrite races, we consider two read states: (1) Exclusive state, where a read event of  is performed exclusively on a thread segment, and (2) Read Shared state, where  has read events that are shared by two or more concurrent thread segments.In the Exclusive state, because read events of  occur on the same thread, they are totally ordered, and the epoch of the last read event is recorded in   .Read events of  that are shared by multiple threads are unordered in a read-only manner, and each read event may consist of a data race with a later write event.Thus, if  is in the Read Shared state, two epochs of the two concurrent read events are recorded in   by the left-ofrelation.
Using   , which maintains only two epochs instead of a full VC, FT detects data races as well as FastTrack, because it locates one or two of the read-write data races.Figure 2 shows three examples of read-write data races during the execution of a multithreaded program with nondeterministic interleaving of concurrent threads.In Figure 2(a), three shared read events,  1 ,  2 , and  3 , happen before the two write events,  4 and  5 .The leftmost event  1 and the rightmost event  3 are kept in   and   , respectively, by the left-of-relation.Thus, FT can report a data race between  3 and  4 , because  1 ℎ   →  4 and  3 ‖  4 .When  5 occurs on thread segment  4 that is concurrent with the others, FT reports two data races { 1 - 5 ,  3 - 5 }.

Lemma 3. If data races exist between earlier reads and a current write event
In Figure 2 Proof.Suppose that the same fixed program execution order is provided to both analyses.Let  race ( race ) be the set of races located by FT (FastTrack), and let   () (  ()) be the read events recorded in   by FT (FastTrack).Because   () ∈   () in the execution order, we guarantee the following: (1) If  race = , then  race =  is satisfied because it is impossible to satisfy  race ̸ = .
(2) If  race ̸ = , then  race ̸ =  is satisfied because  race =  cannot be satisfied by Lemma 3.
Therefore,  race ⊆  race is satisfied.
Write-Write Races.These involve two concurrent write events to .All write events to  are totally ordered, with the assumption that no data races have been detected on .Thus, FT records the epoch of the write event in   and locates a write-write race between   and a later write event to  by analyzing the epoch of   and the current VC of the write event,   ⪯   .
Write-Read Races.These involve a write event to  that is concurrent with a later read event of .FT locates such a data race by analyzing   ⪯   .

Lemma 5. If FastTrack locates a write-write race or a writeread race during the execution of a program, 𝑖FT can locate the data race from the same fixed execution.
Proof.Let   () (  ()) be a write event recorded in   by FT (FastTrack).Then,  race =  race holds, because   () =   () in the execution order, and both analyses employ only Algorithm 1 presents the pseudocode for FT, which consists of three algorithms: ReadCheck(), WriteCheck(), and Maintain().ReadCheck() and WriteCheck() mainly focus on filtering events, reporting data races, and maintaining an access history  for a shared memory location  whenever an event   on thread segment  accesses .To report data races, we use the inversion of ⪯, denoted by , to catch instances where the current event is concurrent with a prior event.In ReadCheck() and WriteCheck,      denotes that neither   ⪯   nor   ⪯   is satisfied.IsOrdered() is used by Maintain() to check the happensbefore relation between the current event and prior events in .Maintain() manages access histories for every  and employs IsMostL() and IsMostR() to maintain only two concurrent events in   by applying the left-of-relation.
Table 2 shows the changing state of an access history for detecting the data races appearing in Figure 1 using the FT algorithm, where we assume that the breadth values are allocated as  = 0,  = 1, and  = 2.In the figure, the epoch of read event  1 on  2 , 2@0, is recorded in   , as the read event of  is performed exclusively.When the rightmost read  2 occurs on  1 ,  enters the Read Shared state.The epoch of  2 (1@1) is recorded with the epoch of  1 , instead of the full VC of FastTrack in Table 1, because ( 1 ) = 0 is less than ( 2 ) = 1, and therefore  1 ⋞  2 .Because  3 is the last read event on thread  when the event occurs, the epoch of the prior leftmost event  1 is updated to the epoch of  3 , 3@0.When  4 occurs on  3 , the data race { 3 - 4 } is Table 2: Access history states for detecting data races using the FT algorithm.

Evaluation
We empirically evaluated the efficiency and precision of FT in comparison with other dynamic detection algorithms that use the happens-before analysis.The experimental results show that our technique not only soundly reports data races, but also reduces the time and space overhead of data race detection for programs with a large number of concurrent threads.

Implementation and Experimentation.
We implemented the FT algorithm and two other dynamic detection algorithms on top of the Pin instrumentation framework [28], which uses a JIT compiler to recompile target program binaries for dynamic instrumentation.Building a lightweight tool for monitoring memory access is easier with Pin than with other dynamic binary instrumentation frameworks, such as Valgrind [30].The two algorithms used for comparison are Djit + [3] (a high performance VC-based happens-before analysis algorithm) and FastTrack [6] (a state-of-the-art happens-before analysis algorithm).
Figure 3 depicts the architecture of the detectors.Each detector consists of an Instrumentor and a Race Detector to report data races during program execution.The Instrumentor consists of two modules: ThreadMonitor and EventMonitor.These, respectively, track thread operations and event instances for every shared memory location considering synchronization primitives.The Race Detector performs the thread identification routines to generate and manage VCs for each active thread segment, as well as the detection routines to report data races.
The thread identification routines employ the VC primitives discussed in Section 2. These are commonly used to analyze the happens-before relation in the detection routines of all algorithms.A lock-free algorithm was used in the detection routines to remove the centralized bottleneck of access histories.Whenever the Instrumentor catches one of the thread operations or events, it calls either the thread identifier routines or the detection routines to add instrumentation at each interesting point of the running target binaries.Because the Instrumentor and Race Detector use only the shadow binaries of the target programs, which are generated by the JIT compiler of the Pin framework, no source code annotation is required to monitor memory access events or synchronization primitives.
To supplement the correct identification of concurrent thread segments, we used a special structural table for each thread.The table consists of four important items of information, the system thread id, Pthread id, Pin thread id, and clock value.The system thread id is the thread id allocated by the operating system, and the Pthread id is allocated by Pthread functions such as pthread create().The Pin thread id is the logical identifier created in sequence whenever the Pin framework catches a thread start operation.Thus, we employed the Pin thread id as the breadth value   of an epoch (  @  ) in the FT algorithm.The clock value is used to form a VC of a thread segment using synchronization primitive operations.
Our experimentation focused on comparing the soundness and the efficiency of on-the-fly data race detection in programs with a large number of concurrent threads.To evaluate the FT algorithm, we compared the data races reported by each detector and measured the execution time and the memory consumed by the execution instances of a set of C/C++ benchmarks using Pthread.For this purpose, we used 12 applications from the PARSEC 2.1 benchmark suite [31].These target different areas, including HPC, with applications such as data mining, financial analysis, and computer vision.All applications were executed with the default simulation inputs of the PARSEC benchmark suite to produce proper runtime overheads and memory consumption.
Before conducting the experiments, we investigated the benchmark applications in terms of the frequency of access events and synchronization primitives.The results of this analysis with the FastTrack algorithm are given in Table 3.We used sim-medium simulation inputs in the execution of each application.In the table, "Same Epoch" means that read/write events to a shared memory location  have been filtered out by FastTrack as they occurred after the first read/write event on the same thread segment."Exclusive" indicates that only epochs were used to locate data races, because read/write events exclusively accessed ."Shared" indicates the Read Shared state in which  has shared read events being performed by concurrent thread segments."VC Scan" indicates that a current write event was compared with   when  entered the Read Shared state.Thus, two memory operations, Shared and VC Scan, require VC operations that require () time and space overheads in FastTrack.
From this investigation, we can see that 78.3% of all operations and events were read events and 21.6% were write events.Other operations and events accounted for less than 0.1% of the total.These results reaffirm that almost all parts of data race detection involve tracing access events to shared memory locations, because this accounts for more than 99% of operations in the benchmarks.Fortunately, the convergence of memory operations is again removed, as there is a possibility that this will affect the tracing of events for data race detection.For example, in the table, 90.7% of read events and 80.2% of write events occurred in the same epoch.VC operations are rarely needed, accounting for an average of only 1.3% of all read/write events.Thus, the switching approach in FastTrack is quite effective in improving the performance of happens-before analysis.The implementation and experimentation were carried out on a system with two 2.4 GHz Intel Xeon quad-core processors and 32 GB of memory under Linux Kernel 2.6.We installed the most recent version of the Pin framework (Version 2.12), and the applications were compiled with gcc 4.4.4 for all detectors.We used a programmed logging method to measure the execution time and memory consumption of each application.This method uses system files in the proc directory, which provides real-time information on the system, including meminfo, iomem, and cpuinfo.The average runtime and memory overheads of all applications were measured for ten executions under each detector.Figure 4 shows the resulting analyzed information, such as thread creation, detected data races, execution time, and memory consumption, during an execution of the x264 application using our implemented FT detector.

Precision.
We acquired the reported data race results to evaluate the precision of iFT.Three detectors were applied on the same Pin framework for fair experimentation.All applications of PARSEC benchmark were run with sim-medium simulation inputs, and two real applications were run with both of server program and several client programs.The two real applications used for the experimentation are MySQL (an open source DBMS) and Cherokee (an open source web and server application).These applications were repeatedly tested until each detector had fixed all warnings.The number of data races located by the three detectors is given in Table 4.All of the detectors reported that there were no data races in six of the applications in the PARSEC benchmarks, blackscholes, dedup, facesim, raytrace, swaptions, and vips.This agrees with prior research [32], which considered an implementation of FastTrack on top of the DynamoRIO instrumentation framework.Djit + and FastTrack reported exactly the same data races for all applications, as found in [6,20], because these two detectors are based on identical precision.Similarly, iFT reported the same data races as FastTrack, with the exception of the bodytrack and x264 applications.
All the detectors located a data race in canneal and fluidanimate, which run into user-defined synchronization functions, such as atomic() and barrier wait().They reported two data races in ferret; these were caused by a shared counter variable and a shared Boolean flag for a queue in the application.The three detectors reported four data races for streamcluster.These were caused by using the same user-defined synchronization, barrier wait(), and object pointers to a shared structure without explicit synchronization.All of the detectors reported eight data races in MySQL due to object pointers to a shared structure without any proper synchronization and shared flags for thread termination.The three detectors located seven data races in Cherokee.A data race in Cherokee was the result of log corruption similar to a well-known bug in Apache's logging code (Apache bug #25520).
For bodytrack, all detectors found six data races, which were caused by the initialization of objects in shared structures without synchronization and the misuse of condition variables.Djit + and FastTrack also reported two data races involving two kinds of unprotected counter variables for a user-defined wait-notify operation, whereas iFT reported only one of the data races.iFT located two data races for x264, caused by two pointers in different functions that were referring to a shared structure and its members.The pointers allowed the shared memory locations to be concurrently accessed by read/write events from each function without any proper synchronization.The other detectors reported three data races, including two detected by iFT; the other one was caused by the same bug via a pointer to the same shared structure.
In bodytrack and x264, shared read events that are not the leftmost or rightmost events can be exempted from relevant events of the data race detection process by our iFT algorithm.Hence, iFT reported fewer data races for these two applications, and the reported data races were a subset of those given by FastTrack.For example, in the result of x264, a prior read access of a shared structure in a file (frame.c)was removed from   of an , since a new read access of the same shared structure in another file (analyse.c)occurred on the leftmost thread.iFT reported only a data race between the leftmost read access and a later write access to the same shared structure in a file (encorder.c),whereas FastTrack reported two data races between these read accesses and the later write.However, iFT located the missed data race after we had fixed the previously reported data race by using a local pointer variable.
From this experiment, we can conclude that iFT is sound, because the precision of the iFT algorithm is fixed relevant to the well-established precision of FastTrack.

Efficiency.
We measured the runtime and memory consumption of the benchmarks over three detectors to evaluate the efficiency of iFT. Figure 5 depicts the measured runtime and memory overhead results for 11 applications of PARSEC with sim-medium simulation inputs.The graph shows the average runtime and memory overheads for each of the detectors as a proportion of the original run.Because facesim is a representative long-running application that uses a small number of concurrent threads and naturally requires quite high runtime and memory overheads for on-the-fly data race detection, the application was excluded from the efficiency test.
From Figure 5(a), almost all of the FT results are lower than those of the other detectors.FT incurred an average runtime overhead of 8.5x, whereas FastTrack and Djit + required average runtime overheads of 9.2x and 11.2x, respectively.In particular, iFT required explicitly lower runtime overheads for two applications, dedup and ferret, which use more than 20 active threads during program execution.For instance, iFT incurred an average runtime overhead of 23.5x for dedup, whereas FastTrack and Djit + incurred average runtime overheads of 27.6x and 37.3x, respectively.In the case of ferret, the incurred runtime overhead of iFT was 7.5x, while FastTrack and Djit + incurred average runtime overheads of 10x and 16x, respectively.Several applications, such as blackscholes, canneal, and raytrace, have lower overheads than the others because of their model of parallelism (e.g., fork-join parallelism).
In Figure 5(b), we see that FT incurred an average memory overhead of 4.3x, whereas FastTrack incurred an average memory overhead of 6.0x.This means that FT reduced the average memory overhead to 58% of that of Djit + and 72% of that of FastTrack for 11 applications.If we consider the three applications that use several ten dynamic threads, FT incurred an average memory overhead of 1.9x, while FastTrack required an average memory overhead of 5.4x.Thus, the proposed FT reduced the average memory overhead to 37% of that recorded by FastTrack.
We measured average memory consumption for two real applications under our Pin framework.The results of the measurement appear in Figure 6.For the experiments, MySQL used 78 multiple threads during 60 seconds for an execution, and 126 threads were used for Cherokee.We employed four monitoring steps, Native, Pin-only, Monitoring, and Detecting, to show how many additional overheads were incurred by instrumentation work under Pin framework.Native means the original execution without our Pin  framework, and Pin-only indicates the measured results that the applications were run on the Pin framework without monitoring and instrumentation work.Monitoring means that only the thread executions and memory accesses were traced under the Pin framework.Detecting means that we measured the memory consumption of the execution of the applications under the three detectors that were implemented on top of the Pin framework.
In Figure 6, we see that Pin-only incurred an average memory consumption of 2.2x and Monitoring incurred an average memory consumption of 2.6x.iFT incurred an average memory consumption of 2.8x, whereas FastTrack incurred an average memory consumption of 3.6x.This means that FT reduced the average memory consumption to 62% of that of Djit + and 76% of that of FastTrack for two applications.If we exclude Pin-only step that incurred 1,128 MB in the average case, FT incurred an average memory consumption of 1.7x, while FastTrack required an average memory consumption of 2.3x.For the two real applications, iFT reduced the average memory consumption to 49% of FastTrack.
We chose the x264 application from the PARSEC benchmark for additional comparison, because it employs a different number of concurrent threads to process the virtual pipelined stages for each input frame.In contrast, the other applications use a fixed number of threads, although they use different inputs.The comparison used all six simulation inputs provided by the PARSEC suite, because these lead to an increasing thread size in each input frame.
Figure 7 depicts the measured runtime and memory overhead results for the x264 application.In the experiment, FT incurred an average runtime overhead of 6.6x, whereas the other detectors averaged more than 8x slowdown.In particular, in the executions with the sim-large input (256 threads), FT reduces the runtime overhead to 74% of that of the other detectors.FT performs well in reducing the memory overhead, averaging just 1.3x, whereas the memory overhead of the other detectors increased by a factor of more than 95% relative to that of FT.Under FT, the application ran with native input using 1,024 concurrent threads, but the other detectors ran out of memory with the native input because of the 32 GB limitation of our system.In this case, FT required a runtime overhead of 11.5x and a memory overhead of 1.6x to locate two data races.It is noteworthy that the distinguished performance of FT is caused by the elimination of the VC operations used in the FastTrack algorithm.
The results in Figure 7 show that FT reduced the memory overhead by 11.4x and gave a speedup of 1.3x compared to the other dynamic detectors.The overheads of FT were similar to those of the other algorithms for small-size inputs, as x264 uses fewer than 20 threads for these inputs.However, with the larger inputs, FT reduced the runtime and memory overheads compared to the other detectors.For example, FT required just 82% of the runtime and 8% of the memory  overhead of FastTrack for these larger inputs.The results emphasize again that FT is practically useful for detecting data races on-the-fly in programs with a large number of concurrent threads.The empirical results from Table 4 to Figure 7 show that our iFT algorithm is a sound and practical method for on-thefly data race detection, because it reduces the average runtime and memory overhead to 84% and 37%, respectively, of those recorded by FastTrack.

Related Work
Most prior dynamic techniques have focused on detecting data races more precisely or efficiently.Since FastTrack was introduced, several detectors have been designed to combine lockset analysis with happens-before analysis by leveraging the lightweight nature of epochs.
AccuLock [20] was the first solution to use this combined approach, achieving comparable performance to FastTrack and limited false positives.This detector applies a new, efficient lockset algorithm to FastTrack to enforce a thread locking discipline.This uses the notion of potential data races, called 0-races, in which any two concurrent read/write events access a shared memory location without a common lock.The detector considers the sensitivity to thread interleaving using thread locking, as it excludes the subset of happens-before relations found with lock acquirements and releases from VCs.However, AccuLock still requires () operations to maintain an access history and locate data races, similar to FastTrack.
ThreadSanitizer [8] is another hybrid detector based on the same combination approach.This detector provides improved precision in the detection of data races by adapting the fastidious aspect of thread synchronizations and race patterns appearing in C/C++ applications.However, unlike AccuLock, it uses VCs to analyze the happens-before relation and multiple locksets for concurrent writes.Thus, the detector offers the same time and memory overhead as earlier hybrid detectors such as MultiRace [3].Recently, a new version of ThreadSanitizer was released (but not reported officially).This included the FastTrack algorithm and epochs instead of the VCs of the old version.
In our prior work [33], we presented an on-the-fly Race Detector for OpenMP programs.This detector uses a thread identifying technique to analyze the happens-before relation and a data race detection protocol that utilizes the lockset analysis.A significant improvement in efficiency was obtained because the left-of-relation was also applied to the protocol, and it is able to precisely report data races for OpenMP programs with a large number of concurrent threads.However, our prior detector may lose its soundness or efficiency when handling general threading models, like Pthread, because it only considers the structured fork-join parallel program model, such as OpenMP.

Conclusion
There is a trade-off between efficiency and precision in the detection of data races using the happens-before or lockset analysis.FastTrack is the fastest happens-before analysis algorithm to provide comparable performance to the lockset analysis.However, there is still room for improvement, as the algorithm requires some VC operations.In this paper, we presented an improved FastTrack algorithm, called FT, that uses only the epochs in each access history by applying the left-of-relation.This algorithm is practically sound, needing only an (1) runtime and memory overhead to maintain an access history and providing similar performance to the wellestablished FastTrack algorithm.
We implemented our algorithm as a Pin-tool on top of the Pin instrumentation framework and compared it empirically with other detection algorithms, including FastTrack.Empirical results from a set of C/C++ benchmarks showed that our FT algorithm is a practical and sound method for on-the-fly data race detection, reducing the average runtime and memory overhead to 84% and 37%, respectively, of those required by FastTrack.This low overhead of the FT algorithm is significant, because it can be used for on-the-fly detection based on both happens-before analysis and a hybrid technique, as presented here for an empirical comparison of efficiency.Thus, we believe that the light weight of FT algorithm can apply to production algorithms which include fault tolerance techniques and testing tools for developing dependable software as well as safety critical software such as avionics and nuclear power systems.Future work will focus on improving the FT algorithm via a hybrid detection technique, similar to that of AccuLock but without the false positive problem, and the enhancement of precision to handle more variant synchronization primitives, as in ThreadSanitizer.

1 Figure 1 :
Figure 1: An example of multithread execution with synchronization primitives.

ℎ𝑏
→ and simply maintains epochs or VCs by updating the access histories.For the algorithm, some notions are used to analyze ℎ   → using the epoch.The function () is shorthand for @, and () ⪯  denotes that the epoch () happens before a vector clock , where () ⪯  if and only if  ≤ [].

( 3 )
If   ‖  and   ‖ , then FT reports two data races between  and both shared read events.(4) If   ℎ   →  and   ℎ   → , then FT fails to report any data races.

Figure 3 :
Figure 3: Overall architecture of a data Race Detector.

Figure 4 :
Figure 4: Execution result of the implemented FT detector.

Figure 6 :
Figure 6: Measured memory consumption for two real applications.

Figure 7 :
Figure 7: Measured runtime and memory overhead results for x264 application.
According to this relation, if a thread segment  must happen at an earlier time than another thread segment ,  happens before  or   happens before   , denoted by   is satisfied, we say that  is concurrent with  or   is concurrent with   , denoted by  ‖  or   ‖   .
ℎ   →  or   ℎ   →   .If neither  ℎ   →  nor  ℎ   → , FT locates one or two of those located by FastTrack.Two distinct shared read events toward  are kept in   and   by the left-of-relation.Since   ‖   , we guarantee the following: (1) If   ℎ   →  and   ‖ , then FT reports a data race between   and , because   ⪯ , and neither   ⊑  nor  ⊑   is satisfied.(2) If   ‖  and   ℎ   → , then FT reports a data race between   and , because   ⪯ , and neither   ⊑  nor  ⊑   is satisfied. Proof.
(b),  1 and  4 are synchronized by a lock variable  1 , and  2 is also synchronized with  4 by a lock variable  2 .For the execution of Figure 2(b), FT records two read events  1 and  3 in   and   , respectively.It reports only the data race { 3 - 4 } between   and  4 , because   ⪯  4 is satisfied.Therefore,  1 ℎ   →  4 by the synchronization between  1 and  4 .FT records  2 from  2 in   instead of  1 if the acquiring lock  2 is reserved, because  1 ℎ   →  2 by the thread interleaving  1 →  4 →  2 .Finally, FT reports two read-write data races { 2 - 4 ,  3 - 4 } for the execution.FT records  3 in   as the leftmost event, and  2 is recorded in   .Thus, FT locates no data races, because  3 ℎ   →  4 by the acquiring lock  1 , and  2 ℎ   →  4 by the signal-wait event.If a pair of wait and signal events does not occur between  2 and  4 , FT obviously locates the data race { 2 - 4 }, as it analyzes that the rightmost event  2 is concurrent with  4 .If data races exist between   and a current write event, the races located by FT are a subset of those located by FastTrack.
In Figure 2(c), there are two kinds of synchronization events, locking and a signal-wait.Because  1 ℎ   →  3 is satisfied by lock variable  1 , 01) ReadCheck(, ) (02) if() = any epoch kept in   then return; (03) if     then Report a data race; (04) MaintainAH(  , ()); =   then return; (03) if     or      then Report a data race; FT also reports the data race { 4 - 5 } and two data races { 4 - 6 ,  5 - 6 }, as does the FastTrack algorithm, when  5 and  6 occur.Consequently, the results in Table2show that FT detects apparent data races as well as FastTrack, although the new algorithm maintains only two epochs for concurrent read events in   .
4−  6 ,  5 −  6 reported, as for the FastTrack algorithm.However, FT only compares two epochs in   without any VC operations.

Table 3 :
Analysis of PARSEC benchmarks using FastTrack.

Table 4 :
Number of data races located on the PARSEC benchmark and real applications.