Reproducing a failure is the first and most important step in debugging because it enables us to understand the failure and track down its source. However, many programs are susceptible to nondeterministic failures that are hard to reproduce, which makes debugging extremely difficult. We first address the reproducibility problem by proposing an OS-level replay system for a uniprocessor environment that can capture and replay nondeterministic events needed to reproduce a failure in Linux interactive and event-based programs. We then present an analysis method, called replay analysis, based on the proposed record and replay system to diagnose concurrency bugs in such programs. The replay analysis method uses a combination of static analysis, dynamic tracing during replay, and delta debugging to identify failure-inducing memory access patterns that lead to concurrency failure. The experimental results show that the presented record and replay system has low-recording overhead and hence can be safely used in production systems to catch rarely occurring bugs. We also present few concurrency bug case studies from real-world applications to prove the effectiveness of the proposed bug diagnosis framework.
Debugging is the hardest part of software development. Traditionally, the process of debugging begins by reproducing a failure, then locating its root cause, and finally fixing it. The ability to reproduce a failure is indispensable, as, in most cases, it is the only way to provide clues to developers in tracking down the sources of failure. However, in the case of some nondeterministic failures such as concurrency bugs, it is not always possible to reproduce the failure provided a given set of inputs and environmental configurations. Without the ability to reproduce, debugging becomes an inefficient and time-consuming process of trial and error. Consequently, some software practitioners report that it takes them weeks to diagnose such hard-to-reproduce failures [
To deal with nondeterministic failures, record and replay tools have been demonstrated to be a promising approach. Such tools record the interactions between a target program and its environment and later replay those interactions deterministically to reproduce a failing scenario. A number of record and replay systems have been proposed in recent years, but many of them incur high overheads [
In this work, we first show that a computer program’s nondeterministic behavior can be fully identified, captured, and reproduced with instruction-accurate fidelity at the operating system level by using existing hardware support in modern processors. From this vantage point, a developer can replay a nondeterministic failure of a program and effectively diagnose it using traditional cyclic debugging methods.
We then present an analysis method, called It presents an idea of OS-based record and replay system capable of intercepting the nondeterministic events occurring in interactive and event-driven programs. To substantiate this idea, we implemented it in an ARMv7 uniprocessor-based Linux system. The system incurs low overhead during recording. Therefore, it can be used in an always-on mode in testing phases or production systems to catch nondeterministic and rarely occurring bugs. It describes a replay analysis method for diagnosing concurrency bug failures based on the proposed record and replay system. The method specifically targets single-variable data races and atomicity violations. The method uses a combination of static analysis, dynamic tracing during replay, and delta debugging to identify failure-inducing memory access patterns that lead to a concurrency bug.
The remainder of this paper is organized as follows. In Section
A program is considered to be deterministic if, when it starts from the same initial state and executes the same set of instructions, it then reaches the same final state every time. In modern computer systems, however, even sequential programs can show unpredictable behavior because of their interaction with the environment, such as I/O, file systems, other processes, and with humans through UIs. Moreover, the occurrence of interrupts and signals can result in varying control flow during successive runs of the same program.
For debugging, these subsequent runs of a failing program can be made deterministic by recording the nondeterministic factors in the original run and substituting their results during replay. For a user-level program, such factors generally include external inputs, system call return values, scheduling, and signals. There are indeed some other sources of nondeterminism that exist at the microarchitecture level, for example, cache or bus states, blocking of I/O operations, and memory access latencies. Such nondeterminism causes a timing variation which may affect when external data is delivered to a user program or when an asynchronous event-handler is invoked. To handle this type of nondeterminism, we use a logical notion of time by keeping track of the number of instructions executed by a process between two nondeterministic events. The logical time helps in maintaining the relative order of the nondeterministic events during replay and guarantees the replication of the functional behavior of the program. For a debugging usage model where the goal is to find errors in a user program, it is sufficient to reproduce the functional behavior of the program rather than its temporal behavior. Therefore, nondeterministic factors existing at the architecture or circuit level which may cause timing variations are out of scope of this work.
The nondeterministic events exposed to a user program can be captured at different abstraction levels, that is, library-level, OS-level, and hardware-level. In general, the higher the abstraction level is, the smaller the performance overhead is, but with less accuracy. In the current work, we implemented the record and replay framework at the OS-level because we believe that the operating system is the perfect place to intercept nondeterministic events with instruction-level accuracy before they are projected to a user-space program. Moreover, many modern computer architectures such as Intel x86, PowerPC, and ARM include rich hardware resources such as performance monitors, breakpoints, and watchpoints that can be exploited at the OS-level to support deterministic replay. Implementation at the OS-level also has the advantage that it does not require any modifications to the target program or the underlying architecture.
Figure
Record and replay system at OS-level.
When a program is set to run in
The record and replay system is implemented at the system call and signal interface in the operating system since together these interfaces represent most of the nondeterministic events found in sequential and event-based programs. These events include data from external devices and the file system, input from timers, interprocess communication, asynchronous event-notification, and interrupts. Therefore, the proposed record and replay system is suitable for debugging a large class of interactive programs including those that generally come with a Linux distribution and other high throughput programs such as Lynx [
In the following subsections, we discuss how we implemented the proposed record and replay framework for a Linux-based ARMv7 system. The entire record and replay system is transparent from the target program and is implemented in software by using support from the ARM’s debug and performance management unit (PMU) architecture. Although the implementation details given here are specific to one architecture, the presence of hardware debug registers and performance monitoring counters in other architectures along with the existing support in commodity operating systems to use these resources makes our approach more generic.
System calls in an operating system provide an interface for a user-level process to interact with its external environment by sending requests to the kernel. In a typical UNIX system, there are around 300 system calls. However, the effect of all the system calls may not be nondeterministic with respect to a process. We broadly categorize the system calls as follows:
The I/O system calls allow a process to interact with hardware devices, networks, and file systems whereas IPC system calls are used to interact with other processes. The results of these system calls cannot be predicted by the user application, so we consider these as nondeterministic system calls. Time-related system calls always return values local to a processor on which they are executed and are different every time.
The process control system calls manage a process status. Such calls only change or get the value of a process control block, so the associated events can be considered deterministic. The memory control system calls are used to handle memory allocation, heap management functions, and so on. Even if these system calls are recorded, we must re-execute them during replay to generate their side effects within the operating system kernel. Otherwise, if, for example, memory is not actually allocated during the replay phase, the kernel will throw an exception when the program tries to access the memory region it has supposedly allocated.
Recording of system calls belonging to the latter two categories is therefore redundant, and in the current work, we do not record them. By eliminating these system calls, we significantly reduce the recording overhead, making the entire record and replay system more efficient.
When a user program invokes a system call, the processor switches from user mode to kernel mode and begins executing the system call handler. If the program is in the recording mode, the system call execution is allowed normally, but before returning to user mode, we log the return value of the system call. In the ARM architecture, this value is returned in
During the replay mode, when the program tries to execute a nondeterministic system call, its handler is replaced by a default function. This function reads the return value of the current system call from the recorded event log and sends it to the process. Any data associated with the system call which was saved during the recording is also returned to the program. Thus, we simulate the effect of a system call completely, rather than executing it during replay.
Signals are a type of software interrupt present in all modern UNIX variants. They are used for many nontrivial purposes, for example, interprocess communication, asynchronous I/O, and error-handling. Signals are sent to a process asynchronously and proactively by the kernel. Therefore, signal handling is a challenging task for any record and replay system. Many existing record and replay frameworks do not support signal replay. Some methods exist to support signal replay [
Before describing our process of signal record and replay, we briefly discuss how signals are handled in Linux. When a signal is sent to a process, the process switches to kernel mode. If it is already in the kernel mode, then after carrying out the necessary tasks and before returning to the user mode, the kernel checks if there are any pending signals to be delivered to the process. If a pending signal is found, it is handled in the do_signal() routine, where the corresponding signal handler’s address is placed into the program counter (PC), and the user mode stack is set up. When the process switches back to user mode, it executes the signal handler immediately.
During the recording mode, whenever a signal is delivered to a process being recorded, we log the signal number and the user process register context in the do_signal() routine before it modifies the current PC. The exact instruction in the process’s address space “where” the signal arrived is indicated by the PC logged in the register context. However, recording only the instruction address is not sufficient because the same instruction may be executed multiple times during the entire program execution, for example, in a loop. We also need to log the exact number of instructions executed by the process to identify “when” the signal arrived. To count the number of instructions, we utilize ARM’s performance monitor unit (PMU) architecture. The Cortex-A15 processor PMU provides six counters. Each counter can be configured to count the available performance events in the processor. We programmed one of the counters to count the number of instructions architecturally executed by a user process. During the recording phase when a nondeterministic event occurs, the current instruction count is also stored in that event’s log. The instruction count is then reset, and it starts counting again until the process encounters another nondeterministic event to be logged. Thus, two sequentially occurring nondeterministic events, for example, a system call and a signal, are separated by the exact number of instructions executed between the two events by the process being recorded.
To replay the signals we make use of ARM’s hardware breakpoint mechanism. During the replay phase, after processing an event, we always check what an upcoming event is in the process’s recorded log. If the next event in the log is a signal, then a breakpoint is set at the instruction address of the user program recorded in the signal log. The instruction counter is reset to begin counting the instructions from the last replayed event. When the replayed program reaches the instruction where the breakpoint is set, a prefetch abort exception occurs, and the process switches to kernel mode. In the exception handler, we compare the current number of instructions executed to the instructions stored in the signal log. If they match, the signal is immediately delivered to the user process. If they are not, the breakpoint is maintained at the current PC using the ARM’s breakpoint address match and mismatch events, and the process execution is allowed until both PC and instruction count match those of the recorded values in the signal log. In this way, the signal delivery to the target program with instruction-level accuracy is guaranteed during the replay phase. The replay process for a signal is illustrated in Figure
Signal replay process.
To capture nondeterministic bug, recording is often required to be done in production systems. Therefore, it is very important that recording overhead be low enough and minimally intrusive to avoid any adverse effect on a production application’s normal execution. We evaluated the performance of our record and replay system regarding recording overhead for various real applications. The experiments were performed on a Samsung Exynos Arndale 5250 board based on Cortex-A15 processor, with record and replay mechanism implemented in Linaro 13.09 server with a Linux 3.12.0-rc5 kernel.
We recorded and replayed some Linux applications, which are listed with their workload conditions in Table
Test application scenarios.
Program | Description | Workload |
---|---|---|
lame | A high-quality MPEG Audio Layer III (MP3) encoder | Encode a 3.5-minute (7751 frames) .wav file |
GNU bc | An arbitrary precision numeric processing language calculator | Load the math library and process an input file containing mathematical operations |
vim | Vim 7.3 text editor | Open an existing text file in vim, and append an eight-character string 10,000 times |
bzip2 | A high-quality data compression/decompression utility | Decompress Linux-3.0.1.tar.bz2 of size 73.1 MB |
netperf-TCP_STREAM | Networking performance measuring benchmark | TCP_STREAM test between the local host and a remote host for 60 secs with default window size |
netperf-TCP_RR | Networking performance measuring benchmark | TCP request/response test between the local host and a remote host for 10 secs |
The performance overhead of recording the application workloads is shown in Figure
Recording overhead.
The results are displayed normalized to native execution without recording. The overhead of recording was under 5% for all the experiments, except the TCP request/response test, which caused 15% overhead. The recording overhead is directly related to the size of the logs. The events generated during recording must be moved from the kernel memory to permanent storage, causing extra overhead. During the request/response test, a large number of requests are generated. Each request/response has the potential to generate enormous quantities of network data that must be logged, resulting in a relatively larger log size as compared to other tests and therefore causing more slowdown. The slowdown can be improved by compressing the logs or by better scheduling of write operations to permanent memory. However, discussion of these optimization methods is out of the scope of current research.
Concurrency bugs are generally associated with multithreaded programs. However, researchers have shown that they also exist in sequential [
The proposed replay analysis works in three phases: static analysis, dynamic tracing, and delta analysis, as shown in Figure
Replay analysis for diagnosing concurrency bugs.
Many concurrency bug detection methods are proposed for shared memory parallel programs. These methods tend to trace memory access to all the shared memory locations in a program to detect possible concurrency bugs. Unlike concurrency bug detection, our proposed system aims to diagnose a concurrency bug that has caused a given program execution to fail. There may be a number of shared memory accesses in the program which are not related to a given failure. Therefore, it is redundant to track all the shared memory locations in a program while debugging. Hence, our first goal is to reduce the scope of shared memory access that might be involved in inducing a given failure. To do this, we make use of an important characteristic of bugs, observed by researchers [
Keeping in view the short propagation distance heuristic of a bug, we believe that the shared global variable involved in the concurrency bug must be accessed within that function. Therefore, we limit our scope of tracing memory access to only those global variables accessed in the identified function. We use the debugging symbols embedded into the program’s binary to disassemble the target function. We extract the addresses of all global memory locations accessed within that function, eliminating all the memory access operations to the function’s stack and also those referring to the read-only data section, as this access is not involved in concurrency bugs.
The output of the static analysis phase is a set of global variables that are accessed within the failure scope, and at least one of these is involved in the concurrency bug that we validate through dynamic tracing during replay.
To trace the access to shared memory locations, existing analysis methods typically rely on the use of heavyweight dynamic binary instrumentation tools such as Valgrind [
Most importantly, since we aim to trace memory access for a limited subset of global variables obtained during the static analysis phase, we want to avoid the inherent expense of instrumenting memory access to every shared memory location as it is redundant to the bug diagnosis process. In the current work, rather than using dynamic binary instrumentation, we employ an alternate approach in which we use the processor’s hardware watchpoint registers to monitor the access to the desired memory addresses. The watchpoint registers are used to stop a target program automatically and temporarily upon read/write operations to a specified memory address. These registers are often used in debuggers as data breakpoints. The main benefit of using hardware watchpoint registers is that they can be used to monitor the access to a memory location without any runtime overhead [
To trace a given memory address, we program a watchpoint value (DBGWVR) and control (DBGWCR) register pair using Linux ptrace ability to read from/write to processor’s registers. The DBGWVR holds the data address value used for watchpoint matching. The load/store access control field in DBGWCR enables the watchpoint matching conditional on the type of access being made [
The entire process of dynamic tracing is automatic and can be described by Algorithm
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
When a failure is encountered in a production run or during testing, the first step usually performed by the developers is to determine whether the root cause is simple re-execution. If the failure in the initial run is caused by a concurrency bug, then it is most likely to disappear when the test case is repeated. Thus, in the case of a concurrency bug, the developers have at least one passing execution of the same program with the same set of inputs.
In the domain of sequential errors, delta analysis is often used to compare the execution paths and variable values in two executions of a program to isolate faulty code regions and incorrect variable values. In the case of a concurrency bug, the failure is caused by conflicting memory access. Therefore, we can reach the root cause of the failure by comparing the memory access patterns of the global variables in failing and successful executions.
Two types of concurrency bugs are common in Linux programs considered in this research work: data races and atomicity violations. A data race occurs when a global variable is accessed by an event-handler and the main code in an unsynchronized way, and at least one of those access operations is a memory write operation. An atomicity violation is said to occur when the desired serializability among consecutive accesses of a shared memory location in the main code is violated by access to the same memory location in an asynchronously invoked event-handler. These bugs can be detected through special combinations of memory access operations that signify a data race or an atomicity violation. Specifically, we consider the standard data race patterns, that is, RW, WR, WW, and atomicity violation patterns, that is, RWR, RWW, WWR, and WRW found in multithreaded programs as conflicting memory access patterns for other types of programs that use event-handlers. Many such patterns may appear during the entire execution of a program and are not harmful to it; for example, it has been shown that only about 10% of real data races can result in software failures [
Concurrency bugs have been well-studied in the domain of thread-based programs, and therefore a number of bug databases are available to validate new algorithms. Unfortunately, no such bug database is available for sequential or event-based programs. We evaluated our proposed model of diagnosing concurrency bugs on a few real bugs caused by data races in concurrent signal-handlers reported in [
Concurrency bug failures are hard to reproduce and debug. In practice, the process of debugging such failures cannot be fully automated, and the involvement of developers remains essential in digging out the root cause. The case studies presented above support the same fact. If the program execution deviates from its intended behavior, we require the developers to identify that deviation and relate it to a specific code section that failed to meet the intended behavior. The rest of the process can be handled automatically and more efficiently by eliminating the debug effort which is redundant to a given failure scope.
We also observed that in the majority of the concurrency bugs, the failure occurs in the same function in which a memory access violation has taken place and hence it is possible to find the suspicious variable in the same function. For the remaining, we can find the suspicious variable by going upwards to the next level in the function call tree.
Record/replay systems are implemented at different abstract levels of a computer system [
In the current work, we implemented record and replay at the OS-level. In doing so, we can faithfully reproduce the events at low-level while providing transparency to user-level programs. Although the recording overhead can be larger as compared to higher abstract level methods, it can be considered negligible as long as it does not perturb the natural execution of the application in the production system.
Flashback [
Scribe [
According to a study on the characteristics of bugs by Sahoo et al. [
In [
Ronsse et al. [
In [
In this section, we discuss some implications and limitations of our work, along with some remaining open questions.
Reproducing a nondeterministic failure for bug diagnosis is a key challenge. To address this challenge, we have presented a light-weight and transparent OS-level record and replay system, which can be deployed to production systems. It can faithfully reproduce both synchronous and asynchronous events occurring in programs such as system calls, message passing, nonblocking I/O, and signals. During the replay of a program, a standard debugger can be attached to it to enable cyclic debugging without any modifications to the program or debugger.
We also presented a method for diagnosing concurrency bug failures which is based on the proposed record and replay system. Given a failure to track down bugs, developers can collect memory access logs for a set of global variables during the replay of failing and passing execution and then compare them to identify any memory access violations causing the failure. Our experience with some real programs shows that this usage model can be very beneficial in locating the root causes of failures related to concurrency bugs.
Our current work is so far limited to debugging of sequential and event-based programs. In the future, we aim to extend it for multithreaded programs so that we can capture and reproduce the exact thread interleaving order in a failing execution and then identify the conflicting memory access operations that lead to program failure.
The authors declare that they have no conflicts of interest.
This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the programs [R0114-16-0046, Software Black Box for Highly Dependable Computing] and [2016-0-00023, National Program for Excellence in SW] supervised by the Institute for Information & Communications Technology Promotion (IITP), Korea.