LTTng CLUST: A System-Wide Unified CPU and GPU Tracing Tool for OpenCL Applications

As computation schemes evolve and many new tools become available to programmers to enhance the performance of their applications, many programmers started to look towards highly parallel platforms such as Graphical Processing Unit (GPU). Offloading computations that can take advantage of the architecture of the GPU is a technique that has proven fruitful in recent years. This technology enhances the speed and responsiveness of applications. Also, as a side effect, it reduces the power requirements for those applications and therefore extends portable devices battery life and helps computing clusters to run more power efficiently. Many performance analysis tools such as LTTng, strace and SystemTap already allow Central Processing Unit (CPU) tracing and help programmers to use CPU resources more efficiently. On the GPU side, different tools such as Nvidia’s Nsight, AMD’s CodeXL, and third party TAU and VampirTrace allow tracing Application Programming Interface (API) calls and OpenCL kernel execution. These tools are useful but are completely separate, and none of them allow a unified CPU-GPU tracing experience. We propose an extension to the existing scalable and highly efficient LTTng tracing platform to allow unified tracing of GPU along with CPU’s full tracing capabilities.


Introduction
Tracing programs has been a common technique from the beginning. Tracers such as DTrace [1], strace [2], SystemTap [3], and LTTng [4] have been around for many years. Not only do they allow recording kernel trace events but some also allow recording user space events. This is a very good alternative to debuggers for finding software bugs and especially performance related issues that concern the execution on the CPU side. The rapid changes in computation never stop to bring new challenges. As part of those rapid changes, we have seen GPU acceleration become a more common practice in computing. Indeed, the different highly parallel architectures of the graphic accelerators, compared to the conventional sequential oriented CPU, represent a very attractive tool for most of the graphical work that can be achieved on a computer. This became the standard in recent years. Most operating systems have even added a graphic accelerator as a hardware requirement. Indeed, its processing power is being harnessed in many of the operating system's embedded tools and helps make computations more fluid, therefore enhancing user experience. The evolution of classical graphical acceleration in games and user interfaces brought architecture changes to the GPU. The graphical accelerators evolved from the graphical pipeline to a more programmable architecture. Programmers quickly figured how to offload raw computations to the graphical accelerator and GPU computing, also referred as General Purpose Graphical Processing Unit (GPGPU), was born. APIs such as Nvidia (http://www.nvidia.com/), Compute Unified Device Architecture (CUDA) (https://developer.nvidia.com/cudazone/), and Apple's open standard Open Computing Language (OpenCL) [5] (developed by the Khronos Group) were created and adapted to standardize the usage of GPUs as GPGPUs. As part of this change, many tools were implemented to help programmers better understand how the new architecture handles the workloads on the GPGPU. The two major manufacturers, Advanced Micro Devices

Related Work
Tracing is the act of recording events during the execution of a program at run-time. This is commonly used with not only GNU's Not Unix (GNU)/Linux [3,4,6] but also all major operating systems such as Unix [1] and Windows (https:// msdn.microsoft.com/en-us/library/windows/desktop/bb96-8803.aspx). The main purpose is to provide a tool for the developers to record some logs about the execution of programs and detect complex problems such as race conditions.
In this section, we first present the different available tracing platforms that are available. Secondly, we discuss what kind of information trace analysis provides, followed by current GPU tracing architectures, tracing tools, and OpenCL tracing capabilities. Then, the topic of time keeping is discussed: many different timing metrics are available and they need to be explored. Moreover, the device timing metrics that can be accessed from the OpenCL library on the GPU may not be using the same source as the host: synchronization methods to help match those two separate sources will be discussed (Section 2.4). Finally, we talk about tracing in the device drivers as a complementary measure of OpenCL tracing.

CPU Tracing.
Tracing on GNU/Linux is somewhat easier than doing it on Windows or other proprietary operating systems. One good reason for that is because we have access to its source code. Since our work focuses on the GNU/Linux operating system, we will direct our attention to this environment and explore available tools on this platform. The major tools identified are strace [2], DTrace [1], SystemTap [3], and LTTng [4].
2.1.1. Strace. The main benefit of strace [2] is its ease of use and the fact that it has been around since the early 90s. The strace functionality allows the user to record all or some specific system calls of a given program or a list of processes. The output is either sent to stderr or redirected to a file. The content of the output is the name of the recorded system call with its parameters and the return value of the call. For instance, recording only the file open and close operations of a program can be done by executing the following command: strace -e open, close./myProgram.
Despite the fact that strace is easy to use, its most obvious limitation lies in the fact that it uses a single output file for trace recording and therefore degrades performance when tracing multithreaded applications since it has to use mutexes to write from the multiple threads to the single file. This issue is discussed in [7].
Strace also has a hidden performance weakness: when tracing an application, the API used to get the system call information (ptrace) adds an overhead of two system calls per system call that the traced program performs [8]. This can have a dramatic impact on the system due to the known overhead of system calls. Therefore, the strace approach does not seem to be suitable in the long run.

DTrace.
DTrace [1] originated from the Solaris operating system. Solaris is a Unix variant developed by Sun Microsystems which was later bought by Oracle. After the acquisition of Sun Microsystems, Oracle ported the tracing tool to the GNU/Linux Operating System (OS) [9]. DTrace relies on hooking scripts to probes that are placed in the operating system kernel. Those scripts can then process and record system calls or other information. A script is compiled into bytecode in order to allow fast execution through a small and efficient kernel based virtual machine. DTrace also provides a library called User Statically Defined Tracing (USDT) that programmers can use to instrument the code of their applications and perform user space tracing [1].
One major issue with DTrace is its usage of synchronization mechanisms that produce a large tracing overhead when tracing heavy workloads such as multithreaded applications [7]. Since programs nowadays are mostly multithreaded, this is a serious limitation.

SystemTap.
Like DTrace, SystemTap [3] provides a framework for hooking scripts to probes in the kernel. It also offers user space tracepoint capabilities and may qualify as a system-wide tracing tool. The main difference with DTrace is that it originated on GNU/Linux and the scripts are compiled to native code instead of bytecode.
While SystemTap allows for user space tracing, a feature needed for OpenCL tracing, the compiled tracepoint script is executed in kernel space. This forces the system to go back and forth from user space to kernel space in order to execute the script. This adds significantly to the cost of executing tracepoints and leads to higher overhead.
Again, like DTrace, SystemTap suffers considerably from the synchronization mechanisms used: creating a large overhead when tracing heavy applications [7]. For the same reason as DTrace, heavily multithreaded applications will suffer from a very large overhead with SystemTap.

LTTng.
Kernel space tracing for LTTng [4] is also based on hooking to kprobes and trace points. The approach to get the information is different from DTrace and SystemTap: modules are written in native and compiled to executable code. LTTng was designed with the idea of tracing multithreaded applications with low overhead: each CPU thread has separate output buffers in order to minimize the need for synchronization methods when writing the trace events. As they are generated, before being written to disk, the trace events are stored in a circular buffer that has a specified capacity. In the worst case, if the event data throughput exceeds the disk throughput, events are intentionally discarded, rather than blocking the traced application, in order to minimize the tracing overhead.
User space event tracing is also provided with the LTTng-UST library and, unlike SystemTap, the user space tracepoint is recorded in a different channel entirely in user space. This allows for minimal tracing overhead [10]. As it will be discussed in Section 2.3.3, tracing OpenCL can be achieved at the user space level. The advantages of LTTng over the other available options justifies its choice as a tracing platform for this project.

Trace
Analysis. Getting the trace information is only one part of the tracing procedure. The encoding used to record traces and the potentially huge size of traces impose strict requirements on the analysis tools. For instance, LTTng provides viewing tools such as babeltrace (https://www.efficios.com/babeltrace/) and Trace Compass (http://projects.eclipse.org/projects/tools.tracecompass/) that allow decoding binary traces to human readable format (babeltrace) and interactive visualization and navigation (Trace Compass).
Visualization tools provide a very good understanding of program execution and allow for quick overall understanding of potential execution bottlenecks and synchronization issues. More advanced analysis can also be performed on traces, for example, pattern searching algorithms for uncovering specific execution patterns within traces [11,12], model checking [13], and statistics computation [14,15].

GPU Tracing.
Research teams around the world have spent a great amount of time developing applications to enable programmers to have a better understanding of the GPU kernel execution [16,17]. It is thus relevant to examine GPU architectures. GPUs started as specialized hardware designed to deal with the display monitor. It first evolved into a graphical pipeline of operations and then towards the more programable highly parallel architecture that we have today [18].
Many programming APIs were implemented to support GPGPU but Nvidia's CUDA and the open standard API OpenCL (supported on several different hardware architectures) were the result of need for high level API [18]. Both APIs provide functions to harness the computing power of graphics cards.
The GPU highly parallel architecture allows not only faster but also more energy efficient operations. This characteristic leads to longer battery life for portable devices and better performance per watt (http://top500.org/list/2014/11/).

GPU Architecture.
The reason behind the efficiency of GPUs compared to CPUs in highly parallel situations is that the GPU offers hundreds and even thousands of smaller computation cores built under the Single Instruction Multiple Data (SIMD) architecture. Not all computations can take advantage of such architecture. This is why only parts of programs can be run on GPUs. The SIMD architecture allows multiple different data elements to be processed simultaneously under the same computing instruction [18].
Different form factors are available: integrated graphics are very popular in the portable computing world but dedicated graphics cards still hold a significant computing power advantage. AMD and Nvidia both offer integrated graphics and dedicated graphics devices, while Intel only offers integrated graphics. All architectures differ a lot from one companie to another and also between models. The performance may vary depending on the application: some architectures are better suited for some kind of computations and vice versa.
Computing on GPUs is a technique that derived from a workaround on the existing capabilities of available hardware. The library that makes OpenCL work on GNU/Linux is a library that positions itself between the program and the drivers but still operates in user space. Figure 1 shows the position of Intel's open source OpenCL library for GNU/Linux called Beignet. Computing on the GPU with OpenCL is done asynchronously: commands are sent to the GPU driver command queue and when the device is ready to receive commands, they are transferred from the host driver command queue to the device command queue. The driver is notified when the commands are done executing.
Third party tools such as TAU (https://www.cs.uoregon .edu/research/tau/home.php) and VampirTrace (http://tu-dresden.de/die tu dresden/zentrale einrichtungen/zih/forschung/projekte/vampirtrace/) provide much of the same information as the proprietary ones but also add support for other computing APIs such as Message Passing Interface (MPI). The information that these tools can provide is the arguments and timing of the API calls to the OpenCL library, as well as some instrumentation that [19] presents in order to get timing for the execution of the events on the graphics card, using profiling tools that OpenCL can provide. CodeXL, Nsight, VTune, TAU, and VampirTrace provide a visualization of the API calls, with related OpenCL context and GPU activity (memory transfers and kernel execution on the device) underneath the API calls.
None of these tools provide a system-wide analysis of all activities related to a specific device (GPU). Also, they do not feature system tracing information that could help provide a more accurate picture of the execution context when debugging. Our solution addresses this issue by proposing a system-wide tracing technique that integrates with LTTng for simultaneous GPU and system tracing.

OpenCL Profiling Capabilities.
As described by [19], tracing the execution of a program is fairly straightforward. OpenCL provides a function to gather information about the execution of a kernel or a data transfer: we will refer to those events as asynchronous commands. This information has to be accessed asynchronously after the asynchronous command completion. The easiest way to collect this information is to subscribe to a callback function that is defined for the event associated with the asynchronous command, using the clSetEventCallback function. As Figure 2 shows, the timing information that concerns the start and end of the kernel execution can be accessed after the end of the kernel execution (at point (5) in Figure 2). For the information to be accessed, the programmer shall use the clGetEventProfilingInfo function. Many different timing metrics can be accessed from this function. Here is a list of available information and their description from the open source specification [20]: (1) Command queued: the command has been added to the host queue.
(2) Command submitted: the command has been transferred from the host queue to the device queue. All of those metrics allow access to a 64-bit integer device time counter at the moment specified [20].
Knowing these four time specifications, we are now able to define precisely the content of the three waiting zones where the commands can reside: (1) On the host driver waiting queue.
(2) On the device waiting queue.
Being able to understand the content of the queues allows the programmer to understand his program resource usage. Then, he can modify how his program interacts with the device and enqueues commands in order to maximize computation resources usage. For instance, instead of waiting during data transfers, it is often feasible to overlap computations on one set of data with the transfers for retrieving the previous set and sending the next set.

Time Keeping.
The importance of the source time clock selection is exposed in [4]. Not all clocks have the monotonic property and in order to ensure that all events are recorded sequentially as they happen, the use of a monotonic clock is the key. The real time clock, for instance, can get set to an earlier time by an administrator or by Network Time Protocol (NTP). Using the function clock gettime (CLOCK MONOTONIC, [. . .]); to access the time reference helps preserve the monotonicity of the trace.
It is important to understand that this clock does not have an official start point. It only guarantees that a reading will not have a timestamp that precedes past readings. Its usage is restricted to comparing events timing within the same computer session, since the timer is reset to an unspecified value at reboot time.
The same principle is also valid for GPU time stamps: they rely on on-chip hardware to get the time and as the CPU monotonic clock, this time stamp only allows for the computation of the difference between two time stamps (the time delta) taken from the same device. Again, this time stamp origin does not refer to any specific time and is therefore different from the CPU time stamp. Synchronization methods are required to match the CPU event time stamp to the GPU time stamp. This problem was already discussed and a synchronization solution was suggested by [21] for synchronizing network traces. The same method may be applied to GPU tracing.

i915
Tracing. Another way to trace GPU activity is to record already existing trace points that are present in the Intel i915 GPU drivers. This approach allows monitoring the commands on the host queue and sent to the device, without the need for an interception library such as the solution Advances in Software Engineering  presented in this paper: OpenCL User Space Tracepoint (CLUST). The upside of this technique is that it monitors all the GPU activity, including graphics (Open Graphics Library (OpenGL)).
One disadvantage of this technique is that it does not allow modifying the applications execution and as Table 1 shows, the only data available is the current content of the host side and device side queues, as well as commands that are being executed on the device.
The reason why i915 monitoring does not allow getting command-specific timing metrics is because in CLUST, in order to access a command start and end time, a profiling flag is enabled when a command queue is created. This leads to extra event data being pushed to the GPU and filled with timing data on the GPU side (events that are invisible to the GPU drivers).
Therefore, tracing i915 drivers would be more beneficial as a complement of tracing with CLUST. This would allow getting a trace that monitors all GPU activity on the host side and also getting the more detailed information regarding OpenCL that CLUST provides.

Implementation
In this section, we present the implementation details of our solution that addresses the identified challenges, followed by a system-wide tracing solution and ways to visualize the OpenCL trace. 3.1. CLUST Implementation. As discussed in Section 2.3.3, the OpenCL API can be divided into two separate function types: synchronous and asynchronous: (1) Synchronous API calls: calls that do not enqueue tasks on the GPU.
(2) Asynchronous API calls: calls that enqueue tasks on the GPU Those two types of functions require a different approach for recording relevant tracing metrics.
3.1.1. Synchronous OpenCL Tracing. As seen in Section 2.3.1, the OpenCL library operates in user space. The purpose of CLUST is to trace an application without the need for recompiling it from its source code. Therefore, CLUST is a library that wraps all function calls of the OpenCL library and automatically instruments all the calls to the API. Figure 3 shows how the API calls are intercepted by the CLUST library. The library file, named libCLUST.so, has to be preloaded using LD PRELOAD before launching the program. In the constructor of the library, the OpenCL symbols are loaded from the real OpenCL library and assigned to function pointers that have the same name with the reallib prefix. Then, all OpenCL functions are redefined, and the start and end of the API calls are recorded using Linux Trace Toolkit next generation (LTTng)-User-Space Tracing (UST) (see Listing 1). must be considered. All of the asynchronous commands have an event provided as parameter. This parameter may be NULL, when the caller does not intend to use an event. When being NULL, because CLUST internally needs it, an event has to be allocated and later freed in order to avoid memory leaks. Here is an example with the asynchronous function clEnqueueWriteBuffer (see Listing 2).

Asynchronous OpenCL
Since this function only enqueues a command in the command processor of the driver, we have to wait until the end of the command execution to get access to more valuable timing metrics. The callback function eventCompleted will be called at the end of the command execution. This is where the interesting timing metrics can be accessed. We access the four timing metrics discussed in Section 2.3.3 using the reallib clGetEventProfilingInfo function and since we do not know which queue and command type the callback is referring to, we also access the command type id and the command queue id that are associated with the event, using the reallib clGetEventInfo function. All collected information is then recorded in a LTTng UST trace point and the event is freed, if the callback parameter specifies that it has to be.
Unfortunately, the timing data collected in this way cannot be written to the trace in sequential time order, since this information was accessed after the execution of the command. This leads to problems at trace analysis time since existing tools assume that the events are written to the trace in a sequential time order.
In addition to recording API function call start and end times, the tracepoints can include relevant metrics to allow the analysis of other important information. This relevant data can be important for later analysis and allow for derived analysis, such as calculating the throughput of data transfers by saving the size in bytes of the data being transferred. By associating this measurement with the timing data recorded by the asynchronous method described previously, we can obtain the throughput of each data transfer. Other metrics such as the queue id can allow for a better understanding of the queues content.

System-Wide OpenCL
Profiling. One of the main advantages of CLUST over existing tracing tools is that it can be used to profile not only one process at a time but also all concurrent OpenCL activity simultaneously. This can be achieved by forcing LD PRELOAD of the libCLUST.so library directly from the /etc/environment variable definition file. Every program will inherit the overridden OpenCL symbols and then all OpenCL applications would automatically be traced when LTTng's CLUST tracing is enabled. The results section (Section 4) show that the overhead of CLUST when the library is preloaded but not tracing is very small. Therefore, this could even be used in production, for 24/7 operation, without significant impact on system performance.

Nonmonotonic Tracepoint
Management. Previously, we identified two different types of CLUST events: the synchronous and asynchronous events. As seen in Section 3.1.1, tracing synchronous events is simple and the timestamp linked to the events is added by LTTng-UST using the monotonic clock at the time of the tracepoint execution.
The challenge lies in collecting asynchronous timestamp data; since the device event timing metrics are accessed after the execution of the command, the timestamp linked with the LTTng-UST event shall not be used to display the device event timing. The timestamps reside in the payload of the tracepoint and the payload timing data that has to be used.
The trace analysis tools, by default, use the tracepoint timestamp for display and analysis. Special treatment would thus be required to indicate that in fact an event contains information about four asynchronous events, each with its own timestamp. Furthermore, these embedded asynchronous event timestamps may not appear sequentially in the trace and use a different time source. Analysis tools such as Trace Compass (http://projects.eclipse.org/projects/tools .tracecompass/) are designed to deal with very large traces and process events for display and analysis in a single pass, assuming a monotonic progression of timestamps. Different approaches can be used to solve this problem; their implementation is discussed in the next section.

Writing in Different Trace
Channels. LTTng records traces in different channels, depending on the logical CPU id, in order to record traces with low overhead. Using additional channels for asynchronous OpenCL events would be possible. The delay between the synchronous events and the asynchronous events could be tolerated in this way. However, this would only work if the asynchronous events are always executed in submission order. One channel would be created for each time value type (Kernel queued, Kernel submitted, Kernel execution start, and Kernel execution end). If the submission order is always followed and asynchronous callback events are received in the same order, the events could be written in each channel when the callback is received and would remain in sorted time order in each channel. However, since the command scheduling on the device is not necessarily first-in first-out, this technique cannot be used.

Sorting.
If there is an upper bound on the extent of reordering needed among commands, it is possible to buffer the events in memory and not output events before every preceeding event has been seen. When information about an asynchronous command is obtained through a callback, four events are described. By correlating this information with the clEnqueueNDRangeKernel calls, it would be possible to determine when all data from earlier events has arrived and it is safe to write some of these events to the trace. Figure 2 depicts the execution of a call to enqueue an OpenCL kernel and its execution on the device. Ignoring the calls to clSetEventCallback that are not traced (action performed by CLUST), the order in which the trace is written to the file is the following: (1) clEnqueueNDRangeKernel start, When a command is queued with clEnqueueNDRangeKernel, the corresponding event can be used to mark the appearance of that kernel, . The callback event for is received later with the execution timing metrics. The latest kernel queued with clEnqueueNDRangeKernel at that time, , where > , is noted. The four embedded events in the callback (Kernel queued, Kernel submitted, Kernel execution start, and Kernel execution end) are then buffered until they can be written safely to the trace because it can be verified that all information about earlier events has been received. This happens when all callbacks for events up to have been received. Indeed, any kernel starting after cannot have any of its four embedded events with timestamps earlier than those in .
The number of such events that can be in flight is bounded by the total size of the queues (in the library, in the driver, and on the card). This buffering and proper sorting could be achieved at trace generation time, when receiving callbacks in CLUST. Alternatively, it could be a separate postprocessing step or be performed at trace reading time in Trace Compass as an input filter. Doing this at trace generation time, rather than as a separate step, may be more efficient overall, because it avoids writing and later reading and reprocessing. However, minimizing the traced system perturbation is the first priority, which makes this solution less attractive.
If the upper bound on the number of events to buffer for sorting is too large, one can decide to resort to general sorting, using one of the available algorithms for external sorting, such as fusion sort. The downside is then having to reprocess the whole trace file before being able to visualize it. This also does not comply with live trace analysis requirements. At least, the asynchronous events should be recorded in their own channel, to separate them from the other sorted events and minimize the size of the trace file needing a reordering step.

GPU-CPU Trace Synchronization.
As explored in Section 2.4, the timing sources of the CPU trace and the GPU trace are not the same. For this reason, the time relation between both time sources needs to be evaluated. This will allow aligning properly the CPU trace with the GPU timing metrics collected.
To compute the time relation, we must find the OpenCL API enqueue functions within the CLUST LTTng-UST trace and determine the most closely related device event tracepoint: (1) The queued time (GPU) of the asynchronous command is after the event time of the OpenCL enqueue API call (CPU).
(2) The end time of the asynchronous command (GPU) is before the event time (CPU) for the callback providing the device events timing metrics.
These two inequalities can then be used to obtain the linear relation between the two timing sources by applying the Convex Hull synchronization algorithm, as proposed in [21], or the more efficient improvement, suitable for live streaming, proposed in [22]. In order to perform trace synchronization, we will need to ensure that all relevant information is gathered in the trace. LTTng-UST already records the cpu id and time stamp (CPU) linked to the OpenCL API call start and end tracepoints as well as the callback event tracepoint. The additional information required in the tracepoint event payloads is the callback event pointer in order to properly match the clEnqueueNDRangeKernel start tracepoint with the corresponding callback event tracepoint.

OpenCL Trace Visualization. The main visualization tool for LTTng is Trace Compass (previously known as the
Tracing and Monitoring Framework plugin within the Eclipse Linux Tools project). This tool already provides visualization for LTTng kernel space tracing and user space tracing. As shown in Figure 4, the left tree displays a hierarchy of processes. The principle is to link the OpenCL activity as a child to the associated process. This implies that there should be one child per thread performing API calls to OpenCL and two additional children per context linked to the parent: one for displaying data transfers between the host and the device and the other one for displaying the OpenCL kernel execution. Figure 4 shows how the trace unification enhances the information shown to users.
In addition to the Control Flow view, the Resources view can be used to display GPU usage: two lines per GPU: one for Advances in Software Engineering 9 the data transfers and the second one for OpenCL kernel execution.
It is important to consider that the architecture of modern GPUs allow for concurrent multikernel execution on the same device, via different streams. If the GPU has multiple streams, it then requires more than one row for displaying OpenCL kernel execution.

Results
In this section, we discuss the results and primarily the overhead of using the OpenCL tracing library CLUST. First, we expose the details of our test platform and the different tests configurations that will be exercised. Then, we measure the library's overhead for both types of tracepoints. Secondly, we measure the impact of tracing on a real OpenCL program.
Finally, examples of how the combined CPU and GPU traces can help troubleshoot issues and enhance application performance will be presented and discussed.

Test Setup.
To demonstrate the low impact of CLUST and LTTng-UST tracepoints, we benchmarked different scenarios of execution with the following different factors considered: (1) Sample size: we measured 100 samples of different sizes ranging from 1 call to 10 8 calls. The need for different sample sizes is to test whether the recording of many tracepoints overwhelm the system or if the design of LTTng really allows for maximum performance. (2) Benchmark configuration: in order to establish a baseline and isolate the impact of the different components, we measured three different execution configurations: (a) Base: tracing disabled, without CLUST. This will serve as the reference for our overhead tests. (b) Preload: tracing disabled, with CLUST preloaded.
When replacing the symbols of the OpenCL library, an overhead is added: the two disabled LTTng-UST tracepoints and the indirection of the pointer to the actual OpenCL function. (c) Trace: tracing enabled with CLUST preloaded. This is the same as the preload but with tracing enabled: the overhead of storing and recording to disk the events will be added to the preload overhead.
(3) Traced element type: as discussed earlier, there are two types of functions in the OpenCL API: the synchronous API functions that do not require any GPU interaction and the asynchronous API functions that enqueue commands to interact with the GPU. We suspect that the impact of CLUST for asynchronous functions is higher since a callback is hooked to every asynchronous OpenCL function calls.
We show in Tables 2, 3, and 4 the average time in nanosecond for each benchmark configurations (Base, Preload, and Trace) as well as the standard deviation. The overhead displayed is the difference between the measurement and the Base.
The benchmarks were performed on our test platform running an Intel i7-4770, 32 GB of memory and using the integrated GPU (HD4600) with the Beignet OpenCL development drivers. Tracing was done with LTTng (v2.6.0) and run on Ubuntu Desktop 14.04 LTS with an upgraded kernel (v3.18.4).

Synchronous Function Tracing Overhead.
For the synchronous API calls, the data is displayed in Table 2. The conclusions we can observe from these results are the following: (1) The average base time ( V ) for a call to clGetPlatformID is about 4.5 ns.
(3) Since each "preloaded" call makes two UST tracepoint calls (without recording anything), we can calculate that the average time for a disabled UST tracepoint ( 1 ) is only 1 nanosecond: (4) Looking at the "traced" time ( V ), we can see that no matter how many tracepoints we generate, the performance per tracepoint is not affected and remains steady. (5) The "traced" metrics show that the overhead of tracing synchronous OpenCL API calls is about 361 ns minus the base call time of 4.5 ns, which give us 356.5 ns for recording two UST tracepoints, or 178.2 ns per tracepoint.

Asynchronous Function Tracing
Overhead. As displayed in Figure 2, recording asynchronous event timing metrics on OpenCL requires hooking a callback function to the event.
The call to an enqueuing function will have the API function's enqueue start and end tracepoints recorded and when the command is done processing by the GPU, another custom tracepoint that contains the timing metrics will be created.
The overhead of such event is expected to be larger than any other synchronous API call since it contains a tracepoint with 44 bytes of payload. This payload has to be generated from 4 calls to clGetEventProfilingInfo and 2 calls to clGetEventInfo. This adds up to the overhead of tracing but it is important to keep in mind that the asynchronous calls are generally performed only a limited number of times per iteration in a computation as compared to synchronous calls. Table 3 shows the average time and standard deviation for various executions of data read from the device to the host using the clEnqueueReadBuffer function, configured for blocking read as a benchmark. Since we have previously shown in Table 2 that the performance of tracing is unaffected by the number of tracepoints to record, this table will show the performance for a fixed number of iterations with different buffer sizes in order to compare the overhead of tracing as a function of the size of the data transferred. This approach shows that the weight of tracing such calls is constant as a function of the size of the buffer being transferred, and the tracing overhead impact per tracepoint is lowered when using larger buffers. Unfortunately, the OpenCL read command time is greatly increased and small buffer reads suffer the most in relative terms. On our test bench, with the Beignet OpenCL drivers, the time measurements are more accurate since the PCI-Express communication latency is not present for integrated graphics.
Measurements have shown that this overhead is caused by two different sources: the calls to the clGetEventProfil-ingInfo function for gathering the start and end times on the device (approximately 1600 ns per call) and the overhead of providing an event to the OpenCL enqueue function (again, twice the same 1600 ns ioctl syscall). The Beignet library performs a heavy syscall to access and write the recorded start and end time of asynchronous functions: this results in a cost of four times 1600 ns, more than 90% of the overall tracing overhead. The recording of the events with LTTng-UST has been measured to be insignificant, as shown in Table 2. Therefore, when tracing is disabled, the data gathering is not performed and the call to the UST tracepoint is also not performed. Avoiding the callsoverwhelm to extract information when tracing is disabled keeps the CLUST library overhead as low as possible when not active.
This very high overhead for retrieving two long integers from the GPU may be reduced by modifying the driver and possibly the GPU microcode to automatically retrieve this information at the end of the execution of the asynchronous command, when OpenCL profiling is enabled.

Real OpenCL Program Tracing Overhead.
Knowing the cost of tracing individual synchronous and asynchronous OpenCL calls does not represent the typical overhead of tracing an OpenCL program. For testing purposes, we used a synthetic benchmark that performs 14 synchronous OpenCL calls and 2 asynchronous calls per iteration. The program generates an image of variable width and height, from one simple run of a very light OpenCL kernel.
We measured the performance for CLUST tracing alone and then the overhead of tracing the operating system kernel with LTTng, for a unified trace overhead measurement. Results are displayed in Table 4.
As we can see, the overhead of tracing has a significant impact on small sized computation tasks. However, when the program is working on bigger images, for more typical resolutions such as 720 p and 1080 p, the relative tracing overhead factor is considerably reduced as compared to the complexity of the calculation itself. GPGPU applications are usually more efficient on larger data sets anyway, again because of the fixed startup costs of sending the commands to the GPU. Figure 5 shows the relative overhead ( from (2)) of tracing with CLUST as a function of the workload: We can also verify the expected tracing overhead computed at the beginning of this section (∼19086 ns overhead per call); we can see that the measured tracing overhead for the real OpenCL programs varies from 14727 ns to 25494 ns and averages at 18990 ns, a measurement that is within 0.5% of the expected tracing overhead.

CLUST Tracing + LTTng Kernel
Tracing. The main advantage of our solution, as compared to existing CPU tracing and GPU tracing approaches, is the unified tracing   capabilities. Our solution allows recording the CPU trace as well as the GPU trace in the same trace format. The gathered performance data is displayed in Table 4 under the CLUST + LTTng columns.
As Figure 5 shows, the tracing overhead of the unified CPU + GPU tracing is yet again very low for larger computations. One of the reasons why the unified tracing overhead does not significantly differ from the CLUST tracing overhead is because the GPU execution speed is not affected by CPU tracing. The overhead is therefore a bit higher than the CLUST tracing overhead alone but this difference fades as the workload size gets larger.
It is interesting to point out that the longer the iterations are, the more LTTng kernel events are recorded. This can be observed by the growth of the overhead per iteration in the CLUST + LTTng Overhead column of Table 5.
The addition of LTTng kernel tracing to the GPU trace allows programmers to find links between the system state and the GPU execution speed. This can lead to performance improvements and bug solving via trace analysis.

CLUST Use
Cases. The biggest advantage of CLUST is not only its low overhead characteristics but also the new kind of information it provides to the programmers. Having access to a unified CPU-GPU trace and its analysis tools can help troubleshoot different types of problems or even help enhance application performance through GPU pipeline analysis.
Examples that depict this kind of troubleshooting scenarios and enhancements will be demonstrated in this section. 4.6. Typical Execution. As a reference for the following use cases, Figure 4 shows a close-up view of one iteration of the program that will be analyzed. It consists of a trace of the program that was presented in Section 4.4 and used to gather performance data. This iteration showed no abnormality and will therefore serve as a base for the upcoming use cases.

Abnormal Execution.
There are two primary types of abnormal executions. The first type occurs when the CPU is preempted, the second occurs when the GPU is being used by different processes.

CPU Preemption.
CPU preemption occurs when the scheduler gives CPU time to another process. For GPU intensive applications, this can lead to catastrophic performance because the CPU constantly gets interrupted and put on wait for device (GPU in our case) when executing calculations or communicating with the device. Figure 6 shows an example of how a unified trace visualization tool would help spot this kind of latency. The orange section of the system kernel trace shows that the CPU is in the WAIT FOR CPU state while the call to a read buffer OpenCL function was completed. Also, the sections on the right and the left of the orange section show normal iteration times. A programmer could fix this problem by making sure that the thread that manages the OpenCL calls has a higher priority than normal and making sure that no other process uses all other available resources.

Shared GPU Resources.
A second issue that can be observed is when the GPU device is shared between multiple processes. The GPU's primary task is to deal with rendering the screen. This process can interfere with the OpenCL kernel execution, and data transfers between the host and the device. Figure 7 shows a great example and its effect on the system. In yellow, we can see that the system is in state WAIT BLOCKED for a long time before being preempted (in orange). On the right side of Figure 7, we can see that the process is running normally. This can be caused by the execution of a concurrent application that uses the GPU resources. If the OpenCL program analyzed is critical, making sure that no other process interferes with it can fix this problem.

Nonoptimized Execution.
Another kind of flaw that OpenCL analysis can outline is nonoptimized OpenCL programs. Due to the different architectures of the GPUs, adapting to the device characteristics is a very important feature of OpenCL programs. Pipeline enhancement and kernel occupancy are just two of the main focuses for optimization. In both cases, the goal is to maximize the device and host resources utilization.

OpenCL Pipeline Enhancement.
One of the best ways to optimize the speed of an OpenCL application is to properly use its pipeline. OpenCL has the capability to process simultaneously device communications and OpenCL kernel executions. Using OpenCL's asynchronous command enqueuing functions, it is possible to maximize the GPU resources usage. As we can see in both Figures 6 and 7, the gray area of the GPU Act. represents idle time for the GPU.

13
These views show that most of the computing and communication time is not being used.
Depending on the application type, the idle time can be reduced by dividing the calculations when possible and enqueuing multiple commands in different command queues to ensure that the device always has work to do. As an example, we divided the workload of our application benchmark and noticed a 28% speedup when running 13 command queues instead of one (with one pthread per command queue). We also measured a speedup of up to 72% for 12 command queues on a different dedicated GPU. The bigger speedup on the dedicated GPU can be explained by the larger overhead of interacting with the device: the latency to communicate with a dedicated GPU is higher than for an integrated graphics GPU. Therefore, dedicated GPU configurations are well suited for pipeline enhancement. After 12 and 13 command queues and threads, we started to notice a slow down when we reached the device saturation point.
CLUST records the required information to allow the analysis and visualization of this kind of behaviour.

OpenCL Kernel Occupancy.
Unfortunately, CLUST cannot currently extract and record the required metrics on the device regarding kernel occupancy since this information is not available in the OpenCL standard. However, having a record of the clEnqueueNDRangeKernel parameters and knowing the device architecture, can help evaluate how the computation was divided and if compute blocks were executed while using a maximum of available resources on the GPU.

Conclusion and Future Work
In this paper, we explored the GPU tracing challenges and addressed the need for a unified CPU-GPU tracing tool. This allows programmers to have a better global overview of the execution of OpenCL powered applications within the whole system. We demonstrated the very low overhead of intercepting OpenCL calls to insert LTTng-UST tracepoints, not only when tracing is inactive, but also when tracing is enabled. This minimizes the impact on the system and maintains the spirit of LTTng as a very low overhead tracing tool. To our knowledge, we presented the only unified GPU-CPU tracing tool available in the open literature.
In complement to this, we demonstrated the analysis and visualization capabilities that a unified trace can provide, and the type of problems that it can help uncover.
As this is a valid solution for OpenCL tracing, future enhancements would be to implement OpenGL and CUDA UST tracing in order to have a global understanding of all GPU activity. Kernel space driver tracing could also be a great improvement for understanding the host's queue. Trace Compass's unified views also need to be fine-tuned. This tracing model not only can serve for GPU tracing but also paves the road for other heterogeneous computing analyses.
The code for this project is open source and available on https://github.com/dcouturier/CLUST.