Multicore systems are complex in that multiple processes are running concurrently and can interfere with each other. Real-time systems add on top of that time constraints, making results invalid as soon as a deadline has been missed. Tracing is often the most reliable and accurate tool available to study and understand those systems. However, tracing requires that users understand the kernel events and their meaning. It is therefore not very accessible. Using modeling to generate source code or represent applications’ workflow is handy for developers and has emerged as part of the model-driven development methodology. In this paper, we propose a new approach to system analysis using model-based constraints, on top of userspace and kernel traces. We introduce the constraints representation and how traces can be used to follow the application’s workflow and check the constraints we set on the model. We then present a number of common problems that we encountered in real-time and multicore systems and describe how our model-based constraints could have helped to save time by automatically identifying the unwanted behavior.
System analysis tools are necessary to allow developers to quickly diagnose problems. Tracers provide a lot of information on what happened in the system, at a specific moment or interest, and also what leads to these events with associated timestamps. They thus allow studying the runtime behavior of a program execution. Each tracer has its own characteristics, including weight and precision level. Some tracers only allow tracing kernel events, while others also provide userspace tracing, allowing correlating the application’s behavior to the system’s background tasks. However, each of these tracers shares the fact that an important human intervention is required to analyze the information read in the trace. It is also necessary to understand exactly what the events read means, to be able to benefit from this information.
Modeling allows technical and nontechnical users to define the workflow of an application and the logical and quantitative constraints to satisfy. Modeling is also often used in the real-time community to do formal verification [
This paper describes a new approach for application modeling using model-based constraints and kernel and userspace traces to automatically detect unwanted behavior in applications. It also explains how this approach could be used on top of some common real-time and multicore applications to automatically identify the problems that we encountered and when they occur, thus saving analysis time.
Our main contribution is to set constraints over system-side metrics such as resource usage, process preemption, and system calls.
We present the related work in Section
This section presents the related work in the two main areas relevant for this paper, software tracing with a userspace component and analysis of traces using model-based constraints.
To extend the specifications checking of an application, trace data must be available at both the application and system levels. We also need to put emphasis on the precision and low disturbance of the tracer we would use to acquire these traces. In this section, we present characteristics of currently available software tracers with a userspace component and kernel tracing habilities.
Basic implementations of tracers exist that rely on blocking system calls and string formatting, such as using
The
In this section, we present different approaches used for model-checking analysis on traces. We also review interesting tools aiming at extracting data from traces.
Other algorithms to automatically generate trace checkers are presented in [
To our knowledge, model analysis is not yet exploiting all the available information. By combining model analysis and trace analysis tools, the gap of unused information can be reduced. This would allow the specified behavior of the system to be verified through its execution trace, during or after running our application. Previous work has also been done on automatic kernel trace analysis using pattern matching, through state machines [
When designing a high performance application, the developers usually know what they expect their application to do. They know the order of the operations to perform and different metrics along with their average values. It is in fact these values that allow the developers to verify that their application is performing well and doing what they want it to do.
In this section, we will present our approach, which uses finite state machine models and constraints over kernel and userspace traces to detect unwanted behaviors in programs. These models will require instrumented applications to delimit the constraints application. We will first detail the general representation and then propose some model-based constraints that could be applied to existing applications.
Whether it is to check a limit in terms of time or resources used by an application, metrics are usually taken between two states during the execution. We first have the start state, appearing before the application’s work that we want to check. This state serves as a base to calibrate our metrics. We then have the second state, the end state for that check, at which point we can validate that we are within the limits.
Even if, during debugging, these states are usually read by the developer knowing the application, they can be fixed in the application workflow using a state machine representation. This representation can then be used to analyze constraints. Events generated from userspace tracepoints can thus be used to identify the state changes in our application.
Our representation is based on four elements: the states, the transitions, the variables, and the constraints. The states are here to represent the different states of our application. The transitions represent the movement from a state to itself or another. The state changes in the traced application can be identified and replicated in the traced system model through the events received in the trace.
The variables are used to get and store the values of the metrics we need to verify. There are three main categories of variables: the state system free variables (not based on the state system such as those used to store timestamps or values available directly from the received events), the counter variables (or counters, such as those used to store the number of system calls throughout time), and the timer variables (or timers, such as those used to store the time spent running a process). The variables are categorized depending on the number of calls needed to get their value from our state system.
Our state system is based on the
Counters do not need more than one call to our state system as their value is considered being the last one encountered: once a counter is incremented, it will keep this value until the next incrementation. On the other side, the new value of a timer is stored in the state system at the end of the activity, adding up to that timer. This means that when requesting the value of a timer at a given timestamp, we need to verify if the timer is currently running. We thus need to get the last value of the timer and its next value to interpolate the current running value.
The constraints are used to express specifications of the expectations for the run of the applications. They are composed of two operands and one operator. The operands are either variables or constant values to be compared. The operator is one of the standard relational operators, equal (
Three validation status are available for the constraints: valid, invalid, and uncertain. The valid status means that the constraint was satisfied. The invalid status means that the constraint was not satisfied. In both those cases, we were able to read the variable and compare it to the requirement. In some cases, however, when there is missing information, a constraint cannot be verified. This is, for instance, the situation of constraints over counters or timers when there is no kernel trace available for the analysis and thus no state system built. In those cases, the constraint validation status is considered as uncertain.
The constraints are linked to a transition and will be checked when this transition is reached. The transition will thus have a validation status that will be the worst case of its constraints statuses. Therefore, having at least one invalid constraint is sufficient to know that the transition did not satisfy the constraints. If there is no invalid constraint, but at least one uncertain constraint, we cannot guarantee that all the requirements were met for that transition, thus making it uncertain. Finally, a transition will be valid if and only if all of its linked constraints are valid.
All those elements will allow building our model used to identify instances of our application in the traces. The instances are identified using their thread id. The variables are currently local to an instance of the application and are thus not shareable.
Figure
State machine representation that can be used to check metrics using traces.
The “event (1)” string represents the event that would be used to enter state
The “
Finally, the “
Both initializations and verifications are discretionary for a state or transition, but events are still needed to follow the application workflow. That allows following a strict order of events to move forward in the application, without necessarily having metrics to check at this point.
This also allows us to initialize variables at one state but to only check them at a later state of our state machine, as shown in Figure
State machine representation with late verification of constraints and a transitional state.
The verifications will use initializations that only appears at state “
The verifications will use both initializations declared at different states
It is also possible for a state to have two (or more) next states. It would still be possible to validate the related constraints. In such cases, different events would be used for each of the possibilities, as shown in Figure
State machine representation with multiple next states for state “
Finally, in our approach, an event can be used at each junction of the model, but only once per junction. This removes any uncertainty about the flow to follow, in order to verify the constraints. Using this and the possibility to have multiple exits per node, we could, for instance, allow executing the initializations each time we encounter an event of type “event (1),” to only verify metrics between the last event of type “event (1)” and the first event of type “event (2).” Figure
State machine representation using a loop, to go over the initializations when reading an event of type “event (1).”
This section gives an overview of some constraints that we can specify on different metrics of the applications or the system. This overview extends the deadline constraint, already present in most real-time analysis tools based on constraint verification, to our new system-specific constraints. Our new constraints take advantage of the kernel-level information, about kernel internals and processes, available in our detailed execution traces.
Real-time is as much about logical determinism as it is about temporal determinism. In such applications, we consider that a deadline has to be satisfied for the result to be correct, and we have to take into account the maximum allowed time to get that result.
Figure
State machine representation of a constraint validating whether we spent at most 2 ms between the states “
When entering the “
When entering the “
We thus would only need userspace traces to check a constraint of this type. The deadline constraint is using a system state free variable.
When designing a high performance application, some tasks can be highly sensitive. In such case, any preemption could be disrupting the application work. We thus usually design our application to be able to work without being interrupted by another task, for instance, by setting a high priority.
Figure
State machine representation of a constraint validating that our process has not been preempted between the states “
When entering the “
Programmatically, we would use the “sched_switch” kernel events to know when the process is scheduled and unscheduled, using the events of types “event (1)” and “event (2)” to limit the search zone. For each “sched_switch” encountered for which we preempt our process (i.e., for which our process enters in a wait-for-CPU state), we can increment our initialized preemption counter. All this work is done directly in our state system. Once we reach the constraint, we thus only need to get the difference between the value of the preemption counter at the timestamp when we entered the “
This constraint is complementary to the deadline constraint. Indeed, an application could reach a given deadline while having been preempted, and in reverse an application could fail a deadline while not having been preempted. They could thus be used together to enforce a high performance condition verification. In typical cases, the deadline is ultimately the important constraint, but any preemption, even a short one that does not cause a deadline failure, may be an indication of the possibility that longer preemption could happen that would cause a deadline failure.
The preemption constraint is using a counter variable.
Whether it is a minimum or a maximum, it can be useful to limit the usage of the resources of a system such as the CPU, raw access memory or even disk, or network input/output. Taking the example of the CPU usage, we could consider, for instance, that our application is doing a really simple job and thus should not use more than 1% of the CPU time during a given period delimited by two states. We could also consider that our application work is so important during a given period that it should be using 100% of the CPU time (no preemption or waiting).
Figure
State machine representation of a constraint validating whether our process used at most 1% of the CPU time between states “
When entering the “
Programmatically, we use the events of types “event (1)” and “event (2)” to delimit the time period during which we look at the CPU usage. Using the kernel traces for the same time period, we can know which process was running on which CPU for how long. With that information, we can sum the running durations of our process and compare that information to the total time period duration. This value is actually computed in our state system allowing getting the actual value at the time of “event (1)” and “event (2)” using only two state system calls for each. Using both those values, we can get the difference and compare it to the limit (1% in our example) to check if our constraint is validated or not.
The resource usage constraint is using a timer variable.
Following the CPU usage constraint, it could be as interesting to limit how much time a process is spending in “wait-for-CPU” or “wait-blocked” status. These constraints are thus complementary to the previous one.
Figure
State machine representation of constraints validating whether the process spent at most 10% of the time period between states “
On this model, the “wait-for-CPU” status constraint “wc” is initialized using the string “
Programmatically, the events of types “event (1)” and “event (2)” would allow delimiting the working time period. We would then look at the kernel events in that period to check the status of our process and compute the time percentage spent in the status we want to check, in the same way that we computed this information for the CPU usage constraint, but using the new state of the unscheduled process to know if it is now waiting for CPU or blocked. This information is directly computed in our state system; we can thus access it easily for the given interval and verify our constraint.
The wait status constraint is using a timer variable.
High performance applications are sometimes designed to work only in userspace during their critical inner loop performing the real-time task. This helps remove any latency that can be caused by other processes, the hardware, or other resources in the system. This is, for instance, the case when a user process gets the proper permissions to access directly some I/O addresses, for interacting with external inputs and outputs through an FPGA card connected to the PCIe bus. In that case, these input and output operations completely avoid any interaction through the operating system. Other common cases of communications that bypass the operating system are accesses through shared memory buffers, synchronized by native atomic operations. In such cases, we would want to verify that the process remained in userspace for all its scheduled time. Using a system calls constraint could be useful in such cases.
Figure
State machine representation of a constraint validating that our process has not done any system call between states “
When entering the “
Programmatically, we count the number of kernel events whose name starts by “syscall_entry,” using the events of types “event (1)” and “event (2)” to limit the search of these events. For each event encountered that matches our search, we can increment our system calls counter. We can then compute the difference for that counter and use that difference to check against the requirement.
The system calls constraint is using a counter variable.
This section presents different case studies of common problems; each one is extracted from a real industrial problem that we solved using tracing.
In real-time systems, we have to comply with the given deadlines for a task. That task can happen multiple times in a short period of time. In some cases that we encountered, a task happening up to 1 000 times per second was missing its deadline one or two times per second. That task being a hard real-time one, those missed deadlines were not acceptable.
Kernel and userspace traces were used to identify the task execution and see what happened on the kernel side. These traces lead to see that, for each task that did not reach its deadline in time, another process of higher priority was scheduled instead. That process was not scheduled the rest of the time, letting the other tasks—having the same system priority as the ones missing their deadlines—reach their deadlines.
Figure
Preemption of a process by another higher priority (
Screenshot of Trace Compass showing the preemption
Trace event “sched_switch” happening to do the preemption
The application could be represented using our model approach, setting at least two states, one for the beginning of each task subject to a deadline and one for its end. We could here use a deadline constraint to be informed each time we have not finished our task in time, limiting the search for problems to precise zones.
Depending on what we expect our application to do, we could also take advantage of other constraints like a preemption constraint or a CPU usage one to get more information as to why we do not follow the expected workflow. These constraints would, however, need kernel traces to be verified. Figure
State machine representation of
Some high performance processes have to be running all the time. In such situations, the system and process are usually configured to favor that status, permanently running, by setting a high real-time priority and affinity to an isolated CPU, for instance. Still, in some instances, our high performance process is preempted when it should not.
In previous work [
Using kernel traces, it is possible to see the different processes being scheduled and compare their priority.
In the
Figure
Unexpected kernel work while tracing an userspace-only application.
Screenshot of Trace Compass showing the period between two “npt:loop” events in the application
Kernel events traced between the two “npt:loop” events showing kernel tasks running while the application is waiting to continue its work, causing latency
With our model approach, we can use a state identifying that our application is in a loop and for each event read that informs us we are doing another iteration of the loop (“npt:loop” in our example), a constraint would be validated. This constraint could be a CPU usage constraint, for instance, ensuring that our application had at least a high share of its CPU. We could otherwise use a preemption constraint, to limit the number of times that our application has been preempted during that iteration of the loop. Finally, if we consider that our application should only be working in userspace during the given period of time, a syscall constraint can be used. Figure
State machine representation of
Synchronization between the different threads and processes of a multicore application is often the hardest part of the design. In high performance applications, we want to be sure that the thread or process waiting for another will spend only the necessary amount of time waiting and be able to resume its activity as soon as possible. For some programs, however, the
The application
Using kernel traces, we can identify the status of a process as wait-blocked status and use a waiting dependency analysis to identify the origin of the waiting status of our process.
The kernel traces were used to identify what
Kernel traces also identified that the same synchronization strategy was used in
Considering that the application should normally have well-bounded delays for its tasks, we could use our model approach to represent the application normal task and use deadline constraints to verify that we are not having unduly long latencies. For the
Processes sometimes expect high performance for multithreaded tasks on a multicore system. In these cases, cache access and synchronization are usually optimized to achieve a good scalability. However, it may happen that some part of the task misses these optimizations and does not scale well, causing regressions when using parallel cores. For high scalability multithreaded processes, this behavior should be avoided.
An occurrence of that problem was encountered while we were searching the point at which a heavy I/O highly parallel application becomes I/O-bound. That application used the
Using only kernel traces and looking at those with a visualizing tool, the regression appearing in that last example was identified. The processes seemed to all be waiting for the last calling thread before unmapping and ending their respective calls to
This behavior is normally associated with the use of a barrier, as seen in Figure
Screenshot of Trace Compass showing the barrier at which threads are waiting before unmapping operations, after their calls to
If we consider the application as the one of highest priorities on the system, a CPU usage constraint could be efficient to know if the process is really taking advantage of the CPU. This constraint would show if the CPU usage is not sufficient, compared to our expectations. We could also use a wait-blocked status constraint, stating that our application should not spend more than a given time percentage in wait-blocked status. With one of those constraints, we would detect that situation.
External resources are necessary in some cases to perform specific tasks. For instance, for some highly parallel computing tasks, GPUs are becoming more and more interesting, as compared to multicore CPUs. In such cases, computation or data rendering depends on another processing unit, different from the one running our application. In high performance situations, if the CPU work is highly dependant on the GPU work and if the GPU work is not optimized, bottlenecks will appear and our process will be in wait-blocked status.
That problem was encountered while we wanted to know if an application running on a CPU and requesting GPU work was optimized.
Kernel and userspace traces can here be useful while using a visualizing tool. In the previous example, we added userspace tracepoints in the API calls to OpenCL to get more information about what happened in the GPU. We thus were able to use the generated events to understand the origin of a latency in a given process.
In some situations, the latency was induced by CPU preemption: the GPU had finished its work but is unable to get back to the process, currently unscheduled or already busy, as shown in Figure
Views showing wait situations while using external resources; these views do not exist in the Trace Compass mainline version yet [
Unified CPU-GPU view showing the process unscheduled from its CPU causing wait on the GPU side
Unified CPU-GPU view showing the process waiting for the GPU while the GPU is still working on another task
The CPU preemption could easily be detected by using our model approach. Before calling the external resource (the GPU in this case), we could enter a “external resource call” state and once that resource answers, we could enter a “external resource answered” state, for instance. We could then use a preemption constraint or a wait-blocked status constraint to ensure that our process does not end up unscheduled from its CPU.
The GPU sharing would however be trickier to detect with the current resources as
Using the case presented in Section
Number of events and sizes of the traces used to benchmark our analysis.
Name | Number of events | Size (MiB) | ||
---|---|---|---|---|
UST | Kernel | Total | ||
tk-preempt (1) | 20 015 | 932 778 | 952 793 | 35.40 |
tk-preempt (2) | 800 015 | 10 961 881 | 11 761 896 | 508.8 |
modelbench (3) | 102 400 | 13 510 489 | 13 612 889 | 352.6 |
Figure
Results of the analysis using the model-based constraints on userspace and kernel traces.
Result shown following the analysis of both kernel and userspace traces when the constraints are satisfied
Result shown following the analysis of both kernel and userspace traces when the constraints are not satisfied
Result following the analysis of the userspace trace only to simulate a case where we would not have any kernel trace, thus making the state system unavailable for the analysis
Results presented in Figures
Figure
Examples of invalid sections as reported by our tool for other cases discussed in Section
Invalid section for the case presented in Section
Invalid section for the case presented in Section
Invalid section for the case presented in Section
Switching on and off the different constraints put in the model represented in Figure
Average (avg.) and standard deviation (atd. dev.) of the time taken in seconds by a run of the model-based constraints analysis, computed using 100 runs of the analysis of the trace tk-preempt (1).
Constraints | Time (in s) | ||
---|---|---|---|
State system build | Model build & constraint verif. | Total | |
|
|||
Avg. | 3.8500 | 0.34352 | 4.1935 |
Std. dev. | 0.23075 | 0.024853 | 0.23532 |
|
|||
Avg. | 3.8592 | 0.72952 | 4.5888 |
Std. dev. | 0.20432 | 0.047765 | 0.21079 |
|
|||
Avg. | 3.8271 | 0.46162 | 4.2887 |
Std. dev. | 0.17792 | 0.043790 | 0.18840 |
|
|||
Avg. | 3.8609 | 1.1251 | 4.9860 |
Std. dev. | 0.22394 | 0.042746 | 0.22484 |
|
|||
Avg. | 3.8312 | 0.15789 | 3.9891 |
Std. dev. | 0.17902 | 0.017863 | 0.17954 |
Average (avg.) and standard deviation (std. dev.) of the time taken in seconds by a run of the model-based constraints analysis, computed using 100 runs of the analysis of the trace tk-preempt (2).
Constraints | Time (in s) | ||
---|---|---|---|
State system build | Model build & constraint verif. | Total | |
|
|||
Avg. | 29.663 | 6.9751 | 36.638 |
Std. dev. | 0.84278 | 1.1635 | 1.5931 |
|
|||
Avg. | 29.599 | 41.223 | 70.821 |
Std. dev. | 1.1128 | 1.0844 | 1.7550 |
|
|||
Avg. | 29.462 | 22.220 | 51.682 |
Std. dev. | 1.0688 | 0.59945 | 1.2692 |
|
|||
Avg. | 29.766 | 58.855 | 88.620 |
Std. dev. | 1.1275 | 1.2559 | 1.8220 |
|
|||
Avg. | 30.049 | 5.2234 | 35.272 |
Std. dev. | 0.98000 | 0.91995 | 1.3683 |
Given the different numbers of userspace and kernel events in each trace, we can see the different baseline times needed to build the state system and the model and verify the constraints when no constraint is active (
Amongst the constraints, we can see that for both traces the deadline constraint is the fastest to compute, followed by the preemption and finally the CPU usage. This is coherent with the fact that state system free constraints do not need complementary data to be computed, while counters need two state system calls for the interval and timers four calls.
For each trace, however, we can see in Tables
Average (avg.) and standard deviation (std. dev.) of the time taken in seconds to build the state system during the first run versus to verify if it exists in the subsequent runs, computed using 100 runs of the analysis of the traces.
Trace | Time (in s) | |
---|---|---|
Build | Access | |
|
||
Avg. | 3.8457 | 0.0516 |
Std. dev. | 0.20404 | 0.00611 |
|
||
Avg. | 29.708 | 0.0539 |
Std. dev. | 1.0463 | 0.00764 |
The last validation step of our approach has been to verify its scalability. As our model-based analysis uses both traces and models, we needed to validate scalability on those two different aspects.
In order to measure the scalability relative to trace length, we generated a number of traces containing events needed to follow the model presented in Figure
Time (in s) to build the instances and check their constraints as a function of the number of userspace events. Lines represent linear regressions of the data.
Time (in s) to build the state system as a function of the number of kernel events. The line represents a linear regression of the data.
Figure
Figure
Also, the linear regressions in both Figures
To analyze the model scalability, we used the trace
Time (in ms) to build the instances as a function of the number of successive states in the model.
Time (in ms) to build the instances as a function of the number of transitions between two states.
Time (in s) to build the instances and check their constraints, as a function of the number and categories of constraints between two states.
We used this trace to first consider a model without constraint and with only one transition per state, to analyze the scalability according to the number of successive states in the model, as shown in Figure
We then analyzed a situation in which the number of states was fixed, but the number of transitions from one state to the other was variable. Figure
Finally, we studied the constraints scalability by fixing the number of states and transitions and by varying the number of constraints on that transition. We repeated that test for the three different categories of constraints and for a case using one constraint of each category. Figure
All those tests allow us to validate that our approach executes in time linearly proportional to the trace length and model size. It will thus take more time to analyze a bigger trace, as it will be longer to follow and check a model with more nodes and constraints.
We have presented our approach for application modeling, using model-based constraints, and kernel and userspace traces, to automatically detect unwanted behavior in real-time and multicore applications. We presented how our models use tracepoints to follow the application workflow. We then proposed some constraints, using userspace and kernel traces information, to validate application behavior. We detailed multiple cases where tracing has been helpful to identify an unexpected behavior and explained how our model approach could have saved time by automatically identifying those behaviors. Finally, we presented the results produced by our approach and the associated execution time as well as its scalability relative to trace length and model complexity.
We believe that using model-based constraints on top of userspace and kernel traces has a great potential to automate performance analysis and problem detection. We intend to pursue our work to use model-based constraints not only to detect problems but also to identify their root cause. We could also use this information to allow our approach to propose simple solutions to common real-time and multicore problems, such as raising the priority of a process if it was preempted but should not have been.
This work represents the views of the authors and does not necessarily represent the view of Polytechnique Montreal. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others.
The authors declare that they have no competing interests.
The authors are grateful to Mathieu Côté, David Couturier, François Doray, Francis Giraldeau, and Fabien Reumont-Locke for the cases studied in this paper. This research is supported by OPAL-RT, CAE, the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Consortium for Research and Innovation in Aerospace in Québec (CRIAQ).