We present the Brigade renderer: an efficient system that uses the path tracing algorithm to produce images for real-time games. We describe the architecture of the Brigade renderer, and provide implementation details. We describe two games that have been created using Brigade.
Historically, games have been an important driving force in the advance of graphics hardware and rendering algorithms. Effort has evolved from striving for abstract, visually pleasing results, to more plausible realistic rendering. In the former, a distinct visual style is chosen, which does not necessarily require realism. Instead, over-the-top animation styles and matching graphics are used. Examples of this approach are most early 2D computer games, but there are also more recent titles such as Super Mario Galaxy [
Two examples of modern games that use a nonrealistic visual style. (a) Super Mario Galaxy, (b) Okami.
Many modern games strive for realistic graphics, where the goal is to convince the player that the result is (or could be) realistic. Examples are racing games such as the Gran Turismo series [
Two examples of modern games that aim for a high level of realism. (a) Tom Clancy’s H.A.W.X., (b) Gran Turismo 5.
Recently, efforts are being made towards physically correct results. For static scenery and a static light configuration, this can be achieved by precalculating global illumination, or by coarsely calculating radiosity. Examples of this are games based on the Unreal 3 engine [
Precalculated global illumination, calculated using Unreal technology. (a) Mirror’s edge, lit by Beäst. (b) Scene lit by Lightmass.
Physically based rendering of virtual worlds has strong advantages. The obvious advantage is image fidelity (Figure
Ray tracing versus path tracing. (a) was rendered using the Arauna ray tracer [
Of the available physically based rendering algorithms, stochastic ray-tracing based methods (path tracing and derived methods) are favored over finite element methods, due to their elegance and efficient handling of large scenes. Unlike rasterization-based approaches, path tracing scales up to photo realism with minimal algorithmic complexity: the only dependencies are compute power and memory bandwidth. Both increase over time. Moore’s law states that the number of transistors that can be placed inexpensively on an integrated circuit rises exponentially over time [
Assuming that all other factors remain constant (e.g., scene complexity, screen resolution), it can thus be assumed that there will be a point where physically based rendering is feasible on consumer hardware.
Recently, Whitted-style ray tracing and distribution ray tracing have been shown to run in real-time, or at least at interactive frame rates, on CPUs (see e.g., [
Interactive path tracing was first mentioned in 1999 by Walter et al. as a possible application of their Render Cache system [
Sadeghi et al. use ray packets for their path tracer [
In their 2009 paper, Aila and Laine evaluate the performance of various ray traversal kernels on the GPU [
More recently, Novák et al. used GPU path tracing with path regeneration to double the performance of the path tracing algorithm on stream processors [
Outside academia, several applications implement interactive path tracing. Examples are octane [
The unbiased path tracing algorithm with russian roulette is shown in Algorithm
hitDiffuse hitDiffuse p pixel[
The path tracing algorithm can be efficiently implemented as on the GPU, using a single kernel per pixel. The kernel loops over the samples for a pixel and outputs the final color. This limits memory access to read-only scene access, and a single write for the final pixel color. CUDA code for this is provided in Appendix Section. Example output is shown in Figure
Scene from Kayija’s paper, rendered using our CUDA path tracing algorithm.
For more general scenes, we can replace the hardcoded ray/scene intersection by the BVH traversal code proposed by Aila and Laine [
A renderer for games has specific requirements, which differ significantly from other applications. Of these, the requirement of real-time performance probably has the greatest overall impact on the design of a renderer. A modern game runs at 60 fps or more. For certain genres, a lower frame rate is acceptable. For the game Doom 4, a fixed frame rate of 30 fps is enforced by the renderer [
Frame rate translates to a strict millisecond budget, which must be divided over all subsystems. Note that if we chose to run the subsystems in order, the budget available to rendering decreases. If, on the other hand, we run the subsystems and rendering in parallel, we introduce input lag: in a worst-case scenario, user input that occurred at the beginning of frame
Apart from real-time performance, rendering for games requires dynamic scenery. Scene elements may undergo complex movement due to physics as well as hand-crafted animations and procedural effects such as explosions. Contrary to popular belief, global changes to scenery are uncommon in games. Typically, large portions of the scenery are static, to avoid game states in which the user cannot progress.
Tightly coupled to the real-time requirement is the fact that games are interactive applications. The renderer must produce correct results for all possible user input and cannot predict any scenery changes that depend on user interaction.
On top of the generic requirements, there are requirements that evolve over time, most notably rendering resolution and scene complexity. At the time of writing, a typical game renders at a resolution of at least
The Brigade rendering system is designed specifically for games and applies and encapsulates the technology of Section
The rendering system achieves high performance by fully utilizing all compute devices in a heterogeneous architecture (Section
Figure
Functional overview of the Brigade renderer, combined with a generic game engine.
The main components of Brigade are as follows.
The scene and hierarchical scene graph contain all data required for rendering. This includes the object hierarchy, mesh data, materials, textures, cameras, and lights. The object decomposition represented by the scene graph is used to steer acceleration structure construction, which makes the scene graph an essential data structure within the system. For convenience, the scene graph object implements keyframe and bone animation.
The core implements the Render () method, initiates acceleration structure updates, synchronizes scene data changes with the compute devices, and divides work over the tracers, if there is more than one.
The acceleration structure updater maintains the BVH, by selectively rebuilding parts of the acceleration structure based on changes in the scene graph.
A tracer is an abstract representation of a compute device or group of similar compute devices. A “compute device” in this context can be a GPU, the set of available CPU cores, or a compute device connected over a network. The tracer holds a copy of the scene data and the acceleration structure and implements the path tracing algorithm with the next event estimation and multiple importance sampling. Tracers are assumed to produce identical output for identical input (except for nondeterministic aspects of the rendering algorithm).
The acceleration structure used by the tracers is the only cached data structure that is derived from scene data. All other data can be modified on-the-fly. This includes (all properties of) materials and lights.
In this system, the governing processes run on the CPU, and tracers (which in a typical setup primarily run on the GPUs) function as workers.
A modern PC is a heterogeneous architecture, which typically consists of a CPU with multiple cores, and at least one GPU.
To efficiently use the available compute power, several options are available. The rendering algorithm is implemented completely on either the CPU or the GPU. The rendering algorithm is implemented on both the CPU and the GPU. Tasks are divided over CPU and GPU.
Each of these options has advantages and disadvantages. A renderer that runs entirely on the CPU or GPU may result in underutilization of the other compute device. An algorithm that is implemented on both the CPU and the GPU will use all resources but requires a greater implementation effort. Dividing tasks over CPU and GPU seems the most attractive option. This is, however, only efficient when CPU and GPU spend equal amounts of time on their assigned tasks.
A fourth option is to use a hybrid solution, where the CPU has specific tasks and uses the frame time that remains to assist the GPU. This is the approach implemented in our system. The CPU is responsible for game logic and acceleration structure maintenance, while the tracers perform the actual rendering. Assuming a CPU tracer is available, this system is able to keep compute devices fully occupied. The process is illustrated in Figure
Double buffering the BVH. The CPU updates the BVH and sends changes to the tracers. Each tracer processes the changes in a commit buffer before rendering the next frame.
For each frame, the CPU updates the game state. The resulting changes to the scene graph are then used to update the BVH. The changes to the BVH, as well as any other scene changes, are sent to the tracers, where they are placed in a commit buffer, which the tracers use to iteratively update a local copy of the scene.
Parallel to these activities, the tracers render using the data that was prepared in the previous frame. A tracer starts a frame by processing the changes in the commit buffer, and then renders a part of the frame. CPU tracers are handled slightly differently than GPU tracers, by postponing rendering until the acceleration structure has been updated. This prevents rendering interferes with acceleration structure maintenance.
When no CPU tracer is available, the CPU can execute game code that does not affect the scene graph after copying scene changes to the commit buffers of the tracers.
The tracer flow is shown in Figure
Tracer thread initialization and main loop.
When more than a single tracer is available, the core estimates an optimal workload division prior to rendering each frame. The advantage of this approach is that no communication between the tracers and the core is required once rendering has commenced, which greatly reduces communication overhead for GPU and network tracers. Dividing the work is nontrivial; however, not every compute device may have the same rendering capacity, and not every line of pixels has the same rendering cost (see Figure
In a game, a typical camera moves in a somewhat smooth fashion. A good workload division for one frame will thus be at least reasonable for the next frame. We exploit this by adjusting the workload balance in an iterative manner.
We implemented four schemes to divide work over the tracers.
In this naive scheme, all workers are assigned an equal share of the screen pixels; no balancing is performed. This scheme is included for reference.
This scheme starts with an equal distribution of the work for each tracer. After completing each frame, the tracer that finished last passes one work unit (one work unit equals four rows of pixels) to the tracer that finished first. When the work is poorly distributed, it may take a large number of frames to properly balance.
Calculates the exact amount of work a tracer can handle based on the previous frame, but without considering differences in cost between lines of pixels. This may result in hickups, when many expensive lines are assigned to a tracer at once. The perfect balancer uses the following formula to determine the workload for worker
Figure
Efficiency of three workload balancing schemes, for two GPUs ((a)-(b)) and three GPUs ((c)-(d)), small camera movements ((a)–(c)) and larger camera movements ((b)–(d)). Values are percentages of rendering time spent idling due to early completion. Measured for the Aztec scene.
Render cost visualized: pixels representing the sky dome or light sources have a significant lower cost than other pixels. Cost is represented by greyscale values (brighter is higher cost), per 32 pixels (a full warp). Measured using a tracer implemented in NVidia’s CUDA [
Table
Average percentage of summed rendering time for all GPUs spent idling due to early completion, for the four balancing schemes, over 128 frames, for a slow and a faster moving camera. Measured for the Aztec scene.
2 GPUs | 3 GPUs | |||
---|---|---|---|---|
Slow | Fast | Slow | Fast | |
None | 46.4 | 30.2 | 45.1 | 47.2 |
Robin Hood | 2.1 | 8.2 | 4.9 | 20.7 |
Perfect | 2.8 | 2.4 | 12.2 | 8.0 |
Perfect Smooth | 1.4 | 3.4 | 2.8 | 6.2 |
For acceleration structure maintenance, we use the following assumptions. A game world may consist of millions of polygons. A small portion of these polygons is dynamic. Several tracers will use the same acceleration structure.
Based on these assumptions, a full rebuild of the BVH for each frame is neither required nor desired, as it would put a cap on maximum scene complexity, even when very few changes occur. We reuse the system described by Bikker [
Brigade uses a double-buffered approach for BVH maintenance. During a single frame, the CPU updates the BVH based on modifications of the scene graph. The resulting changes to the BVH are sent to the tracers, where they are placed in a commit buffer. At the start of the next frame, the commit buffer is processed, which results in an up-to-date BVH for each of the tracers. This process is illustrated in Figure
Each frame is thus rendered using the BVH constructed during the previous frame. Acceleration maintenance construction thus only becomes a bottleneck when the time it requires exceeds the duration of a frame.
To reduce the noise in the final rendered image, several frames can be blended. Each pixel of the final image is calculated as
Note that even though the camera may be static, objects in the scene may not be. It is therefore important to limit the minimum value of
Brigade exposes a CPU-based synchronous single ray query that uses the BVH from the previous frame, to provide the game engine with a fast single-ray query. This query is useful for a number of typical game situations, such as line-of-sight queries for weapons and AI, collision queries for physics, and general object picking. The single-ray query uses the full detailed scene (rather than, e.g., a coarse collision mesh), including animated objects.
Maintaining a sufficient frame rate is of paramount importance to a game application. In this subsection, we propose several approaches to scale the workload.
The relation between frames per second and samples per pixel is almost linear. Brigade adjusts the rendered number of samples per pixel when the frame rate drops below a specified minimum and increases this value when the frame rate exceeds a specified maximum.
By balancing the ratio of primary and secondary rays, the quality of antialiasing and depth of field blurring can be traded for secondary effects. The primary rays are faster; increasing their ratio will also improve frame rate.
Changing the termination probability of Russian roulette does not introduce bias, although it may increase variance [
Alternatively, the workload can be reduced by reducing rendering resolution, or limiting trace depth. Limiting the maximum recursion depth of the path tracer introduces bias but also improves performance. In practice, due to Russian roulette, deep rays are rare, which limits the effect of a recursion depth cap on performance.
For game development, the scalability of a renderer based on path tracing is an attractive characteristic. A relatively slow system is able to run the path tracer at an acceptable frame rate, albeit perhaps not at an acceptable level of variance. Faster systems benefit from the additional performance by producing more samples per pixel, and thus a smoother image.
The rendering system described in the previous section is relatively simple. To a large extend, this simplicity is the result of the chosen rendering algorithm. The path tracer does not rely on any precalculated data, which greatly reduces data dependencies. There are two exceptions, and these are also the most complex parts of the system. The first is the acceleration structure, which is cached and updated iteratively, in a double-buffered fashion. As a result, games cannot make arbitrary changes to the scene graph. The second is the data synchronization between the renderer core and the tracers, which generally run on the GPU(s). Using a commit buffer system, Brigade makes this virtually invisible to the application, and few restrictions apply.
Apart from the tracers, Brigade is a platform-independent system. The tracers abstract away vendor-specific APIs for GPGPU and allow the implementation of networked tracers and CPU-based tracers. When using a CPU tracer, the system is able to achieve full system utilization, with little overhead.
To validate our results, we have applied the renderer to two-student game projects. Both games have been produced in approximately 14 working days.
The Reflect game application is a student game that was developed using an early version of the Brigade engine. The game scenery is designed to simulate actual requirements for game development, and purposely mimics the graphical style of a well-known modern game (Portal 2 [
The scenery has the following characteristics: scenery consists of approximately 250 k triangles, divided over multiple, separated rooms; the scene is illuminated by thousands of area light sources, many of which are dynamic; the game world is populated by dozens of dynamic objects.
Art assets for the game were created in Alias Wavefront Maya 2011 and were directly imported into the game.
Like Portal 2, Reflect is a puzzle game, where the player advances by activating triggers that in turn open doors or activate elevators. A “mirror gun” is available to the player to transform flat and curved wall sections into mirrors. These mirrors, as well as glass cube objects, can be used to divert lasers that block the way.
Two views from the “Reflect” game, rendered at
Materials in the levels are deliberately simple. Although specular surfaces are supported by this version of the renderer, specularity significantly increases noise, making this impractical in most situations.
The freedom in lighting setup led to a final level that contains approximately 10 k light emitting polygons. Direct and indirect illumination simply works and results in subtle global illumination, both for static and dynamic objects.
The CPU tracer that was implemented for this project proved to be problematic: keeping the CPU and GPU tracers in sync required significant engineering effort, while the overall contribution of the CPU is quite small.
The student game “It’s About Time” was created using a recent version of the Brigade renderer. Four views from the game are shown in Figure
“It’s About Time” is a third-person puzzle game that takes place in deserted Aztec ruins. The player must collect a number of artifacts by solving a number of puzzles, located in several areas in an open outdoor world.
“It’s About Time” is designed to run on a typical high-end system, using a single hexacore CPU and one or more current-generation NVidia or AMD GPUs. The game renders to standard HD resolution. This resolution can be halved to improve frame rate. We developed an updated CUDA tracer that roughly doubles the performance of the first iteration (as used in Reflect), as well as an OpenCL tracer, which produces identical images. A CPU tracer was not developed; the CPU is reserved for acceleration structure maintenance and game logic. The implemented path tracers are unbiased.
One of the puzzles features an animated water surface that responds to the player, consisting of tens of thousands of polygons. For the player character, a detailed set of skinned animations is used. The puzzles make extensive use of rigid animation. As a result, acceleration structure maintenance requires considerable processing. A detailed day-night cycle and an animated cloud system (with shadowing) were implemented to emphasize the strength of the path tracer for dynamic lighting. A standard physics engine was integrated to provide high quality physics simulation. The level is detailed and consists of 1.4 M triangles. The artists used a small set of sand stones to construct most of the buildings and ruins.
The game world is illuminated by a sun (or the moon), and some smaller light sources. To reduce variance, we modified the path tracer to always sample two light sources per diffuse surface interaction. One of these rays always probes the primary light source. This significantly reduces variance in most areas. Adaptive converging is used to improve image quality when the camera is (almost) stationary. These application-specific optimizations where implemented in the GPU tracer code.
Figure
System utilization for the four views shown in Figure
Four views from “It’s About Time.”
For the first two views, the CPU is underutilized, as both acceleration structure maintenance and game logic require little processing time. For the other two views, the camera is near a simulated water surface that consists of 18 k polygons. Both the simulation itself and the resulting acceleration structure maintenance require considerable processing time. This also affects the GPU tracers, which use more time to transfer and process the modified scene data.
The Brigade renderer is an in-core rendering system, which stores multiple copies of the scenery. The host system stores a full copy of the scene and synchronizes this data with each of the tracers.
For the 1.4 M triangle scene of “It’s About Time,” memory use is 737 MB, which consists of 175 MB triangle data, 42 MB for the acceleration structure, and 520 MB texture data. The size of other data structures is negligible, except for the commit buffer, which must be large enough to store per-frame changes to scene data and the acceleration structure. For “It’s About Time,” we used a 2 MB commit buffer.
Figure
Noise level and performance at 2 spp, 4 spp, 8 spp, and 16 spp. Rendering resolution is
We have investigated the feasibility of using physically based rendering in the context of real-time graphics for games. We implemented a renderer based on the path tracing algorithm, and used this to develop two proof-of-concept games. We have shown that real-time path tracing is feasible on current generation hardware, although careful light setup is required to keep variance levels acceptable.
The development of a game using path tracing for rendering simplifies game development. This affects both software engineering and art asset development. Since Brigade does not distinguish static and dynamic light sources and does not impose any limitations on the number or size of light sources, lighting design requires little knowledge beyond discipline-specific skills. The fact that polygon counts and material properties have only a small impact on rendering performance provides level designers and graphical artists with a high level of freedom in the design of the game. This reduces the number of iterations level art goes through, and allows a team to have game assets in the engine early on in the project.
Despite these positive experiences, real-time path tracing in commercial games is not feasible yet on current generation high-end hardware. Acceptable variance at HD resolution and real-time frame rates requires 8x to 16x the performance that can be achieved on our test system. Without further algorithmic improvements, this level may be reached in a few years. We do believe this can be accelerated. Already GPU ray tracing performance is benefiting from architectural improvements, on top of steady performance improvements. Another way to partially solve the rendering performance problem is to use cloud rendering, where dedicated servers are used for rendering images, which are then transferred over the internet to the client. At the time of writing, the Brigade system is being integrated into the OTOY cloud service, specifically for this purpose. The cloud rendering service will be made available to indie game developers in the near future and will allow them to use path tracing without the need of owning sufficiently powerful hardware.
Apart from raw performance, we should address the issue of variance. While low sample rates already result in reasonably converged images in our experiments, this will not be sufficient for more complex materials. Techniques like bidirectional path tracing (BDPT) and energy redistribution path tracing (ERPT) may solve this to some extent. However, not all of these techniques produce acceptable images at low sample rates; therefore, a minimum performance level is required before this can be considered for real-time graphics.
A temporary solution to the rendering performance problem is to use postprocessing on the path traced image. Although some work has been conducted in this area, it typically does not consider all the data that is available in a path tracer, which leaves room for improvement. Note that any form of postprocessing will introduce bias in the rendered image. For the intended purpose, this is, however, not an objection.
Efficient CUDA implementation of the path tracing algorithm, using a single kernel per pixel (see Algorithm
{ { { { } { } { } { } } } }
Several scenes used in this paper were modeled by students of the IGAD program of the NHTV Breda University of Applied Sciences. The original of the scene shown in scene 4 was modeled by Jim Kajiya. The author wishes to thank Erik Jansen and the anonymous reviewers for proofreading and useful suggestions.