Tree Traversal with Vector Processing


Path tracing, much like other forms of stochastic, concurrent problem solving, is a very resource intensive task. In an effort to improve the runtime performance, I decided to implement vector processing in the spatial tree traversal.

Babylon utilizes a bounding volume hierarchy (BVH) tree to subdivide the geometry of a scene into spatial partitions. This allows the tracing engine to scale logarithmically with the number of triangles in a scene. Combined with multi-threading, it is a fairly efficient algorithm. However, the spatial partitioning provides an opportunity for further optimization.

Each spatial partition is an axis aligned bounding box (AABB), which are specified by two points on opposite corners of the box. Each point is a 3D vector consisting of 3 floats. On most x86 machines, a float is 4 bytes, so each AABB occupies 2 points * 3 floats * 4 bytes per float = 24 bytes = 192 bits.


Most x86 processors since 2015 (supporting the AVX standard or later) have special SIMD registers of 256 bits or greater. By appending another component to each point to transform the AABB points to standard homogenous coordinate form, the two points can be perfectly packed into a single SIMD register on such systems.

The AABB intersection algorithm performs the same operation on each element of each vector of the box, an ideal case for taking advantage of SIMD processing. The standard algorithm performs the arithmetic and distills the results of the vector arithmetic into two scalars. The scalars are then compared to return the result of the function. By taking advantage of SIMD processing, each thread of the tree traversal can be theoretically improved by a factor of 2.

While the potential improvements are worthwhile, SIMD programming often requires coding machine specific assembly instructions, which goes against my choice to make Babylon architecture agnostic (it’s also a pain). In addition, while I believe that a user of Babylon is likely to be running appropriately powerful (and recent) hardware, I understand that is not always the case.

To resolve this issue, I decided to use the excellent Boost.simd library. This library abstracts the specific hardware instructions for a variety of platforms behind a consistent interface. This allows me to maintain my hardware agnostic philosophy. If Babylon is being compiled on the AVX or later architecture, it will take advantage of SIMD tree traversal. Otherwise, it will default to the previous scalar method.


If Babylon is run on a single threaded machine, the SIMD tree traversal will improve by a theoretical limit of 2x. However, on multithreaded systems, the gains are much larger. On a dual-core system, the improvement is 4x, quad-core 8x, and so on.

While the tree traversal is greatly improved, the impact on overall performance is significantly dependent on the scene complexity, in terms of triangle count and BRDF numerical complexity.  For scenes with particularly complex BRDFs, the tree traversal represents a small fraction of the overall computing time, which dampens the improvement in tree traversal. In my tests, I typically saw an improvement of 7% – 20%, depending on these factors.

Babylon can be found on my GitHub.

Leave a Reply