I chose to implement this technique for two reasons:
To give our artists the ability to use as many lights as they see fit without worrying too much about performance.
An excuse to finally try out compute shaders (which this is perfectly suited for).
I began researching the technique and found a lot of information about it. This made creating a rough first iteration quite easy and the thing I struggled with the most at this point was forgetting to adjust for the differences in how DX11 and OpenGL handles clip space since most of the code that I found was written in OpenGL/GLSL and I was using DX11/HLSL.
The first iteration included a compute shader for the following tasks:
Creation of the clusters (a.k.a. the Axis Aligned Bounding Boxes (AABB) that the frustum is subdivided into).
Light culling
The cluster creation shader was a pretty straight forward to write with all the resources that I found on the topic and has stayed the same for all the subsequent iterations. It dispatches a group of one thread for each cluster and outputs the min and max in view space for each cluster into an array that is indexed using the SV_GroupID multiplied by the number of clusters in each dimension. The dimensions are technically arbitrary and a profiler should be used to find the optimal size, but that is not something I've chosen to spend any time on and so I chose to use the same as a guide that I found: 16x9x24. The X and Y matches our game's target aspect ratio and the Z plays well when doing math.
The light culling shader for the first iteration was very simple and used a single thread per cluster, same as when creating the clusters, but with a different dispatch and group size. The group size was 64x1x1 and the dispatch size was the total amount of clusters divided by the group size: 3456 / 64 = 56. This meant that every thread had to check against every light, utilizing no cooperation between different threads, but it worked and I had an easier time fixing any bugs in the other shaders.
Up until this point I have only implemented it for point lights (Pun unintended), but we have multiple different types of lights in our engine, and so, before I moved on to optimizing anything, I added our other types of lights to make sure that they also worked in the first iteration.
The different types of lights have slightly different data layouts (structs) and most of it isn't needed until the light pass where we consume the data produced by these shaders. I chose to start off using spheres to represent spotlights as well to avoid getting stuck trying to implement intersection tests for Cone vs AABB. This meant that I could put them all in the same array if I made another buffer containing only the data required for performing Sphere vs AABB intersections, which was their position in view space and their range. Since the different lights were inserted after each other; I was able to figure out what type of light it was using the total counts for each light sent through a constant buffer.
Now that I had a working base implementation; it was time to optimize it. I had read about a clever way to avoid invoking the light culling shader on all the clusters, and instead marking the 'active' ones in a array of booleans. This array has the same size as the total number of clusters so that you can easily index into it using the same linear cluster index that we calculate in the other two shaders. The only requirement was that you had to do a pre-Z depth pass, but that was something that I was already doing in our engine!
It worked by dispatching a thread per pixel and calculating the cluster index from a sample of the depth buffer. The resulting array is quite sparse and not very cache efficient, so it's further improved by doing an additional pass where all the active cluster indices are put into a compact list that can then be consumed by the light culling shader much more efficiently. It also adds the ability to dispatch the light culling shader indirectly using the same buffer that is needed to keep track of how many active clusters that the compact list contains.
This time I wanted to try my hand at improving the light culling shader by increasing the number of threads used per cluster. I had seen someone else's version of it, except they had none of the things that I added in the second iteration and was therefore still going through all of the clusters, but I was still able to take some inspiration from it and wrote my own.
The way it works is by looping over all the lights in batches, where each batch is the same size as the number of threads. Every thread retrieves one light each, determines what type it is, and does the intersection test. If the intersection test succeeds, then a group shared index counter is incremented and the light is added to a group shared array of lights at the index of the counter's value before it was incremented.
After all the batches have been processed; the thread with group index 0 is responsible for retrieving a index into the global light list, similar to how the previous index was determined except now the index is not just incremented by 1, but by the total number of visible lights that the group has found.
I profiled my implementations and used the single thread per cluster version as a base line. I also added another implementation made by Ángel Ortiz who has made a guide (link) on clustered shading that was very helpful. His implementation doesn't include the active clusters optimization, so to better compare with my own; I profiled mine without that as well and the final results were quite interesting!
The biggest surprise to me was how much faster I had managed to make my implementation, especially since it was my first attempt. I'm haven't had time to nail down why exactly why my implementation is faster, but my guess would be because Ortiz tries to be more efficient by batching the clusters and the lights at the same time. He makes a group of threads copy a light each into a group shared array, which I thought was smart and is something that I took inspiration from, but then each thread changes to act per cluster instead and has to loop over all of the shared lights on their own, no matter if they are valid lights or not.
The difference between using the active clusters optimization and not using it, was not that surprising. I believe it was more intended for scenes with potentially millions of lights as it inevitably adds a bit of over head and for smaller scenes, like in our projects, it doesn't seem to be worth it. I will probably change it to only use it if the number of lights exceed a certain threshold to get the best of both worlds.
Profiling notes:
I used ID3D11Query's to create timestamps for the dispatch calls and took multiple snapshots to get an average time.
I had to convert Ortiz's implementation from GLSL to HLSL and as such I might have gotten something wrong in the process. I've triple checked it, though, so it should be correct.
The scene I used was not complex and all the lights were stacked on top of each other.
I didn't move around after entering the scene so that I would get comparable results.