My Ray Tracer Journey - Part 2

Eren Dere -- 23 April, 2021

10 min read

Overview

In this two weeks, I was trying to implement a BVH tree with GLSL shaders. And then, I was going to implement recursive ray tracing (I failed :D). I immediately initiated another RayTracer project which runs on CPU. My CPU project runs a lot faster than GPU implementation. This proves that I need to learn a lot more things related to shader programming.

I will mainly work on CPU part from now on. I will consider GPU part for the project part of the course. Learning concepts and implementing them comfortably is more important.

Within 3 late days, I have managed to implement the things related to HW2 and I will share my comments and thoughts on things that I’ve encountered in detail.

Things I Did With GPU

I first want to talk about my experience with compute shaders and the problems that I’ve encountered. I really like playing with shaders and I can implement most of the algorithms that I see with them. I really want to learn how to write clean an efficient shader code. However, it runs so sloooooowly and I cannot understand why. I guess I need to be an expert in order to fully understand what were my problems. While I was searching on the net, I saw people trying to understand what their shader code does by looking at the assembly codes created by shader compilers.

I guess being a newbie shader programmer is not enough for creating optimized shaders. I need to have a deeper understanding about the inner structure of GPUs. Things were working great in the beginning especially when I was working on the first homework. But when recursive algorithms started coming I started having problems.

GPU Code Does Not Have Recursion

I already knew that I could not implement recursive algorithms like BVH traversal and recursive ray tracing just like we normally do in CPU side. But I had solutions in my mind. Since speed was very important for us, I mainly focused on implementing a BVH traversal algorithm in compute shader.

We all have some solutions in our minds for implementing recursive algorithms iteratively. We use stacks. It is one of the most convenient ways of implementing recursive algorithms iteratively. So, How I did that in compute shader?

For BVH traversal, I implemented a stack

BVHNode nodeStack[MAX_STACK_LENGTH]

Here is my pseudo code for traversing the BVH:

IntersectionReport intersectBVH(Ray r, Mesh m)
{
    BVHNode rootNode = BVHNodes[mesh.rootBVHNode];
    nodeStack[0] = rootNode;
    int stackPointer = 1;
    BVHNode current = nodeStack[0];
    while(stackPointer >=1)
    {
        // Pop the stack
        stackPointer--
        current = nodeStack[stackPointer];

        bool aabbIntersect = intersectBox(r, current.aabb, tmin, tmax);
        if(aabbIntersect == true)
        {
            // Current node is a leaf
            if(current.isLeaf == true)
            {
                IntersectionReport rprt = intersectFace(r, current.indices, tmin,tmax)
                // Things related to face intersection
                ...
            }
            // If it is not a leaf, we push children into the stack
            else
            {
                nodeStack[stackPointer] = BVHNodes[current.rightNode];
                stackPointer++;
                nodeStack[stackPointer] = BVHNodes[current.leftNode];
                stackPointer++;
            }
        }
    }

    return report;
}

This algorithm is straightforward. I first flattened the BVH tree that I’ve constructed in CPU and sent it to the shader by using Shader Storage Buffer Objects. I had a giant BVHNode array which contains all the nodes coming from all the meshes in the scene. I just kept their offsets and reached them by using these offsets.

However, this destroyed the speed gain. Shader started running ridiculously slow. Especially when tree heights are higher, shader becomes slower and slower. After searching through the net, I have learnt that checking things with if and else everywhere in the shader is not a proper approach. Shaders do not like branching because GPU is not created for these purposes. I needed to have another approach for BVH traversal. Implementing recursion with stacks is not efficient.

I searched for a solution for a week :D. And finally I implemented this algorithm and it worked better. I flatten the BVH again, but I keep the elements in the array sorted in pre-order form. Just like how we traverse the BVH tree. For each node, I was keeping how many child nodes they have.

So, my new algorithm was just looping that giant BVH node array and passing one index if bounding box intersects. Otherwise it jumps to the index of the other child at that level. This gave better results. But again, it was really really slow.

Here are my rendering time results for some scenes;

Bunny: 14.719 ms
ScienceTree: 35.859 ms
Car: 115.773 ms
Low Poly: 153.762 ms
Tower: 140.304 ms
Windmill: 114.857 ms
Berserker: 51.798 ms
ChineseDragon: 1201.54 ms

chinese dragon output

So as the tree grows larger. My runtime is not like it is O(logn). It is because I am not capable of writing highly optimized GPU code for now 😦

Here is a picture showing how much time I spend in intersection checking. More time is spent in brighter parts.

bvh science tree

So, I dealt with BVH like that. After that, I started implementing recursive ray tracing. But how :D? Branching is a big sin. Stacks are highly inefficient. How should I do that. Again, only solution came into my mind was again implementing stacks. But here I could only make mirrors unfortunately. Dielectrics were really really hard to implement in GPU.

Science Tree with only mirrors enabled. Rendered in 500ms

I could not finish this homework with shaders. Sağlık olsun :D. I will definitely return and clean the mess I made. But I had to finish my homework. In order to keep up with the class, I needed a solution. So I started a CPU based ray tracer. My experience with that was truly Raytracing In One Weekend.

CPU Based RayTracer

I initiated another project immediately after I decided that dealing with GPU optimizations taking a lot of time. I moved my Ray Tracing algorithms in compute shaders to good old C++. Since I am comfortable with C++, I moved faster.

I first rendered our hw1 images starting from simple.xml. Had a some minor errors and I resolved them thanks to GDB. After I saw that I correctly produce the results. I moved on hw2 related topics.

I implemented the BVH tree without flattening. I used the traversal algorithm in our slides. I have used midpoint heuristics for splitting. I searched for some good and efficient ray-box intersection algorithms and found some on the internet. I guess that made my BVH traversal run faster. Furthermore, I also implemented the parallelism with C++ 11 features.

Parallelism

I searched on the internet for efficient parallelism in RayTracing and saw some codes in stackoverflow, github etc. I also read blog posts of the classmates and tried to understand their approaches. After all, I learned that I need to asynchronously process every pixel with N number of threads where N is the number of cores in CPU. Here is the parallel technique I use. (I simplified the code a little bit):

std::atomic<int> count(0);
std::vector<std::future<void> futureVector;
int cores = std::thread::hardware_concurrency();
while(cores--)
{
    futureVector.pushBack(std::async(std::launch::async, [=]()
    {
        while(true)
        {
            int index = count++;
            if(index >= worksize)
                break;

            vec2 coords = GiveCoords(index, _imageWidth);
            Ray primaryRay = ComputePrimaryRay(index, _imageWidth);
            vec3 pixel = RayTrace(primaryRay);
            WritePixelCoord(coords.x, coords.y, pixel);
        }


    }));
}

for(auto& element: futureVector)
{
    element.get();
}

I saw this pattern of code which uses future and async in a lot of places. By using parallelism and BVH trees, I was ready for calculating the times.

Low poly images gave great results. But chinese_dragon frustrated me a little bit. It was rendering it in 2 seconds while rendering others in 70 – 300 ms. Something was wrong. And I saw the problem after thinking for a day. I was just limiting the depth of my tree as 10. It was slow because of that. That tree depth was enough for small meshes but for large meshes, I needed deeper trees.

After I increased the depth limit. I got really fast results which shocked me.

Render Times Of Different Scenes
Scene Name	Time
Berserker	77.247 ms
Bunny	18.162 ms
Chinese Dragon	73.838 ms (I did not expect that fast)
Cornellbox Recursive	38.083 ms
Cornellbox	34.564 ms
Low Poly scene	238.048 ms
ScienceTree Recursive	173.516 ms
ScienceTree	67.676 ms
Tower	260.641 ms
Two Spheres	21.928 ms
Windmill	225.047 ms
Other Dragon	240.724 ms
Water Tap Frames	100 – 300 ms on average
Berserker	77.247 ms

Some Errors

Of course I had a lot of errors. For example;

weird dragon

I was trying to implement multithreaded rendering and getting this result. I was using index counter as std::atomic<int> but it was not syncronized for some reason. But From that image, we can see that threads spend a lot less time when not hitting bounding boxes of dragon’s BVH. Another result like this;

weird science Tree

When handling dielectrics, I was hitting my head to the desk. I was getting close results to correct images but it was a little bit different. You can see that glass science tree is not correct. This panicked me a lot because it was already 20.00 PM and after 23.59, I was not allowed to resubmit my work in Odtuclass.

error corrected

I found my mistake. It was really silly. While checking total reflection, I was calculating cosPhi as usual. But when calculating transmitted rays, I was just taking square root of cosPhi. This was the reason of my wrong results. I corrected it and everything worked smoothly after that.

Rendering Results

I am happy with my BVH implementation since it runs a lot faster than I've expected. Here are some rendering results for images specific to HW2

chinese dragon

cornellbox recursive

sciencetree recursive

spheres recursive

other dragon

Conclusion

I panicked a lot in while implementing this homework because I thought that I was going to be left behind. But I managed to overcome the difficulties and kept up with my classmates I guess. I learned a good lesson from this. I should not take risks like that. Learning is more important. I think I will carry on with CPU implementation because I’m happy with its speed. But it does not mean that I will give up learning shader programming :D.

Soft Shading Added

tap

berserker

bunny soft

Part 3 is here