My Ray Tracer Journey - Part 2

Eren Dere -- 23 April, 2021
 10 min read
headImage

Table Of Contents

  1. Things I Did With GPU
  2. GPU Code Does Not Have Recursion
  3. CPU Based RayTracer
  4. Parallelism
  5. Rendering Results
  6. Conclusion
  7. Soft Shading Added

Overview

In this two weeks, I was trying to implement a BVH tree with GLSL shaders. And then, I was going to implement recursive ray tracing (I failed :D). I immediately initiated another RayTracer project which runs on CPU. My CPU project runs a lot faster than GPU implementation. This proves that I need to learn a lot more things related to shader programming.


I will mainly work on CPU part from now on. I will consider GPU part for the project part of the course. Learning concepts and implementing them comfortably is more important.


Within 3 late days, I have managed to implement the things related to HW2 and I will share my comments and thoughts on things that I’ve encountered in detail.

Things I Did With GPU

I first want to talk about my experience with compute shaders and the problems that I’ve encountered. I really like playing with shaders and I can implement most of the algorithms that I see with them. I really want to learn how to write clean an efficient shader code. However, it runs so sloooooowly and I cannot understand why. I guess I need to be an expert in order to fully understand what were my problems. While I was searching on the net, I saw people trying to understand what their shader code does by looking at the assembly codes created by shader compilers.


I guess being a newbie shader programmer is not enough for creating optimized shaders. I need to have a deeper understanding about the inner structure of GPUs. Things were working great in the beginning especially when I was working on the first homework. But when recursive algorithms started coming I started having problems.

GPU Code Does Not Have Recursion

I already knew that I could not implement recursive algorithms like BVH traversal and recursive ray tracing just like we normally do in CPU side. But I had solutions in my mind. Since speed was very important for us, I mainly focused on implementing a BVH traversal algorithm in compute shader.


We all have some solutions in our minds for implementing recursive algorithms iteratively. We use stacks. It is one of the most convenient ways of implementing recursive algorithms iteratively. So, How I did that in compute shader?


For BVH traversal, I implemented a stack


BVHNode nodeStack[MAX_STACK_LENGTH]

Here is my pseudo code for traversing the BVH:


IntersectionReport intersectBVH(Ray r, Mesh m)
{
    BVHNode rootNode = BVHNodes[mesh.rootBVHNode];
    nodeStack[0] = rootNode;
    int stackPointer = 1;
    BVHNode current = nodeStack[0];
    while(stackPointer >=1)
    {
        // Pop the stack
        stackPointer--
        current = nodeStack[stackPointer];

        bool aabbIntersect = intersectBox(r, current.aabb, tmin, tmax);
        if(aabbIntersect == true)
        {
            // Current node is a leaf
            if(current.isLeaf == true)
            {
                IntersectionReport rprt = intersectFace(r, current.indices, tmin,tmax)
                // Things related to face intersection
                ...
            }
            // If it is not a leaf, we push children into the stack
            else
            {
                nodeStack[stackPointer] = BVHNodes[current.rightNode];
                stackPointer++;
                nodeStack[stackPointer] = BVHNodes[current.leftNode];
                stackPointer++;
            }
        }
    }

    return report;
}

This algorithm is straightforward. I first flattened the BVH tree that I’ve constructed in CPU and sent it to the shader by using Shader Storage Buffer Objects. I had a giant BVHNode array which contains all the nodes coming from all the meshes in the scene. I just kept their offsets and reached them by using these offsets.


However, this destroyed the speed gain. Shader started running ridiculously slow. Especially when tree heights are higher, shader becomes slower and slower. After searching through the net, I have learnt that checking things with if and else everywhere in the shader is not a proper approach. Shaders do not like branching because GPU is not created for these purposes. I needed to have another approach for BVH traversal. Implementing recursion with stacks is not efficient.


I searched for a solution for a week :D. And finally I implemented this algorithm and it worked better. I flatten the BVH again, but I keep the elements in the array sorted in pre-order form. Just like how we traverse the BVH tree. For each node, I was keeping how many child nodes they have.


So, my new algorithm was just looping that giant BVH node array and passing one index if bounding box intersects. Otherwise it jumps to the index of the other child at that level. This gave better results. But again, it was really really slow.


Here are my rendering time results for some scenes;


  • Bunny: 14.719 ms
  • ScienceTree: 35.859 ms
  • Car: 115.773 ms
  • Low Poly: 153.762 ms
  • Tower: 140.304 ms
  • Windmill: 114.857 ms
  • Berserker: 51.798 ms
  • ChineseDragon: 1201.54 ms

chinese_dragon

chinese dragon output


So as the tree grows larger. My runtime is not like it is O(logn). It is because I am not capable of writing highly optimized GPU code for now 😦


Here is a picture showing how much time I spend in intersection checking. More time is spent in brighter parts.


bvh_science_tree

bvh science tree


So, I dealt with BVH like that. After that, I started implementing recursive ray tracing. But how :D? Branching is a big sin. Stacks are highly inefficient. How should I do that. Again, only solution came into my mind was again implementing stacks. But here I could only make mirrors unfortunately. Dielectrics were really really hard to implement in GPU.


sciencetreemirror

Science Tree with only mirrors enabled. Rendered in 500ms


I could not finish this homework with shaders. Sağlık olsun :D. I will definitely return and clean the mess I made. But I had to finish my homework. In order to keep up with the class, I needed a solution. So I started a CPU based ray tracer. My experience with that was truly Raytracing In One Weekend.

CPU Based RayTracer

I initiated another project immediately after I decided that dealing with GPU optimizations taking a lot of time. I moved my Ray Tracing algorithms in compute shaders to good old C++. Since I am comfortable with C++, I moved faster.


I first rendered our hw1 images starting from simple.xml. Had a some minor errors and I resolved them thanks to GDB. After I saw that I correctly produce the results. I moved on hw2 related topics.


I implemented the BVH tree without flattening. I used the traversal algorithm in our slides. I have used midpoint heuristics for splitting. I searched for some good and efficient ray-box intersection algorithms and found some on the internet. I guess that made my BVH traversal run faster. Furthermore, I also implemented the parallelism with C++ 11 features.


Parallelism

I searched on the internet for efficient parallelism in RayTracing and saw some codes in stackoverflow, github etc. I also read blog posts of the classmates and tried to understand their approaches. After all, I learned that I need to asynchronously process every pixel with N number of threads where N is the number of cores in CPU. Here is the parallel technique I use. (I simplified the code a little bit):


std::atomic<int> count(0);
std::vector<std::future<void> futureVector;
int cores = std::thread::hardware_concurrency();
while(cores--)
{
    futureVector.pushBack(std::async(std::launch::async, [=]()
    {
        while(true)
        {
            int index = count++;
            if(index >= worksize)
                break;

            vec2 coords = GiveCoords(index, _imageWidth);
            Ray primaryRay = ComputePrimaryRay(index, _imageWidth);
            vec3 pixel = RayTrace(primaryRay);
            WritePixelCoord(coords.x, coords.y, pixel);
        }


    }));
}

for(auto& element: futureVector)
{
    element.get();
}

I saw this pattern of code which uses future and async in a lot of places. By using parallelism and BVH trees, I was ready for calculating the times.


Low poly images gave great results. But chinese_dragon frustrated me a little bit. It was rendering it in 2 seconds while rendering others in 70 – 300 ms. Something was wrong. And I saw the problem after thinking for a day. I was just limiting the depth of my tree as 10. It was slow because of that. That tree depth was enough for small meshes but for large meshes, I needed deeper trees.


After I increased the depth limit. I got really fast results which shocked me.


Render Times Of Different Scenes
Scene NameTime
Berserker77.247 ms
Bunny18.162 ms
Chinese Dragon73.838 ms (I did not expect that fast)
Cornellbox Recursive38.083 ms
Cornellbox34.564 ms
Low Poly scene238.048 ms
ScienceTree Recursive173.516 ms
ScienceTree67.676 ms
Tower260.641 ms
Two Spheres21.928 ms
Windmill225.047 ms
Other Dragon240.724 ms
Water Tap Frames100 – 300 ms on average
Berserker77.247 ms

Some Errors

Of course I had a lot of errors. For example;


chinese_dragon_error

weird dragon


I was trying to implement multithreaded rendering and getting this result. I was using index counter as std::atomic<int> but it was not syncronized for some reason. But From that image, we can see that threads spend a lot less time when not hitting bounding boxes of dragon’s BVH. Another result like this;


sciencetree_error

weird science Tree


When handling dielectrics, I was hitting my head to the desk. I was getting close results to correct images but it was a little bit different. You can see that glass science tree is not correct. This panicked me a lot because it was already 20.00 PM and after 23.59, I was not allowed to resubmit my work in Odtuclass.


sciencetree_corrected

error corrected


I found my mistake. It was really silly. While checking total reflection, I was calculating cosPhi as usual. But when calculating transmitted rays, I was just taking square root of cosPhi. This was the reason of my wrong results. I corrected it and everything worked smoothly after that.


Rendering Results

I am happy with my BVH implementation since it runs a lot faster than I've expected. Here are some rendering results for images specific to HW2


chinese_dragon

chinese dragon



cornellbox_recursive

cornellbox recursive



sciencetree_recursive

sciencetree recursive



spheres_recursive

spheres recursive



other_dragon

other dragon


Conclusion

I panicked a lot in while implementing this homework because I thought that I was going to be left behind. But I managed to overcome the difficulties and kept up with my classmates I guess. I learned a good lesson from this. I should not take risks like that. Learning is more important. I think I will carry on with CPU implementation because I’m happy with its speed. But it does not mean that I will give up learning shader programming :D.


Soft Shading Added


softtap

tap



berserker_soft

berserker



bunny_soft

bunny soft


Part 3 is here

Copyright © 2023 --- Eren Dere