The measured time for cornel box is measured time for subdivision of 4 and 2 interreflections on 3770k.
1. Bounding box.
In a basic ray tracer, the ray is tested against every primitive in the scene (Even if the primitive is not in camera view). This is very wasteful if a ray that is not in the direction test against a an objects with lot of primitives from tesselation or just has a lot of geometry data. The easiest method is a bound box. Say there are n objects and m primitives, the runtime is decreased by a constant factor of 1/n.
The idea:
1. From each coordinate in an object find the smallest and largest x, y, z, and the difference between the largest between them.
2. Create a box that is tall, wide, and has enough depth to store all coordinates of this object.
The box has width = largest x - smallest x, height = largest y-smallest y, depth = largest z - smallest z, that is situated at (smallest x, smallest y, smallest z).
3. For the ray intersection, the box is tested first. If there is intersection then proceed to attempt intersect every primitive in this object.
diffuse collect time: 6.47721s total elapsed time: 14.7384s
2. Multi-threading using C++11 threads
My collect diffuse ray function is a void function. I created a thread for each node up to 8 threads. Each node sends diffuse rays to every other node in that one thread. Each time thread limit is reached, the front of the thread list is popped from the front, aka FIFO queue. Writers is limited to one at a time at critical section. Whereas, there is no limit to number of readers. There could be reader and writer at the same time. The X86 architecture does not interleave so the readers will not read a bad data, only a delayed data.
std::list<std::thread> threads;
void GeometryNode::propogateDiffuseRays(std::vector<GeometryNode*>* sceneList);
for (std::vector<GeometryNode*>::iterator node = sceneList.begin(); node != sceneList.end(); node++){
//place threads on threads FIFO queue
threads.push_back(std::thread(&GeometryNode::propogateDiffuseRays, (*node), &sceneList));
numThreadsCurrent++;
if (numThreadsCurrent == constants::threads){
threads.front().join();
threads.pop_front();
numThreadsCurrent--;
}
}
if constants::threads == 8,
diffuse collect time: 4.58913s total elapsed time: 12.818s
Next, I made the ray trace function concurrent using futures. Again the same FIFO thread queue.
for (float y = 0; y < height; y++) {
for (float x = 0; x < width; x++) {
Vector3D direction = xDirection * (x + sx / sampling - width / 2) + invertY2*invertY*yDirection *(y + sy / sampling - height / 2) + d*zDirection;
direction.normalize();
Parameters parameter(lights, ambient, backgroundColor);
//place threads on futures FIFO queue
rays.push_back( Ray (eye, direction, "air"));
pixelLocations.push_back(pixelLocation(x, y, sx, sy));
futures.push_back(std::async(&GeometryNode::trace, root, 0, &rays.back(), parameter));
numThreadsCurrent++;
if (numThreadsCurrent == constants::threads){
if (futures.front().get() != constants::noIntersect){
colors[pixelLocations.front().y][pixelLocations.front().x] = colors[pixelLocations.front().y][pixelLocations.front().x] + rays.front().surfaceBrightness;
intersections[pixelLocations.front().y][pixelLocations.front().x] = true;
}
rays.pop_front();
pixelLocations.pop_front();
futures.pop_front();
numThreadsCurrent--;
}
}
}
If constatnts::Threads == 1,
diffuse collect time: 6.84174s total elapsed time: 19.361s
If constatnts::Threads == 8,
diffuse collect time: 4.37595s total elapsed time: 6.70991s
Takeaway:
When introducing single thread FIFO queue there is a 51% longer runtime compared to no queue. But increasing the thread queue depth to the number of threads the CPU can handle, run time is decreased by the number of real number of cores on the CPU.