Voxel terrain renderer

with  OpenGL

Introduction of the current rendering system

It took a while. But I finally got the terrain renderer working how I wanted it to. At least kind of. There are still a few things I have to fix, but the base is completed. The current terrain renderer mainly consists of a double buffered pooled buffer. In the end, I will list a few resources that explained the concepts I am using quite well. But I swear, I didn't steal any code :). Also, this is my first time constructing something like this terrain renderer, so it is rather experimental and is mainly for testing stuff out and for me to learn advanced concepts like this.

All the code for the project is written in Java (or GLSL because of the shaders). The project uses OpenGL as graphics API. But I will use some sort of pseudocode to write code sections in this article.

Explanation of the voxel world

Foremost, I have to explain how my voxel world is build. 

The world consists of multiple chunks that have fixed dimensions (16*16*512). These chunks are placed in a two-dimensional array, as they are not stacked on top of each other. Inside each chunk, there are multiple chunk sections (they have a dimension of 16*16*16). And the chunk sections contain the actual voxels. I will not explain how the voxels are stored in the chunk sections, as this could be its own topic. The voxels have a position and a material. This material determines how the voxels look. It sets the color of the voxel, as well as tell the mesher, if neighboring faces should be culled or not.

With all the initial explanation out of the way, let's dive into the rendering process.

Techniques I tried before

Before I tried the voxel pooling approach, I tried some other techniques used to store voxel mesh data on the GPU and to render it.

Multiple VAOs and VBOs

The first approach I implemented was the simplest one. All chunk sections have their own VAO (Vertex Array Object) with some VBOs. You then loop over all chunk meshes and render them individually to the screen. To optimize this process, you can preform frustum culling on the chunk meshes while you loop over them, before you send the draw call to the GPU. This way, you don't render chunk sections that are not on the screen anyway. 

The rendering process would then look something like this:

forEach(ChunkSection section in chunks){

if(section.isVisible()) {

glBindVertexArray(section.getVAO());

glDrawElements(GL_TRIANGLES, section.drawCount, GL_UNSIGNED_INT, 0);

glBindVertexArray(0);

}

}

The method works great. It is easy, well-structured and understandable. You can also separate and sort meshes by their appearance (like the alpha/transparency value), to blend them correctly. The only downside is, that there are a lot of draw calls needed to render the whole terrain. You might think that this is no big deal, as the GPU can render small meshes really fast anyway, but the problem is not that there is too much to draw, but that there are too many draw calls. This is called driver overhead, as the driver of the GPU has to do a lot of work to issue a draw call.

Indirect Rendering

Indirect rendering aims to solve this problem of the driver overhead. This is achieved by storing all draw calls you want to issue in a buffer on the GPU. If the program then wants to draw everything to the screen, it calls glDrawElementsIndirect(mode, type, indirect); with the buffer containing the draw calls bound to GL_DRAW_INDIRECT_BUFFER. This is only available in OpenGL version 4.0 or later. The GPU can then read the indirect buffer directly and issue all draw calls without dealing that much with Drivers or OpenGL. All vertices you want to render with one drawIndirect call have to be in the same buffer, as the GPU cannot switch between buffers while reading from the indirect buffer and performing the indirect calls.

A pseudocode implementation of an indirect rendering process could look like the following code snippet. Note, that the indicesOffset and geometryOffset from the chunk section are offsets into the geometry and indices buffers themself. What values have to go into the "indirectBuffer" is explained in detail in the official documentation of glMultiDrawElementsIndirect. The draw call has to be glMultiDrawElements here, because the indirect buffer contains more than one draw call. And because the draw call data is the only thing that is in our buffer, we don't have to specify a stride value (the last "0" value in the glMultiDrawElementsIndirect call).

// Initalization

chunkVAO = genVertexArrays();

* configure the chunk VAO *

Buffer geometryBuffer;

Buffer indicesBuffer;

Buffer indirectBuffer;

forEach(ChunkSection section in chunks){

geometryBuffer.push(section.geometry);

indicesBuffer.push(section.indices);

indirectBuffer.push(section.drawCount, 1, section.indicesOffset, section.geometryOffset, 0);

}

// Renderloop

while(shouldRender){

glBindVertexArray(chunkVAO);

glBindBuffer(GL_DRAW_INDIRECT_BUFFER, indirectBuffer.getID());

glMutliDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, 0, drawCallCount, 0);

}

So far, so good. The only problem here is, that every mesh you want to draw with the draw indirect buffer has to be stored in one big VBO. This is not that difficult when you are dealing with static props that don't change over time, as you know exactly how much space they need from the start. 

The problem occurs when you have dynamic meshes that change over time. And that is exactly what we have when dealing with voxel terrain. It gets changed all the time. So now the question is not how we should issue draw calls, but how we should store the data in the big terrain buffer.

"Dynamic" buffers

The first Idea I had, was a buffer, managed by the program, that could shrink and grow as it needs, while filling in data if there is enough space in between other data blocks. The Idea wasn't even that bad. The buffer worked and did what it should. It worked like a linked list, that contained a chain of what I called Buffer Nodes, all referencing to the next node in the chain.

If the program decides that it needs more space for more terrain, the buffer will add a new buffer node, which contains an offset and a size, that can later be used as vertex offsets in the indirect rendering call.

Using dynamic buffers with the buffer nodes, the initiation process would look like this:

// Initialization

* create and configure the chunkVAO *

DynamicBuffer geometryBuffer = new DynamicBuffer();

DynamicBuffer indicesBuffer = new DynamicBuffer();

DynamicBuffer indirectBuffer = new DynamicBuffer();


forEach(ChukSection section in chunks){

section.geometryNode = geometryBuffer.load(section.geometry, null);

section.indicesNode = indicesBuffer.load(section.indices, null);

section.indirectNode = indirectBuffer.load(section.drawCount, 

                                           1, 

                                           section.indicesNode.offset,

                                           section.geometryNode.offset, 

                                           0,

                                           null);

}

If we want to edit and then update a chunk section, we just have to delete the contents of each buffer node and load the data into the dynamic buffers again:

// Delete old data

geometryBuffer.delete(section.geometryNode);

indicesBuffer.delete(section.indicesNode);

indirectBuffer.delete(section.indirectNode);


// Add new data

section.geometryNode = geometryBuffer.load(section.geometry, section.geometryNode);

section.indicesNode = indicesBuffer.load(section.indices, section.indicesNode);

section.indirectNode = indirectBuffer.load(section.drawCount, 

                                           1, 

                                           section.indicesNode.offset,

                                           section.geometryNode.offset, 

                                           0, 

                                           section,indirectNode);

The load method of the dynamic buffer is quite simple: 

BufferNode load(Buffer data, BufferNode node){


if(node.length < sizeof(data)){

* remove node from buffer *

node = null;

}


if(node == null){

   * find place for a new node that has the size of data *

   * grow the buffer if there is no space for the node *

}


glBindBuffer(GL_COPY_WRITE_BUFFER, id);

glBufferSubData(GL_COPY_WRITE_BUFFER, node.offset, sizeof(data), data);

glBindBuffer(GL_COPY_WRITE_BUFFER, 0);


return node;

}

This method of memory management worked quite well. There were just two big problems. The first problem was, that the buffer had to be copied completely, if it had to resize itself to a greater size. And the buffer was already huge to begin with, as terrain takes up a lot of memory by default. If the program then loads in a new chunk section, the whole buffer has to be moved around just to add a small part of mesh. This then results into the second problem, which is the problem of synchronization between the GPU and the program. That is, if the dynamic buffer grows in size, and the whole buffer has to be copied, the GPU has to first do all the copying before it can edit the data once more for the next chunk mesh. This doesn't sound like a big deal, but we are running into the same problem as with the many draw calls again. We are editing the buffer a lot of times with separate calls. This will result in more driver overhead that we wanted to reduce. Therefor, I had to come up with something different.

Pooled buffers:

And something different I found. What if the buffer does not shrink and grow, but has a fixed size and is separated into little parts that can be occupied by mesh data. These little parts can then be edited once they are needed. This way, not the whole buffer needs to be copied in order to add data into the buffer. This technique is called memory pooling, as there is a big pool of unoccupied memory in the buffer. If we then need some data from the memory pool, we just get a fixed amount of memory from the buffer and fill it with our mesh data. Because we scoop up a bit of memory each time we need it, the little container in the buffer are called Memory Buckets. In the case of a voxel engine, you could also call the process of memory pooling, voxel pooling.

The initialization and updating of the mesh data is nearly exactly the same as with the dynamic buffer. The only thing that is different, is the underlying way of handling the data. We now have a fixed sized buffer that contains fragments of memory that we can access anytime we want separately. The load method for this technique looks like this now:

MemoryBucket load(Buffer data, MemoryBucket bucket){


if(sizeof(data) > globalBucketSize)return null;


if(bucket == null){

      bucket = freeBuckets.pop();

}


glBindBuffer(GL_COPY_WRITE_BUFFER, id);

glBufferSubData(GL_COPY_WRITE_BUFFER, bucket.offset, sizeof(data), data);

glBindBuffer(GL_COPY_WRITE_BUFFER, 0);


return bucket;

}

You will notice, that the load method seems a lot simpler than with the dynamic buffer. And this is true. Memory pooling does not need an overly complicated method for traversing a node list to find a space big enough for the data to fit in. It does not even care about the size of the data at all (except in the if clause, where the size is tested to be smaller or equal to the size of the buckets in the buffer). The buffer just has to keep track of the positions in the buffer that are not used yet. The positions are the single offsets to the different memory buckets. I did that in the form of a stack structure because I can just pop the value from the stack if I need it.

We improved the performance of loading new terrain now because we don't have to copy the whole buffer each time we don't have space anymore. But we still have the same problem as before, that we are editing the buffer multiple times, which causes a lot of driver overhead. The best case scenario for the pooled approach is currently as fast as the best case scenario of the old dynamic buffer. How do we fix that?

Buffer mapping:

Buffer mapping is a technique, that helps you to interface with the buffers you created. It also increases the performance of reading from and writing to the buffer. To map a buffer, you just have to call glMapBuffer(target, access); where target is the buffer that you want to map, and the access specifies how the mapped buffer is going to be used (GL_READ_ONLY, GL_WRITE_ONLY, GL_READ_WRITE). This method returns a pointer to the buffer, that you can use to read from the buffer or write to it. If you are finished with modifying the buffer, you can unmap the buffer with glUnmapBuffer(target);. You have to do this, because you cannot use the buffer in operations like draw calls, if they are mapped, as the unmapping process makes sure, that the data on the GPU and the mapped buffer from the application are the same. Just using glMapBuffer(); is fine, but if you want finer control of how the buffers are mapped, you can use glMapBufferRange(target, offset, length, access);. This method guarantees precise control of where the buffer is mapped from and more. This means, you can specify an offset into the buffer and how much data from this offset should be mapped. But the access part is where the real fun and the optimization begins. There are a lot of flags you can combine and give to the glMapBufferRange(); method to tell OpenGL how it should map the Buffer. The standard flags you always have to set if you want to edit or read the buffer are GL_MAP_WRITE_BIT or GL_MAP_READ_BIT. These two flags specify how you will use the buffer. You can also use both, if you want to read from and write to the buffer at the same time. 

If we map the buffer in the beginning, we can then easily use the pointer to load new data into the buffer. The new load function and the allocation function of the pooled buffer class looks like this now:

private Buffer mappedBuffer;


void allocate(long bufferCount, long bucketSize){


   id = glGenBuffers();


   glBindBuffer(GL_COPY_WRITE_BUFFER, id);

   glBufferStorage(GL_COPY_WRITE_BUFFER, bucketCount * bucketSize, 0, GL_MAP_WRITE_BIT); 


   mappedBuffer = glMapBufferRange(GL_COPY_WRITE_BUFFER, 0, bucketCount * bucketSize, GL_MAP_WRITE_BIT);

   

   glBindBuffer(GL_COPY_WRITE_BUFFER, 0);

}


MemoryBucket load(Buffer data, MemoryBucket bucket){

 

if(sizeof(data) > globalBucketSize)return null;

 

if(bucket == null){

      bucket = freeBuckets.pop();

}


mappedBuffer.position(bucket.offset);

mappedBuffer.put(data);


return bucket;

}

We now have a gained a bit of performance by using mapped buffers. However, if we want to issue an indirect draw call while the buffer is mapped, OpenGL will throw an error. Because as long as the buffer is mapped, the changes you make to the buffer will not be visible to the GPU. You have to unmap it first, before you use the data inside the buffer. But unmapping the buffer can be quite difficult and expensive at times. Which is why we would rather want the buffer to be mapped persistently, so that we can keep the pointer forever. We can do that, by combining GL_MAP_PERSISTENT_BIT into the flag in glBufferStorage and glMapBufferRange. But now we have another problem. The changes we make to the mapped buffer are still not going to be visible to the GPU. We have two ways to fix this problem. We could use GL_MAP_FLUSH_EXPLICIT_BIT in the flag of glMapBufferRange to tell OpenGL that we want to flush the data to the GPU manually. To do that, we can either call glFlushMappedBufferRange(); with the respective arguments, or set a memory barrier with glMemoryBarrier(GL_BUFFER_UPDATE_BARRIER_BIT);. This does the job, but we could also just add the flag GL_MAP_COHERENT_BIT to glBufferStorage and glMapBufferRange. This will result in the buffer modifications being visible to the GPU directly. But setting GL_MAP_COHERENT_BIT, can be quite expensive performancewise. So you should use it with care.

Frustum culling and animations using compute shaders

In this video, you can see the chunks having a little animation once they get loaded in. You may also have asked yourself, how frustum culling is realized while using indirect rendering. We can't just say that we don't want to draw one specific chunk, as you would have to remove the whole draw call from the buffer. Doing this through mapped buffer is possible. But with a lot of camera movement, this method is not that practical, as we want to reduce the data traffic as much as possible. 

Therefor, and because I wanted to do something with them, I used compute shaders to perform frustum culling and fancy animations on the chunk sections. For everyone that does not know what compute shaders are: Compute shaders are a sort of shader (a program on the GPU) that can perform parallel computations on data other than meshes and pixels. Usually they are tasked with calculating particle behavior or similar things which can run in parallel and share the same program. 

In my case, the compute shader checks if a chunk section is visible and displays it on screen. It archives this by having another buffer which serves as the real draw indirect buffer, while also having the normal indirect buffer where the application can add draw calls to. The compute shader then works on every draw call in the application's indirect buffer in parallel and decides if it should be added to the draw indirect buffer. And you might notice, that the compute shader needs some sort of coordinate to know where the chunk sections are located. So from now on, the draw calls we put into the indirect buffer will have three additional values, representing the position of the chunk sections. Don't forget, that the stride in glMultiDrawElementsIndirect(); is now 3 * sizeof(Float) bigger and cannot be set to 0. The normal draw call has a stride of 5 * sizeof(Integer). The resulting stride with the position should then be 3 * sizeof(Float) + 5 * sizeof(Integer). The initialization and binding for both these buffers would look like this:

// Initialization 


indirectBuffer = new PooledBuffer();

indirectDrawBuffer = glGenBuffers();


indirectBuffer.allocate(bucketAmount, sizeof(Drawcall));


glBindBuffer(GL_DRAW_INDIRECT_BUFFER, indirectDrawBuffer);

glBufferData(GL_DRAW_INDIRECT_BUFFER, 

             bucketAmount * sizeof(Drawcall), 

             GL_STATIC_DRAW

             );

glBindBuffer(GL_DRAW_INDIRECT_BUFFER, 0);



// Binding


glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, indirectBuffer.getID());

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, indirectDrawBuffer);


* dispatch the compute shader *

...

The shader also keeps track of how many meshes were visible and returns this value back to the application. The meshcount is then used as drawCallCount in glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, 0, drawCallCount, sizeof(Drawcall));. This counter that keeps track of the amount of meshes that should be drawn on screen is implemented using atomic counters. Atomic counters are special buffers, that can keep track of one single number across multiple compute shader invocations. We need them to handle synchronization issues between all compute shader invocations. If we would not use atomic counters, some invocations would interfere with each other, and we would get wrong values as a result. Additionally, the atomic buffer needs to be mapped, so that it can be reset before the shader starts running. All of that, the atomic counter creation, mapping and binding would look like this:

// Initialization


atomicCounter = glGenBuffers();


flags = GL_MAP_WRITE_BIT | 

        GL_MAP_READ_BIT | 

        GL_MAP_PERSISTENT_BIT | 

        GL_MAP_COHERENT_BIT;


glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, atomicCounter);


glBufferStorage(GL_ATOMIC_COUNTER_BUFFER, {0, 0}, flags);


mappedCounter = glMapBufferRange(GL_ATOMIC_COUNTER_BUFFER, 

                                 0, 

                                 sizeof(Integer), 

                                 flags);


glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, 0);



// Reset and binding


mappedCounter.position(0);

mappedCounter.putInt(0);


glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 3, atomicCounter);

Once we configured all buffers we need, we can import and use this shader as any other shader in OpenGL. If we bound the shader, we can then dispatch it using glDispatchCompute(num_x, num_y, num_z);. If you want to know more about how compute shaders work in general and how to set them up, take a look into the Resources section. 

But we have to remember to set a memory barrier for all shader storage buffers and atomic counters, essentially flushing their values so that they are visible for both the GPU and the application. We can do that by calling glMemoryBarrier(); passing it a bitwise combination of GL_SHADER_STORAGE_BARRIER_BIT and GL_ATOMIC_COUNTER_BARRIER_BIT.

The compute shader itself is kind of a mess right now, and it is itself not that spectacular. Therefor I will not show it here. If you still want to take a look at the compute shader; I will post it in the future for everyone to see here.

Synchronization problems

The outline of the algorithm we have right now looks kind of like this:

updateMeshes();

runComputeShader();

getDrawCountFromCompute();

glMultiDrawElementsIndirect();

If we ran this, there would not be that much on screen. Maybe there will be some chunk meshes flickering on screen, but not much more. Why is that?

There is something wrong going on with the retrieval of the draw count. It seems like we cannot receive the correct number of draw calls, that the atomic counter counted through the compute shader invocations by the time we come to the indirect draw call. This is caused by the GPU simply being not fast enough to crunch all chunk mesh calculations by the time we want to retrieve the data back to the application. This problem might not be obvious at first, as it does not seem like something in the application is getting slowed down with the compute shader. But we have to remember, that the GPU and the application do not run in sync with each other. We ourselves have to sync them up. But before we do that, let's visualize what is going on, so we can understand it better.

In the diagram, we can see, that the GPU is still processing stuff in the compute shader, when the application reads the draw count from the atomic buffer inside the compute shader. And by the time this draw count is received, there is only a small amount of chunk sections computed yet. And while the compute shader is still running on the GPU, the application just issues the indirect draw call using the false draw count. 

To fix this synchronization error,  we have to make use of sync objects. These are very handy if you want your application to check if the GPU finished a certain task. And they are very easy to use too:

fenceObject = glFenceSync(GL45.GL_SYNC_GPU_COMMANDS_COMPLETE, 0);


do{

result = glClientWaitSync(fenceObject, GL45.GL_SYNC_FLUSH_COMMANDS_BIT, 0);

}while(result != GL_ALREADY_SIGNALED && result != GL_CONDITION_SATISFIED);


glDeleteSync(fenceObject);

This code snipped would halt the application, until all commands in the GPUs command queue that were issued before glFenceSync(); was called are processed. We can now add these sync fences to our algorithm outline to make sure, that the compute shader is finished doing its calculations, for us to retrieve the amount of meshes that need to be drawn on screen. Additionally, we will also place a sync fence after we updated the meshes, as we want to make sure that all geometry data is uploaded to the GPU before we use it.

updateMeshes();

placeFenceAndWait();


runComputeShader();

placeFenceAndWait();


getDrawCountFromCompute();

glMultiDrawElementsIndirect();

The final step: Double buffered voxel pool

The algorithm we have now is working. But we can make a few more improvements. For example, using a double buffered voxel pool! So far, our pooled buffer class only contains one buffer that is mapped and where we can write to. This works fine, but we can speed it up by using another buffer that we can map and edit. This second buffer then gets copied completely to the GPU each time. Copying the whole buffer might seem like we are back to the technique with the dynamic buffers, but we don't have to move the whole mapped buffer around. There are just little parts of it that need to be copied over. And with both buffers already existing on the GPU itself, we do not have to worry about performance that much. It is rather expensive, to render using a buffer that is created and mapped with the flag GL_MAP_COHERENT_BIT set. We don't have this problem here, because the buffer that contains the actual data does not need any flags, because it is never mapped. And the smaller secondary buffer is marked with GL_CLIENT_STORAGE_BIT which means, that the data in this buffer are just used by the application for storage purposes. With this combination of secondary buffers and buffer flag bits, we get a good performance boost.

Other optimizations we can make

Compressing vertex data:

Our terrain mesh consists of vertices. These vertices need some attributes, like the position, color, normal and maybe stuff like emission of the voxel. We can just upload these values as they are. That means we would have this value construct for each vertex:

vertex{

vec3 position; // 3 * 4 Bytes = 12 Bytes

vec4 color; // 4 * 4 Bytes = 16 Bytes

vec3 normal; // 3 * 4 Bytes = 12 Bytes

float emission; // 1 * 4 Bytes = 1 Byte

} //       > 50 Bytes per vertex

With this data layout, we would count 50 Bytes per vertex. That doesn't sound that much, but if we have a lot of vertices, the memory footprint on the GPU is going to be huge. Especially, as we are using a double buffering system which, at worse, doubles the amount of memory occupied by the application. So how do we shrink that down?

The first thing we can do, is to compress the position. As you may remember, we have the coordinates of each chunk section saved in the indirect draw call buffer for frustum culling. We can now use these coordinates as translation values, by adding an attribute pointer which refers to them (I made the translation to a vec4 value to store the chunk alpha for the transition in the w component). This attribute pointer then needs an attribute divisor, as it is for one mesh. You can use the position of the draw call in the indirect draw buffer as the base instance value of the draw call itself:

// Configuring the pointers:


glBindBuffer(GL40.GL_ARRAY_BUFFER, indirectDrawBuffer);


glEnableVertexAttribArray(3);


glVertexAttribPointer(3, 

                      4, 

                      GL_FLOAT, 

                      false, 

                      5 * sizeof(Integer) + 4 * sizeof(Float), 

                      5 * sizeof(Integer)

                      );


glVertexAttribDivisor(3, 1);


glBindBuffer(GL40.GL_ARRAY_BUFFER, 0);



// Loading the draw call in the compute shader


cmd[0] = count; // Amount of indices

cmd[1] = 1; // Amount of instances

cmd[2] = firstVertex; // From geometryBucket.offset

cmd[3] = baseVertex; // From indicesBucket.offset

cmd[4] = currentIndex // The offset into the draw call buffer

// for the translations

Once we took care of that, we can compress the vertex position down to one Integer. We know the dimensions of the chunk section where we want to generate the mesh for. If the chunk section is containing just 16*16*16 voxels, we can represent each vertex coordinate by four bits. One byte is eight bits, and an integer 32 bits long. Each position coordinate can now be stored in one integer only by using a bit of bit-shifting. We even have some spare room where we can stuff some more information inside, like information about if the voxel is liquid or something similar. In my implementation, I use 9 bits for each coordinate. The bits that are still free get used later. The advantage of using 9 bits instead of four is that I can still be flexible with the size of the chunks. Maybe I will reduce the bits per coordinate in the future. If we unpack the position in the shader, we can then add the position to the translation of the chunk and get the final position of the vertex.

vec3 positions = vec3( float((positions_packed >> 23) & 511) + translation.x,

float((positions_packed >> 14) & 511) + translation.y,

float((positions_packed >> 5) & 511)  + translation.z);

Secondly, we can reduce the size of the normal value to three bits. This is quite easy to do, as in a voxel environment, there are only six normal values in total. We can store these six normal values in an array and is just one integer as index. And as there are only six viable normal values, we only need three bytes for the normal index. These three bytes can be stored inside the geometry integer.

ivec3[] NORMALS = ivec3[](ivec3(0,0,1), ivec3(0,0,-1), ivec3(1,0,0), ivec3(-1,0,0), ivec3(0,1,0), ivec3(0,1,0));

The third, and maybe the most easy value to compress, is the color. Instead of sending the color as a float for each component, we could just send it as a byte, each resulting in an integer in total.

And just like that, our value structure for each vertex looks like this now:

vertex{

uint geometry; // 1 * 4 Bytes = 4 Bytes

uint color; // 1 * 4 Bytes = 4 Bytes

float emission; // 1 * 4 Bytes = 4 Bytes

} //       > 12 Bytes per vertex

And just like that, we reduced the size of one vertex from 50 Bytes to 12 Bytes which is a reduction rate of 75%. This compression of data has two major benefits. The memory footprint on the GPU gets reduced and the data doesn't take this long to get uploaded.

Limit the number of mesh updates:

There are a lot of mesh updates happening each frame, if you move around the game world. If you upload the data of every mesh in one frame, the game will freeze a for a bit. But if you distribute the upload of the mesh data on multiple frames, the game does not freeze, and it continues to run smoothly. And the player can't see the delay in loading some meshes anyway, as it still happens very fast. By limiting the updates per frame, we can also reduce the size of the secondary buffer in the double buffered voxel pool.

Add LOD to the terrain:

LOD means Level Of Detail, which is the process of making more versions of one mesh that have less detail than the original mesh. This is usefull, when we don't want to display the complex geometry of on object all the time. And we don't even have to display the complex geometry the whole time. If an object is far away, we don't see every detail anyways. You can implement LOD in voxel worlds really easily, as you can just scale the voxels up. If you are using an octree structure in your voxel world, you can do it even easier. I myself want to implement LOD in my project at some point, but there are other things that come first.

Occlusion testing:

In OpenGL you can use occlusion querys to test if something is visible on screen. Maybe you can implement that, so that the chunk sections render themself as one large cube, which then in turn tells the renderer if the chunk section is visible or not. If it is not visible, you don't have to generate the mesh data and the renderer does not have to render the complex terrain mesh on screen. I have been planning on doing occlusion testing for the terrain for a long time now, but I couldn't get myself to do it, as I still have to learn how to use occlusion querys correctly.

Conclusion

This marks the end of my explanation. I hope it made things clearer, and it helps you with your own voxel related projects. There is still a lot I can do, to optimize my project. If you have any questions you want to ask, or if you have tips and tricks or suggestions for me, you can write it in my comments, or use the google from below.

Thanks for reading!

Ressources:

A core explanation of vertex/voxel pooling: https://nickmcd.me/2021/04/04/high-performance-voxel-engine/


An explanation for persistent buffer mapping: https://www.cppstories.com/2015/01/persistent-mapped-buffers-benchmark/


The wiki documentation of syncing OpenGL: https://www.khronos.org/opengl/wiki/Sync_Object


The wiki documentation of indirect rendering: https://www.khronos.org/opengl/wiki/Vertex_Rendering#Indirect_rendering


Compute shader explanation: https://antongerdelan.net/opengl/compute.html 


A second compute shader explanation: https://medium.com/@daniel.coady/compute-shaders-in-opengl-4-3-d1c741998c03 


A good video explaining voxel world optimizations: https://youtu.be/VQuN1RMEr1c


Footnotes and explanations:

VAO: Vertex Array Object. Used in OpenGL to tell the GPU how the bound buffers should be read. https://www.khronos.org/opengl/wiki/Vertex_Specification#Vertex_Array_Object VBO: Vertex Buffer Object. A buffer on the GPU, that contains vertex data (data for rendering a mesh).  https://www.khronos.org/opengl/wiki/Vertex_Specification#Vertex_Buffer_Object