Project 3

If You Want to Sing Out, Sing Out

Project Proposal

For project 3, like project 2 will use OpenCL to produce a flocking simulation. This time with more advanced behaviors and the following improvements over Attack of the Saucer:

  • Flocking and combat simulation
    • Similar to 'Fleet Commander', the simulation would control thousands of fighters seeking to chase / attack one another
      • Use multiple Vertex Buffer Objects (VBO) for projectile rendering
      • Use of other flocking behaviors (alignment, cohesion)
    • Allow the user to drop gravity wells in the scene instead of saucers
      • Possibly varying size and strength based on user input
    • Using what I learned from project 2, I intend to optimize the simulation to get more than 18,000 verticies rendered at one time.
    • Using a better data structure than project 2
      • Attack of the Saucer was very similar to Andy's galaxy example - a large array
      • Avoid reading in the entire data array for each kernel read as in 'Saucer'
  • I also want to incorporate ideas that other students did in project 2 that I did not:
    • Use of textures instead of rendering triangles for the boids
    • Multiple VBOs for rendering different objects, projectiles, etc
  • Document the effect of  some optimization techniques from week 7 notes
    • Performance comparison between various machines in EVL?
  • Optional: Simulation occurs in a 3D space instead of a 2D plane
  • Optional: 'Fleet Commander' style localized interaction menu
    • Touch capability?
  • May demo on OmegaDesk for better performance verses using the Cyber-commons wall

Overall the goal is to design an optimized version of project 2 that can handle more active objects in the scene.

Environment Setup

The setup for this project is virtually identical to project 2.

This project was developed and tested using the following:

  • Windows 7 x64
  • Visual Studio 2008
  • OpenCL support via the NVIDIA GPU Computing SDK 4.1

This project assumes OpenCL has been installed using one of the following:

Source code download link (Visual Studio 2008 project):

Unzip the source code and open 'CS 525 Project.sln' using Visual Studio 2008.

  • In Visual Studio, on the menu bar go to Debug/Start Debugging (or press F5).


As proposed above, this project is a combat simulation between two groups of starfighters. The movement, targeting, and gravity calculations are done using OpenCL to take advantage of GPU parallelization. The fighter movement and targeting is done on a single OpenCL kernel while projectile movement and collision detection is done on a second kernel. Also the fighters and projectiles are rendered on different vertex buffer objects (VBO).

During the simulation the two fighter groups will attack each other until there are no more forces on the opposing side. Anytime during the simulation the user can drop black holes in the environment to change the balance of power.

As fighters get closer to black holes, the stronger the gravitational pull toward the event horizon at the center. The color of any object will also red-shift as they get closer to the center. Any object reaching the event horizon is destroyed.

By toggling the show black hole option in the menu, the gravitational range (white) and the event horizon (red) are shown.

User Interaction

  • Hold left mouse: Pan the map
  • Hold middle mouse: Zoom in/out
  • Right click: Open menu
    • Create a black hole (at current mouse position)
      • Only four black holes can be active at a time
    • Remove all black holes
    • Toggle the visibility of the black holes
    • Reset the simulation
      • All previously dead fighters are reset
      • Fighter groups are placed in starting areas
      • Black holes are not automatically removed
  • Minus: Slow down time
  • Equals/Plus: Speed up time

Simulator Control Flags

The following changes in the source code or kernels can be used to easily tweak the simulation:

  • nPoints    (project3.cpp line 38)
    • Change the number of verticies used for the fighters (should be a multiple of 6)
  • initCameraX/Y/Z    (project3.cpp line 40)
    • Set the initial camera x, y, z position
  • useLocalMem    (project3.cpp line 44)
    • Toggle the local memory version of the flocking algorithm for fighters
    • The algorithm is currently not performing as intended.
  • dummyLasers    ( line 26)
    • Default 0
    • If 1, projectiles will not destroy fighters on impact

Video Overview


Phase One: Optimization of basic flocking

For this phase, a stripped down version of project 2 was used. The kernel was cleaned out to do only the following:

  • Determine the position of each boid
  • Calculate the cohesion and separation forces
    • Using a for loop to iterate through all boids (as in project 2)
  • Calculate the resulting movement force
  • Update the position of the boid

A few minor changes to the boids were also done:

  • Boids are rendered using a texture
  • Boids are now defined by six points in order to use triangles for rendering rather than quads (who's use is discouraged in OpenGL 3.0 and newer)
  • Center position is stored only in memory rather than also being included in the VBO

The optimization of the flocking algorithm used the following:

  • Flocking calculations are now done using local memory
  • The boids data is divided into blocks
    • nBlocks = globalSize / localSize
  • Based on the Brown Deer Technology OpenCL N-Body Tutorial.
  • Approximately a 1.6x improvement in kernel performance:
    • Global memory version (36000 points)
      • 22 FPS, 38.5 ms kernel speed
    • Optimized local memory version (36000 points)
      • 24 FPS, 24.4 ms kernel speed

Phase Two: Opposing forces and projectiles

This phase will implement the rendering and movement calculations of opposing forces and projectiles. This will initially be attempted using multiple VBOs and/or additional kernels.

  • Second fighter group uses the same kernel and VBO as the first group
  • The second ship texture uses the same image file as the first, just using different texture coordinates to 'shift' the image

Initially projectiles were done using the following:

  • Second VBO passed into the main kernel
  • Kernel would determine if a target is in range and firing arc
  • A colored line would be rendered between the source and target

Projectiles were then done using the following:

  • Projectile movement is done on a second kernel
  • Projectile rendering in done on a second VBO
  • Projectile data is shared between the two kernels
    • Fighter kernel flags the projectile as active and sets initial position and direction
    • Projectile kernel handles movement after being fired as well as rendering to the projectile VBO
  • Projectiles are rendered as points

The final version of projectiles:

  • Projectiles are drawn using six points and textured mapped using triangles like the fighters
  • Projectile VBO uses three arrays:
    • Vertex and texture coordinate arrays for rendering the texture
    • Color array for tinting the color for black holes

Phase Three: Gravity Wells

This phase will implement the gravity well portion of the project as needed for the black holes. This includes the placement and rendering of the black hole as well as adding the gravitational forces on all other objects (fighters and projectiles)

Optimization Tests

Testing Environment:

  • CPU: Intel Core i5 @ 2.8 GHz
  • System RAM: 4 GB
  • OS: Windows 7 x64
  • NVIDIA GeForce GTX 460 @ 1520 MHz
    • 1024 MB

References used:

General Optimizations

The follow statistics represents project 3 running with all features enabled and with a nearly the maximum number of objects expected given the initial vertex amount:

  • 18000 vertexes
  • 3000 active fighters
  • 672 active projectiles
  • Flocking in global memory
  • Fighters on kernel 1 and VBO 1
  • Projectiles on kernel 2 and VBO 2
  • Fighter kernel speed: 18.4 ms
  • Projectile kernel speed: 29.1 ms
  • FPS: 9 ms

Projectile Memory Access Optimization

Note that during high combat situations, the projectile VBO speed is noticeably slower than the projectile one. By design the programs are relatively similar:

  • Share the same movement code
  • Fighters check for collisions with other fighters for flocking
  • Projectiles check for collisions with enemy fighters

During the for loop where a projectile checks all enemy ships for a collision, it was also checking if it was close to its intended target inside the same loop. Considering the global memory location is used to identify the target, this was unnecessarily comparing the list of possible targets to the known one. So any checks for the intended target was moved outside this loop. This resulted in a projectile kernel speed increase from 29.1 ms to 15 ms (1.9x improvement).

This loop is still used for 'virtual 3D' combat where a stray laser has a higher change of hitting a ship that was close to the intended target.

Unrolling the black hole for loop

Considering there is a cap of four active black holes in the simulation, it was easy to unroll this for loop. While a lot messier than before, this improved the fighter kernel speed from 18.4 ms to ~18.1 ms (1.01x improvement)

Replacing function() with native_function()

During the development of project 2, I had written two functions to calculate the distance and angle between two vectors. This was done due to solve problems I was having with the flocking algorithm. For project 3 I was able to fix a few problems that prevented the use of the native distance() function. Improved fighter kernel speed from 18.1 ms to 15.6 ms (1.16x improvement) with acceptable precision loss.

Improved projectile kernel speed from 15 ms to 6.4 ms

Reducing arithmetic in kernel

There are a few variables which are computed in the kernel that could just be constants:

    float w = 83.0 / 4.0;
    float h = 61.0 / 4.0;

    float r = sqrt( pow(w/2.0, 2) + pow(h/2.0, 2) );
    float theta = atan( (h/2.0) / (w/2.0) );
    float fi = PI - 2.0 * theta;

However, it is apparently after to do these calculations rather than just using the solution as constants. Kernel speed from 15.6 ms to 15.9 ms.

Resulted in the same performance drop if the follow was declared as global constants for just replaced the above code:

    float w = 20.7;
    float h = 15.2;

    float r = 12.8;
    float theta = 0.6;
    float fi = 1.8;

Avoiding automatic conversion from double to float by adding an 'f' suffix also had little effect. These values are used to determine the vertices of objects when they are rotated.

Kernel / VBO Arrangement Tests

This test examines the performance of different kernel/VBO configurations for the two fighter groups. For this test projectile firing was disabled and performance is solely based on the flocking algorithms.

Base conditions:

  • 18000 vertexes
  • 3000 active fighters
  • 0 active projectiles
  • Flocking in global memory
  • Fighters on a single kernel and VBO
  • Fighter kernel speed: 18.4 ms
  • FPS: 21 ms

Experiment 1: Fighter groups on different kernels, same VBO

  • Fighter A kernel speed: ~16.6 ms
  • Fighter B kernel speed: ~16.6 ms
  • FPS: 13

Experiment 2: Fighter groups on same kernel, different VBOs

  • Fighter A kernel speed: ~16.6 ms
  • Fighter B kernel speed: ~16.6 ms
  • FPS: 13

Experiment 3: Fighters and projectiles on same kernel, different VBOs

  • Kernel speed: ~13.7 ms
  • FPS: 28
  • While this yields a performance improvement, projectiles no longer fired correctly, and this version of the kernel file is more complicated and harder to debug than the base case. I was unable to resolve this problem by the due date.
    • This bug may be due to a a lack synchronization between the projectiles and the fighters.
    • Fixing this would require a reconstruction of the kernel and data structures.

Future Optimizations

  • Proper current / next data structure for synchronization
  • Using single precision constants (using an 'f' suffix)
  • #pragma unroll directive (cl_nv_pragma_unroll)
  •  native_divide (x, y)