Tuesday 2/14/17
12:30 PM
I spoke to Prof. Dorland. He has been busy and therefore unable to respond to our email. I sent him a reminder email to give us some times when we might meet with him and some reading material to familiarize ourselves with the subject.
Friday 2/17/17
3:00 PM
The group met with Prof. Dorland in his office. He discussed 5 research projects with us. We will each do one of these individually, but there is enough similarity between projects for collaboration. He asked us to email him with a description of what we would ideally like to be doing in our research projects.
5:00 PM
I emailed Prof. Dorland, indicating my interests.
Saturday 2/25/17
2:00 PM
The group met with Prof. Dorland in his office to build the basic understanding necessary for his research.
We discussed ways to represent points in 2 dimensional space. We worked our way up to 8 dimensional space. We learned how to take a dot (inner) product in higher dimensions. We discussed how to transform from one coordinate system to another in up to 8 dimensions. This can be done by multiplying a vector by a transformation matrix. This is similar to Eingenvalues, which we also discussed. Both require linear algebra. I am the only on in the group who has not taken linear algebra, so I will have to learn some background.
Prof. Dorland gave use a problem to do before our next meeting. The problem was to transform the 8 dimensional vector F(x) = sin(pi/4 * x), where x is the dimension for integers [0,7], into a coordinate system such that only the first component had a non-zero value.
Monday 2/27/17
Afternoon
I found a solution to the sine problem.
7:00 PM
The group met to discuss the sine problem.
Tuesday 2/28/17
11:00 AM
We were supposed to meet with Prof. Dorland, but he had to cancel the meeting. We used the allotted time to discuss the sine problem.
Saturday 3/4/17
Morning
I finished reversing my transformation I used to solve the sine problem. To do this, I solved an 8 variable system of equations. I used the methods of linear algebra, though I did not know that was what I was doing.
1:00 PM
We met Prof. Dorland in his office to finish our discussion on the fundamentals of his research. We discussed compression (as in the compression of data). This was the premise of the sine problem. We found and discussed a general way to compress higher dimensional vectors. We then discussed the Fourier transform, which decomposes a function into constituents of sine and cosine. The amplitudes of each constituent are then used as the basis.
Tuesday 3/7/17
11:00 AM
We met Prof. Dorland in the classroom to discuss a specific example of how to approximate the solution to a differential equation.
Friday 3/10/17
5:15 PM
We met Prof. Dorland in his office.
Wednesday 3/15/17
5:00 PM
I met with Prof. Dorland to discuss possible projects in which I might be interested. I decided on one that interested me the most.
Saturday 3/18/17
9:30 AM - 11:45 AM
I conversed with Prof. Dorland through a gmail chat about the specifics of what I will be doing. He gave me an article to read to get familiar with the language CUDA. The conversation is recorded in the file <DIgital_Conversation_with_ProfDorland.pdf> (attached at the bottom of the page).
Wednesday 3/29/17
5:30 PM - 7:00 PM
I met with Prof. Dorland, and he gave me a list of tasks I will do as part of my project. First I will need to increase my environment development skills with gnu make files, emacs, and subversion. I will continue to increase these skills throughout the rest of the project. I will then need to use the nvcc compiler for GPU code. One of the long term goals of this project is to link multiple GPU's to do computations in the same code. He advised that I download a CUDA software development kit to see lots of examples of CUDA code.
11:15 PM - 1:45 AM (Thursday)
I researched make files and successfully made one to run a hello world script.
Tuesday 4/11/17
9:30 PM - 12:30 AM (Wednesday)
I wrote a cuda file to take the sum of all integers 1 to N.
I made a make file to run the code.
The code could not work using ordinary variables, so I had to convert everything to pointers.
I added a destruction section to clear up the memory on the device and locally.
I fixed all the bugs the compiler could catch and created an executable with make.
I got output but the value of the output was wrong.
Thursday 4/12/17
11:00 AM - 12:00 PM
I added a shared variable to hold the sum and synchronized threads before and after adding to the shared variable. It still didn't work. I changed from my initial method to putting the terms in an array and adding pairs and making a new array until only the sum remains. There are syntax errors I need to figure out.
Tuesday 4/18/17
4:00 PM - 5:30 PM
Cannot progress on summation problem using the method of arrays because it is impossible to initialize the dimensions of an array using a variable. It can be done with a constant; however, the constant cannot be defined using a variable; it must be hard-coded. Since the dimensions of the array needed changes as a function of N (specifically 1 + log2(N) X N), it cannot be hard-coded.
11:30 PM - 2:30 AM (Wednesday)
I wrote a code to perform scalar multiplication on a vector (aB) on two different GPU's in series. I had nonsensical output and a long stream of runtime errors.
Wednesday 4/19/17
1:00 PM - 2:00 PM
I had allocated the wrong amount of memory for one of my variables. This was causing the runtime errors. I used int output in the print function to express a float. This caused random memory accesses and the the nonsensical output. I fixed these errors, and now I am getting the expected output.
5:00 PM - 6:00 PM
Prof. Dorland and I had our usual meeting.
During that week
I wrote a code with a kernel to return true if two vectors had the identical entries. This would have worked except kernels can't return non void values since they are global functions.
Wednesday 4/26/17
5:00 PM - 6:00 PM
Prof. Dorland and I had our usual meeting.
Sunday 4/30/17
6:00 PM - 7:00 PM
I wrote a code to measure the time taken to transfer a variable size of data in either direction between the CPU (host) and the GPU (device). I began collecting data for NULL variable transfers. If a pointer is set to NULL, regardless of its size, it takes the same amount of time to transfer between host and device. Because of this, I assigned all indices of the pointer to 0 rather than assigning the whole pointer to NULL. I spent any spare chance I had for the following 2 weeks measuring and recording times for different sized data transfers. I also continued to update the code during this time period to make it more general and able to handle large numbers.
*How the code works*: A single data transfer does not take a long enough time to measure for small amounts of data. To get around this problem, I have the line that transfers the data in a for loop that runs 1 million times. This for loop is in another for loop to make the transfer run slower if necessary. The 1 million in the internal for loop is divided by a factor which can be increased to make the transfer run faster if necessary. The faster (C) and slower (T) constants can be changed at the beginning of the code, and the location to do this is explained with comments. The data size being transferred can also be changed at the top. The code outputs "Start" to the screen when the data has started to transfer and "End" when transfer has ended.
*How to run the code*: I also wrote a make file for the code. Prof. Dorland can give you the pathway to the directory which houses the code and the make file. In the directory with the code and the make file do the following:
Type "make". This will create an up-to-date version of the executable "exe", which can be run by any computer.
It is only necessary to do "make" if the file has been updated or "exe" does not exist.
You can check for the executable "exe" by typing "ls".
To run "exe", type "./exe" in the directory that now contains it.
When you have finished measuring times, type "make clean" to remove the executable and any object files also created by "make".
*Timing procedure*: The time in microseconds required to transfer the selected size of data is equal to C/T times the time between "Start" and "End". This time can be measured with a stopwatch. There is an intentional slight delay between starting the code and "Start". This means that starting and stoping the stopwatch is delayed by the same amount of time which is unique to each person depending on his reaction time; therefore, the measurement of interest (the time between "Start" and "End") is the same for each person.
7:00 PM - 2:00 AM (Monday)
As a group we wrote our research paper by combining our individual sections.
Wednesday 5/3/17
5:00 PM - 6:00 PM
I showed Prof. Dorland my code and the data I had collected. He asked me to use these to figure out the maximum amount of data he could transfer with 3 GPU's running his current code and still double the efficiency.
Tuesday 5/9/17
12:00 AM - 2: 00 AM
I calculated the the maximum amount of data that could be transfer with 3 GPU's running a code that currently takes 1 second per timestep and still double the efficiency.
First, I set Amdahl's Law equal to 2 for double efficiency.
Then I solved for p, the proportion of the time that must be spent in parallel and found that p=3/4.
When the data is in the latent region, it takes approximately 1.5 times longer to transfer data from the device to the host than from the host to the device; however, for non-latent data, these times are approximately the same. I, therefore, defined the lower boundary of non-latency as the amount of data such that
where and are functions that give the time in microseconds required to transfer D bytes of data from the device to the host and from the host to the device respectively. Since every GPU in the system must transfer it's own data to the host only once, and the host must then transfer each GPU's data to all the other GPU's, the time spent transferring data for a system with n GPU's is given by
.
When the data is non-latent, this equation reduces to
.
For Prof. Dorland's code and GPU's cluster,
,
where T is the original time per timestep in microseconds, and n=3. can be obtained from this equation, and D can then be interpolated from the graph of . The maximum amount of data that could be transfer with 3 GPU's running a code that currently takes 1 second per timestep and still double the efficiency is therefore 100 MB.
I then made substitutions into Amdahl's law to give the predicted speedup for the generalized case of n GPU's. Assuming non-latency and using
,
I obtained the equation
.
Observe that
.
This indicates that infinitely increasing the number of GPU's does not infinitely increase the efficiency. In fact, doing so reduces the efficiency to 0, meaning that with infinite GPU's, infinite time is spent transferring data. Solving for n such that
, I obtained the number of GPU's for a system where efficiency is neither gained nor lost and increasing the number of GPU's beyond this number only decreases the efficiency. This number is
.
Note that I have used without a subscript rather than with a subscript since for non-latent data. I will use this notation from now on.
The first derivative of is given by
,
and the second derivative of is given by
satisfies the requirements of Rolle's Theorem on the interval ; therefore, there exists at least one in such that
,
where
is a critical point of . can be obtained by solving the polynomial equation
.
Substituting this polynomial into the expression for the second derivative of reveals
It follows that on , there is only one predicted by Rolle's Theorem and it is the absolute maximum of on this interval.
Tuesday 5/16/17
I found a method for finding the zeros of cubic polynomials generally, called the cubic formula. Using this method, the zero in , which is the absolute maximum of on this interval, of the polynomial
is given by
,
where
,and .
It can be shown with a linear Taylor approximation that
when . The expression, therefore, reduces to
.
Note that for convenience, the can be omitted since only integer precision is needed.
I have attached a PDF called <GPU_Graphs_Hi_Res.pdf>, which has high resolution graphs of the data I collected and graphs of the relevant equations I derived. It can be found at the bottom of the page.
For future investigation:
It is very useful to know the exact lower boundary of non-latency. From my data 400 bytes is latent, 4,000 bytes is intermediate, and 40,000 bytes is non-latent. To do this, one should comb in between 4,000 bytes and 40,000 bytes to see when
,
which is how I am defining non-latency.
Helpful Links