Project Updates

Will be posting updates about my thesis research under Dr. Vivek Sarin here.

So, I was working with Dr. Jennifer Welch (http://parasol.tamu.edu/people/welch/). This work is a perfect mash-up of my longing to work in theory and logic which has a strong bearing on real world distributed systems. In this field, we essentially try to analyse the computational complexity of crash/failure-prone asynchronous (or partially synchronous)(possibly also anonymous) distributed systems. It is ridden with impossibility results, for example, in the consensus problem (http://en.wikipedia.org/wiki/Consensus_%28computer_science%29), given an asynchronous message-passing based system, even if a single process crashes, the problem cannot be solved.

The basic idea behind such impossibility results is to get a complexity bound on what actually is possible, given the impossibility, by weakening the conditions imposed. In the above example, crashes of less than half the total number of processes in the system is allowed in a partially synchronous system. This is again done with the help of various concepts or ideas, if I may say so, like failure detectors. Failure detectors are essentially a specification for detecting crashes in a system. Depending on how strong the condition imposed is, for a given problem, there are different families of failure detectors. The fun thing about failure detectors is this - they give a list of correct processes in a system at any given time, but can be wrong. Depending on how strong or weak the conditions are, the failure detector may be eventually perfect (diamond P) or eventually strong(diamond S) (many other such groups). Basically, they aren't correct now, but we can be sure that certainly some time, you dont know when, but they "eventually" will be. Somehow, I found that funny.

OK. So the basic idea behind the entire field of research is to develop a strong computational theory that gives logical/theoretical proofs for what can or cannot be achieved and in how much asymptotic time. It deals strongly with concurrent operations, atomic operations, linearizability, failure detection etc. and what every researcher does, is to keep on finding new conditions that reveal more about the nature of a distributed system or new more generalised frameworks that show known results in new light. Hope you enjoyed the tour through my brief stint in the field of distributed computing. As for my amazement at the mahattwa of impossibility results, here you go -> http://www.springerlink.com/content/p8f078j3kqcnev0f/ :D. And as one of Facebook updates said "You know you are in the right field, when an innocuous discussion on failure detectors and loads of logic equations later, "at least" one of you "eventually" exclaims, "God does not exist"."

PROJECT MAJOR UPDATE #9: All necessary benchmarks done, results compiled, paper submitted to HiPC. Presently working on Cache and Comparison Optimal Binary Search Tree (CCOBST), a new data structure that we believe improves over traditional OBST. Simultaneously will be working on the CUDA sorting algorithm. Hectic.

PROJECT MAJOR UPDATE #8: Damn patent regulations. Cannot have my name in the PG paper though I might be given special mention. As for the SPMV paper, PetSc is giving some MPI related problems on Eka. Need to figure that out so that I can give a PetSc benchmark too. Time is sure running out. Plus, we have thought of an interesting and "cute" (as Shrirang put it) sorting algorithm to implement on CUDA. Lets see how that turns out. Hope, I manage to remember and deal with my hypothetical data structures sometime in future.

PROJECT MAJOR UPDATE #7: All set to wrap up and send in a paper. Should be not more than a few weeks. The optimiser in me is still stuck on tiny tweaks here and there, but I am pretty sure, they might give nearly a 10-15% performance boost. Wish me luck. I guess I must start a new thread for a new project ;-). Aah! And I nearly forgot to update about my PG paper. Yeah, that is much more complete than the SPMV one and for people who are completely unaware of what PG is based on, you may want to read up a paper by James Singer on Perfect Difference Sets. There is a PDS based network suggested by Parhami, but ours is quite different and in our opinion, better. Let us see what reviewers think of it. :D

PROJECT MAJOR UPDATE #6 : After hefty efforts having gone into getting results that would support our claim to integrate our SPMV kernel into OpenFOAM, we are still in a squared zero situation. It intuitively seems more likely to me, that though it might (and a very weak might) give a performance bonus for the same number of cores used, will it good enough for CRL to market it. In other words, is it worth putting in the effort. Details about this analysis is given in the project report attached.

PROJECT MAJOR UPDATE #5 : Working on a paper on a new network architecture called PG(Projective Geometry) based on Karmarkar's algorithm. Second paper is on SpMV on a single node. Here, we are claiming a simple and novel algorithm making use of variable size boxes to carry out SpMV operations. At the same time, I am also ruminating over a cache oblivious method of searching through a BST that, at least theoretically, should give a better performance than other solutions out there, because it optimises on cache hits, rather reduces cache misses. I am also attempting to come up with a parallelised data structure that makes efficient use of threads for the searching operation. Have an idea that has been through peer and senior review, and actually been aesthetically thrashed...but still working on it. I am on a roll, babay!!

PROJECT MAJOR UPDATE #4 : All bugs fixed. Accurate results. Performance improvement 5-10% over previous code. Previous code was nearly 10% better than HYPRE (from Lawrence Livermore). So modified new code must be nearly 15% better than the HYPRE performance. Tada!! Looking now to improve upon the end-to-end time/performance. Parallelization galore!!!!!

PROJECT MAJOR UPDATE #3 : Fixed the problem of redundant sending of vector values. Should expect improvement in performance, except for one teeny-weeny problem (okay not that teeny-weeny). For some reason, the code is accurate for matrix inputs of the size 1k*1k. However, for inputs upto 9k*9k or 25k*25k, the results are accurate for uptil 2 processors. The moment I change the number of processors to 4 or above, I start getting wrong values. No computer science theory I have learnt yet explains such a bug. One might say, memory issues. Maybe so, but let me inform "one" about a few more details. For the 25k thingie, it started showing errors from the 2500th line onwards, accurate till then. For the 9k thingie, error right from the 3rd value. Please, please someone explain this to me. Debugging for matrices that huge is so bloody irksome.

PROJECT MAJOR UPDATE #2 : (read *) The issue to deal with was redundant sending of vector values to each rank for every thread within the rank. Suppose 4 threads at a particular rank had suppose {1,0,3}, {1,4,5}, {5, 4, 0, 6}, {5, 6} as the indices of the partial vector they intend to recv. For every iteration, index=0 will be sent twice, 1 twice, 3 once, 4 twice, 5 thrice, 6 twice. Too many repititions, isnt it? we'd better have only {0,1,3,4,5,6} sent, each once. So this is where a sort of merging is to be done. Next point : Even if the indices are recvd only once, each thread should know which one it is to take, and which to discard - adopt a sort of masking scheme.

First part, I have implemented and tested. Second part has the very irritating issue of sending pointers through MPI. Trying to fix that.

* For those uninitiated in the topic, our intention is to multiply a sparse matrix or lets say, a simple matrix with a vector. Say for A = { 1,0,2,3 ; 3,4,7,8 ; 0,0,5,6 ; 0,0,8,9 } and x = {10,20,30,40}, we need to find A.x. So one simple scheme to speed it up is to get one processor to take the upper half and the other to take the lower half of A. But now processor1 would require the lower half of x for it right half, and processor2 would require the upper half of x for its left half (even x is partitioned as lower half and upper half and divided between the processors). So we need to send these unavailable parts. These are sent from one processor to the other and hence this takes time. Now think of improvements -> ? ;)

PROJECT MAJOR UPDATE #1 : My naive implementation of scatter using binary tree logic is performing in the +- 10% range of that of the corresponding HP-MPI routines on Eka. For nearly 32Mb arrays run on 32 nodes, it is performing better, whereas it is significantly dirtier in the case of very small arrays. But since the objective is to consistently use large sizes of arrays, I believe these first cut results are very promising.

PROJECT ASSIGNED (w00t!) : Work on new algorithm implementation for SPMV, with focus on new methods for data partitioning in order to achieve better performance. If I am lucky enough, I just might get a paper published. Simultaneously, developing a seemingly good algo for scatter-reduce operations. Shall upload the specifics and possible source code soon, and hoping to get peers and geniuses alike to criticise it.

~!Project Update!~ : Source code cannot be uploaded to maintain confidentiality (yes, I had signed the agreement without a gag on my mouth and definitely without a two-barrelled gun right onto my eyeballs). But something fantastic did strike me as the robotic hand did not guide my numb fingertips across the paper in the perfect replication of my signature. Shall discuss that soon, once I figure out the exact mathematics behind it and the exact physics, electronics and mechanics behind the robotic arm that isn't.

OLD NEWS (ironical but true)

Presently working on personal projects - essentially figuring out which holds most potential to make something out of itself. Onus is on the project I am yet to be assigned at CRL, Pune. Hopefully I get to do something swashbuckling <forgive the usage of such a word in lieu of my brain's inability to think of a better one>. Will soon publish my list on this site and seek your comments on their validity and practicability.

Meanwhile, I continue to code my own simple algorithms to benchmark processor/cluster/supercomputer(muhaha..eka) attributes. Working on 16 gig RAMs is fun!

Projects done/(not to be restarted any time soon) can be found in the Resume section.