Performance Improvements for the Graph Module of Sagemath

Organization: Sage Mathematical Software System

Abstract: In the context of libraries of graph algorithms, the graph module of Sagemath stands out for its simple and well-commented code; however, our preliminary testing outlined that some of its algorithms are not as efficient as other implementations, like Boost graph library or igraph. This project will try to address this issue: the main goal is to wrap more efficient graph libraries (a ticket about Boost is already present), but we might also improve existing algorithms, if time allows.

Extended abstract

In the last few years, several libraries of graph algorithms, like the graph module of Sagemath, Boost, igraph, NetworkX, webgraph, and so on. Among all these tools, Sagemath stands out for its simple and well-commented code; however, some of its algorithms are not as efficient as other graph libraries, like Boost or igraph, as outlined by our preliminary testing.
This project will try to fill this gap: we will code wrappers allowing Sagemath to use algorithms from more efficient graph libraries, without affecting its simplicity (a ticket dealing with this topic is already present, asking to include algorithms from Boost http://trac.sagemath.org/ticket/17966). If time allows, we will also try to speed-up already-implemented algorithms.
As far as we know, this would create the first user-friendly graph library offering performances that are comparable with the fastest implementations available.

Personal:

Name: Michele Borassi

Contact information (email, instant messaging, ...):
email / Google Hangouts: ***
Skype: ***

Location/Timezone: Italy / UTC+01:00

University: IMT Institute for Advanced Studies Lucca

Personal web page: https://sites.google.com/a/imtlucca.it/borassi/

Curriculum Vitae: downloadable from my web page

Mentor: David Coudert

Background:

What are your technical skills, education, experience, etc. Especially make sure to explain with what level of mathematics you are comfortable with and on what level you would like to program.

I have a Master Degree in Mathematics, and I am a Ph.D. student in Computer Science. I have excellent programming skills in Java and good programming skills in C/C++ (as testified by my project C++ vs Java - A Comparison of Identical Graph Libraries Written in Two Different Languages). I also have skills in Python programming, developed in the Ph.D. courses Computer Programming and Methodologies, and Data Science for Complex Networks (I was graded "A" in both courses).

My general programming skills, my knowledge of complexity theory, of data structures, and graph algorithms are testified by several papers where I improved state-of-the-art algorithms in several graph problems [1-6].

Who are you? What makes you the best person to work on this particular project? Your personal motivation?

Since I am a Ph.D. student in Computer Science, and I have a mathematical background, in my opinion, the efficiency of the graph module of Sagemath can benefit from my programming skills, and from my understanding of real-world graphs from an algorithmic point of view. My reasearch-level experience in graph algorithms will help me understand the main performance bottlenecks and improve them.

My motivations to pursue this work are many. First of all, I think that Sagemath is an excellent tool because its code is well-written and thoroughly commented. An improvement in efficiency would make it one of the best graph analysis tools available, if not the best. Furthermore, I will learn a lot by working in an open-source environment, and by having feedbacks from other people (my mentor, in primis). Finally, the experience I will develop using Sagemath and the final result of the project might become very useful in my academic work.

What platform and operating-system are you using on your computer? (Sage development is done best on Linux and OSX)

I use Linux, Lubuntu 15.04. In the past, I also had experience with Mac OS X and Windows.

Are you or have you been engaged in other open-source projects?

I have never been engaged in open-source projects, but I have developed my own code in several works (see next question).

Do you code on your own pet projects?

In the past few years, I have coded on several projects. In almost all my academic works I had to code algorithms, and code them efficiently. I also have programmed a software playing Othello (a board game) using Edax engine, which is one of the best engines available. In particular, I have developed an opening book and I have built a user interface. After debugging and making the code more readable, this work might be included in the new version of Edax.

Are you a Sage user, how long do you know Sage?

During the last year, I have worked with the hyperbolicity package in the graph module of Sagemath, in order to find possible improvement. The results of this work will be published in [1].

Project:

Title, Project Synopsis: a short description and summary of its aim and scope.

My preliminary experiments have shown that the performances of the graph module of Sagemath can be improved in many tasks (for instance, computing connected components and betweenness centrality, as shown in the appendix). The best implementations available can be more than 50 times faster than the Sagemath implementations.

With this project, I will try to address this problem. I will wrap the fastest graph libraries available (igraph, Boost) within Sagemath, and I will replace some of the current algorithms with their faster counterparts. The importance of this work is also testified by a ticket asking to include Boost algorithms (http://trac.sagemath.org/ticket/17966). This way, I will try to improve the efficiency of Sagemath without affecting its simplicity.

After this main goal has been reached, if time allows, we will pursue other goals, too: coding new algorithms from scratch, in order to extend the features of the graph module of Sagemath and to speed-up current algorithms (for instance, we might code an algorithm for computing clique separators [7], and use it in order to speed-up the computation of the hyperbolicity of a graph).

What is your personal involvement or relationship with your proposed project?

The efficiency of algorithms has always been my field of expertise, as testified by my works and publications. I think that collecting all state-of-the-art algorithms in a single, simple, and efficient library can significantly simplify the work in the field of network analysis, which is becoming more and more important in recent years.

This project will provide such a tool, and I will be very happy to use this tool in my researches. Furthermore, I like Sagemath, and I prefer it to any other available tool: for this reason, I would really like to improve this tool even more.

Details: describe all the details and explain modules or parts of your whole project. Break down the whole project into individual tasks - as good as possible - and describe deliverable and quantifiable results for each of them. It also helps if you have already discussed this with a possible mentor.

I have divided the project in several tasks, each of which contains one or more tickets (small tasks that are submitted to Sage, reviewed, improved, tested, until they get accepted for inclusion). Some tasks do not include tickets because they do not require to write code for Sage: mainly, these tasks deal with testing. I have tried to define as many tickets as possible, so that I can have feedbacks "on the fly", and after a part is reviewed, I can build the others from solid and debugged foundations. The tasks I will try to address in my project follow.

  1. Perform extensive testing of existing graph libraries on a big dataset of real-world networks, in order to understand their features (I already have a big dataset of networks, used in my previous papers). The goal is to provide a benchmark of comparison of their efficiency and generality, in order to decide which libraries are more suited to be wrapped in Sagemath.
  2. Design a new "backend" (graph implementation), or improve an existing one, in order to simplify calls to other libraries, possibly avoiding conversions (one ticket).
  3. Decide with my mentor which graph libraries we should include in Sagemath, and sort them in order of priority. Decide a good compromise between efficiency and generality (possibility to work with directed/undirected, weighted/unweighted, mutable/immutable graphs).
  4. Perform the following operation for all libraries in Task 2, in order of priority (we will probably start with Boost, since a ticket is already present on this topic).
    1. design modules to perform library-dependent adaptations (one ticket for each module);
    2. include the library in Sagemath as optional package (my mentor confirmed that this is the standard procedure);
    3. compare Sagemath algorithms with the algorithms based on other libraries. If the latter is more efficient, and the optional package is present, set the new approach as default.
  5. perform extensive testing to decide if the wrappers coded in the previous task have an impact on the efficiency of algorithms. In case, try to improve them, by refining the backend defined in Task 3 (optional, if time allows).
  6. (optional, if time allows): try to implement new algorithms in the graph module of Sagemath, like decomposition by clique separators (one ticket for each algorithm).

Schedule: A timetable, including special circumstances like exams or holidays, for the individual tasks.

Task Start date Expected time End date
1 Test existing libraries May, 25 2 weeks June, 8
2 Decide which libraries to include June, 9 1 week June, 16
3 Design a new "backend" or improve an existing one June, 17 1 week June, 23
4 Include the libraries June, 24 2-3 weeks for each library Depends on the number of libraries
5 Testing on wrappers When task 4 is complete 1 week  
6 New algorithms When all other tasks are completed Until the end of the project  

 

Risk Management: Try to anticipate potential problems and explain, how to mitigate them. Propose alternative scenarios, if a particular milestone isn't reached, to still successfully complete the project.

It is highly probable that Tasks 1 and 2 will be completed as expected, since they only involve testing and discussing with my mentor. Task 3 might involve some more problems, since it is based on the results in Tasks 1 and 2, which are not known, yet.

The first risk is that the graph module of Sagemath is already as efficient as other libraries: this possibility is ruled out by the tests in the Appendix. These tests show that, even if some algorithms are already efficient (for instance, computing the average distance), other algorithms can be improved, like computing connected components and betweenness centrality.

Another risk is that the number of libraries to wrap is too big or too small with respect to time limits. In the first case, I will have to restrict my attention only to some of them, in order of priority, and leave the others for future work. In the second case, I will devote more time to Tasks 4-5.

The last risk I can foresee during Task 3 is that wrappers have a deep impact on the running time of algorithms (in particular, linear-time algorithms): this possibility is ruled out by the tests in the Appendix, that prove that the wrapping of igraph library in Python is efficient, and I can use the same code.

Finally, Task 6 is fundamental for the availability of the code, but since it is the last task to perform, I might "run out of time". I will try to avoid this issue by perform as much on-the-fly testing as possible, and by commenting the code as soon as I write it, in order to make this task easier, and avoid delays.

Additional Material

Bibliography

[1] MICHELE BORASSI et al., Faster Algorithms for the Computation of the Hyperbolicity of Real-World Graphs. In preparation.

[2] MICHELE BORASSI et al., Fast Top-k Closeness Centrality Enumeration. In preparation.

[3] MICHELE BORASSI et al., Fast Diameter and Radius BFS-based Computation in (Weakly Connected) Real-World Graphs - With an Application to the Six Degrees of Kevin Bacon Game. Theoretical Computer Science, Special Issue - Fun With Algorithms, 2014

[4] MICHELE BORASSI et al., On the Solvability of the Six Degree of Kevin Bacon Game - A Faster Graph Diameter and Radius Computation Method. In Proceedings of the Seventh International Conference on FUN WITH ALGORITHMS, 2014.

[5] PAULO VIEIRA MILREU et al., Metabolic Stories: Exploring all Possible Scenarios for Metabolomics Data Analysis. Bioinformatics, 2014.

[6] MICHELE BORASSI et al., Telling Stories Fast Via Linear-Time Delay Pitch Enumeration. In Proceedings of the 12th International Simposium on Experimental Algorithms, 2013.

[7] ROBERT E. TARJAN, Decomposition by Clique Separators, Discrete Mathematics 55 (2), 1985.

Appendix: comparison of Sage with other graph libraries

In order to prove that the efficiency of the graph module of Sagemath can be improved, we have compared its performances with other libraries. The competitors are the most efficient existing graph libraries: igraph and Boost. We have also tested the igraph library wrapped in Python, in order to see if the wrapping has an impact on running times.

We have chosen three algorithms: computing the number of connected components, computing the betweenness centrality of all vertices, and computing the all-pairs-shortest path. These tests were performed on a small dataset of real-world networks, taken from the well-known SNAP dataset (http://snap.stanford.edu/). All tests considered deal with undirected, unweighted graphs: the directed inputs have been symmetrized.

The test has been performed on a Dell Latitude E6430, Intel Core I7 Quad-Core, 2.7GHz, running Lubuntu 14.10. The C libraries were compiled with gcc 4.9.1, optimization level 3, while Python version was 2.7, and Sagemath was run using the notebook interface (our preliminary tests showed that it does not slow down the execution). The times shown were averaged on 10 runs.

The results for connected components are plotted in the following table.

NetworkSagemathigraphigraph
(Python)
Boost
web-Stanford 0.7 <0.1 <0.1 <0.1
flickEdges 0.6 <0.1 0.1 <0.1
com-youtube.ungraph 2.6 0.2 0.3 0.2
roadNet-CA 2.7 0.3 0.4 0.2

 

The following table shows the same results for the average distance.

NetworkSagemathigraphigraph
(Python)
Boost
p2p-Gnutella08 1.1 1.9 1.9 Not included
oregon2_010526 2.2 4.7 4.6 Not included
ca-HepTh 2.2 5.0 5.3 Not included

 

Finally, the following table shows the results for the betweenness centrality.

NetworkSagemathigraphigraph
(Python)
Boost
p2p-Gnutella08 223.3 4.0 4.1 3.9
oregon2_010526 800.8 10.7 11,3 15.6
ca-HepTh 471.3 18.5 18.1 42.2

 

It is clear that in some cases Sagemath is already very efficient (for instance, in computing the average distance). However, there are other tasks where Sagemath is much less efficient than igraph or Boost, like the computation of betweenness centrality and of the number of connected components: these tasks will be addressed by my project.

Subpages (1): Library Comparison
Comments