This page contains all relevant information about the C++ library for fast, efficient implementation of Machine Learning algorithms created by the members of the
FASTlab at Georgia Tech led by
Prof. Alex Gray.
The FASTlib effort aims to bring the diverse range of machine learning algorithms to a common code base. The core aim is to have an optimal balance of speed, flexibility, and usability, with the goal of rapid development of world-class machine learning algorithm implementations.
Motivation
Why a common library?
In the current state of affairs, the best implementation of each algorithm is typically in MATLAB or in a stand-alone C code. Neither situation is optimal in the grand scheme of things. Some things are very difficult to do efficiently in MATLAB whereas they are easy to do in C, whereas others are very difficult to do in C.
For example, MATLAB programmers often resort to creating quadratically-sized matrices in order to exploit vector computation, using inordinate amounts of memory, whereas in C a nested for loop with small memory usage would work. Additionally, combining MATLAB and C code, although possible, is still not yet a wide practice.
On the other hand, standalone C can be quite difficult. Using linear algebra in C usually requires writing your own methods, or going through the effort of integrating a large third-party library. Utilities for reading files have to be hand-rolled, and a lot of file readers have various quirks and expectations. With a core library, these processes can be just one line of code.
A common library can provide the basic necessities for machine learning, while still allowing familiar, efficient programming constructs. FASTlib uses the same linear algebra library, LAPACK/BLAS/ATLAS, that is used by MATLAB, and provides a very easy-to-use wrappers for these (finding singular values in LAPACK requires two function calls of 14 parameters each, whereas our single wrapper needs only two parameters).
Why a new library?
Several libraries exist that attempt to unify machine learning algorithms. However, a lot of these efforts are closely run by a small number of individuals, and aren’t intended to include everybody’s pet project. The entire code base is released, and extensive quality control is required. FASTlib instead attempts to allow distributed development, so that fresh-out-of-the-oven ideas need not be excluded, while still allowing a thoroughly reviewed center.
Development Model
A core tenet of FASTlib is that it is to be open and extensible not just by a few developers, but by a community. Typically, a centralized code base implies a plethora of politics: Who owns the code? Whose style should be used? Will contributions be rejected on pure stylistic grounds? Will the code have to be forked if different requirements are necessary?
To avoid a lot of these plaguing issues, a staging model is used. Contributed code follows a simple migration path. We’ll describe starting from the beginning of the migration path to the end.
Individual Contributions
Any developer can contribute a new method to FASTlib in their own user directory, and utilize the core API features. When a researcher has a great new idea, the initial concern is rarely “How do I make my code pretty and maintainable?” Typically, the code will be full of lots of ideas that seem good to try. In line with extreme programming philosophy, until a new idea is completely hashed out, the best long-term organization is probably not known. However, these contributions don’t have to be fragmented outside of the library – they can be checked into the source control system under user-specific directories. As a result of being in a common repository, collaboration will be vastly simplified, and allows immediate peer exposure.
Shared Algorithms
We want to support a vast array of algorithms. These are part of FASTlib’s extended API and must meet a certain standard of quality, but not nearly as high as we would want the core of FASTlib to be, in order to facilitate contribution and avoid too much political debate. For example, this might contain a support vector machine or kernel density estimation implementation. Many of these will have executable targets that can be run from the command line, but may also implement a standard C++ interface. For example, a Classifier interface might have a function that returns a label for a test point. How the support vector machine works internally is not a prime issue.
Core API
The core API is extremely well organized and thoroughly reviewed, with a consistent programming style, in order to be easy to use, flexible, and very fast. Every person has their own view of how a problem could be solved, and our API should be flexible so that new ideas can be easily built upon.
The core API has many common data structures, useful frameworks and libraries such as parallelization support, and fundamental numerical and data analysis techniques. The core API is pure C++, and not executable files.
Language and Style
The most common languages used for machine learning are C, C++, Java, and MATLAB. Of all of these, we chose C++, as it is both fast and flexible. Although C++ can lead to difficult code when used improperly, we intend to stick to simpler features of C++ and augment our library to avoid other pitfalls.
For examples, in unmanaged languages like C or C++, especially in machine learning, it is entirely possible to write an algorithm that appears to work but is actually incorrect because of no run-time checking. For example, a neural network that computes weight updates for each node (suppose we number this node N), but applies the weight to node N+1, might actually appear to work. FASTlib has a very fast debug mode that will check accesses to data structures, and would immediately catch the neural network error as an out-of-bounds write to an array – saving the developer hours of manually inserting print statements or reading over the code multiple times.
To be fast, we use C++ templates in inner-loop type computations, but employ simple inheritance in other situations. The inheritance tree shouldn’t too be too tall to navigate, to avoid distributing logic over many layers of hierarchy.
Finally, C++ is an excellent language for developing parallel code. All the machine’s features are there, but C++ templates provide exciting possibilities for simplification: for instance, we use templates to allow painless serialization of data structures. You can find our complete style guide in the attachments
Why do I want to use FASTlib?
Maybe there is some machine learning algorithm you'd like to implement, but you want to do this in a fast programming language but have at your fingertips many common tools. After FASTlib matures, you will alternately be able to apply many machine learning algorithms all from the same package, and integrate them together.
We suggest you begin by looking at the
FASTlib Tutorial that helps you get started.