KAU Data Scienсe Center

3.3.2. Apache Spark MLLib and ML

Firstly, Apache introduced Mahout built on the top of MapReduce. Mahout was mature and came with many ML algorithms. However, ML algorithms generally use many iterations making Mahout run very slowly.

Apache, then, introduced Spark MLLib and Spark ML built on top of Spark ecosystem on Hadoop, which making them much faster than Mahout. Spark MLLib contains old RDD-based API (Resilient Distributed Dataset). RDD is the Spark basic abstraction of data representing an immutable, partitioned collection of elements that can be operated on in parallel with a low-level API that offers transformations and actions. Spark ML contains new API build around DataFrame-based API and ML pipelines and it is currently the primary ML API for Spark. A DataFrame is a Dataset organised into named columns and it is conceptually equivalent to a table in a relational database. Transformations and actions over DataFrame can be specified as SQL queries, which is convenient for developers with SQL background. Moreover, Spark SQL provides Spark more information about the structure of both the data and the computation being performed than Spark RDD API. Spark ML brings a concept of ML pipelines, which help users to create and tune practical ML pipelines; it standardises APIs for ML algorithms so multiple ML algorithms can be combined into a single pipeline, or workflow. Spark MLlib is slowly being deprecated in the maintenance mode and most likely will be removed in a future major release.

Spark MLLib and Spark ML contain ML algorithms such as classification, regression, clustering or collaborative filtering; featurization tools for feature extraction, transformation, dimensionality reduction and selection; pipeline tools for constructing, evaluating and tuning ML pipelines; and persistence utilities for saving and loading algorithms, models and pipelines. They also contain tools for linear algebra, statistics and data handling. Except the distributed data parallel model, MLlib can be easily used together with stream data as well. For this purpose, MLlib offers few basic ML algorithms for stream data such as streaming linear regression or streaming k-means. For a larger class of ML algorithms, one have to let model to learn offline and then apply the model on streaming data online.

Strong points

ML tools for large-scale data, which are already integrated in Apache Spark ecosystem, convenient to use in development and production.
Optimized selected algorithm with optimized implementations for Hadoop included preprocessing methods.
Pipeline (workflow) building for Big Data processing included a set of feature engineering functions for data analytics (classification, regression, clustering, collaborative filtering and featurization) aslo with stream data.
Scalability with SQL support and very fast because of the in-memory processing.

Weak points

Mainly focused to work on tabular data;
High memory consumption because of the in-memory processing.
Spark MLlib and Spark ML are quite young ML libraries in involving state. They are not very popular and the number of ML algorithm implementation is not very high.

Return to Contemt

Google Sites

Report abuse