KAU Data Scienсe Center

3.3.3. H2O, Sparkling and Deep Water

H2O, Sparkling Water and Deep Water are developed by H2O.ai (formerly 0xdata) [H2O]; they are Hadoop compatible frameworks for DL over Big Data as well as for Big Data predictive analytics.

To access and reference data, models and objects across all nodes and machines, H2O uses distributed key/value store. H2O's algorithms are implemented on top of distributed Map/Reduce framework and utilize the Java Fork/Join framework for multithreading. H2O can interact in a stand-alone fashion with HDFS stores, on top of YARN, in MapReduce, or directly in an Amazon EC2 instance. Hadoop mavens can use Java to interact with H2O, but the framework also provides REST API via JSON over HTTP and bindings for Python (H2O-Python), R (H2O-R), and Scala, providing cross-interaction with all the libraries available on those platforms as well. H2O also provides stacking and boosting methods for combining multiple learning algorithms in order to obtain better predictive performance.

H2O: Except the REST API and bindings for popular programming languages, H2O is accessible through CLI as well giving possibilities to set several options to control cluster deployment such as how many nodes to launch, how much memory to allocate for each node, assign names to the nodes in the cloud, and more. It offers a web-based interactive environment called Flow (similar to Jupyter). Data source for the framework are natively local FS, Remote File, HDFS, S3, JDBC, others through generic HDFS API. Although the ML algorithm coverage is not high, they are optimised to run over Big Data and cover the need of the target companies i.e. banks and insurance sectors. In details, H2O is used by 8/10 top banks for pattern-based Anti-Money Laundering (AML), fraudulent behaviour detection, real-time personalised product recommendation; 7/10 top insurance companies for risk group and claim classification automation, customer churn reduction, customer retention analysis, insurance fraud alert system and usage-based insurance telematics; and 4/10 top healthcare companies for real-time preventive care, cancer detection or personalised medicine development.

Regarding the DL in H2O, it is based on FFNNs trained with stochastic gradient descent (SGD) using back-propagation. The global model is periodically built from local models via model averaging. Local models are build on each node with multi-threading using global model parameters and local data.

Sparkling Water contains the same features and functionality as H2O but provides a way to use H2O with Spark. It is ideal for managing large clusters for data processing, especially when it comes to transfer data from Spark to H2O (or vice versa).

Deep Water (see Fig. 9) is H2O DL with native implementation of DL models for GPU-optimised backends suc as TensorFlow, MXNet, and Caffe. These backends are accessible from Deep Water through connectors.

Fig. 9 H2O Deep Water architecture [H2Odeepwater]

Strong points

Industrial use with significant growth and high popularity among financial, insurance and healthcare companies.
optimization algorithms for Big Data processing and analytics with infrastructure supports.
H2O provides a wider generic set of ML algorithms that leverages Hadoop/Spark engines for large-scale dataset processing. It aims to make ML/DM process more automatic through GUI.

Weak points

UI flow, the web-based user interface for H2O, do not support direct interaction with Spark.
H2O is more general purpose and aims at different scalable DM in comparison with (specific) DL libraries e.g. TensorFlow or DL4j.

Return to Contemt

Google Sites

Report abuse