SNAP is a Spark Native next generation analytics platform, specifically designed for ad-hoc queries on very large datasets.
What is SNAP?
The Sparkline Data Nextgen Analytics Platform ( SNAP) is a Spark Native distributed Analytics platform geared towards fast ad-hoc querying over a Logical Cube( Star Schema) in Hadoop or S3 Data Lakes.
SNAP was designed to efficiently query vast amounts of data stored in HDFS or S3 and make it easy to consume using standard B.I tools like Tableau. If you work with terabytes or petabytes of data, you are likely having issues with query performance, management of summary tables and or cubes . Sparkline SNAP was designed as an alternative to SQL on Hadoop tools to go beyond SQL and optimize all consumption workloads whether they are SQL reporting, OLAP analysis or machine learning.
SNAP is designed to handle data warehousing and analytics, where the Star Schema is the fundamental unit of data analysis. It is highly suited for aggregating large amounts of data , slicing and dicing, drill down, drill through analysis. These workloads are often classified as Online Analytical Processing (OLAP).
Since SNAP only needs Apache Spark as a run time, it is very simple to deploy on existing Hadoop clusters from any vendor and on any cloud.
Sparkline SNAP vs Traditional B.I Stacks
The current norm to support ad-hoc queries at scale with "think time" response times is to copy the data into a ‘Specialized’ Data Store; part of the maintenance is to pre-aggregate the underlying data into many Materialized Views. Such a solution has several drawbacks:
- Data management costs: The ETL to create multiple materialized views and maintain them as the source data changes is a both a time and a resource sink.
- The multiple-copy solution only works well when the workload is known before-hand, when pre-aggregates can be used to optimize the known workload. Pre-aggregates typically break down when the workload is ad-hoc.
The SNAP platform introduces the concept of an indexes, starting with an OLAP index, as well as an optimization layer that can understand query patterns and adapt to the workload by building the required indexes as and when needed.
- Sparkline SNAP provides the ad-hoc query capability by extending the Spark SQL layer, through SQL extensions and an extended Optimizer(both logical and Physical optimizations).
- Sparkline SNAP uses OLAP indexing vs. pre-materialization as a technique to achieve query performance. OLAP Indexing is a well-known technique that is far superior to materialized views to support ad-hoc querying.
What are the key use cases
The SNAP platform can be used across various B.I and A.I use cases on Hadoop and S3.
- Reporting and querying large datasets with filters and aggregates
- Slice and Dice queries using standard tools like Tableau, Qlik etc.
- Python based analysis on Spark data.
- Combining data science with data exploration on spark.
- Forecasting and planning applications that require OLAP modeling.