What data processing tool should I use?

6/14/2024

TL;DR: Horses for courses

This blog post will be updated as I add new tools.  First published: 6/7/2024

Before you ask yourself what data processing tool should I use, ask yourself these prior questions:

These may even make the question of what data processing tool to use irrelevant. If you are part of a team that uses databricks you should probably use spark and you may want to explore photon or spark-rapids once you care about performance (after you've written janky code the first time round to solve the problem).

Here i'll go through some of the pros and cons of data processing tools, discuss their performance and explain how they work for the following tools: (in alphabetical order) cudf, dask, dask-cudf, Daft, DataFrames.jl, DuckDB, FireDucks, ibis, modin, pandas, polars, spark, spark-rapids, Theseus. There are many data processing tools I excluded as mentioned in the Compostable Codex and the 2024 MAD (Machine Learning/AI/Data) landscape such as Velox. To get a higher level understanding of data processing tools I recommend reading https://howqueryengineswork.com/ by Andy Grove.

Cudf is NVIDIA's dataframe library that offers GPU accelerated analysis. It is written in C++ (libcudf) and utilizes CUDA to directly interact with the GPU, enabling parallel processing and efficient computation. Cudf is the libcudf binding to python. It mimics the pandas API with a few exceptions. It provides a tool (cudf.pandas) to speed-up pandas code which uses the GPU where possible with zero-code changes for users. It does this by calling the cudf functions where possible and falling back to pandas if it the function is not implemented in cudf (see how-it-works and also this talk at GTC by Ashwin Srinath). It is reported that cudf offers a speed up of around 50x compared to pandas. Other functions can be sped up anywhere from 8x to 20,000x but YMMV (Your Millage May Vary) and will depend on things like what GPU you use.

Source: NVIDIA Blog

Pros: The fastest solution on the database-likes ops benchmark. Built on familiar pandas API. Offers a zero-code change tool to accelerate your pandas code if you have access to a GPU.

Cons: Requires GPU hardware that may not be easily accessible for some users.

Daft is a distributed query engine for large-scale data processing in Python and is implemented in Rust. Unlike polars it can scale in the cloud. See the daft comparison page here. Daft is designed to also handle data such as Images, Embeddings and Python objects efficiently with its Arrow based memory representation. For example, you could have a UDF (User Defined Function) that crops images, which are rows,  in your dataframe. Daft uses Ray as it's execution engine and can use GPUs or CPUs. Daft has many integrations which allows it to read data from cloud storage as well as SQL databases.

Source: Daft GitHub

Pros: Ability to handle things such as image and text files with a dataframe like API. Works well with Ray and can help with pre-processing for deep learning.

Cons: Not many examples beyond the tutorials. Not included in many benchmarks so it's hard to comment on its performance.

Dask Dataframe is a python library that parallelizes pandas code. It is built to handle larger-than-memory datasets and can allow you to, for example, work through a 100 GB dataset with 32 GB of memory. Historically dask has been slow compared to spark but has recently (May 2024) seen speed-ups thanks to a new shuffle algorithm (storing data effectively on disk) and a query optimizer (column projection, filter push-down and automatically resizing partitions). This has led to a 20x speed up compared to legacy Dask DataFrame. Dask works on your local machine but it shines when running on a distributed platform such as kubernetes. It is worth mentioning coiled offers a product for running serverless dask. Dask's design is laid out in a 2015 scipy conference proceedings paper but the most up-to-date resource on how it works is the distributed docs.

Source: Coiled blog

Pros: Good option for a pandas user who wants to scale out on the cloud and is working with big data as seen here in a benchmark using cloud computing.

Cons: The Cluster, Client, .compute() API can be tricky for new users coming from pandas. Debugging tornado coroutines and concurrency (the network library that Dask uses) can be hard.

This is the cudf backend (plugin/engine) to dask. This allows you to scale cudf across multiple GPUs by using Dask-CUDA. GPUs tend to have lower memory than CPUs so Dask-CUDA allows you to work on larger-than-memory problems with one GPU or many GPUs. It uses UCX (Unified Communication X) to speed up network transfers and RAPIDS Memory Manager to provide efficient memory management.

Pros: Fast option for dealing with big data.

Cons: Multiple GPU hardware can be expensive if not used efficiently. Similar to Dask, The Cluster, Client, .compute() API can be tricky for new users coming from pandas.

Apache Arrow DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. DataFusion has a comparison page to understand how it distinguishes itself from other tools on this page. DataFusion is targeted for developers and not end users such as data scientists. It can help build projects such as domain specific query engines, new database platforms and data pipelines, query languages and more. The architecture of DataFusion can be found in the SIGMOD 2024 paper (Lamb24, hereafter) which i'll summarize here. Lamb24 argue that tightly integrated systems like DuckDB are expensive to develop and as new computing becomes available and new AI pipelines are developed DataFusion is well positioned to be developed along side these components for the next decade. DataFusion is currently used in InfluxDB, Coralogix, Synnada and Arroyo. It currently has executions run times for Spark (Comet), PostgreSQL (Seafowl), Vega (VegaFusion) and InfluxQL (time series query language for InfluxDB). DataFusion used in SQL analysis tools such as dask-sql and SDF which use DataFusions's SQL parser, planner and plan representation to analysis SQL queries. It also used in table formats such as the Rust implementation of Delta Lake, Apache Iceberg and Lance.

It's performance can be found on the ClickBench page which is a benchmark for analytical databases run by ClickHouse. DataFusion is in the top 25 percentile of results. 

Source: ClickBench

Pros: Great foundation tool to help build databases. Vibrant Community.

Cons: Not designed for end users.

DataFrame.jl is the DataFrame library for Julia. This offers similar benefits that Julia offers (see my old blog post on Julia). This includes expressive syntax e.g. select(df, :x => mean, :y) is the equivalent of df.assign(x_mean = df['x'].mean())[['x_mean', 'y']] in pandas API, multiple dispatch (dispatching based on type and function's arguments), built in parallelism and JIT compilation using the LLVM compiler framework (This allows Julia to achieve performance close to that of statically compiled languages like C or Fortran). Julia is used widely in some scientific groups and DataFrames.jl is an obvious choice if you and your team use Julia. The DuckDB Database-like ops benchmark has DataFrames.jl at 4th out of 15 dataframe libraries.

Pros: Expressive syntax and fast performance on small local datasets.

Cons: If you are not a Julia user I don't think DataFrames.jl will sway you to use Julia.

DuckDB is a fast in-process analytical database. It has become popular with data scientists because of its ease of reading parquet files (the preferred method for storing tabular data for data science) and its focus on analytical query workloads (OLAP: online analytical processing). You can read parquet files from s3 (cloud object storage) and HuggingFace Datasets. It uses SQL syntax which is familiar to many analysts. It has a rich and growing extension library including a tool that does spatial analysis.

DuckDB's speed is explained in their 2019 International Conference on Management of Data Conference paper (note: anyone interested in DB research should check out the ACM SIGMOD/PODS conference next week). DuckDB splits up the logic of the different components. These are the API (C++), SQL Parser (libpg_query), optimizer (cost-based), execution engine (vectorized), Concurrently Control (serializable MVCC (MultiVersion Concurrency Control)) and Storage (DataBlocks). The SQL parser, which is derived from Postgres' SQL parser, takes a query string and returns a parse tree of C structures. This parse tree is transformed into a parse tree of C++ classes. The optimizer performs a join order optimization using dynamic programming (finds the most efficient join order) with a greedy fallback (quick decisions based on locally optimal choices such as joining the tables with the lowest estimated cost). This includes flattening subqueries (reduce expressions that appear multiple times) and simplifying the expression tree using cardinality estimation (estimate number of rows that will be processed at different stages of the query). The vectorized interpreted execution engine allows for efficient use of CPU caches and performance improvements because operations are performed on batches of data rather than on individual rows. The vectors have a selection vector (list of offsets stating which indices are important) to avoid moving data within vectors when filtering. The execution engine works on chunks of the physical plan (think DAG). Lastly, DuckDB uses read-optimized DataBlocks storage layout where logical tables are partitioned into chunks of columns with are compressed. These blocks carry min/max indexes for every column which help determine if they are relevant to a query.

Modified version of Figure 1 from the DuckDB paper

Pros: Great option for processing small and big data on a single machine.

Cons: Doesn't scale out to multiple machines.

FireDucks is a very new (June 2024) drop in pandas replacement tool for accelerating data processing. FireDucks is developed by the Japanese company NEC. Preliminary benchmarks shows it's marginally faster than polars. The GitHub page currently only has a README.

Source: FireDucks docs and code

Pros: Looks fast. pandas accelerator with zero change.

Cons: Hasn't been benchmarked by an independent third party. Empty GitHub repo is suspicious.

Ibis is an open-source dataframe library that works with any data system. It offers a friendly API if you are connecting to any database and want to query the data using python. It's a great tool if you have data in a variety of locations, for example, you can used a shared API across your company if one team has data in parquet files and another team has data in snowflake. Similar to spark, you can write SQL queries or use the dataframe API.

Source: Ibis docs

Pros: One API to rule them all. Offers SQL and a dataframe API for a variety of users.

Cons: You mat want to use one upstream library if you are just working with parquet files.

Modin is a drop-in replacement to speed up pandas code. There is a detailed write-up on Modin's architecture in the docs. Modin can use Dask, Ray (not me) or MPI as it's backend.

Source: Modin docs

Pros: Zero code change to accelerate pandas code.

Cons: Unclear on the speed-up of Modin compared to other options.

Jensen Huang recently said: "pandas is the spreadsheet (excel) of data science". Pandas is a fast, powerful, flexible and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. Pandas has been around for ~15 years and is ubiquitous (~200,000,000 million downloads per month). NVIDIA has a good write-up on pandas in their glossary page which I won't repeat here.  Pandas is a safe option for anyone learning to code and deciding which data processing tool. ChatGPT will often give data processing responses using pandas code due to its prevalence in various training data (StackOverflow, Github, Kaggle, etc.)

Source: pandas docs

Pros: Extensive API, lots of examples, defacto dataframe API.

Cons: Doesn't scale out. Slow compared to modern tools.

Polars is a Rust dataframe library with bindings to python. Polars has a multi-threaded query engine for effective parallelism. Similar to DuckDB, Polars works on vectorized columnar processing with cache-coherent algorithms. In a recent TPC-H benchmark using 10 GB of data, Polars came out the fastest when compared to Pandas, DuckDB, PySpark, Dask and Modin. Polars has a streaming API to work with data larger than your memory. Polars has its own Domain Specific Langauge (DSL) for transforming data. This allows for complex queries to remain human-readable but it may take a while to get use to it if you are coming from pandas. In the future polars will have a GPU engine backed by NVIDIA

The technical details of polars are published in a blog which I will summarize here. Polars uses SIMD (Single Instruction Multiple Data) which runs the same instruction with different data. Polars creates an optimized logical plan. It does this by parsing the query into a logical plan (what the user wants to do, but not the how) then the optimizer traverses the plan (several times) to avoid unnecessary work. The logical plan is a tree with data sources as leaves of the tree and transformations as nodes. The optimizer may modify nodes, add nodes or remove nodes. It will change the order of operations such as filters to make them occur as early as possible. It will run projection pushdown by only selecting the columns needed in the query and predicate pushdown to filter data as close to the source as possible. Polars has other optimizations given here such as common subplan elimination (cache subtree/file scans that are used by multiple subtree in the query plan). Cardinality estimation (number of distinct values in a column to determine optimal group by strategy). Currently, I am not aware of technical documentation of the execution engine e.g. choosing optimal join and sort algorithms. the blog states "This post will not go into much detail about the execution model of our engines or how it is able to work so fast. That is left for another time". Polars also knows the schema of the data upfront and can validate if the transformations are correct. This ensures you don't run into any errors halfway through executing a query. 

Source: Polars docs

Pros: One of the fastest options when working with local data.

Cons: The API can take a while to get used to. Doesn't have a cloud strategy currently but is actively working on it.

Apache Spark was created in 2009 by Matei Zaharia at UC Berkley as a faster alternative to MapReduce. Spark is still used widely in enterprise settings and is the core component of the databricks platform. It is a huge open-source project with over 2,000 contributors. Spark is a distributed computing system designed for big data processing and analytics. It has bindings to Java, Scala, Python (pyspark) and R. The Spark core engine uses the concept of Resilient Distributed Dataset (RDD) as its basic data type. Spark executes very fast by caching data in memory across multiple parallel operations. Spark also handles real-time stream processing. There are several published resources on what is spark and how it works such this blog plot which i'll summarize here. Spark has a query optimizer called Catalyst which generates the physical execution plan. You can learn more about Catalyst in this Spark + AI Summit 2019 talk. Spark is unique in that it runs on a JVM (Java Virtual Machine). Distributed computing in spark is curated by a centralized driver node that orchestrates the execution and one or more executor nodes that read and process data. Whilst Spark SQL and Spark DataFrame API are very popular, spark also supports the pandas API

Work on speeding up spark processing has taken place in the Tungsten project which is broken down into memory management (eliminate the overhead of the JVM object model and garbage collection), intermediate data in memory vs CPU registers. Loop unrolling and SIMD (compile and execute simple for loops). Databricks has sped up Spark processing with photon which offers up to 12x speedups with zero-code changes (fall back to the existing engine in some situations). Photon is a vectorized engine (over code generation) written in C++ (not Java due to performance limitations). The interpreted vectorization uses a dynamic dispatch mechanism to choose the code to execute and process data in batches which enables SIMD. Whereas code-generated systems use a compiler at runtime to produce code specialized for the query. The vectorized approach helps with observability. It was shown to double the speedup of the TPC-DS 1TB benchmark. Photon does well on CPU-bound workloads such as joins and aggregations but will have little impact on queries that are IO or network-bound.  The technical details behind photon can be found in this paper. I would encourage you to read the paper if you would like to find out more about the vectorized execution engine which I did not cover in detail. At the Data + AI Summit Databricks announced there are partnering with NVIDIA to bring GPU runtime to photon.

Pros: Solid choice for big-data querying in enterprise (e.g. some companies will happily pay for databricks). Lots of examples and a huge community. Bindings to many popular programming languages.

Cons: Not designed to work well on a single machine.

Spark-rapids is a plugin for Apache spark that runs spark workloads on GPUs. It it developed by NVIDIA. To learn more about spark-rapids check out this talk from the Spark + AI Summit 2020. Benchmarks show that running spark on the GPU can lead to cost savings of 50%. There are other benchmarks but the results are not displayed. Other companies have noted they have seen a 20x speed up in their data processing and it is used at Taboola.

spark-rapids is a zero code change solution. All you have to do is launch Spark with the RAPIDS Accelerator for Apache Spark plugin jar and enable a configuration setting.  Spark 3.0+ lets users provide a plugin to replace the backend for SQL and DataFrame operations. the RAPIDS accelerator uses UCX to speed up shuffles by keeping as much data on the GPU as possible. spark-rapids performance is tied to GPU performance as discussed with cudf.

Pros: fast option for enterprise data processing.

Cons: may require adjusting configurations to accelerate queries.

Theseus is a novel distributed execution architecture that takes advantage of full-system GPU-hardware acceleration to enable petabyte-scale analytics. It is currently being developed by Voltron data.

The philosophy behind the architecture is published in The Composable Codex

Source: Theseus docs

Pros: exciting results running queries on the biggest of data (Petabyte scale)

Cons: not currently open source. In active development.