databricks notes

General guidelines for choosing databricks worker type

1. Memory optimized

a lot of shuffle and spills, or when spark cahcing is needed

this is typical when joining many big tables, the data needs to shuffle to different nodes for joining

2. Storage optimized

analytical workload that raeds the same data repeatedly

3. Compute optimized

ETL jobs that don't require wide transformation (e.g. join and aggregation). Join needs multiple tables held in memory, and aggregation need whole table in memory.

If just copy/flow data from one place to another, with computation at row level or small batch level, e.g. derive column, concat name, some small window aggregation, then not too nmuch memory is required.

IF the ETL needs joins or lookups or anything that needs to hold big data in memory for long, memory optimized if more suitable.

4. GPU Optimized

obviously for machine learning

5. General purpose / all purpose

all purpose cluster should only be used for ad hoc query and interactive notebook during development and testing.

Auto scaling.

Autoscaling should not be used for Spark Structured Streaming because once the cluster scales up, it is difficult to determine when to scale it down.

Photon

provides faster query / scan of big data, and with many small files.

faster Delta Cache access and Delta writing using merge, update, insert, etc.

join improvements as well

Recommended to enable Photon if

ETL has MERGE operations

write large Delta/parquet data to cloud storage

scan / join / agg of large datasets

ad hoc / interactive sql queries

Shuffle partition number

It should be a multiplication of the total number of worker cores, so each core is busy

-- in SQLset spark.sql.shuffle.partitions = M*<number of total cores in cluster>

-- in PySparkspark.conf.set("spark.sql.shuffle.partitions", M*sc.defaultParallelism)

-- M is a multiplier here,

-- In the absence of manual shuffle partition tuning set M to 2 or 3 as a rule of thumb

Partition

dont partition table that is too small (eg. < 1TB). The ingestion time clustering will order data by ingestion time.

each partition should be big enough, e.g. > 1GB, parition column should be low cardinality, e.g. Year

Optimize

OPTIMIZE compacts files, and can be applied with ZORDER, which sorts and co-locates data by selected columns

use high cardinality columns for z-ordering, e.g. customer id

Page updated

Google Sites

Report abuse