databricks notes
General guidelines for choosing databricks worker type
1. Memory optimized
a lot of shuffle and spills, or when spark cahcing is needed
this is typical when joining many big tables, the data needs to shuffle to different nodes for joining
2. Storage optimized
analytical workload that raeds the same data repeatedly
3. Compute optimized
ETL jobs that don't require wide transformation (e.g. join and aggregation). Join needs multiple tables held in memory, and aggregation need whole table in memory.
If just copy/flow data from one place to another, with computation at row level or small batch level, e.g. derive column, concat name, some small window aggregation, then not too nmuch memory is required.
IF the ETL needs joins or lookups or anything that needs to hold big data in memory for long, memory optimized if more suitable.
4. GPU Optimized
obviously for machine learning
5. General purpose / all purpose
all purpose cluster should only be used for ad hoc query and interactive notebook during development and testing.
Auto scaling.
Autoscaling should not be used for Spark Structured Streaming because once the cluster scales up, it is difficult to determine when to scale it down.
Photon
provides faster query / scan of big data, and with many small files.
faster Delta Cache access and Delta writing using merge, update, insert, etc.
join improvements as well
Recommended to enable Photon if
ETL has MERGE operations
write large Delta/parquet data to cloud storage
scan / join / agg of large datasets
ad hoc / interactive sql queries
Shuffle partition number
It should be a multiplication of the total number of worker cores, so each core is busy
-- in SQLset spark.sql.shuffle.partitions = M*<number of total cores in cluster>
-- in PySparkspark.conf.set("spark.sql.shuffle.partitions", M*sc.defaultParallelism)
-- M is a multiplier here,
-- In the absence of manual shuffle partition tuning set M to 2 or 3 as a rule of thumb
Partition
dont partition table that is too small (eg. < 1TB). The ingestion time clustering will order data by ingestion time.
each partition should be big enough, e.g. > 1GB, parition column should be low cardinality, e.g. Year
Optimize
OPTIMIZE compacts files, and can be applied with ZORDER, which sorts and co-locates data by selected columns
use high cardinality columns for z-ordering, e.g. customer id