databricks notes


General guidelines for choosing databricks worker type

1. Memory optimized

a lot of shuffle and spills, or when spark cahcing is needed

this is typical when joining many big tables, the data needs to shuffle to different nodes for joining

2. Storage optimized

    analytical workload that raeds the same data repeatedly

3. Compute optimized

    ETL jobs that don't require wide transformation (e.g. join and aggregation). Join needs multiple tables held in memory, and aggregation need whole table in memory.

If just copy/flow data from one place to another, with computation at row level or small batch level, e.g. derive column, concat name, some small window aggregation, then not too nmuch memory is required.

IF the ETL needs joins or lookups or anything that needs to hold big data in memory for long, memory optimized if more suitable.

4. GPU Optimized

   obviously for machine learning

   

5. General purpose / all purpose

   all purpose cluster should only be used for ad hoc query and interactive notebook during development and testing.

   

Auto scaling.

Autoscaling should not be used for Spark Structured Streaming because once the cluster scales up, it is difficult to determine when to scale it down.


Photon

   provides faster query / scan of big data, and with many small files.

   faster Delta Cache access and Delta writing using merge, update, insert, etc.

   join improvements as well

   Recommended to enable Photon if

       ETL has MERGE operations

  write large Delta/parquet data to cloud storage

  scan / join / agg of large datasets

  ad hoc / interactive sql queries


Shuffle partition number

   It should be a multiplication of the total number of worker cores, so each core is busy

   -- in SQLset spark.sql.shuffle.partitions = M*<number of total cores in cluster>

   -- in PySparkspark.conf.set("spark.sql.shuffle.partitions", M*sc.defaultParallelism)

   -- M is a multiplier here, 

   -- In the absence of manual shuffle partition tuning set M to 2 or 3 as a rule of thumb

   

Partition

   dont partition table that is too small (eg. < 1TB). The ingestion time clustering will order data by ingestion time.

   each partition should be big enough, e.g. > 1GB, parition column should be low cardinality, e.g. Year


Optimize

   OPTIMIZE compacts files, and can be applied with ZORDER, which sorts and co-locates data by selected columns

   use high cardinality columns for z-ordering, e.g. customer id