Azure Databricks

Azure Databricks provides Apache Spark based ingestion, processing, and analysis of large volumes of data in a data lakehouse. Data engineers, data scientists, and data analysts can use interactive notebooks to run code in Python, Scala, SparkSQL, or other languages to cleanse, transform, aggregate, and analyze data.

Apache Spark clusters -
Databricks File System (DBFS)
Notebooks
Metastore - Azure Databricks supports the use of a Hive metastore or Unity Catalog to define a relational schema of tables over file-based data. The tables can be queried using SQL syntax to access the data in the underlying files.
Delta Lake - Delta Lake builds on the relational table schema abstraction over files in the data lake to add support for SQL semantics commonly found in relational database systems.
SQL Warehouses - SQL Warehouses are relational compute resources with endpoints that enable client applications to connect to an Azure Databricks workspace and use SQL to work with data in tables.

The secret to Spark's high performance is parallelism. Scaling vertically (by adding resources to a single computer) is limited to a finite amount of RAM, Threads and CPU speeds; but clusters scale horizontally, adding new nodes to the cluster as needed.

Page updated

Google Sites

Report abuse