databricks database and table

Firstly you must have a running cluster to view and create a database. If you shut down the cluster, the data won't be lost but you won't be able to see it until you have a cluster running again.

There are two types of tables: global and local. The global one is registered with Databricks Hive metastore and can be seen by multiple clusters. The local one is also called temporary view and is not registered with metastore so can be seen by only the current cluster.

A database can be created through the menu / running SQL "create database ...". Similarly a table can be created through the menu / running SQL "create table ..." / output to table from data frame.

A SQL reference is available at https://docs.databricks.com/sql/language-manual/index.html

A database is just a schema (not reflected in the file system) and a table is a directory. The directory will be containing the files of the data inserted into the table. You may set partition with a table so there will be sub directories as well.

Managed and unmanaged tables

A managed table is a Spark SQL table for which Spark manages both the data and the metadata. In the case of managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Some common ways of creating a managed table are:

CREATETABLE<example-table>(id STRING,value STRING)

dataframe.write.saveAsTable("<example-table>")

Another option is to let Spark SQL manage the metadata, while you control the data location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE <example-table>, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.

Some common ways of creating an unmanaged table are:

CREATETABLE<example-table>(idSTRING,valueSTRING)USINGorg.apache.spark.sql.parquetOPTIONS(PATH"<your-storage-path>")

dataframe.write.option('path',"<your-storage-path>").saveAsTable("<example-table>")

SO basically if you set the path, it is unmanaged, if not it is managed. It assumes if you specify a storage path, you want to persist the data even the table (the metadata) can be dropped.

To replace content of a table, you may either drop the table and recreate it.

Or a better way is 'overwrite':

dataframe.write.mode("overwrite").saveAsTable("<example-table>")//ManagedOverwritedataframe.write.mode("overwrite").option("path","<your-s3-path>").saveAsTable("<example-table>")//UnmanagedOverwrite

The Insert overwrite from SQL.

Partition

Spark SQL is able to generate partitions dynamically at the file storage level to provide partition columns for tables. This enhances query performance.

To add partition to a new table, simply specify it when creating table.

// Create managed table as select

dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("<example-table>")

However, if you create a partitioned table from existing data, Spark SQL does not automatically discover the partitions and register them in the Hive metastore. In this case, SELECT * FROM <example-table> does not return results. To register the partitions, run the following to generate the partitions: MSCK REPAIR TABLE "<example-table>"

// Save data to external files

dataframe.write.mode(SaveMode.Overwrite).partitionBy("id").parquet("<file-path>")

// Create unmanaged/external table

spark.sql("CREATE TABLE <example-table>(id STRING, value STRING) USING parquet PARTITIONED BY(id) LOCATION "<file-path>"")spark.sql("MSCK REPAIR TABLE "<example-table>"")

Page updated

Google Sites

Report abuse