Query cluster or sql endpoint

https://docs.databricks.com/dev-tools/python-sql-connector.html

Databricks provides interface for querying the database through (1) cluster and (2) sql endpoint.


The SQL endpoint requires extra cpu cores and a separate cluster to run.

While you can just connect to the existing cluster (if live) and run same sql queries on it.


To do that, just need to install the Databricks SQL connector. 

pip install databricks-sql-connector 


A sample script is as in below. The server-hostname and http-path can be found from the compute node's advanced options under JDBC/ODBC tab. Or for SQL endpoint it is in the Connection details tab of an endpoint.

The access-token can be generated from User Settings.


from databricks import sql

with sql.connect(server_hostname="<server-hostname>",

                 http_path="<http-path>",

                 access_token="<access-token>") as connection:

    with connection.cursor() as cursor:

        cursor.execute("SELECT * FROM <database-name>.<table-name> LIMIT 2")

        result = cursor.fetchall()

        for row in result:

          print(row)



For reporting and analytics purpose, it may be easier to load data directly into a pandas data frame:


from databricks import sql

import pandas as pd


server_hostname = 'adb-123.azuredatabricks.net'

http_path = 'sql/protocolv1/o/1234567890/123-mhxgk453'

access_token = '<the personal access token>'


connection = sql.connect(server_hostname, http_path, access_token)


df = pd.read_sql('SELECT * FROM mac_hub.heatmap LIMIT 10', connection)

df.head()


connection.close()