data factory - load to datalake and databricks

Here is an example of Data factory to load data from an ODBC source (e.g. OSI PI) to data lake gen2 and further onto databricks delta table.


Linked services


linked service to OSI PI AF Server

Use ODBC, self-hosted integration runtime, install PI SQL Client on the self-hosted server

The connection string format is:

Driver=PI SQL Client; AF Server=myAfServer; AF Database=myAfDatabase; Integrated Security=SSPI;

It can include user and password as well


linked service to datalake storage gen2

The url is https://<storage account name>.dfs.core.windows.net

use access key for the datalake account


linked service to databricks

Get url from the compute node and create a personal access token



Datasets


Create a dataset for each of the linked services

OSI PI dataset, use ODBC data source, select the self-hosted integration runtime

Select the table to be load


Create a CSV dataset that points to the datalake's container

The file path following the below format:

     <container name>/<directory name>/<File>

To load data into a folder, leave the <File> to empty

Be careful about Escape Character and Quote Character, if the data contains \ then don't use \ as escape character, maybe double quote

 

Create an Azure Databricks Delta Lake dataset

Point it to the databricks delta table.



Pipeline

The "Pipeline" in Data Factory is like control flow in SSIS.

It has a handy 'copy data' task. Only need to select the source and sink datasets, and it will copy data over from source to sink.


Create a Copy Data task for copying data from OSI PI dataset to Datalake dataset.

The source dataset can point to a table or a query, e.g. select * from element.element where name = 'ABC'


Create another Copy Data task for copying data from datalake dataset to databricks dataset.

NOTE, the source points to the datalake dataset

Use wildcard file path if there can be multiple files and dont want specify file names.

Put in the datalake container name, leave the <directory> to emtpty, set the <file> to "*"


Validate, publish and run the pipeline.


IMPORTANT.

The Data Factory's 'Copy Data' task creates a databricks notebook behind the scene for copying data to databricks delta lake sink.

The databricks environment needs to have access to the datalake (by configuration), otherwise the notebook will fail and the pipeline will fail.

To do that, create a service principal for the datalake and assign at least "Blob storage data contributor" to the service principal.

In databricks config, (go to the cluster in Compute, and click Edit, and go to Advanced Options/Spark Config).

Type in the service principal configs for storage account name, application id, secret/password and tennant id:

fs.azure.account.oauth2.client.id.<storage account name>.dfs.core.windows.net <application id>

fs.azure.account.oauth2.client.secret.<storage account name>.dfs.core.windows.net <secret>

fs.azure.account.oauth.provider.type.<storage account name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider

fs.azure.account.auth.type.<storage account name>.dfs.core.windows.net OAuth

fs.azure.account.oauth2.client.endpoint.<storage account name>.dfs.core.windows.net https://login.microsoftonline.com/<tenant id>/oauth2/token

This will the databricks environment to have access to the data lake container.