Data factory

Learn Data Factory  Quick


In Azure, seems everything is a resource, a sql server, a blob storage or a data factory. You can have resource group to contain all those resources.

A data factory is equivalent to the MS SSIS, for moving data from source to destination and do transformation if needed.


Database / Storage as Source / Destination

First it needs a source, e.g. database table or a file in the storage.

If ms sql database, create a sql server resource, set username password (sql login), choose compute and storage capacity, select redundancy strategy.

If storage, it needs to create a storage account, upload some files.


within a sql server security setting, there are firewalls and virtual networks to set.

or from a sql database property tab, 'set server firewalls' to allow ip / azure subscription / public, etc. to access the database

Through 'Settings' -> "Computer + Storage", you may change the DTU and storage capacity and redundancy strategy. There are 3 redundancy strategy for database, 1. Locally-redundant (3copies in another domain locally), 2. Zone-redundant (3 different locations in the zone, and 3 copies each location), 3 geo-redundant (3 copies in another peer geolocation, e.g. australia east --> australia west)


within a storage account, there are 4 types of storage.

1. containers, which keeps blob files, e.g. csv, json, etc.

2. file shares, which also keeps files and folders. it mainly for "lift and shift" file sharing. otherwise, use containers which is more optimized for data access.

3. queues, keeps 64k? small size messages, there is an URL to access it.

4. tables, which is data in table format for NoSQL, e.g. Cosmos DB to use.

It can be locally-redundant (3 copies in another domain locally)

or geo-redundant (3 copies in another peer geolocation, e.g. Australia east --> Australia west)


The database and storage can be source / destination.


Data Factory

Create a data factory resource, and go to the "Author & Monitor" on the overview

This will go to the data factory design/maintain page

The Author page has the toolbox, i.e. Pipeline/dataset/data flow design

The Manage page has the Linked Services, integration runtimes, Git, Trigger, global parameter and keys, etc. 

The Monitor page keeps the run history, alerts, etc.


Data pipeline

Simply create a new pipeline, like a new control flow in ssis. A bunch of tasks are available. 

Most importantly, Move & Transform. it can be a Copy Data task (from source to sink) or a Data Flow which contains more transformation.

Iteration & conditionals to implement some flow logic in the control flow.

General tab has all the variable, stored procedure, Web (get/post a web), Web hook (a hook for outside to call and trigger the flow?)

Azure function for running a script hosted like an app

Batch Service for a custom script to run from a pool of nodes. It needs to create a batch account and select nodes first.

Databricks, machine learning, etc.


Data flow

this is similar to the data flow in ssis. it moves data from a source to destination (i.e. sink).

It doesn't seem to have the toolbox as in ssis, but you need to add source first.

then from the source, click on the + button and it will show the subsequent task it can connect to, from where you can use lookup, sort, pivot, parse, join, union, aggregate, etc. pretty much all the ssis task components. But there is no script task within data flow anymore, and no run sql task either.

The source and the sink have to linked to a dataset created from a linked service.


Linked Services

The linked services from the "Manage" tab is like the Connections in SSIS, where you specify the connection to either sql server or a storage or a web url, etc.


Dataset

you can create a dataset from a linked service. e.g. a dataset pointing to a table on a database specified in a linked service / a dataset pointing to a file on a blob storage specified in a linked service.


So above are pretty much it about data factory.

create linked services to the databases/files and create datasets.

then create pipelines and data flows to transfer data between datasets.


Scheduling

on the "Manage" tab, create Triggers. Specify frequency, time, etc. just like sql server schedule.

move back to pipeline, on the head of the design pane, choose the trigger to link a pipeline to.

Back to the Triggers list, clicking the 'Related' link will show pipelines related to a trigger.

On the "Monitor" tab, you can check trigger runs.


Azure Resource Manager (ARM) template

on the "Manage" tab, ARM template has the option to export the current Data Factory as ARM template in JSON. It outputs a zip file that contains a few files, but probably the arm_template.json file is the actual definition of the Data Factory.

 It can also import an existing ARM template to update a Data Factory (the link goes to Azure portal Custom Deployment). So on Customer Deployment page, select "Build your own template in the editor" and then select "load file" and then select the arm_template.json file.  Fill in the password, connection strings, etc and finally click 'Purchase' and it will create a Data Factory from the json file.

On the design pane for pipeline, dataflow, etc, you may also view the json code for just that part, but the arm_template.json puts them all together.


This is all done for a quick go through the data factory, mimicking SSIS functions.