synapse quick notes

Synapse

Synapse is a combination of Data Factory, PySpark/Scala Notebook, SQL server database, and Data Lake Storage/Lake database. Microsoft puts a few good tools together to make data integration easier.


When created, it must have a dedicated data lake storage for internal storage.

In the Data tab, 

It has the Lake Database and the SQL database. By default only the Lake Database which points to the data lake storage, and it keeps the tables created in the SQL or PySpark scripts. It also provides the linked services, to point to different data sources, and integration datasets. For the pipeline /dataflow to work, it must have the datasets from the linked service.

In the Develop tab,

It has the SQL scripts, Notebooks, and Data flow. The SQL scripts is for accessing the SQL database / Lake databases. A handy thing is if you browse to the data storage file, right clicking the file it has the option to generate the SQL script (e.g. the openrowset query for parquet file). The notebook is pretty much PySpark Notebook, but you can choose scala and .Net as well. Data flow is the same as Data Factory's data flow, which goes from a source to a sink, with different transformation.

In the Integrate tab,

It has the Pipeline,  which is the Data Factory's pipeline, and runs Copy Data, Run Notebook, etc. It mainly for putting together a bunch of notebooks, dataflows, or simply copy data tasks or other tasks into a control flow.


external table

Another handy thing is the parquet files/folders in the data lake storage can be exposed as an external table, which is attached to a SQL database, so people can query the parquet files through SQL. Simply right click the folder / file, and choose New SQL Script - >  create external table.

SQL Pools

By default, it must have a built-in SQL instance, but you can create new SQL database as well

Spark pools

To be able to run Notebook, it needs to create spark pool. Choose the size, and scale, etc.

Same as Data Factory,

It has the integration runtime, built-in to SSIS, or self-hosted runtime. Also it has triggers for running Pipelines by schedule