az

10/10/20

az: A new protocol to interact with Azure Blob File System using python

What is the Azure Blob File System?

An taken from the documentation: "Azure Blob storage is Microsoft's object storage solutions for the cloud". If you are wondering, the word 'blob' comes from "its ability to optimize storing of massive amounts of unstructured data (e.g. any file types)".

Poster for the file "The Blob" (1958). Taken from https://www.imdb.com/title/tt0051418/

Different ways to work with blobs

In the docs Microsoft offer a variety of ways to work with blobs. The options vary from GUI's (Graphical User Interfaces) to CLI (Command Line Interface) to domain specific language such as python. I've listed the main options below with links to the documentation:

Python

The Python option will be the one mostly likely used by data scientists so it is expanded on further. The solution offered on the docs doesn't use any dependencies and is quite exhaustive. For example, to upload a csv file to blob you have to do the following:

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

blob_service_client = BlobServiceClient.from_connection_string("AZURE_STORAGE_CONNECTION_STRING")

container_client = blob_service_client.create_container("my-container")

blob_client = blob_service_client.get_blob_client(container="my-container", blob="my_file.csv")

with open("my_file.csv", "rb") as data:

blob_client.upload_blob(data)

The code above is not easy to remember and requires setting up the various connections. Wouldn't it be great is there way a library that simplifies this?!

FSSPEC

FSSPEC (Filesystem interfaces for Python) is a project that simplifies this process. FSSPEC standardizes the way python interacts with file systems. The suite of file systems it handles can be found here but to name a few: AWS S3, Google Cloud Storage, Microsoft Azure Storage, Hadoop File System, https and Dropbox. It means you can read and write data to these without having to learn native libraries such as boto3 or azure-storage-blob. You can simply use pandas, dask or cudf.

FSSPEC isn't well publicized and that is the beauty of it. You may have seen it being used in a blog or used it yourself without realizing it. To quantify its depth it is used in nearly 6,000 projects on GitHub and that is only public projects.

adlfs

adlfs is the package associated with FSSPEC to access Microsoft Azure Storage. It's more nascent than s3fs and gcsfs which are the packages for accessing AWS S3 and Google Cloud Storage, respectively. The implementations in cloud specific packages are passed upstream to FSSPEC in the known_implementations dictionary. The most commonly used implementation is abfs (shorthand for Azure Blob File System) to access data in from Azure Data Lake Gen2. There is also adl (shorthand for Azure Data Lake) which can access Azure Data Lake Gen1.

AZ

I implemented the protocol az as replication of abfs (you can see the PR here). FSSPEC allows for the classes to have multiple names. For example, you can access Google Cloud Storage using either the gcs protocol or the gs protocol which are identical.

So how do I use it?

Install

First install the package using pip (or conda):

$ pip install adlfs

Accessing data

You can access data in your blob storage using a storage options dictionary. You can find this information by locating the storage account in the Azure portal.

storage_options={'account_name': 'ACCOUNT_NAME', 'account_key': 'ACCOUNT_KEY'}

To read in the csv file that was uploaded to the blob earlier you can (soon*) use pandas:

import pandas as pd

pd.read_csv('az://my-container/my_file.csv', storage_options=storage_options)

If you want a stable solution to use today you can use dask.dataframe:

import dask.dataframe as dd

dd.read_csv('az://my-container/my_file.csv', storage_options=storage_options).compute()

If you have access to a GPU you can use cudf:

import cudf

cudf.read_csv('az://my-container/my_file.csv', storage_options=storage_options)

If you are doing development work with the file you may want to copy it to local rather than reading the bytes from the remote store to reduce egress costs. You can do this by importing the class and using FSSPEC methods on it: in this case get.

from adlfs import AzureBlobFileSystem

fs = AzureBlobFileSystem(**storage_options)

fs.get('my-container/my_file.csv', 'my_file.csv')

Note: if you are copying a partitioned parquet file you will have to use the wildcard to grab the data:

fs.get('my-container/my_partioned_parquet_file.parquet/*', 'my_partioned_parquet_file.parquet')

Uploading data

To upload a file to blob you instantiate the class as above and use put.

fs.put('my_file.csv', 'my-container/my_file.csv')

That was much nicer than using azure-storage-blob!

Similarly as above use a wildcard if uploading a partitioned parquet file to blob:

fs.put('my_partioned_parquet_file.parquet/*', 'my-container/my_partioned_parquet_file.parquet')

Accessing public datasets

Azure hosts a suite of public available data. Some examples are the NYC yellow taxi trip dataset (a common big dataset; I have previously used it in a previous blog), data from the COVID tracking project and GOES-16 satellite data.

Unlike how public data is stored in AWS and GCP, Azure stores each dataset in a different storage account. The storage account name is needed for accessing data. I'll go through two examples below:

COVID tracking project

You can find out where the data is stored by looking at the Azure Notebook associated with the data here and find the storage account name. e.g. the full URL is https://pandemicdatalake.blob.core.windows.net and the storage name is 'pandemicdatalake'. There is no account key for this account so you can define storage options just using account_name:

import dask.dataframe as dd

storage_options={'account_name': 'pandemicdatalake'}

df = dd.read_parquet('az://public/curated/covid-19/covid_tracking/latest/covid_tracking.parquet',

storage_options=storage_options).compute()

Here's the output:

GOES-16 satellite data

Satellite data is more complicated than tabular data as the data consists of different spectral bands and has a suite of meta-data related to the properties of the data such as geo-referencing. The data is stored as NetCDF. You can learn more about GOES-16 here.

I'm going to download data from yesterday afternoon (10/9/2020; when Hurricane Delta made landfall in Louisiana) and show the data using satpy:

from adlfs import AzureBlobFileSystem

from satpy import Scene

import glob

storage_options={'account_name': 'goes'}

fs = AzureBlobFileSystem(**storage_options)

fs.get('noaa-goes16/ABI-L1b-RadF/2020/283/16/*00182*', 'goes16-data')

scn = Scene(reader='abi_l1b', filenames=list(glob.glob('goes16-data/*')))

scn.load(['colorized_ir_clouds'])

scn.show('colorized_ir_clouds')

Here's the output:

Acknowledgements

My small contribution is built upon the significant work done by @martindurant and @hayesgb.

Footnotes

As of writing this on 10/10/20 the latest pandas version is 1.1.3 and the storage options keyword argument hasn't been added to read_csv. You can test using the az protocol on master though if you dare:

$ pip install git+https://github.com/pandas-dev/pandas

Otherwise wait until 1.2 is out.