This page is dedicated to documenting some of the tools I've used and hopefully gives a good sense of my experience in addition to serving as a reminder to myself about each of these systems and how they interact with each other. There are a plethora of big data tools out there and it is easy to get lost trying to remember which tools do what and how to use them. Specifically, the tools I've interacted with are focused on statistical analysis, machine learning, and distributed systems. Many of these interface with each other, such as certain platforms providing the infrastructure to host software at a large scale which powers the computation logic underneath.
Git is a version control system allowing for complex versioning with both remote and local repository support allowing for continued work even when remote repositories are unavailable. Considered the standard for repository management, GitLab and GitHub are hosting services which follow the git system. Especially useful as it allows multiple branches of the same repository to exist, each with their own changes, and the ability to merge these branches when the changes are ready.
A fully fledged Python Integrated Development Environment (IDE) developed by JetBrains, this tool comes with a plethora of features designed to make large-scale, complex Python project development easier and more efficient. Additionally, allows for downloading specific plugins which may address any of your developer needs that fall outside of the scope of the base tools that are built in.
Pretty much every aspect of project development is addressed by this IDE, from connecting to version control system to code easy refactoring and connecting to python runtime environment for easy on the fly testing and running of code. All these features make PyCharm a hefty beast and if your project is particularly large, it will be pretty beefy to run compared to a more lightweight editor like VS Code which is discussed down below in this section.
Similar to PyCharm, IntelliJ is a fully fledged Java Integrated Development Environment (IDE) and is focused specifically on Java projects. This IDE has many of the same features as PyCharm, with additional support for Java specific tools such as Maven integration.
An IDE designed specifically for the R coding language. R's focus is around statistical analysis, and the IDE enhances the data analysis experience by providing many out of the box tools to help with data visualization, package installation, and code organization.
This IDE is different from the other two in that it was not built by JetBrains. Additionally, R is not geared towards building complex programs, but more for working with data. As such, the IDE has an emphasis around ease of working with data, versus the other two JetBrains IDE's are focused around building large-scale, complex programs instead of specifically focusing on data and statistical analysis.
Visual Studio Code (VS Code) is a "jack of all trades" for coding. It is a general text editor with plugins available to download that support a vast amount of coding languages and features. With enough plugins, one can transform VS Code into an almost fully fledge IDE!
For cases where a beefy, feature rich IDE is not needed, this is the perfect tool. Additionally, its support for almost any coding language makes it perfect for tackling projects in a language that might not have an easily accessible or usable IDE. With VS Code you can tailor the experience to your liking.
Jenkins is a Continuous Integration/Continuous Development (CI/CD) platform that allows for easy development and deployment of code directly from source code repositories such as GitHub or GitLab. To understand the usefulness of Jenkins and how it fits in to the tech stack, we must first understand the uses of CI/CD.
CI/CD is for automating otherwise tedious parts of the development and deployment cycle. Specifically, things such as code testing, health checks, and any other steps for code deployment can be handled by a CI/CD pipeline. Once set up, a single Git push is enough to set off a chain of these automated actions that get the code ready for whatever the end goal is, ensuring a well defined, automated procedure.
Jenkins simply provides a platform to set up CI/CD servers, link them to source code repositories, and visualize pipelines.
Docker is an environment virtualization tool that is especially useful for creating reproducible environments, i.e. making sure that the project will run with necessary software level dependencies fulfilled, no matter the actual physical machine it will be running on.
The point of this tool is to allow easy set up of all project dependencies so that, once set up in a dockerfile, the project can be run by anyone with that dockerfile without any further downloading or tinkering necessary to get it running, regardless of that persons actual software environment. These specifications range from OS level specification all the way to specific version of code language and packages/libraries. This ensures a consistent runtime environment for all members of a project throughout a projects development life cycle, all the way from a projects inception to deployment.
Kubernetes, also known as K8s, is for managing and scaling Docker containers for deployment. It essentially manages a cluster of servers, each with potentially multiple Docker containers, called pods (a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers).
When a project is ready for deployment, Kubernetes provides the tools for scaling, providing functionality such as load balancing and ensuring that the deployment will be able to scale up or down efficiently, with Docker containerization taking care of any runtime environment dependencies.
A machine learning library which which provides a robust framework for the whole ML pipeline from pre-processing to training via autograd to inference. A large part of the API is integrated from Keras which was originally a separate package and provides "pre-built" solutions such as layers and architectures with several elements that can be customized to specific use cases. Additionally, the TensorBoard API provides visualization tools which allow for easy debugging and information regarding the training process.
There are two execution modes, eager mode and graph mode. Eager mode is good for model development and does not construct graphs for the underlying operations. These graphs "can provide optimizations that make models run faster with better memory efficiency", and are good for productionalizing an already developed model.
On the surface, PyTorch is similar to TensorFlow in that it allows for custom, scalable ML architectures. However, the difference comes in implementation, PyTorch is a lot more "Pythonic", and from personal use, was slightly easier to pick up. PyTorch is like the new kid on the block, bringing a fresh approach to ML pipelines in a, some would say, more comfortable way than its predecessors.
Both tools are a good choice for ML pipelines and, at a high level, choosing between the two comes down to preference of the API's. I find PyTorch's API a little easier to pick up compared to the latter, getting a base model up and running is more intuitive compared to TensorFlow which has several details under the hood that need to be set up to get things working.
Apache Spark is a distributed parallel compute engine, and PySpark is the Python interface. Many large data pipelines which require complex transformations are more easily handled in Python than native SQL. Additionally, loading in data to Python using pandas read_sql functions can take quite a while. This is where PySpark comes in, adding parallelization to querying and data transformations. The Spark session can be run locally, or harness a properly set up computing cluster for maximum performance gains. The local session might still garner performance improvements if the machine has multi-core capabilities, and what once would take several minutes on a single thread can now be parallelized and reduce execution time significantly.
While Spark is built mainly for large scale ETL pipelines and classical ML algorithms (through the native SparkML library), there have been community made libraries for integrating Spark with more modern ML API's such as TensorFlow and PyTorch. However, there are also other tools such as Ray, which is explored below, that have been directly built for distributed modern ML pipelines and take advantage of GPU compute power for parallelization. Spark is typically more focused on classical ETL which relies more on CPU compute power, although this looks to be changing with the release libraries such as Horovod, as well as Spark 3 which natively introduced GPU acceleration features.
Apache Airflow is a workflow manager and schedules complex pipelines which might have several disjointed parts which depend upon each others successful deployment. It is specifically useful for pipelines which depend upon other pipelines. The Directed Acyclical Graph (DAG) structure of Airflow allows for immensely complex workflows to be boiled down into an intuitive format which is much easier to debug, visualize, and understand.
Airflow has applications in anything that would benefit from complex scheduling and workflow dependency management, which more often than not is found in Data Engineering, ML Engineering and Data Science type roles. Any type of workflow can be handled from data cleaning pipelines to model training/inference and visualization generation.
Ray is similar to Spark in that it is meant for distributing work to multiple worker machines, however, Spark is more geared towards data transformations and traditional ML methods instead of the new complex deep learning methods that are becoming more common today. Ray is more geared towards GPU-based operations which greatly benefit DL applications and allows for the use of native Python code and libraries such as scikit-learn, PyTorch, and TensorFlow without much adjustments. This is in contrast to PySpark which requires significant set-up and modifications to get things working properly.
A workflow orchestration tool similar to Airflow, however, with a focus on data-centric tasks such as data engineering and ML. Flyte is much more rigorous in terms of its approach to workflows, requiring argument object type specifications in each step of the workflow. This allows for Flyte to be type aware during runs and comes with its pro's and cons.
Debugging is slightly more of a headache as you will have to ensure that the datatypes are supported by Flyte, or convert them to a Flyte native type. However, this rigor ensures that everything will work exactly the way you intend (once all the types are figured out and debugged), and forces the user to understand the type of data being ingested and output each step of the process.
Enterprise data warehouse(EDW) used to store massive data sets and enable data table manipulations/retrieval via queries. This database has its own SQL dialect with some slightly different keywords such as TOP (instead of LIMIT). The database can be queried by the native "Teradata Studio" app and also can be connected to via the teradatasql or teradataml Python package which has some built in ML functionality.
This data solution is considered somewhat legacy and has become far less prevalent than the more modern GCP/Azure/AWS based alternatives. Large-scale queries often times run into spool space errors if not properly set up, and scaling is lackluster at times, slowing down query speeds on especially high-traffic days. The overall experience feels clunky when compared to more modern alternatives, data debt being the main factor for users to stay on this platform.
A modern cloud data warehouse offered by Google, capable of handling petabytes of data with minimal configuration. Slightly more complex to query and interact with Python compared to Teradata as it requires setting up Service Account keys for programmatic access which need specific IAM permissions.
The benefits of BigQuery lie in is ease of use. The web-querying interface is significantly easier and more efficient than Teradata's "Teradata Studio" Mac app for querying which has quite a few performance issues and "security features" constantly prompting you for passwords every 30 minutes. Additionally, BigQuery was designed specifically to handle massive queries for data analysis and so running a query over hundreds of terabytes of data poses no issues.
The Vertex AI Platform is a large-scale, complex, fully managed service offered by Google. The platform provides purpose-built MLOps tools for data scientists and ML engineers to automate, standardize, and manage ML projects.
The platform offers many benefits, notably, Vertex AI Workbench which is a fully managed, scalable, enterprise-ready compute infrastructure for running Jupyter Notebooks. This service allows customers to run JupyterLab Notebooks on Google Cloud servers with the ability to specify hardware environments to match compute needs, from beefy Nvidia A100 GPU based environments for training deep neural networks to high RAM environments for easy in-memory EDA(Exploratory Data Analysis) with Pandas, all without having to set up any hardware or worry about availability (as long as you can afford it).
The platform also comes with a variety of other tools which are all connected to the overall Google Cloud Platform such as Model Endpoints, Datasets, and Pipelines. All of these tools work well with each other, and are usually designed to be used together, along with other aspects of GCP such as BigQuery or GCS.
Similar to Vertex AI, Domino Data Labs is a data science platform with the ability to host Jupyter Notebooks on specific hardware. This differs from Vertex AI in that it is not a fully managed service, host servers have to be set up and managed by the client.
The platform offers tools similar to Vertex AI, including Jupyter Notebook workspaces, Model Endpoints for inference and monitoring, and of course Scheduled Jobs. Additionally, there are tools provided to make setting up and hosting data web apps, such as streamlit or shiny based web apps, easy.
Overall, the Domino Data Labs platform is slightly more streamlined than Vertex AI, with a clear focus on data science, and less integration/focus on the Google Cloud Platform.