Dev who? Ops what?
DevOps != Dev + Oops. It's a lot more, about adopting methodology, culture, mindset and good people ;-)
Now: Dev Oops if you don't adapt to the new containerization world, tooling and methodology...
Must read list
NOTE: This page was created back in 2012 (when I started to pick up DevOps) because I am interested in DevOps. I don't necessarily know everything on the page. I have hands-on experience with the ones in bold.
MLOps is a set of practices that helps data scientists and engineers to manage the machine learning (ML) life cycle more efficiently. It aims to bridge the gap between development and operations for machine learning. The goal of MLOps is to ensure that ML models are developed, tested, and deployed in a consistent and reliable way.Â
MLOps is essential for ensuring that machine learning models are reliable, scalable, and maintainable in production environments.
Accommodating increased AI infra needs
Customizing and tuning models
Managing new artifacts
Navigating evaluation and monitoring
Connecting to enterprise data
MLOps to address the gap between development and Ops for ML (models).
DevOps is a set of practices that helps organizations to bridge the gap between software development and operations teams. MLOps is a similar set of practices that specifically addresses the needs of ML models.
There are some key differences between MLOps and DevOps, including:
Scope: DevOps focuses on the software development life cycle, while MLOps focuses on the ML life cycle
Complexity: ML models are often more complex than traditional software applications, requiring specialized tools and techniques for development and deployment
Data: ML models rely on data for training and inference, which introduces additional challenges for managing and processing data
Regulation: ML models may be subject to regulatory requirements, which can impact the development and deployment process
Despite these differences, MLOps and DevOps share some common principles, such as the importance of collaboration, automation, and continuous improvement. Organizations that have adopted DevOps practices can often leverage those practices when implementing MLOps.
In the era of AI, add MLOps.
Working to support the deployment AI, deploy and manage LLMs at scale, support for SFT, RL, Governance, Guardrails, .etc.
I prefer Terraform + Ansible to adopt Infrastructure as Code + Configuration Management.
Terraform for provisioning infrastructure (e.g. AWS VPC topology, subnet, Security Groups, EC2 instances, etc.)
e.g. leverage cloud-init / user data when launching EC2 to configure VMs, especially in an orchestrated fashion (creating a k8s cluster for example) is bad (error prone, hard to troubleshoot).
Ansible to do host (VM) configuration (ad-hoc tasks glued in Shell scripts > playbooks) after terraform provisoned the VMs (up & running)
ad-hoc tasks for parellel execution (ideal for command line warriors to work at scale, simple KISS, predictable)
playbooks for codified structured processes
NOTE: the separation makes sense and is proved to be useful.
Attitude towards Ansible Playbook (mode): personally I dislike YAML (YAML + Jinja 2 templates), it is counter KISS philosophy, become worse if conditions (like when) and loops are abused. It is also NOT easy to write good Playbooks (flexible and reusable by leveraging Roles, organise variables well). If I can deliver code in form of Ansible ad-hoc tasks glued together in Shell (Bash) script, I'd prefer to do so. Fun facts: the latter normally draw higher customer satisfaction (infra Ops as target audience as it has lower execution and maintainence overheads).
IMPORTANT: be careful with terraform apply, run terraform plan and read carefully before confirming, especially regarding destroying and recreating resources (e.g. virtual machines)
Git Workflow
GitOps oriented CI/CD pipelines
Container centric workflows
leveraging Fedora Silverblue, Fedora CoreOS
Docker Compose, podman (podman compose) to orchestrate container on single Linux host
use podman to generate YAML and migrate workloads to production k8s clusters
kompose to migrate Docker Compose workloads to k8s clusters
etc. during development, CI/CD, release, deployment, testing, validation, etc.
IaaS agnostic
Many other tools from Hacker News or GitHub, countless and constantly evolving, hard to keep up ;-)
MLOps in Google Cloud, Vertex AI.
...
Contents below this line can be considered out-dated.
(The table is a bit messy and out-dated, a lot of them have overlaps, or there is no more edges or borders - obsolete)
Ansible (with playbooks) is my current choice for parallel execution (ad-hoc tasks), automation and configuration management as it does NOT require agent (yes, SSH ;-)
Chef / Chef Solo - Infrastructure as Code
NOTE: cookbooks, recipes, berkshelf (manage cookbook dependencies), foodcritic (lint tool for Chef), knife (CLI tool), knife solo (github), kitchen (cookbooks repository)...
Pupplet / Puppet (Masterless) + MCollective
Sunzi - Sunzi is the easiest server provisioning utility (shell scripts) designed for mere mortals. If Chef or Puppet is driving you nuts, try Sunzi!
Saltstack - Salt
Vagrant
Vagrant + Chef Solo / Ansible / Shell / Puppet (master-less)
Veewee - Automate the building of Vagrant Base Boxes by creating definitions (shell scripts basically)
Packer - a tool for creating identical machine images for multiple platforms from a single source configuration
LXC - Linux Containers
Docker - LXC (Linux Container) Engine
Kubernetes (production-grade container scheduling and managemnet)
Fission - FaaS for Kubernetes
CoreOS (Linux Kernel + systemd + LXC)
Jenkins
GitHub Actions
Jenkins X
Circle CI
GitLab CI
Argo CD / Argo Rollouts / Argo Workflows
Tekton - k8s native Pipeles
Maven (Java build automation and project management tool)
Deployment
Capistrano (Ruby)
Mina (Ruby)
Fabric (Python)
Rundeck
DRBD - Distributed Replicated Block Device (network based RAID 1)
ZFS - Snapshots and Clones
Btrfs - Snapshots
GlusterFS (clustered NFS and more)
Ceph (object store and file system)
Hadoop / HDFS
CFS Cassandra File System (HDFS replacement and enhancement based on Apache Cassandra)
Lustre
Ansible - generic (python) with Cookbook like Playbooks
Fabric - Application Deployment (python)
cdist (shell) - generic configuration management
Capistrano - Application Deployment (Ruby)
MCollective - trustworthy parallel remote execution framework for large deployment
Salt Stack (Python)
Parallel remote execution, scale like nothing else (ZeroMQ), Configuration Management
Func (Fedora Unified Network Controller)
Git
Hosted Git => GitHub, BitBucket
Self hosting
Commerical => GitHub Enterprise, Atlassian Stash
Free & Open Source options => gitlab (Ruby on Rails + gitolite replaced by gitlab-shell since v5.0)
SVN / Subversion
CVS (so damn old school)
OpenGrok
LXR
Elastic Search
Shell (Bash)
Go
Java
Rust
Python
Ruby
...
Amazon Web Services (AWS EC2 - Xen powered)
Azure
Google Cloud
Oracle Cloud
Rack Spaces (Xen)
Linode (Xen)
Digital Ocean (KVM)
VPS (RAM Host, BuyVM)
Terraform - Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.
PaaS - Platform as a Service
Heroku
Engine Yard
Cloud Foundry (VMware)
Nebula Overlay Network
WireGuard (based)
OpenStack - RackSpace
CloudStack / CloudPlatform - cloud.com => Citrix => Apache
AWS VPC
CloudStack/CloudPlatform VPC
Cloud Native / k8s - Promethus (retired: Heapster + InfluxDB + Grafana)
sysstat
Cockpit
Netdata
Nagios
StatsD (node.js)
prometheus
Zabbix
Monit
Monitorix
Cacti
Munin
Ganglia
Graphite (with collectd, Diamond, statsd, gdash, ganglia)
MRTG
New Relic
Server Density
Graylog2
logstash
logster (generate metrics from log files)
Elastic Stack (Elastic Search + Kibana +Logstash + Beats)
Confluence
gollum
Dokuwiki
JIRA
Redmine
trac
Discourse (Ruby on Rails, discussion forum for DevOps;-)
Vanilla Forums (PHP)
NOTE: This page is NOT complete. It is updated on a regular basis, comments are welcomed.