Oops

What is DevOps / Ops?

Dev who? Ops what?

DevOps != Dev + Oops. It's a lot more, about adopting methodology, culture, mindset and good people ;-)

Now: Dev Oops if you don't adapt to the new containerization world, tooling and methodology...

Must read list

NOTE: This page was created back in 2012 (when I started to pick up DevOps) because I am interested in DevOps. I don't necessarily know everything on the page. I have hands-on experience with the ones in bold.

What is MLOps & LLMOps ?

MLOps is a set of practices that helps data scientists and engineers to manage the machine learning (ML) life cycle more efficiently. It aims to bridge the gap between development and operations for machine learning. The goal of MLOps is to ensure that ML models are developed, tested, and deployed in a consistent and reliable way.

MLOps is essential for ensuring that machine learning models are reliable, scalable, and maintainable in production environments.

Accommodating increased AI infra needs
Customizing and tuning models
Managing new artifacts
Navigating evaluation and monitoring
Connecting to enterprise data

LLMOps — refers to the practices and processes involved in managing and operating LLMs. It is a specialized framework extending MLOps for the unique challenges of managing LLMs throughout their lifecycle — from initial development and fine-tuning to deployment, continuous monitoring, and ongoing optimizations.

LLMOps involves a comprehensive set of activities

Model deployment and maintenance
Data management
Model training and fine-tuning
Monitoring and evaluation
Security and compliance

How?

Data collection and preparation
Model development
Model deployment
Model management

Difference between MLOps and DevOps?

MLOps to address the gap between development and Ops for ML (models).

DevOps is a set of practices that helps organizations to bridge the gap between software development and operations teams. MLOps is a similar set of practices that specifically addresses the needs of ML models.

There are some key differences between MLOps and DevOps, including:

Scope: DevOps focuses on the software development life cycle, while MLOps focuses on the ML life cycle
Complexity: ML models are often more complex than traditional software applications, requiring specialized tools and techniques for development and deployment
Data: ML models rely on data for training and inference, which introduces additional challenges for managing and processing data
Regulation: ML models may be subject to regulatory requirements, which can impact the development and deployment process

Despite these differences, MLOps and DevOps share some common principles, such as the importance of collaboration, automation, and continuous improvement. Organizations that have adopted DevOps practices can often leverage those practices when implementing MLOps.

TL;DR

In the era of AI, add MLOps.

Working to support the deployment AI, deploy and manage LLMs at scale, support for SFT, RL, Governance, Guardrails, .etc.

I prefer Terraform + Ansible to adopt Infrastructure as Code + Configuration Management.

- Terraform for provisioning infrastructure (e.g. AWS VPC topology, subnet, Security Groups, EC2 instances, etc.)
- e.g. leverage cloud-init / user data when launching EC2 to configure VMs, especially in an orchestrated fashion (creating a k8s cluster for example) is bad (error prone, hard to troubleshoot).
- Ansible to do host (VM) configuration (ad-hoc tasks glued in Shell scripts > playbooks) after terraform provisoned the VMs (up & running)
- ad-hoc tasks for parellel execution (ideal for command line warriors to work at scale, simple KISS, predictable)
- playbooks for codified structured processes

NOTE: the separation makes sense and is proved to be useful.

Attitude towards Ansible Playbook (mode): personally I dislike YAML (YAML + Jinja 2 templates), it is counter KISS philosophy, become worse if conditions (like when) and loops are abused. It is also NOT easy to write good Playbooks (flexible and reusable by leveraging Roles, organise variables well). If I can deliver code in form of Ansible ad-hoc tasks glued together in Shell (Bash) script, I'd prefer to do so. Fun facts: the latter normally draw higher customer satisfaction (infra Ops as target audience as it has lower execution and maintainence overheads).

IMPORTANT: be careful with terraform apply, run terraform plan and read carefully before confirming, especially regarding destroying and recreating resources (e.g. virtual machines)

Git Workflow

GitOps oriented CI/CD pipelines

Container centric workflows

- leveraging Fedora Silverblue, Fedora CoreOS
- Docker Compose, podman (podman compose) to orchestrate container on single Linux host
- use podman to generate YAML and migrate workloads to production k8s clusters
- kompose to migrate Docker Compose workloads to k8s clusters
- etc. during development, CI/CD, release, deployment, testing, validation, etc.

IaaS agnostic

Many other tools from Hacker News or GitHub, countless and constantly evolving, hard to keep up ;-)

MLOps in Google Cloud, Vertex AI.

...

Contents below this line can be considered out-dated.

Terminology and Tools

(The table is a bit messy and out-dated, a lot of them have overlaps, or there is no more edges or borders - obsolete)

Configuration Management and Automation

Ansible (with playbooks) is my current choice for parallel execution (ad-hoc tasks), automation and configuration management as it does NOT require agent (yes, SSH ;-)

Chef / Chef Solo - Infrastructure as Code

NOTE: cookbooks, recipes, berkshelf (manage cookbook dependencies), foodcritic (lint tool for Chef), knife (CLI tool), knife solo (github), kitchen (cookbooks repository)...

Pupplet / Puppet (Masterless) + MCollective

Sunzi - Sunzi is the easiest server provisioning utility (shell scripts) designed for mere mortals. If Chef or Puppet is driving you nuts, try Sunzi!

Saltstack - Salt

Development / Production Environment

Vagrant

Vagrant + Chef Solo / Ansible / Shell / Puppet (master-less)
Veewee - Automate the building of Vagrant Base Boxes by creating definitions (shell scripts basically)

Packer - a tool for creating identical machine images for multiple platforms from a single source configuration

LXC - Linux Containers

Docker - LXC (Linux Container) Engine

Kubernetes (production-grade container scheduling and managemnet)

Fission - FaaS for Kubernetes

CoreOS (Linux Kernel + systemd + LXC)

Continuous Integration - CI/CD

Jenkins
GitHub Actions
Jenkins X
Circle CI
GitLab CI
Argo CD / Argo Rollouts / Argo Workflows
Tekton - k8s native Pipeles

Maven (Java build automation and project management tool)

Deployment

Capistrano (Ruby)
Mina (Ruby)
Fabric (Python)

Workflow Automation

Rundeck

Automated Recovery / Replication

DRBD - Distributed Replicated Block Device (network based RAID 1)

ZFS - Snapshots and Clones

Btrfs - Snapshots

Distributed File Systems

- GlusterFS (clustered NFS and more)
- Ceph (object store and file system)

Hadoop / HDFS
CFS Cassandra File System (HDFS replacement and enhancement based on Apache Cassandra)
- Lustre

Configure, Deploy, Ad-hoc Tasks, Parallel Execution

- Ansible - generic (python) with Cookbook like Playbooks

Fabric - Application Deployment (python)
cdist (shell) - generic configuration management
Capistrano - Application Deployment (Ruby)
MCollective - trustworthy parallel remote execution framework for large deployment
Salt Stack (Python)
Parallel remote execution, scale like nothing else (ZeroMQ), Configuration Management
Func (Fedora Unified Network Controller)

Source Code Management - SCM

Git

Hosted Git => GitHub, BitBucket
Self hosting
Commerical => GitHub Enterprise, Atlassian Stash
Free & Open Source options => gitlab (Ruby on Rails + gitolite replaced by gitlab-shell since v5.0)

SVN / Subversion

CVS (so damn old school)

Source Code Browser / Search / Cross-referencer

OpenGrok

LXR

Search / Indexing

Elastic Search

Scripting Language (polyglot)

Shell (Bash)
Go
Java
Rust
Python
Ruby
...

IaaS - Infrastructure as a Service

Amazon Web Services (AWS EC2 - Xen powered)

Azure

Google Cloud

Oracle Cloud

Rack Spaces (Xen)

Linode (Xen)

Digital Ocean (KVM)

VPS (RAM Host, BuyVM)

Terraform - Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

PaaS - Platform as a Service

Heroku

Engine Yard

Cloud Foundry (VMware)

Overlay Network

Nebula Overlay Network

WireGuard (based)

Private Cloud

OpenStack - RackSpace

CloudStack / CloudPlatform - cloud.com => Citrix => Apache

Virtual Private Cloud (Network Virtualization)

AWS VPC

CloudStack/CloudPlatform VPC

Monitoring

Cloud Native / k8s - Promethus (retired: Heapster + InfluxDB + Grafana)

sysstat
Cockpit
Netdata

Nagios
StatsD (node.js)
prometheus
Zabbix
Monit
Monitorix
Cacti
Munin
Ganglia
Graphite (with collectd, Diamond, statsd, gdash, ganglia)
MRTG

Application Performance Management - APM

New Relic

Server Density

Log Management / Analysis / Visualisation

Graylog2

logstash

logster (generate metrics from log files)

Elastic Stack (Elastic Search + Kibana +Logstash + Beats)

Knowledge Management

Confluence

gollum

Dokuwiki

Issue Tracker (Bug Tracker)

JIRA

Redmine

trac

Discussion Forums

Discourse (Ruby on Rails, discussion forum for DevOps;-)

Vanilla Forums (PHP)

NOTE: This page is NOT complete. It is updated on a regular basis, comments are welcomed.

Page updated

Google Sites

Report abuse