Core SRE Responsibilities for MongoDB
Area SRE MongoDB Focus
----------------------------------------------------------------------------------------------------------------------------------
Availability Implement HA with replica sets, design failover strategies, monitor replica health, and reduce MTTR.
Performance Monitor query execution (explain()), optimize indexes, detect slow queries, and manage connection pools.
Capacity Planning Forecast disk, CPU, and memory usage; decide on sharding strategies before reaching scaling limits.
Scalability Set up and maintain sharded clusters with balanced chunk distribution.
Automation Automate backup, restore, index maintenance, and scaling using Ansible/Terraform/Helm.
Monitoring & Alerting Use Prometheus + Grafana, or Cloud Manager/Atlas monitoring to track metrics like replication lag, lock %, and page faults.
Reliability Testing Run chaos tests (e.g., killing primaries, network partition simulations) to verify failover.
Security Enforce authentication (SCRAM, x.509), TLS, IP whitelisting, and role-based access control.
Disaster Recovery Design RPO/RTO-aligned backup and restore processes using Ops Manager/Atlas Snapshots or mongodump/mongorestore.
2. Key MongoDB Metrics SREs Monitor
Cluster Health
Replica set state (PRIMARY, SECONDARY, ARBITER)
Replication lag
Oplog window size
Performance
Query execution time
opcounters (insert, query, update, delete rates)
Page faults / Working set memory fit
Storage
Disk space usage per database/collection
WiredTiger cache usage
Index size vs data size
Sharding
Chunk distribution across shards
Migration queue size
Balancer state
3. Common MongoDB SRE Tasks
Failover Management: Ensure failover completes under SLA (< 15s typical in Atlas).
Schema Evolution Support: Work with teams to ensure changes won’t cause downtime.
Index Lifecycle: Create, rebuild, and drop indexes without impacting peak load.
Backup Strategy: Incremental + periodic full backups with restore drills.
Upgrade Planning: Rolling upgrades of MongoDB versions in a replica set.
Performance Tuning:
Adjust wiredTigerCacheSizeGB
Optimize indexes and compound queries
Avoid unbounded arrays in documents
Cost Optimization: Right-size instances and storage tiers.
4. Tools & Automation for MongoDB SRE
Provisioning: Terraform, Ansible, Helm (for Kubernetes MongoDB deployments)
Monitoring: Prometheus + Grafana, Datadog, MongoDB Cloud Manager/Atlas
Backups: Percona Backup for MongoDB, Ops Manager, AWS EBS snapshots
CI/CD: GitHub Actions, Jenkins pipelines for DB migrations and config changes
Chaos Testing: Gremlin, Chaos Mesh for MongoDB failover drills
5. SRE Playbooks for MongoDB
Incident Playbooks
Replica set failover troubleshooting
Slow query spikes
Disk space alerts
High replication lag
Change Management Playbooks
Rolling restarts
Index deployment during low load
Config change rollback
MongoDB SRE Runbook
(Operational Playbook for High Availability & Reliability)
1. Quick Reference
Cluster Type: [Replica Set / Sharded Cluster]
Version: [MongoDB Version]
Deployment Mode: [Atlas / On-Prem / Kubernetes / Cloud VMs]
Primary Contact: [Name / Team]
SLA: [Availability %, RPO, RTO]
2. Monitoring & Alert Thresholds
Metric Tool Threshold Action
Replication Lag Prometheus / Atlas > 10s Check secondary health, network latency, oplog size.
Oplog Window Prometheus / Ops Manager < 1 hour Increase oplog size or reduce write load.
Disk Usage CloudWatch / Grafana > 80% Add storage or prune old data.
WiredTiger Cache Usage Atlas / Grafana > 85% Increase RAM or tune indexes.
Connections Grafana > 80% of max Increase connection pool or investigate spikes.
Lock % Ops Manager > 10% sustained Identify and optimize heavy queries.
3. Incident Response Playbooks
3.1 Replica Set Failover
Trigger: Primary goes down / election occurs.
Steps:
Verify failover completed — rs.status() shows new PRIMARY.
Check application logs for connection retries.
Review rs.printSecondaryReplicationInfo() for lag.
If failover failed:
Manually reconfigure replica set — rs.reconfig()
Restart affected nodes.
Rollback Plan: Promote original primary when stable using rs.stepDown() and re-add as secondary.
3.2 High Replication Lag
Trigger: Lag > 10s.
Steps:
rs.printSecondaryReplicationInfo() — identify lagging node.
Check CPU/disk bottlenecks on secondary.
Review oplog size — db.printReplicationInfo().
If oplog window too small, resize:
bash
CopyEdit
mongod --oplogSize <GB>
If network latency high, route traffic closer to primary.
3.3 Disk Space Running Low
Trigger: Disk usage > 80%.
Steps:
Run:
js
CopyEdit
db.stats()
db.collection.stats()
Drop old or unused collections.
Compact storage:
js
CopyEdit
db.runCommand({ compact: "<collection>" })
Increase storage volume size.
3.4 Slow Query Spike
Trigger: p95 query latency > SLA.
Steps:
Use slow query log:
js
CopyEdit
db.system.profile.find({ millis: { $gt: 100 } })
Review indexes — db.collection.getIndexes().
Use explain("executionStats") to check plan.
Optimize query or add compound index.
Rebuild indexes in low-traffic window.
4. Maintenance Procedures
4.1 Rolling Upgrade
Start with secondaries — systemctl stop mongod, upgrade, restart.
Step down primary — rs.stepDown(60).
Upgrade former primary.
Validate cluster health.
4.2 Backup & Restore
Backup:
Incremental daily + full weekly using:
mongodump --archive --gzip --oplog
Verify backup checksum.
Restore:
mongorestore --archive --gzip --oplogReplay
Test restore quarterly to validate RPO/RTO.
5. Change Management
Apply changes via PR in mongo-config-repo.
Use apply --dry-run in staging first.
Implement during low-traffic hours.
Maintain rollback scripts.
6. Security Standards
Enforce TLS (SSL) for all connections.
Enable SCRAM-SHA-256 authentication.
Restrict roles to readWrite or read only as needed.
IP whitelisting / security groups.
Audit log enabled:
yaml
auditLog:
destination: file
format: BSON
path: /var/log/mongodb/auditLog.bson
7. Automation
Provisioning: Terraform + Ansible.
Scaling: Kubernetes HPA or Atlas Auto-Scaling.
Monitoring: Prometheus + Grafana dashboard.
Chaos Testing: Gremlin or Chaos Mesh for failover drills.
8. Contacts & Escalation
Level Contact Response Time
L1 On-call SRE < 15 mins
L2 DB Architect < 30 mins
L3 Vendor Support < 2 hrs