Core SRE Responsibilities for MongoDB

Area SRE MongoDB Focus

----------------------------------------------------------------------------------------------------------------------------------

Availability Implement HA with replica sets, design failover strategies, monitor replica health, and reduce MTTR.

Performance Monitor query execution (explain()), optimize indexes, detect slow queries, and manage connection pools.

Capacity Planning Forecast disk, CPU, and memory usage; decide on sharding strategies before reaching scaling limits.

Scalability Set up and maintain sharded clusters with balanced chunk distribution.

Automation Automate backup, restore, index maintenance, and scaling using Ansible/Terraform/Helm.

Monitoring & Alerting Use Prometheus + Grafana, or Cloud Manager/Atlas monitoring to track metrics like replication lag, lock %, and page faults.

Reliability Testing Run chaos tests (e.g., killing primaries, network partition simulations) to verify failover.

Security Enforce authentication (SCRAM, x.509), TLS, IP whitelisting, and role-based access control.

Disaster Recovery Design RPO/RTO-aligned backup and restore processes using Ops Manager/Atlas Snapshots or mongodump/mongorestore.

2. Key MongoDB Metrics SREs Monitor

Cluster Health

Replica set state (PRIMARY, SECONDARY, ARBITER)
Replication lag
Oplog window size

Performance

Query execution time
opcounters (insert, query, update, delete rates)
Page faults / Working set memory fit

Storage

Disk space usage per database/collection
WiredTiger cache usage
Index size vs data size

Sharding

Chunk distribution across shards
Migration queue size
Balancer state

3. Common MongoDB SRE Tasks

Failover Management: Ensure failover completes under SLA (< 15s typical in Atlas).
Schema Evolution Support: Work with teams to ensure changes won’t cause downtime.
Index Lifecycle: Create, rebuild, and drop indexes without impacting peak load.
Backup Strategy: Incremental + periodic full backups with restore drills.
Upgrade Planning: Rolling upgrades of MongoDB versions in a replica set.
Performance Tuning:
- Adjust wiredTigerCacheSizeGB
- Optimize indexes and compound queries
- Avoid unbounded arrays in documents
Cost Optimization: Right-size instances and storage tiers.

4. Tools & Automation for MongoDB SRE

Provisioning: Terraform, Ansible, Helm (for Kubernetes MongoDB deployments)
Monitoring: Prometheus + Grafana, Datadog, MongoDB Cloud Manager/Atlas
Backups: Percona Backup for MongoDB, Ops Manager, AWS EBS snapshots
CI/CD: GitHub Actions, Jenkins pipelines for DB migrations and config changes
Chaos Testing: Gremlin, Chaos Mesh for MongoDB failover drills

5. SRE Playbooks for MongoDB

Incident Playbooks
- Replica set failover troubleshooting
- Slow query spikes
- Disk space alerts
- High replication lag
Change Management Playbooks
- Rolling restarts
- Index deployment during low load
- Config change rollback

MongoDB SRE Runbook

(Operational Playbook for High Availability & Reliability)

1. Quick Reference

Cluster Type: [Replica Set / Sharded Cluster]
Version: [MongoDB Version]
Deployment Mode: [Atlas / On-Prem / Kubernetes / Cloud VMs]
Primary Contact: [Name / Team]
SLA: [Availability %, RPO, RTO]

2. Monitoring & Alert Thresholds

Metric Tool Threshold Action

Replication Lag Prometheus / Atlas > 10s Check secondary health, network latency, oplog size.

Oplog Window Prometheus / Ops Manager < 1 hour Increase oplog size or reduce write load.

Disk Usage CloudWatch / Grafana > 80% Add storage or prune old data.

WiredTiger Cache Usage Atlas / Grafana > 85% Increase RAM or tune indexes.

Connections Grafana > 80% of max Increase connection pool or investigate spikes.

Lock % Ops Manager > 10% sustained Identify and optimize heavy queries.

3. Incident Response Playbooks

3.1 Replica Set Failover

Trigger: Primary goes down / election occurs.
Steps:

Verify failover completed — rs.status() shows new PRIMARY.
Check application logs for connection retries.
Review rs.printSecondaryReplicationInfo() for lag.
If failover failed:
- Manually reconfigure replica set — rs.reconfig()
- Restart affected nodes.

Rollback Plan: Promote original primary when stable using rs.stepDown() and re-add as secondary.

3.2 High Replication Lag

Trigger: Lag > 10s.
Steps:

rs.printSecondaryReplicationInfo() — identify lagging node.
Check CPU/disk bottlenecks on secondary.
Review oplog size — db.printReplicationInfo().

If oplog window too small, resize:

mongod --oplogSize <GB>

If network latency high, route traffic closer to primary.

3.3 Disk Space Running Low

Trigger: Disk usage > 80%.
Steps:

Run:
db.stats()

db.collection.stats()

Drop old or unused collections.

Compact storage:
db.runCommand({ compact: "<collection>" })

Increase storage volume size.

3.4 Slow Query Spike

Trigger: p95 query latency > SLA.
Steps:

Use slow query log:
db.system.profile.find({ millis: { $gt: 100 } })

Review indexes — db.collection.getIndexes().
Use explain("executionStats") to check plan.
Optimize query or add compound index.
Rebuild indexes in low-traffic window.

4. Maintenance Procedures

4.1 Rolling Upgrade

Start with secondaries — systemctl stop mongod, upgrade, restart.
Step down primary — rs.stepDown(60).
Upgrade former primary.
Validate cluster health.

4.2 Backup & Restore

Backup:

Incremental daily + full weekly using:
mongodump --archive --gzip --oplog

Verify backup checksum.

Restore:

mongorestore --archive --gzip --oplogReplay

Test restore quarterly to validate RPO/RTO.

5. Change Management

Apply changes via PR in mongo-config-repo.
Use apply --dry-run in staging first.
Implement during low-traffic hours.
Maintain rollback scripts.

6. Security Standards

Enforce TLS (SSL) for all connections.
Enable SCRAM-SHA-256 authentication.
Restrict roles to readWrite or read only as needed.
IP whitelisting / security groups.

Audit log enabled:
yaml

auditLog:

destination: file

format: BSON

path: /var/log/mongodb/auditLog.bson

7. Automation

Provisioning: Terraform + Ansible.
Scaling: Kubernetes HPA or Atlas Auto-Scaling.
Monitoring: Prometheus + Grafana dashboard.
Chaos Testing: Gremlin or Chaos Mesh for failover drills.

8. Contacts & Escalation

Level Contact Response Time

L1 On-call SRE < 15 mins

L2 DB Architect < 30 mins

L3 Vendor Support < 2 hrs