Core SRE Responsibilities for MongoDB 

Area SRE MongoDB Focus

----------------------------------------------------------------------------------------------------------------------------------

Availability                        Implement HA with replica sets, design failover strategies, monitor replica health, and reduce MTTR.

Performance                    Monitor query execution (explain()), optimize indexes, detect slow queries, and manage connection pools.

Capacity Planning           Forecast disk, CPU, and memory usage; decide on sharding strategies before reaching scaling limits.

Scalability                        Set up and maintain sharded clusters with balanced chunk distribution.

Automation                     Automate backup, restore, index maintenance, and scaling using Ansible/Terraform/Helm.

Monitoring & Alerting   Use Prometheus + Grafana, or Cloud Manager/Atlas monitoring to track metrics like replication lag, lock %, and page faults.

Reliability Testing           Run chaos tests (e.g., killing primaries, network partition simulations) to verify failover.

Security                          Enforce authentication (SCRAM, x.509), TLS, IP whitelisting, and role-based access control.

Disaster Recovery       Design RPO/RTO-aligned backup and restore processes using Ops Manager/Atlas Snapshots or mongodump/mongorestore.


2. Key MongoDB Metrics SREs Monitor

Cluster Health

Performance

Storage

Sharding


3. Common MongoDB SRE Tasks

4. Tools & Automation for MongoDB SRE


5. SRE Playbooks for MongoDB

MongoDB SRE Runbook

(Operational Playbook for High Availability & Reliability)

1. Quick Reference

Cluster Type: [Replica Set / Sharded Cluster]
Version: [MongoDB Version]
Deployment Mode: [Atlas / On-Prem / Kubernetes / Cloud VMs]
Primary Contact: [Name / Team]
SLA: [Availability %, RPO, RTO]


2. Monitoring & Alert Thresholds

Metric Tool Threshold Action

Replication Lag Prometheus / Atlas > 10s Check secondary health, network latency, oplog size.

Oplog Window Prometheus / Ops Manager < 1 hour Increase oplog size or reduce write load.

Disk Usage CloudWatch / Grafana > 80% Add storage or prune old data.

WiredTiger Cache Usage Atlas / Grafana > 85% Increase RAM or tune indexes.

Connections Grafana > 80% of max Increase connection pool or investigate spikes.

Lock % Ops Manager > 10% sustained Identify and optimize heavy queries.


3. Incident Response Playbooks

3.1 Replica Set Failover

Trigger: Primary goes down / election occurs.
Steps:

Rollback Plan: Promote original primary when stable using rs.stepDown() and re-add as secondary.


3.2 High Replication Lag

Trigger: Lag > 10s.
Steps:

If oplog window too small, resize:

bash
CopyEdit
mongod --oplogSize <GB>



3.3 Disk Space Running Low

Trigger: Disk usage > 80%.
Steps:

Run:

js
CopyEdit
db.stats()

db.collection.stats()


Compact storage:

js
CopyEdit
db.runCommand({ compact: "<collection>" })



3.4 Slow Query Spike

Trigger: p95 query latency > SLA.
Steps:

Use slow query log:

js
CopyEdit
db.system.profile.find({ millis: { $gt: 100 } })



4. Maintenance Procedures

4.1 Rolling Upgrade


4.2 Backup & Restore

Backup:

Incremental daily + full weekly using:
mongodump --archive --gzip --oplog

Restore:


mongorestore --archive --gzip --oplogReplay


Test restore quarterly to validate RPO/RTO.


5. Change Management


6. Security Standards

Audit log enabled:
yaml

auditLog:

  destination: file

  format: BSON

  path: /var/log/mongodb/auditLog.bson


7. Automation


8. Contacts & Escalation

Level Contact Response Time

L1 On-call SRE < 15 mins

L2 DB Architect < 30 mins

L3 Vendor Support < 2 hrs