Check cluster, replica set, and sharded cluster status in Ops Manager
Review alerts for node down, replication lag, election events, or disk pressure
Verify all agents (Automation, Monitoring, Backup) are running and healthy
Ops Manager use: Real-time dashboards, alerting, and topology view
Review key metrics: CPU, memory, disk I/O, network, connections
Analyze slow queries and query execution plans
Watch cache usage (WiredTiger cache) and eviction rates
Ops Manager use: Performance Advisor, Query Profiler, Metrics Explorer
Ensure scheduled backups completed successfully
Validate snapshot retention and storage usage
Perform periodic restore tests (to staging or test cluster)
Ops Manager use: Snapshot backups, point-in-time recovery, restore workflows
Confirm desired state vs actual state (no drift)
Review recent automation changes or deployments
Safely apply config changes (storage, parameters, version upgrades)
Ops Manager use: Automation Agent, versioned configuration management
Review database users and roles
Rotate credentials if required
Verify TLS, authentication, and authorization settings
Ops Manager use: Centralized user management, security configuration tracking
Track data growth trends
Monitor disk utilization and index sizes
Plan scale-up or scale-out (add nodes or shards)
Ops Manager use: Historical metrics, capacity graphs
Review triggered alerts and acknowledgments
Tune alert thresholds to avoid noise
Investigate recurring alerts and apply fixes
Review index usage and unused indexes
Apply recommendations from Performance Advisor
Coordinate index builds (foreground vs background)
Ops Manager use: Index suggestions, impact analysis
Check MongoDB logs for warnings or errors
Correlate logs with performance spikes or failures
Investigate issues like replication lag, step-downs, OOM events
Review access logs and audit events
Ensure backup policies meet compliance requirements
Document operational changes
Shard
A replica set that stores a subset of the data
Config servers (CSRS)
Store metadata about chunks and shard keys
mongos
Query router that directs client requests to the right shard(s)
Poor shard key
Balancer disabled
Jumbo chunks
Check:
db.collection.getShardDistribution()
MongoDB Ops Manager commands to SRE and DBA role expectations, written the way interview panels think about responsibility, I’ve kept it practical and production-oriented so you can confidently explain what you ran and why.
Expectation:
Ensure clusters are always healthy and issues are detected early.
Commands / actions
systemctl status mongodb-mms-automation-agent
systemctl status mongodb-mms-monitoring-agent
rs.status()
rs.printSlaveReplicationInfo()
What this proves
You understand monitoring dependencies (agents first)
You know how to detect replication lag and node failures
Interview phrasing
“From an SRE point of view, my priority is cluster availability and replication health, which I validate through Ops Manager alerts and replica set status.”
Expectation:
Respond quickly, identify root cause, and restore service.
Commands
db.currentOp({ "secs_running": { $gt: 5 } })
tail -f /var/log/mongodb/mongod.log
db.serverStatus().wiredTiger.cache
What this proves
You can correlate performance spikes with queries and system resources
You understand memory pressure and cache eviction issues
Interview phrasing
“During incidents, I correlate Ops Manager metrics with logs and current operations to identify whether the issue is query-related, memory pressure, or infrastructure.”
Expectation:
Ensure MongoDB runs efficiently at scale.
Commands
db.setProfilingLevel(1, { slowms: 100 })
db.getProfilingStatus()
db.collection.aggregate([{ $indexStats: {} }])
What this proves
You know how to identify slow queries safely
You don’t drop indexes blindly
Interview phrasing
“As a DBA, I rely on Ops Manager Performance Advisor and indexStats before making any schema or index changes.”
Expectation:
Data must be recoverable at any time.
Ops Manager API
GET /api/public/v1.0/groups/{GROUP-ID}/clusters/{CLUSTER-ID}/snapshots
POST /api/public/v1.0/groups/{GROUP-ID}/clusters/{CLUSTER-ID}/restoreJobs
What this proves
You understand RPO/RTO concepts
You test restores, not just backups
Interview phrasing
“We verify backups daily and periodically restore snapshots to staging to validate recoverability.”
Expectation:
Make changes safely with zero or minimal downtime.
Actions
Rolling restarts via Ops Manager
Version upgrades using Automation Agent
Validation
db.version()
What this proves
You follow controlled change processes
You avoid manual restarts in production
Interview phrasing
“All production changes go through Ops Manager automation to avoid configuration drift and ensure safe rollouts.”
Expectation:
Ensure secure access without breaking applications.
Commands
db.getUsers()
db.runCommand({ connectionStatus: 1 })
What this proves
You understand authentication and authorization
You verify access impact before changes
Interview phrasing
“I regularly audit users and roles through Ops Manager and validate access at the database level.”
Expectation:
Prevent outages caused by resource exhaustion.
Commands
db.stats(1024*1024)
Ops Manager metrics:
Disk growth
Cache utilization
Connections
What this proves
You plan ahead, not react
You understand growth trends
Interview phrasing
“I use Ops Manager historical metrics to plan storage and scaling well before thresholds are hit.”
Expectation:
Reduce manual work and support audits.
Example
GET /api/public/v1.0/groups/{GROUP-ID}/clusters
What this proves
You can integrate Ops Manager into scripts
You support compliance and reporting
Interview phrasing
“We use Ops Manager APIs for inventory, audit reports, and backup verification.”
Skill Area SRE DBA
Monitoring & Alerts ✅ Primary ✅ Support
Incident Response ✅ Primary ✅ Support
Performance Tuning ⚠️ Support ✅ Primary
Backup & DR ✅ ✅
Security ⚠️ ✅ Primary
Capacity Planning ✅ ✅
Automation ✅ Primary ⚠️