Monitoring and Alerting

Monitoring:

Best Practices for Monitoring with MongoDB Atlas

Leverage Atlas's Built-in Monitoring Tools:
- Utilize Atlas's dashboards that display live and historical data, including CPU, memory, disk usage, and I/O patterns.
Set Up Alerts:
- Configure alerts for key metrics like CPU utilization, memory usage, disk space, and replica set replication lag to be notified of potential issues.
- Use alert thresholds that align with your application’s performance baselines to minimize noise.
Monitor Key Metrics:
- Throughput and Latency: Track read and write throughput and latency to detect performance bottlenecks.
- Connections: Monitor the number of connections to your database to ensure that your cluster is not overwhelmed.
- Replication Lag: Keep an eye on replication lag in replica sets to ensure data consistency across nodes.
- Disk Usage: Monitor disk space to avoid running out of storage, which could lead to write failures.
Review Query Performance:
- Regularly analyze slow queries using the Performance Advisor and Query Profiler, which can provide index recommendations and optimization insights.
Assess Index Usage:
- Use Atlas’s index analysis tools to ensure that your queries are making efficient use of indexes and detect unused indexes which could be removed.
Enable Performance Advisor:
- Leverage the Performance Advisor in Atlas to get recommendations on how to optimize queries and indexes, and to identify anti-patterns.
Utilize Logs:
- Use Atlas’s log rotation and archival features to manage and analyze log data efficiently for troubleshooting and historical analysis.
Capacity Planning:
- Monitor resource utilization trends over time to plan for scaling your cluster’s resources appropriately, either horizontally (sharding) or vertically (adding more powerful instances).
Use Custom Dashboards and Reports:
- Customize dashboards to track specific metrics relevant to your application’s performance and operation. Utilize Atlas’s integration with third-party tools if needed.
Monitor Backup Status:
- Regularly check the status and logs of automatic backups to ensure that data is being backed up according to your business continuity requirements.
Security Monitoring:
- Monitor connection events, authentication failures, and other security-related metrics to ensure that your deployment remains secure.
Stay Informed with Atlas Updates:
- Keep abreast of updates and improvements to Atlas monitoring features and tools, and incorporate them into your monitoring strategy when beneficial.

By following these best practices, you can effectively use MongoDB Atlas's robust monitoring capabilities to maintain the health, performance, and security of your MongoDB deployment. Regularly review these metrics and adjust your monitoring approach as your database and application needs evolve.

Hardware Metrics:

Monitoring hardware metrics is essential to ensure that your MongoDB deployments are running efficiently and that you can anticipate potential issues before they affect performance. Below is a list of key hardware metrics to monitor, along with brief descriptions and what to look for in a healthy environment:

CPU Utilization:
- Description: Measures the percentage of the CPU used by the system, both user processes and system processes.
- Healthy Environment: Generally, CPU utilization should remain below 75% for a sustained period. Spikes are normal, but consistently high usage may indicate the need for more CPU resources or optimization of database queries.
Memory Usage:
- Description: Monitors the amount of RAM used by the system. MongoDB uses substantial memory for its data operations.
- Healthy Environment: Ensure memory usage remains below total physical RAM. Insufficient memory can lead to increased disk I/O as data would be read from disk (swapping), reducing performance.
Disk I/O (Input/Output Operations):
- Description: Tracks the speed and frequency of read and write operations on the disk.
- Healthy Environment: Low latency and high throughput are ideal. High disk I/O wait time or frequent spikes can indicate contention or the need for faster or more efficient storage solutions.
Disk Space Utilization:
- Description: Measures the capacity used on disk storage relative to total available disk space.
- Healthy Environment: Maintain ample free space (about 20-25% or more) to accommodate data growth and operations like index creation. Running out of disk space can cause write operations to fail.
Network Throughput:
- Description: Assesses the rate of data transfer across the network, typically measured in Mbps or Gbps.
- Healthy Environment: Sufficient bandwidth to handle incoming and outgoing data traffic without bottlenecks. Monitor for unusual spikes or drops which could indicate network issues or attacks.
Swap Usage:
- Description: Tracks the use of swap space (disk used as memory).
- Healthy Environment: Swap usage should be minimal. High swap indicates that the system is running low on RAM and could heavily degrade performance.
Replication Lag (for Replica Sets):
- Description: The time difference between the primary node processing a write operation and the secondary nodes applying that operation.
- Healthy Environment: Low replication lag (usually under a few seconds) is ideal to ensure that data consistency is maintained. High or increasing lag could indicate performance issues or network delays.
File Descriptor Usage:
- Description: The number of file descriptors being used by the MongoDB process (open files, sockets, etc.).
- Healthy Environment: Usage should stay below system limits to prevent issues. If these limits are approached, you may need to increase available file descriptors.
Cache Hit Ratio:
- Description: The proportion of data requests served from cache versus those requiring disk reads.
- Healthy Environment: High cache hit ratio is desirable, indicating good memory utilization. Low hit ratios suggest inadequate memory allocation or misconfigured caching policies.

Alerting:

Setting up alerts in MongoDB Atlas allows you to proactively monitor and respond to events that could affect the performance or availability of your MongoDB deployments. Alerts can be configured based on various metrics and events, such as disk usage, CPU usage, memory usage, and operation counts. Here’s how you can set up alerts in MongoDB Atlas:

Step-by-Step Guide to Set Up Alerts in MongoDB Atlas

Log in to MongoDB Atlas:
- Go to the MongoDB Atlas website and log in to your account.
Select a Project:
- Navigate to the project where your cluster is located. You can select the project from the dropdown list on the left sidebar.
Navigate to Alert Settings:
- In the left-hand navigation pane, go to “Alerts.”
Create an Alert Configuration:
- Click on the “Alert Configuration” button.
- Now, click on "Add Configuration" to start creating a new alert configuration.
Select the Event Type:
- Choose the type of event you want to set an alert for. Some common event types include:
  - Host Alerts: Related to hardware and usage metrics (e.g., CPU, memory, disk usage).
  - Replica Set Alerts: Related to MongoDB replica set health and state changes.
  - Cluster Alerts: Related to cluster-specific metrics.
  - Network Peering Alerts: Related to issues with network peering connections.
  - Backup Alerts: Related to backup success or failure.
Define the Alert Details:
- Conditions: Specify the conditions under which the alert will be triggered. For example, “Trigger when average CPU usage is greater than 80% for at least 5 minutes.”
- Thresholds: Set the thresholds that will trigger the alert.
- Notifications: Choose how and where you want to receive notifications. Atlas supports email, Slack, webhooks, and PagerDuty.
- Notification Interval: Define how often notifications should be sent if the alert condition persists.
Add the Alert:
- Once you have configured the alert with the desired settings, click on “Save” to activate the alert.

Recommended Alerts to Set Up

Here are some recommended alerts you might consider setting up to effectively monitor your MongoDB deployment:

Host-Specific Alerts:
- High CPU Utilization: Alerts when CPU usage exceeds a certain threshold (e.g., 80%) for an extended period.
- Memory Utilization: Triggers when memory usage is too high, indicating a possible memory leak or under-provisioning.
- Disk Utilization: Alerts when disk usage crosses a threshold, useful for preventing out-of-disk situations.
Database Performance Alerts:
- Query Execution Time: Monitors when queries are taking longer than expected to complete.
- Number of Connections: Alerts on an exceptionally high number of database connections, which may indicate resource exhaustion or a DDoS attack.
Replica Set and Backup Alerts:
- Replica Lag: Alerts when secondary nodes fall behind the primary by a certain number of milliseconds.
- Backup Failures: Notifies you if scheduled backups fail.
Cluster-Level Alerts:
- Event Type Changes: Alerts for any critical event changes at the cluster level like tier change, version upgrade, etc.

Configuring these alerts can ensure that you are notified about potential issues before they impact your operations and can help maintain the health and performance of your deployments on MongoDB Atlas.

Page updated

Report abuse