Wednesday, 15 January, 2020 UTC


Summary

For successful DevOps teams, alerting is an indispensable practice. You can’t possibly watch every service in your application every second of every day, yet you must be ready to take action immediately, should any service hit a snag. With New Relic Alerts, you can ensure that the right members of your team get the alerts they need as quickly as possible. If a monitored application, host, or other entity triggers a predefined alert condition, New Relic Alerts notifies you automatically.
At the same time, though, your team needs to minimize alert fatigue, which too often leads to mistakes and miscommunication in your incident-response process. With New Relic Alerts, you can easily manage alert policies and conditions that focus on key metrics, while filtering out expected behavior.
To help you get started, we created a list of suggestions, based on best practices from the field, for setting alert conditions for apps instrumented with New Relic Browser and New Relic APM, and for hosts monitored with New Relic Infrastructure. These suggestions serve as a great starting point for teams looking to get up and running with New Relic Alerts, or for teams looking to improve their workflows.
Note: This post covers alert conditions only. You should create alert policies based on how your organization is structured and on your incident response workflow. In some cases, you might have an alert policy that contains a condition that spans your entire New Relic account. In other circumstances, you’ll have to scope the condition to one or more apps or hosts. Similarly, if you’re a mature DevOps team, you may be grouping conditions for Browser, APM, and Infrastructure into the same policy, segmented by app or product. More traditionally structured teams may want to separate Infrastructure, APM, and Browser conditions into different policies.
If you’re not already familiar with New Relic Alerts, be sure to review the following before getting started:
  • The Alerts documentation (including NRQL Alert Conditions, Baseline Alerts, and Outlier Detection)
  • Getting Started with New Relic Alerts: Best Practices That Set You Up for Success
You should also be familiar with these two terms:
  • Thresholds: These are alert condition settings that define what is considered a violation. Threshold values include the value that a data source must pass to trigger a violation and the time-related settings that define a violation; for example:
    • An application’s average web response time is greater than 5 seconds for 15 minutes.
    • An application’s error rate per minute hits 10% or higher at least once in an hour.
    • An application’s AJAX response time deviates a certain amount from its expected baseline behavior.
    For more information, see the New Relic documentation for setting thresholds for alert conditions.
  • Baselines: You can use baseline alert conditions to define thresholds that adjust to the behavior of your data. Baselines are useful for creating alert conditions that:
      • Notify you only when data is behaving abnormally.
      • Dynamically adjust to changing data and trends, including daily or weekly trends.
      • Work well out-of-the-box for new applications with as-yet-unknown behaviors.
    For more information, see the New Relic documentation for creating baseline alert conditions.

Configuring alert conditions for Browser applications

Use the following examples as best practices for getting started with alert conditions for frontend applications you’re monitoring with New Relic Browser.
ConditionUsage
Threshold condition on Pageview load timeTriggers an alert if page load times spike over the accepted threshold.
Baseline condition on Pageview throughputTriggers an alert for sudden traffic drops or spikes only. (Baseline conditions reduce noise by accounting for expected traffic fluctuations.)
Threshold conditions on JavaScript errorsTriggers alerts when JavaScript errors appear in browser applications.
Baseline throughput conditions on AJAX request response timeTriggers alerts when AJAX requests that contact your backend services affect network latency.
Baseline throughput conditions on key Page actions (e.g., button clicks)Triggers alerts for user behavior changes that don’t set off other alerts; for example, if a CSS change moves the "checkout" button off the viewable screen, it won't cause a spike in errors or response times but will affect customer experience.

Configuring alert conditions for APM applications

Use the following examples as best practice for getting started with alert conditions for applications you’ve instrumented with New Relic APM.
ConditionUsage
Threshold condition on web transaction time and ApdexTriggers an alert when your application isn’t meeting web transaction time or Apdex thresholds.
Baseline condition on transaction throughputTriggers an alert for sudden traffic drops or spikes only. (Baseline conditions reduce noise by accounting for expected traffic fluctuations.)
Threshold condition on error percentageTriggers an alert for increases in your application’s error percentage; for APM applications, this threshold should be 0.
Baseline response time and throughput alerts for key external requestsTriggers alerts if upstream service providers (e.g., payment gateways) are causing latency.
For load-balanced applications, outlier conditions on throughput, response time, and error rates, faceted by hostTriggers alerts for application problems localized to a single host. Host-specific conditions help with troubleshooting root cause analysis; for example, if a node in a cluster stops receiving traffic, it won’t affect all other nodes, but you’ll still need to troubleshoot it.
Baseline throughput and response time conditions on individual high-value transactionsTriggers alerts on fluctuations on high-value transactions, like checkouts or logins, which often only represent a small portion of overall transactions.

Configuring alert conditions for Infrastructure hosts

Use the following examples as best practice for getting started with alert conditions for hosts you’re monitoring with New Relic Infrastructure.
ConditionUsage
Threshold conditions for CPU, memory, I/O, and storage on each host.Triggers alerts when any of these basic metrics for a host’s overall health rises above accepted thresholds.
“Host not reporting” conditions for each hostTriggers an alert if a host suddenly crashes or shuts down unintentionally.
Threshold conditions on key processes to ensure they’re running at the desired capacity (e.g., Java virtual machines, log watchers, etc.)Triggers alerts when processes such as JVMs run above accepted thresholds. These may trigger in conjunction with other alerts, like throughput outliers, and aid in root cause analysis.
Next steps: An alerting strategy, when effectively implemented, is one of the most important parts of any successful DevOps team. Check out Effective Alerting In Practice to learn:
  • How shifts in modern technology stacks are leading to changes in alerting strategies
  • Some alerting best practices for dynamic and scaled environments
  • How to design and maintain an alerting system useful to your organization and teams