• /
  • EnglishEspañolFrançais日本語한국어Português
  • EntrarComeçar agora

Level 1 - Infrastructure alert coverage scorecard rule

Infrastructure alert coverage ensures that your servers, containers, and other infrastructure components have monitoring alerts in place to detect issues before they impact your applications and customers.

About this scorecard rule

This infrastructure alert coverage rule is part of Level 1 (Reactive) in the business uptime maturity model. It verifies that your critical infrastructure components have basic alerting configured to notify you when problems occur.

Why this matters: Infrastructure issues often cascade to application problems. Without proper infrastructure alerting, you might only discover problems when customers start complaining about slow or unavailable services.

How this rule works

This rule examines your infrastructure entities and checks whether they have alert conditions defined. Specifically, it looks for alerts on:

  • INFRA-HOST entities: Physical servers, virtual machines, and cloud instances
  • INFRA-KUBERNETES-POD entities: Kubernetes pods and containers

The rule fails if any monitored infrastructure entity lacks at least one alert condition.

Understanding your score

  • Pass (Green): All infrastructure entities have at least one alert condition defined
  • Fail (Red): One or more infrastructure entities lack alert coverage
  • Target: 100% alert coverage across all critical infrastructure components

What this means:

  • Passing score: Your infrastructure monitoring foundation is in place
  • Failing score: Some infrastructure components could fail without alerting your team

How to improve infrastructure alert coverage

If your score shows missing infrastructure alerts, follow these steps to establish comprehensive coverage:

1. Identify uncovered infrastructure

  1. Review the failing entities: Identify which specific hosts or pods lack alert coverage
  2. Prioritize by criticality: Focus first on production systems and business-critical infrastructure
  3. Assess monitoring gaps: Determine if missing alerts represent actual monitoring gaps or intentional exclusions

2. Set up essential infrastructure alerts

For each infrastructure entity, configure alerts for these critical metrics:

Host monitoring alerts:

  • CPU utilization: Alert when CPU usage exceeds 80% for 5 minutes
  • Memory usage: Alert when memory utilization exceeds 85% for 5 minutes
  • Disk space: Alert when disk usage exceeds 90% or available space drops below 1GB
  • Host availability: Alert when the host stops reporting data for 3 minutes

Kubernetes pod alerts:

  • Pod restart frequency: Alert when pods restart more than 3 times in 10 minutes
  • Container resource limits: Alert when containers approach CPU or memory limits
  • Pod availability: Alert when pods are not in a running state for more than 2 minutes
  • Node resource pressure: Alert when nodes experience memory or disk pressure

3. Configure alert conditions effectively

Use appropriate thresholds:

  • Start with conservative thresholds and adjust based on your environment's normal behavior
  • Consider different thresholds for development, staging, and production environments
  • Account for expected usage patterns (e.g., batch processing jobs, traffic spikes)

Set proper evaluation windows:

  • Use longer windows (5-10 minutes) for metrics that naturally fluctuate
  • Use shorter windows (1-3 minutes) for availability and critical failure conditions
  • Avoid overly sensitive alerts that trigger on temporary spikes

4. Establish alert routing and escalation

  1. Define notification channels: Set up email, Slack, or PagerDuty integrations
  2. Assign responsible teams: Ensure alerts reach the teams who can respond
  3. Create escalation procedures: Define what happens if initial alerts aren't acknowledged
  4. Test notification delivery: Verify alerts actually reach the intended recipients

Measuring improvement

Track these metrics to verify your infrastructure alert coverage improvements:

  • Coverage percentage: Aim for 100% alert coverage on production infrastructure
  • Alert effectiveness: Monitor how often infrastructure alerts help prevent application issues
  • Response times: Measure how quickly teams respond to infrastructure alerts
  • False positive rate: Ensure alerts are tuned to avoid unnecessary noise

Common scenarios and solutions

Legacy or decommissioned infrastructure:

  • Problem: Old hosts or containers still appear in monitoring but don't need alerts
  • Solution: Remove unused entities from monitoring or tag them as non-production to exclude from coverage requirements

Development and testing environments:

  • Problem: Dev/test infrastructure clutters alert coverage metrics
  • Solution: Use tags or naming conventions to separate environments and focus coverage rules on production systems

Specialized infrastructure:

  • Problem: Some infrastructure requires custom monitoring approaches
  • Solution: Create environment-specific alert templates for different infrastructure types (databases, load balancers, etc.)

Cloud auto-scaling resources:

  • Problem: Dynamically created instances may not inherit alert configurations
  • Solution: Use infrastructure templates or automation to ensure new instances get proper alert coverage

Advanced considerations

Customizing coverage rules

You may need to adjust the scorecard rule if:

  • Different entity types: Your infrastructure includes other entity types (databases, load balancers, etc.)
  • Environment segregation: You want to focus only on production infrastructure
  • Business criticality: Some infrastructure is more critical than others

Integration with other monitoring tools

If you use multiple monitoring tools:

  • Ensure alert coverage doesn't create duplicate notifications
  • Coordinate with existing monitoring systems to avoid gaps
  • Consider using New Relic as a central aggregation point for infrastructure alerts

Important considerations

  • Start with critical systems: Focus first on production infrastructure that directly impacts customers
  • Balance coverage with noise: Ensure comprehensive coverage doesn't create alert fatigue
  • Regular maintenance: Review and update alert conditions as your infrastructure evolves
  • Team readiness: Ensure teams can actually respond to the alerts you're creating

Next steps

  1. Immediate action: Set up basic alerts for any infrastructure currently lacking coverage
  2. Ongoing monitoring: Review this scorecard rule weekly to maintain coverage as infrastructure changes
  3. Advance to Level 2: Once infrastructure alerting is established, focus on proactive monitoring practices

For detailed guidance on infrastructure monitoring setup, see our Infrastructure monitoring documentation.

Copyright © 2025 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.