Minimizing Downtime With Health Checks
Introduction
In the world of modern software, downtime equates to a loss of revenue and brand trust.
Don’t believe me? In 2014, a study done by Gartner resolved that service downtime had an average cost of $5,600 per minute and again in 2016 Ponemon Institute’s research suggested nearly $9,000.
Additional Evidence:
March 2015 - Apple suffered an outage for their store and the estimated loss of revenue was $25,000,000
August 2016 - Delta had a five-hour outage in their operation center and because of the cancelled flights, they estimate the loss at $150,000,000
March 2019 - Facebook (Meta) had a 14 hour outage that had an estimated cost of $90,000,000
There are many ways to help minimize software downtime, but this blog will be limited to leveraging application health checks as a means to catch issues early and minimize costly downtime.
Why Health Checks Matter?
As applications modernize and grow, it becomes more and more likely that the ecosystem will be heavily reliant on cloud services, possibly container orchestration platforms, integrations with external APIs, data platforms, etc. One small hiccup could cause a cascading ripple of failures through the entire ecosystem. Health checks that target these pieces of infrastructure or services can activate diagnostics in near real-time, allowing an immediate escalation path and an earlier final resolution.
The Business Value of Health Checks
Minimizing Downtime and Revenue Loss
Early Detection: Health checks act as an early warning system, and should be used to alert your team to issues before they become full-blown outages.
Cost Implications: As evidenced above, outages regardless of duration equate to a loss of revenue to the organization.
Enhanced Reliability and User Trust
Reputation: Consistent availability and responsiveness build loyalty. Health checks help you maintain this reliability and keep users engaged.
Proactive Transparency: Some organizations choose to share system status publicly. Health checks can feed into status pages or dashboards, giving users a quick view of any issues.
Operational Efficiency
Automation: Health checks can be leveraged to automate the monitoring process, reducing operational costs of ongoing monitoring and allowing staff to potentially work on other features.
Team Productivity: Clear and accurate health checks quickly pinpoint where issues are occurring to reduce team churn when resolving outages.
Scalability and Growth
Informed Scaling Decisions: Health checks can include performance metrics like CPU usage and memory consumption, helping you scale services intelligently and avoid over-provisioning.
Ecosystem Fit: Platforms like Azure Monitor and Application Insights integrate seamlessly with health checks, providing end-to-end visibility.
Implementation Strategies and Best Practices
Start Small and Iterate: Rome wasn’t built in a day. Start with the most critical dependencies, such as databases or critical external integrations. Make decisions on what it means for that service to considered healthy, degraded, or unhealthy (the three basic statuses a health check uses) and what the impact is on services that leverage them and create an action plan.
Leverage Existing Tooling: Ideally, your organization can take advantage of existing health check libraries such as .NET’s Microsoft.Extensions.Diagnostics.HealthChecks to streamline the implementation of health checks.
Monitoring and Alerting: Connect health checks to alerting services like PagerDuty, Slack, or Microsoft Teams for rapid delivery and escalation of unexpected outages.
Security Considerations: Ensure health check endpoints do not expose sensitive data or provide attackers with too much information about your internal systems.
Implementation Example (.net 8.0)
Step 1 & Step 2. Register health checks in Program.cs and then map the health checks endpoint
Step 3. Implement a health check
Below is a sample that always returns healthy
Step 4. Register the healthcheck (back in program.cs)
Step 5. Run the project and validate
It is worth adding as a final implementation note, that by default the health that is returned is a rolled-up worst case outcome of all the configured health checks, therefore if just one health check returns unhealthy, the single resolved outcome would be unhealthy. A custom response writer can be added if a breakdown is desired for more granular use-cases.
Example:
Considerations/Drawbacks
While health checks do offer significant advantages, some operations teams have doubts and the following are some reasons to consider:
Resource Overhead: Defining, developing, and maintaining health checks take time and resources which can be challenging to budget for.
False Sense of Security: Health checks that are poorly implemented may show false positives, therefore making troubleshooting even more complex and result in errors going unnoticed.
Investment vs. Actual ROI: For small organizations/user bases with non-critical services, the cost of implementation and maintenance may not be justifiable.
Maintenance Complexity: As the organization’s ecosystem evolves the health checks need to be maintained, adding operational overhead.
Conclusion
Well designed and implemented health checks can drastically reduce downtime, protect your platform’s reputation, and give your teams space to focus on other tasks and features without constantly worrying about hidden system failures. While they aren't always a perfect fit, for most businesses, the benefits significantly outweigh the risks. Take a closer look at your environment and consider implementing health checks as an essential guardrail.
If you need guidance, our consulting team specializes in architecting reliable, scalable systems on .NET, Azure, and modern JavaScript stacks.
We're here to help you design a health check strategy tailored to your unique needs!