HashiCorp Incident Management

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Community Blogs
3 min read
Community Team Member

This blog was written by Sabitha Muppuri (Sr Staff Site Reliability Engineer)

 

The Critical Need for Vendor Tool Health Monitoring in Orchestration Environments

 

In today's highly orchestrated and autoscaling cloud environments, vendor tool health plays an important role in maintaining application stability and performance. This blog entry will detail the critical requirement of monitoring vendor tool health status, particularly in environments that leverage tools like Terraform for dynamic deployment. We'll discuss the potential impact of vendor events or maintenance on dependent applications and how Palo Alto Networks prevents vendor tool event management from being a problem with automated alerting and timely notification. This proactive approach empowers on-call teams to quickly assess issues, refresh status pages, or disregard non-critical events, with minimal operational impact.

 

Weekly Terraform Deployment Insights:

 

JayGolf_2-1764186798077.png

 

Potential Outage Scenarios from Terraform Problems:

 

  • ​​Module and Registry Access: Providers and modules are retrieved by Terraform during initialization. Initialization can fail due to failure in the Terraform Registry, VCS (GitHub/GitLab), or network firewall.

 

  • Provider Availability (Cloud/API Layer): Providers (e.g., AWS, Azure, GCP) utilize their respective API endpoints. Broken API endpoints will cause plans or applies to fail or hang. We must observe provider status dashboards (e.g., status.aws.amazon.com, status.cloud.google.com). A healthy global provider may still have regional outages (e.g., AWS us-east-1) affecting our operations.

 

  • Provider Plugin Caching: If a provider binary is unavailable or of the wrong version, Terraform initialization will fail.

 

Current Monitoring and Gaps:

 

The current solution is to monitor the Terraform Cloud Status Page (status.hashicorp.com).

 

  • Existing Capabilities:
    • Subscriptions to events via Email, RSS, and Slack.
    • Email subscriptions include an option to filter by particular components.
    • RSS and Slack do not include component-level filtering.
  • Events Requiring Attention:
    • Ongoing incidents
    • Scheduled maintenance activities
  • Missing Functionality:
    • Email notifications get lost among other emails and are easily ignored.
    • Slack messages are on time but do not include the facility to filter for required components.
    • Maintenance notifications are triggered on creation, start, and completion.
    • No reminders are sent to teams 24 hours or 60 minutes prior to the activity.
    • Current Incidents:
    • Scheduled Maintenance Activities:

Suggested Solution: Automated Alert Notification

 

On-call teams are not well notified both in real-time outages and planned maintenance scenarios. The suggested solution is to automate alert notifications to PagerDuty, where on-call personnel actively monitor outages and can immediately respond.

 

​​Implementation Strategy:

 

1. Poll Status Page: We will poll the [status.hashicorp.com/api/v1/summary](https://status.hashicorp.com/api/v1/summary)API every 5 minutes for maintenance and incident details in JSON format.

 

2. Database Storage: There will be a database storage that holds data on ongoing maintenance and incidents.


3. PagerDuty Integration: The affected components we are utilizing will send notifications to PagerDuty whenever the automation detects them.

 

JayGolf_3-1764186843652.png

 

4. Scheduled Activity Reminders: For scheduled activities, there will be a reminder job that will run, which will send notifications to teams 24 hours and 60 minutes before the activity begins.

 

  1. JayGolf_6-1764194324927.png

     

  2. JayGolf_7-1764194361331.png

     

5. Customizable Component Selection: Users can choose specific components where they need alerts.

 

6. Extended Alerting: Alerting can also be used on other platforms like Slack, PagerDuty, etc., where incident managers are actively following up on status.

  • 112 Views
  • 0 comments
  • 0 Likes
Register or Sign-in
Labels
Contributors
Top Liked Authors