Tips To Never Miss An Incident Notification With Squadcast Escalations Policies

·

4 min read

Problem At Hand

Companies implement an Incident Response process to promptly resolve critical issues. Setting up escalation policies to notify engineers is a key step in this process. With traditional escalation policies, alert notifications still get missed which results in higher response times and failure to meet SLAs.

So, how can one ensure incident notifications are never missed?

Solution

To address this, organizations need to ensure that incidents get acknowledged and resolved within the specified timeframes. To avoid missing incidents implementing additional measures come in handy. For instance, regular reminders, advanced escalation policies, and keeping track of incidents notifications.

This can be done by implementing the following:

1. Escalation Layer Repetition

In the event of an incident, if nobody acknowledges the incident within the first set of notifications after 5 minutes, the escalation layer can be repeated.

This repetition involves sending notifications again to ensure that the incident receives attention.

In some cases, when incidents remain unacknowledged, L2 team or managers may need to manually review and call the primary team responsible for handling the incident. Repeating the escalation layer multiple times can decrease the likelihood of L2/P2 personnel picking up the incident.

This enables the On-Call team to never miss a notification and avoid potential delays in resolution.

2. Medium Of Notifications

Define how your On-Call engineers should be notified when an alert is triggered. This can be done in 2 ways:

  1. Custom: You as the Admin can define the medium of notification for eg: SMS, Phonecall, Email, Push (Mobile-app)

  2. Personal: You can let your fellow teammates specify their preferred medium of notifications under their Profile settings.

This flexibility allows you to ensure the On-Call engineer is definitely notified of an actionable alert.

An example Escalation Policy could be:

  • Notify the On-Call engineer as soon as the alert comes in via SMS and Email.

  • If the On-Call engineer does not acknowledge within 5 mins then notify via Phone and Push notification

  • If the alert is not acknowledged in the next 5 mins then notify the Squad via Phone call and Push notification

So on and so forth you can have multiple layers with the preferred medium of notification.

3. Adding Multiple Layers

Rules can be configured for incident notifications in a specific order to ensure efficient escalation:

  • For example, initially, within 0 minutes (i.e. immediately), the incident notification can be sent to the On-Call schedule via SMS.

  • If there is no acknowledgement within 5 minutes, it can be escalated to the same on-call schedule using different notification methods.

  • Multiple layers of escalation can be included within a single Escalation Policy.

  • The Escalation Policy can be designed to include numerous steps, with shorter time intervals between each rule.

  • This allows for a comprehensive and efficient escalation process that ensures timely attention to incidents.

4. Repeat the Entire Escalation Policy Multiple Times

  • The policy can be repeated numerous times, ensuring comprehensive coverage.

  • In the event of an incident, logs are presented to enhance accountability. Managers can access these logs and determine the time and status of received notifications.

  • This makes it easy to track incident notifications and offers more transparency leaving no possibility of missed notifications going unnoticed.

(Please Note: You can repeat any Escalation Policy for a maximum of 3 times only.)

For more information on escalation policies, take a moment to dive into Squadcast escalation policies documentation.

The Round Robin and Advanced Escalations can also ensure equitable distribution of escalations among team members, promoting fairness and balanced workload management. Checkout this video to know more.

Common Use Cases

  1. Critical Server Downtime in a Web Hosting Company

In a web hosting company, when a critical server goes down, Escalation Policy can be configured to notify the primary on-call engineer. If there's no response within a certain time frame (e.g., 5 minutes), it escalates the incident to a secondary engineer or the team lead. This ensures swift response and minimizes downtime, crucial for maintaining SLAs.

  1. Handling Outage in a Cloud Service Provider

For a cloud service provider, when there's an outage affecting multiple customers, the first layer of Escalation Policy can alert the first-line support. If the issue continues or impacts a significant number of clients, it can escalate to the incident management team. This guarantees that the provider responds promptly to minimize service disruption and meets SLAs.

Safeguarding Operations With Efficient Escalation Policy

By implementing these proactive measures such as custom notifications, optimizing escalation policies, and leveraging escalation layer repetition mechanisms, the risk of missing important alerts can be reduced significantly.

Squadcast can help in achieving all the above and effectively navigate incident response challenges, minimize their impact, and deliver a superior customer experience.

Read More: Squadcast helped GrowthPlug improve Better clarity, improved & accountability

Squadcast is a Reliability Workflow platform that integrates On-Call alerting and Incident Management along with SRE workflows in one offering. Designed for a zero-friction setup, ease of use and clean UI, it helps developers, SREs and On-Call teams proactively respond to outages and create a culture of learning and continuous improvement.