Incident management for high-velocity teams
Responding to an incident
The following sections describe Atlassian's process for responding to incidents. The incident manager (IM) goes through this series of steps to drive the incident from detection to resolution.
Jira field | Type | Help text |
Summary | Text | What's the emergency? |
Description | Text | What's the impact on customers? Include your contact details so responders can reach you |
Severity | Single-select | (Hyperlink to a Confluence page with our severity scale on it) Choosing Sev 2 or 1 means you believe this must be resolved right now - people will be paged. |
Faulty service | Single-select | The service that has the fault that's causing the incident. Take your best guess if unsure. Select "Unknown" if you have no idea. |
Affected products | Checkboxes | Which products are affected by the incident? Select any that apply |
Once the incident is created, its issue key is used in all internal communications about the incident.
Customers will often open support cases about an incident that affects them. Once our customer support teams determine that these cases all relate to an incident, they label those cases with the incident's issue key in order to track the customer impact and to more easily follow up with affected customers when the incident is resolved.
Severity | Description | Examples |
1 | A critical incident with very high impact |
|
2 | A major incident with significant impact |
|
3 | A minor incident with low impact |
|
Once you establish the impact of the incident, adjust or confirm the severity of the incident issue and communicate that severity to the team. We've found numbering the level to be very beneficial in clearly communicating severity.
At Atlassian, severity 3 incidents are passed to the delivery teams for resolution during business hours, whereas severity 1 & 2 require paging team members for an immediate fix. The difference in response between severity 1 and 2 is more nuanced and dependent on the affected service.
Your severity matrix should be documented and agreed among all your teams to have a consistent response to incidents based on customer impact.
Internal Statuspage | External Status page | |
Incident name | | Investigating issues with |
Message | We are investigating an incident affecting | We are investigating issues with |
In addition to creating a Statuspage incident, we send an email to an incident communications distribution list that includes our engineering leadership, major incident managers, and other interested staff. This email has the same content as the internal Statuspage incident. Email allows staff to reply and ask questions, whereas Statuspage is more like one-way broadcast communication.
Note that we always include the incident's Jira issue key on all internal communications about the incident, so staff knows what chatroom to pop into for more questions.
Setting up an on-call schedule with Opsgenie
In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.
Read this tutorialHow we run incident postmortems
Blameless postmortems help us understand and remediate the root cause of incidents. Learn how we run incident postmortems at Atlassian, from our handbook.
Read this article