Incident management for high-velocity teams
Disaster recovery plans for IT ops and DevOps pros
As IT services move from a back-of-the-house cost center to driving core value for the business, effective IT disaster recovery practices are more important than ever.
Whether it’s application downtime, data loss, or even an on-premise fire, responding during disaster is rarely simple.
For small businesses, the recovery can be devastating. About 40-60 percent of small businesses never reopen their doors following a disaster, according to FEMA.
What is a disaster recovery plan?
A disaster recovery plan is a documented set of practices and procedures set up to protect an organization and its IT assets in the event of a disaster. Typically the plan encompasses scenarios, runbooks, backups, and instructions for getting the business and IT services operational. This is especially relevant in events like system failure, downtime, security breach, or data loss.
"Before the 1970s, most organizations only had to concern themselves with making copies of their paper-based records. Disaster recovery planning gained prominence during the 1970s as businesses began to rely more heavily on computer-based operations. At that time, most systems were batch-oriented mainframes. Another offsite mainframe could be loaded from backup tapes, pending recovery of the primary site."
Disaster Recovery Planning vs. Business Continuity Planning
Disaster recovery planning is a subset of business continuity planning. Where disaster recovery planning focuses on getting the impacted services running again as fast as possible, business continuity planning focuses on ensuring the business can operate uninterrupted in the event of a disaster.
IT plays a central role in both practices, whether it be disaster recovery or business continuity.
It’s easy to confuse disaster recovery and business continuity, or to treat them as interchangeable. Disaster recovery planning is aimed at restoring service after an incident. Disaster recovery is a smaller piece of the overall business continuity plan. A business continuity plan is designed to keep the organization functioning before, during, and after an incident. If disaster recovery is “how do we end this incident,” business continuity is “how do we continue operating as a business even during an incident.”
Disaster Recovery Planning vs. Incident management
For DevOps and IT Operations teams, incident management is the process used to respond to an unplanned event or service interruption and restore the service to its operational state.
Incident management and disaster recovery are often used interchangeably, depending on the team and organization. Incident management is also focused on addressing incidents in real time and getting services up and running again during the incident.
At Atlassian, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response.
Or according to Google’s book on Site Reliability Engineering:
"Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. If you haven’t gamed out your response to potential incidents in advance, principled incident management can go out the window in real-life situations."
Google also recommends including incident management as part of an organization’s disaster recovery testing process. Through the incident response process, ideally responders' actions and communications are recorded to create a rich incident timeline that can serve as a resource for future related incidents or outages. This is helpful for organizations running disaster recovery testing, as teams have the full context of operations.
What is recovery time objective?
Recovery time objective is the acceptable recovery time period for a business function to resume normal service after an outage. It is closely related to mean time to recovery discussed in DevOps metrics.
Disaster recovery planning in a DevOps world
How do disaster recovery plans stay relevant in a world of continuous delivery, automated testing, and multiply deploys per day?
In other words, what role do disaster recovery plans play in organizations practicing DevOps?
Thankfully, the two practices can live together and benefit off each other. The same tools and processes you use to push code from development, to testing, to production can also play a role in disaster recovery. For example, backups of production environments used to test deploys can also be used to run disaster simulations. And the tracked code commits from your CI/CD pipeline can be a useful tool for surfacing recent changes in a disaster recovery scenario.
It’s no secret that DevOps is increasingly setting the pace for all IT decisions in the company. But this doesn’t have to mean that the hard work put into the recovery plan and resources is wasted, or that your disaster recovery plan will sit on the shelf collecting dust.
Learn more about Atlassian’s incident management solution, Jira Service Management, and discover how it gives Dev and Ops teams the flexibility to work together — whether they’re resolving incidents or in disaster recovery mode.
Learn incident communication with Statuspage
In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.
Read this tutorialIncident communication templates and examples
When responding to an incident, communication templates are invaluable. Get the templates our teams use, plus more examples for common incidents.
Read this article