So, you’re having an incident
This document is the GOV.UK PaaS team playbook for managing a technical incident. It covers tasks for the engineering lead and the communications (comms) lead.
Engineering lead tasks
As the engineer on support, you become the engineering lead in an incident, and you’re responsible for declaring an incident. However, other engineers should help as necessary, especially in incidents lasting several hours or more.
As the engineering lead:
- you are responsible for investigating and resolving the issue, and reporting to the comms lead so they can communicate severity and timelines to stakeholders
- you are not expected to fix any underlying problems with our technology or processes – these will be discussed and addressed in an incident review
- you should be cautious if you’re unsure about what to do or what the impact of any action would be – the goal is to get the platform stable enough to be fixed properly during office hours
If you need help during office hours, other engineers can help you as needed. Out of hours, you can escalate to the rest of the team using PagerDuty if you consider the incident high priority. There is no guarantee that anyone will answer.
If an incident is ongoing across the boundary between in and out of office hours (for example, an in-hours incident continuing past 5pm or an out-of-hours incident continuing past 9am), you should perform a handover with the incoming engineering lead.
If you’ve been involved in an out-of-hours incident, you are not required to work again until 11 hours after the end of the incident.
Starting an incident
- Acknowledge the incident on PagerDuty and decide if the alerts you have received and their impact constitute an incident or not. Incidents generally have a negative impact on the availability of tenant services in some way or constitute a cyber security incident. Problems such as our billing smoke tests failing may indicate a tenant-impacting problem but do not in themselves constitute an incident.
- Document briefly which steps you are taking to resolve the incident in the #paas-incident Slack channel. If the situation impacts tenants, escalate to the person on communication (comms) support using PagerDuty so they can communicate with tenants.
- The #paas-incident channel has a bookmarked hangout link. Join this video call to communicate with the comms lead and talk through what you’re doing and what’s happening.
- If you decide it’s not an incident after investigating further, you must resolve the incident in PagerDuty. If you’re working out of hours, decide whether the issue needs a resolution immediately or whether an engineer can resolve it in hours. If you are sure it is an incident, agree on a priority for the incident with the comms lead. You can change this priority level later as more information emerges.
Communication lead tasks
You are not expected to be involved every time an alert goes off. PagerDuty will call the engineering lead in the event of an alert, and it is the engineering lead’s responsibility to triage and escalate to you as necessary.
As the comms lead, you communicate the status and impact of an incident to tenants and document what’s happening during the incident process.
- The #paas-incident Slack channel has a bookmarked hangout link. Join this video chat to communicate with the engineering lead, who will talk through what they’re doing and what’s happening. Record the actions the engineering lead takes and the times they happen.
- Create a new incident report using the incident report template, and fill it in with the information you currently have. Save the incident report in the incident reports folder on the PaaS shared drive.
- Begin populating the timeline with the notes the engineer has left in #paas-incident, if there are any.
- Work with the engineering lead to agree a priority for the incident.
- Ask for periodic updates (every 20-30 minutes) from the engineer(s) handling the technical side of the incident. You should draft and issue comms appropriate to the update time for the severity level. This is described in our documentation on response times for services in production. Our primary communication channel with tenants during an incident is StatusPage. If a tenant contacts you on Slack or through another channel, politely ask them to wait for updates through StatusPage.
You should write incident comms in plain English and focus on what impact tenants can expect rather than what is wrong. For example, choose “end users are likely to experience intermittent interruption” rather than “one of the availability zones is down”.
If an incident is ongoing across the boundary between in and out of office hours (for example, an in-hours incident continuing past 5pm or an out-of-hours incident continuing past 9am), you should perform a handover with the incoming comms lead.
If you have been involved in an out-of-hours incident, you are not required to work until 11 hours after the end of the incident.
Cyber security incidents
If the engineering and comms leads suspect the incident may need to involve the Cyber Security team, the incident can be escalated to the Cyber Security team. To escalate to the Cyber Security team you can call the phone number listed under Cyber Security on the PagerDuty Live Call Routing addons page. You can also email the Cyber Security team reporting address, or use the #cyber-security-help Slack channel.
Examples of cyber security incidents include:
- unauthorised access to tenant apps and backing services
- unauthorised access to platform infrastructure
- exploitation of vulnerabilities in platform APIs
See the appendix on what qualifies as a cyber security incident for more information.
As per our shared responsibility model, we are not responsible for the code tenants deploy. As a result, we are not responsible for spotting, preventing, or mitigating cyber security incidents within their services. However, we may spot exploitation of a vulnerability in a tenant service in our logs, at which point we should inform the tenant and give them the information we have.
After an incident is resolved
Once an incident is resolved:
- the comms lead should make sure they have communicated the resolution in all the same channels that they first communicated the incident in – this will primarily be StatusPage
- the delivery manager or one of the incident leads should schedule an incident review as soon as it is reasonably possible for as many concerned parties as possible to attend, ideally within 1-2 weeks
It is possible that an incident will run across office hours – for example, an in-hours incident might not be resolved before the end of the working day or an out-of-hours incident might run for a long time. In these cases, you may need to hand over an incident to fresh engineering and comms leads.
When to consider a handover
You should start the process to hand over an incident if:
- you are working on an incident in office hours and the time has reached 5pm (based on the priority of the incident, decide whether the incident is sufficiently high priority to be worked on out-of-hours. If the incident is low priority, you or another engineer may resume the incident work at the start of the next working day.)
- you are working on an incident out-of-hours and office hours have begun (9am on working days)
- you have been working on an incident for a long stretch of time in or out of hours (6 hours should be the maximum)
If you cannot find someone to hand over to after 6 hours of working on an incident out of hours, you should take a break and return either:
- during office hours to hand over to the in-hours support people, or
- after you have rested for an hour or two, and then attempt to reach other people again
How to hand over
- The comms lead should check the incident report has a summary of the incident and is up-to-date.
- The comms lead should share the incident report with the new engineering and comms leads so they can gain context.
- The comms lead should set up a short meeting to talk through:
- the current incident status
- any useful contextual information
- the status of any communications that need to be updated further
Use the #paas-internal Slack channel to contact other team members.
Out of hours
You can use PagerDuty to escalate an issue, create a new issue for people on call, or look up the phone numbers of people on call.
There is no response play or out-of-hours support to escalate an incident to the Senior Management Team (SMT). If you need to contact SMT, wait until in hours and talk to the person on the Product and Technology (P&T) SMT escalation rota in PagerDuty.
Service provider support details
Log into AWS and raise a support ticket.
We have denial-of-service (DoS) protection as part of our AWS contract. You can contact the team through the standard support channel.
Raise a ticket through the Aiven console, or email firstname.lastname@example.org. We do not have a support package with Aiven, so they cannot guarantee response times.
What qualifies as a cyber security incident?
We follow the NCSC’s definition of what constitutes a cyber incident. For our team, this means:
- unauthorised access to non-public data of GOV.UK services
- exploitation of a vulnerability or lack of access control to use or alter GOV.UK services and systems
- denial-of-service (DoS) or similar attacks on GOV.UK services
For an incident to be considered a cyber security incident, it should be an active breach and/or involve exposure of data rather than these issues just being a hypothetical possibility. For example, if we discover that some GOV.UK PaaS software has a vulnerability with a low likelihood of exploitation, we would not consider it a cyber security incident. However, if we learned that an attacker had access to GOV.UK PaaS tenant data, we would consider it a cyber security incident.
If you’re in doubt about whether to declare a cyber security incident, you can seek help by escalating.
Defining an incident priority
Our incident priorities are publicly documented on our product pages.