This section covers incidents and outages where the priority is to ensure HA service, it gives an overview of what you should be aware of before you are faced with an incident. It also outlines how to manage incident comms.
If you suspect a security breach alert Information Assurance (IA) immediately
If we’re having an incident
If you are on support when an incident happens, you should:
- nominate an incident lead (this may be you)
- nominate an incident comms person (during OOH this can be the person on the PaaS escalation rota)
- join #paas-incident on Slack
- get on with understanding and fixing the issue
The incident lead, comms person and anyone else needed to work on the incident will form the incident team.
The incident team can request support from any other members of the PaaS team and fixing the incident is usually more important than routine meetings (1 to 1s, retrospectives, planning, etc).
If you’re the incident lead:
Start making notes of what you’re doing - the #paas-incident Slack channel is the best place for this - so that the incident comms can start putting them in the incident report. It’s worth bearing in mind that Tenants may know there’s a problem and may join this channel. Also note, slack messages can start to disappear after a few days.
Decide if you need people to help, and ask for them to come over and sit with you. Many people can investigate at the same time, but only the incident lead should be making changes to production.
Consult the product manager and delivery manager to decide when the matter is not longer impacting the service, and is therefore resolved, or can be downgraded.
Create a pivotal story to track our response to the incident. This should be used to keep a record of what we do to resolve the problem.
Ensure that we schedule the post mortem and publish our incident report Draft the incident report using the information that has been noted on slack. There is a template for this which also contains guidance, and there are examples here.
If you’re incident comms:
- Let the PaaS team know about the incident on #paas Slack channel
- Notify the tenants about the investigation by creating a new incident in Statuspage using the template “Possible issue being investigated”. Important: when making or updating an incident make sure to tick the boxes to say which components are affected, otherwise notifications will not be sent.
- Tell internal stakeholders: send a summary of the incident as soon as possible to the GaaP incidents email list (this tells the GaaP team and a few others - internal to GDS - including IA team members).
- Update tenants hourly using the saved templates in our Statuspage account.
- Ensure that all decisions/comms are entered into the timeline section of the incident report.
PaaS Emergency contacts and escalations document (restricted access) provides useful contact information for senior GaaP management escalations for out of hours support.
When the incident is over
When the problem has been fixed, check that our status page is showing that the PaaS is operational.
The incident report template gives some guidance about how to complete it.
The incident lead and incident comms should ensure that the report is completed and that all relevant details are in the timeline.
Incident Review meeting
This is a no-blame retro of the incident. See blameless postmortems for some background.
The purpose of the meeting is to agree on what happened, to ensure the record fully reflects this, and to agree all follow-up actions. It should be held within a few days of the incident being resolved.
Invite the people from the team who were involved (Incident Lead/Comms/Team who worked on it) and if they are not on the list already, add the Delivery manager, Product Manager and Tech Arch.
Publishing the Incident Report
Tell the GaaP comms team (Nettie Williams) as soon as possible that there may be a report to publish.
Deciding to publish
This decision should be made by two of DM/PM/TL/TA. By default we publish Incident Reports on the GaaP Blog unless there is a good reason not to. This approach is consistent across Data Group - it is similar to how GOV.UK publishes its reports. It sets a good example and demonstrates openness, which is a good thing. We just need to make sure we consider any negative ramifications.
The only incidents for which this is not automatically true are for security incidents which need to be carefully considered in order to ensure that no further harm could be caused by publishing these.
Editing for publication
Create a copy of our factual incident report which can be edited for publication and send it to Nettie in the GaaP comms team.
The GaaP comms team will edit to ensure it is suitable for the audience. This will include our users who are often developers, and will also include non-tech people. The comms team will want to pair with a member of the PaaS team on this rewrite, and have it fact checked. This could be PM/DM/TA/TL or someone that they suggest. The comms team will agree publication with the cabinet office press office.
If there is a P1 incident, the GaaP programme team will have been informed via the GaaP incidents email list, and will be kept updated via the PaaS Announce email list.
If an incident needs to be escalated beyond the PaaS team, the incident comms person will contact people in the following order:
- GaaP Programme Director
- GaaP Programme Team
The person contacted above will decide if they need to alert a member of the GDS executive group. If none of the above are available then they will try the people below in the following order:
- David Lewis - Director for GDS Portfolio Group
The contact details for the above people, as well as useful contacts, can be found in PaaS Emergency contacts and escalations (restricted access)