Table of contents

This is for internal use by the PaaS team. Public-facing documentation is located at docs.cloud.service.gov.uk.

Incident Process

This section summarises :

  • how to manage incidents and outages to ensure a highly available service
  • how to manage incident comms

If you suspect a security breach, alert Information Assurance (IA) immediately

Set up the incident team

When an incident happens, the person on support must:

  • nominate an incident lead; this person must have production access, and may be the person on support
  • nominate an incident comms person; during out-of-hours this can be the person on the PaaS escalation rota
  • join #paas-incident on Slack if required

The incident lead, incident comms person and anyone else needed to work on the incident will form the incident team.

Incident team leads can request support from any other members of the PaaS team. Incidents take priority over BAU, for example retrospectives or planning sessions.

Investigate and resolve the incident

Your incident responsibilities depend on your role within the incident team.

Incident lead

  1. Investigate the incident, requesting help from PaaS team if required.

  2. Make necessary changes to the production environment (only the incident lead can do this).

  3. Record actions taken and changes made in the #paas-incident slack channel.

  4. Discuss the incident with the PaaS product and delivery managers to decide when the incident is resolved, or can be downgraded as it is no longer impacting the service.

As a lower priority, create a pivotal story to record all actions taken and identify any ongoing work.

Incident comms

  1. Let the PaaS team know about the incident on the #paas Slack channel.

  2. Notify the tenants about the investigation by creating a new incident in Statuspage using the “Possible issue being investigated” template.

    Important: When creating or updating an incident, you must tick the boxes to say which components are affected, otherwise notifications will not be sent.

  3. Update tenants hourly using the saved templates in our Statuspage account.

Escalations

You can escalate incidents within the GOV.UK PaaS team.

  1. Go to PagerDuty.

  2. Go to Configuration and select Schedules.

  3. Check who is currently on the escalation rota.

  4. Go back to Configuration and select Users.

  5. Select the relevant person to find their phone number.

If you need to escalate an incident beyond the GOV.UK PaaS team, contact these people in the following order:

  • Head of Reliability Engineering
  • TechOps Programme Director

The person contacted will decide if they need to alert a member of the GDS executive group.

If neither the TechOps Programme Director nor the Head of Reliability Engineering are available, you must contact the Director for GDS Portfolio Group.

The contact details for the above people, as well as useful contacts, can be found in PaaS Emergency contacts and escalations (restricted access)

When the incident is over

When the problem has been fixed, check that our status page is showing that the PaaS is operational.

Write the incident report

The incident lead and incident comms must write the incident report, ensuring that all relevant details, decisions and comms are in the timeline section of the report.

The incident report template provides guidance about how to complete it.

Hold an incident review meeting

Conduct a no-blame retro of the incident within one week of resolving the incident to:

  • agree on what happened
  • ensure the record fully reflects this
  • agree all follow-up actions

Invite the following people:

  • Incident Lead
  • Incident Comms
  • Incident team
  • Delivery manager
  • Product Manager
  • Technical Architect

Publish the incident report

Incident comms creates a version of the incident report for publication. Refer to the template for guidance, and also see previous examples.

By default we publish Incident Reports on Statuspage unless there is a good reason not to. It sets a good example and demonstrates openness.

The only incidents for which this is not automatically true are for security incidents which need to be carefully considered to ensure that no further harm could be caused by publishing these.

Suppliers

AWS

We have a full support contract with AWS. Open a support case through the AWS console or at https://aws.amazon.com/support. If the incident is ‘critical’ or ‘urgent’ severity, use click to chat or click to call for immediate contact.

Aiven Elasticsearch

Aiven monitors its services 24 hours a day 365 days a year, and provides free email support regarding problems using and accessing its services. Aiven personnel:

  • are automatically alerted on any service anomalies
  • rapidly address any issues in system operations requiring manual intervention

Responses are provided on a best-effort basis during the same or next business day. Email support@aiven.io for all support requests.