Skip to main content
Table of contents

This is for internal use by the PaaS team. Public-facing documentation is located at docs.cloud.service.gov.uk.

Incident Process

This section summarises :

  • how to manage incidents and outages to ensure a highly available service
  • how to manage incident comms

You may refer to Support roles and responsibilities for information on support rota, roles and responsibilities for in hour and out of hour engineers.

Warning

Reevaluate the information you have throughout an incident in a security context.

If you suspect a security breach, alert Information Assurance (IA) immediately.

In hours incident process

For incidents which occur during the working day, the in hours support engineer should begin by investigating the incident and assessing the severity. The comms lead will also be notified and will generate an incident report. Depending on the severity of the incident, the support engineer should also request more support from other members of the PaaS team. Incidents take priority over BAU work.

Investigate and resolve the incident

Your responsibility depends on your role within the ‘incident team’:

In hours support engineer (incident lead)

  1. Investigate the incident and request additional help if required.
  2. Decide on the incident severity level.
  3. Make necessary changes to the production environment (only the incident lead should do this).
  4. Record actions taken and changes made in the #paas-incident Slack channel.
  5. Discuss the incident with the incident comms lead to decide when the incident is resolved or can be downgraded as it is no longer impacting the service.

Incident comms lead

  1. Let the PaaS team know about the incident on #paas-incident Slack channel.
  2. Open a hangout for you and the support engineer (share a link in the #paas-incident Slack channel, but suggest that only PaaS team members working on the incident join to avoid distractions).
  3. Set up a new incident report document (make a copy of the blank incident report template). Record a timeline of events. Make sure the incident report can be viewed by anyone in GDS, and share a link in the #paas-incident Slack channel.
  4. Notify the tenants about the investigation by creating a new incident in Statuspage using the ‘Possible issue being investigated’ template. Important: when creating or updating an incident you must tick the boxes to say which components are affected, otherwise notifications will not be sent.
  5. Update tenants in the #govuk-paas cross-government Slack channel.
  6. Update tenants regularly using the saved templates in our Statuspage account. The severity level page shows how frequently you should update tenants.

Between them, the incident lead and comms lead should make a decision as to whether the incident needs to be escalated further.

When the incident is over

When the problem has been fixed, check that our Statuspage is showing that the PaaS is operational. Note on #paas-incident channel that the issue is resolved. Update tenants in the #govuk-paas cross-government Slack channel.

Write the incident report

The incident lead and comms lead must write the incident report, ensuring that all relevant details, decisions and comms are in the timeline section of the report.

The incident report template provides guidance about how to complete it.

Hold an incident review meeting

Conduct a no-blame retro of the incident within one week of resolving the incident in order to:

  • Agree on what happened
  • Ensure the record fully reflects this
  • Agree all follow-up actions

Invite the following people to the retro:

  • Incident lead
  • Incident comms lead
  • Delivery Manager
  • Product Manager
  • Technical Architect
  • Anyone else who had any involvement in the incident

Write Pivotal stories for any actions which have come out of the incident.

Publish the incident report

The incident comms lead should create a version of the incident report for publication. Refer to the template for guidance, and also see previous examples.

By default we publish Incident Reports on Statuspage unless there is a good reason not to. It sets a good example and demonstrates openness.

The only incidents for which this is not automatically true are for security incidents which need to be carefully considered to ensure that no further harm could be caused by publishing these.

Out of hours incident process

Step one - Preliminary investigation (max. 15 minutes)

Incident lead (on-call engineer)

  1. Carry out enough investigation to confirm that there really is an issue.
  2. Decide what the priority is (according to the severity levels).

Incident comms (on-call comms lead)

  1. Open a hangout for you and the support engineer (share a link in the #paas-incident Slack channel). If other on-call support people from other GDS teams join, suggest that only PaaS team members working on the incident are on the call to avoid distractions.
  2. Set up a new incident report document (make a copy of the blank incident report template). Record a timeline of events. Make sure the incident report can be viewed by anyone in GDS, and share a link in the #paas-incident Slack channel.
  3. For P1 incidents, contact the person on the GaaP SCS Escalation rota if you need help for communication so that the incident lead can focus on investigating the incident. You can also add them as a responder to the incident pagerduty has created, and they will be automatically alerted.

For all other incidents the communication and support lead will update Statuspage and #paas-incident, and then wait until working hours to investigate further.

Step two - Investigate and resolve the incident

If you are paged out of hours as the on-call engineer, you will be the incident lead. There will also be an on-call comms lead paged. The comms lead will be responsible for incident comms. You may decide to escalate to GaaP SCS Escalation so they can engage with senior stakeholders and advise on communication with tenants.

Incident lead (on-call engineer)

  1. Investigate the incident.
  2. Make necessary changes to the production environment (only the incident lead can do this).
  3. Record actions taken and changes made in the #paas-incident slack channel.
  4. Discuss the incident with the comms lead to decide when the incident is resolved or can be downgraded as it is no longer a P1 incident.

Incident comms (on-call comms lead)

  1. Notify tenants about the investigation by creating a new incident in Statuspage using the ‘Possible issue being investigated’ template. Important: when creating or updating an incident you must tick the boxes to say which components are affected, otherwise notifications will not be sent.
  2. Let the PaaS team know about the incident on the #paas-incident slack channel so team members are updated next day.
  3. Update tenants in the #govuk-paas cross-government Slack channel.
  4. Update tenants hourly using the saved templates in our Statuspage account.
  5. When the incident is resolved or downgraded, update the #paas-incident and the #govuk-paas cross-government Slack channels to state it is no longer a P1 incident.
  6. Before the incident lead and comms lead can break, they should write an ‘incident handover’ on the #paas-incident slack channel so that the in hours on call support team can pick up the incident. This should include the status of the incident and what time it finished so the team knows when to expect them back 'at work’ if relevant.
  7. The GaaP SCS Escalation rota is the final decision point.

Useful contact details can be found in PaaS Emergency contacts and escalations (restricted access)

When the incident is over

When the problem has been fixed, check that our Statuspage is showing that the PaaS is operational.

Ensure that #paas-incident and #govuk-paas cross-government Slack channels are updated to show the incident is over.

The next working day

Using the information documented in the #paas-incident channel, create a Pivotal story to record all actions taken and identify any ongoing work.

Continue as per the in hours process.

If neither the out of hours support engineer nor out of hours comms lead respond first time

Pagerduty will attempt to alert both support engineer and comms lead again, 15 minutes after the first alert.

In the event that you are contacted but aren’t scheduled to be on the rota, your first point of call should be to check the #paas-incident Slack channel to see if another engineer has already picked it up. If no one has, leave a message to state that you are beginning to investigate.

Suppliers

AWS

We have a full support contract with AWS. Open a support case through the AWS console or at https://aws.amazon.com/support. If the incident is ‘critical’ or 'urgent’ severity, use click to chat or click to call for immediate contact.

Aiven Elasticsearch

Aiven monitors its services 24 hours a day 365 days a year, and provides free email support regarding problems using and accessing its services. Aiven personnel:

  • are automatically alerted on any service anomalies
  • rapidly address any issues in system operations requiring manual intervention

Responses are provided on a best-effort basis during the same or next business day. Email support@aiven.io for all support requests.

Other Important contacts in Reliability Engineering

You can refer to RE On-call escalation for other important contacts that may be relevant, including Cyber Security.