This section summarises :
- how to manage incidents and outages to ensure a highly available service
- how to manage incident comms
You may refer to Support roles and responsibilities for information on support rota, roles and responsibilities for in hour, out of hour engineers and Techops RE Incident Escalation.
Reevaluate the information you have throughout an incident in a security context.
If you suspect a security breach, alert Information Assurance (IA) immediately.
In hours incident process
For incidents which occur during the working day, the in hours support engineer should begin by investigating the incident and assessing the severity. The comms lead will also be notified and will generate an incident report. Depending on the severity of the incident, the support engineer should also request more support from other members of the PaaS team. Incidents take priority over BAU work.
Investigate and resolve the incident
Your responsibility depends on your role within the ‘incident team’:
In hours support engineer (incident lead)
- Investigate the incident and request additional help if required.
- Decide on the incident severity level.
- Make necessary changes to the production environment (only the incident lead should do this).
- Record actions taken and changes made in the #paas-incident Slack channel.
- Discuss the incident with the incident comms lead to decide when the incident is resolved or can be downgraded as it is no longer impacting the service.
Incident comms lead
- Let the PaaS team know about the incident on #paas-incident Slack channel.
- Notify the tenants about the investigation by creating a new incident in Statuspage using the ‘Possible issue being investigated’ template. Important: when creating or updating an incident you must tick the boxes to say which components are affected, otherwise notifications will not be sent.
- Update tenants regularly using the saved templates in our Statuspage account. The severity level page shows how frequently you should update tenants.
Between them, the incident lead and comms lead should make a decision as to whether the incident needs to be escalated further.
When the incident is over
When the problem has been fixed, check that our Statuspage is showing that the PaaS is operational. Note on #paas-incident channel that the issue is resolved.
Write the incident report
The incident lead and comms lead must write the incident report, ensuring that all relevant details, decisions and comms are in the timeline section of the report.
The incident report template provides guidance about how to complete it.
Hold an incident review meeting
Conduct a no-blame retro of the incident within one week of resolving the incident in order to:
- Agree on what happened
- Ensure the record fully reflects this
- Agree all follow-up actions
Invite the following people to the retro:
- Incident lead
- Incident comms lead
- Delivery Manager
- Product Manager
- Technical Architect
- Anyone else who had any involvement in the incident
Write Pivotal stories for any actions which have come out of the incident.
Publish the incident report
The incident comms lead should create a version of the incident report for publication. Refer to the template for guidance, and also see previous examples.
By default we publish Incident Reports on Statuspage unless there is a good reason not to. It sets a good example and demonstrates openness.
The only incidents for which this is not automatically true are for security incidents which need to be carefully considered to ensure that no further harm could be caused by publishing these.
Out of hours incident process
Step one - Preliminary investigation (max. 15 minutes)
Carry out enough investigation to confirm that there really is an issue and what the priority is (according to the severity levels). For P1 incidents, the comms lead should contact the Techops RE Incident Escalation person if you need help for communication so that the incident lead can focus on investigating the incident:
The comms lead will: 1. Create an open hangout and share with incident team to enable ongoing discussion.
If the on call Techops RE Incident Escalation person needs to know then: 1. Go to PagerDuty. 1. Under teams, select GOV.UK PaaS. 1. Go to Configuration and select Schedules. 1. Check who is currently on the Techops RE Incident Escalation rota. 1. Go back to Configuration and select Users. 1. Select the relevant person to find the phone number.
For all other incidents the communication and support lead will update Statuspage and #paas-incident, and then wait until working hours to investigate further.
Step two - Investigate and resolve the incident
If you are paged out of hours as the on-call engineer, you will be the incident lead. There will also be an on-call comms lead paged. The comms lead will be responsible for incident comms. You may decide to escalate to Techops RE Incident Escalation so they can engage with senior stakeholders and advise on communication with tenants.
Incident lead (on-call engineer)
- Investigate the incident.
- Make necessary changes to the production environment (only the incident lead can do this).
- Record actions taken and changes made in the #paas-incident slack channel.
- Discuss the incident with the comms lead to decide when the incident is resolved or can be downgraded as it is no longer a P1 incident.
Incident comms (on-call comms lead)
- Create an open hangout and share with incident team to enable ongoing discussion.
- If deemed necessary, contact the TechOps RE Incident Escalation person to get comms support.
- Notify tenants about the investigation by creating a new incident in Statuspage using the ‘Possible issue being investigated’ template.
- Let the PaaS team know about the incident on the #paas-incident slack channel.
- Update tenants hourly using the saved templates in our Statuspage account.
- When the incident is resolved or downgraded, update the #paas-incident slack channel to state it is no longer a P1 incident.
- The Techop escalation rota is the final decision point.
Useful contact details can be found in PaaS Emergency contacts and escalations (restricted access)
When the incident is over
When the problem has been fixed, check that our Statuspage is showing that the PaaS is operational.
Ensure that #paas-incident is updated to show the incident is over.
The next working day
Using the information documented in the #paas-incident channel, create a Pivotal story to record all actions taken and identify any ongoing work.
Continue as per the in hours process.
If neither the out of hours support engineer nor out of hours comms lead respond
- Pagerduty will automatically contact the TechOps RE Incident Escalation rota.
- The escalation person will then need to reach out to another engineer to try and resolve the issue.
- If the TechOps RE Incident Escalation doesn’t respond to the alert within 30 minutes, Pagerduty will automatically alert all members of the out of hours support rota by email.
In the event that you are contacted but aren’t scheduled to be on the rota, your first point of call should be to check the #paas-incident Slack channel to see if another engineer has already picked it up. If no one has, leave a message to state that you are beginning to investigate.
We have a full support contract with AWS. Open a support case through the AWS console or at https://aws.amazon.com/support. If the incident is ‘critical’ or ‘urgent’ severity, use click to chat or click to call for immediate contact.
Aiven monitors its services 24 hours a day 365 days a year, and provides free email support regarding problems using and accessing its services. Aiven personnel:
- are automatically alerted on any service anomalies
- rapidly address any issues in system operations requiring manual intervention
Responses are provided on a best-effort basis during the same or next business day. Email firstname.lastname@example.org for all support requests.
Other Important contacts in Reliability Engineering
You can refer to RE On-call escalation for other important contacts that may be relevant, including Cyber Security.