Responding to alerts
Tips for responding to alerts when you are on support.
Failed logins
In case of security incident, for example when alerted of a brute-force attack by Prometheus monitors, it may help to look at UAA logs.
The access logs is on the UAA VMs in /var/vcap/sys/log/uaa/localhost_access.log
.
All other logs are sent to logit. You must Log into logit via Google[external link] before you can access kibana.
- Scope to component: UAA
- filter by
@source.deployment: "prod"
or@source.deployment: "prod-lon"
(depending on which deployment you are looking for) - Look for strings like
UserNotFound
,PrincipalAuthenticationFailure
orUserAuthenticationFailure
principal=XXX
will give you the usernameorigin=[1.2.3.4]
will give you the origin IPUserAuthenticationSuccess
will give you the successful attempts- SSO authentication origin will show
sessionId=XXX
- UAA authentication origin will show
clientId=cf
- SSO authentication origin will show
CPU credits
These alerts tell us when an AWS instance that has burstable CPU performance has exhausted its credits and is performing badly. This is an informational alert that should help you understand why other alerts may be triggered at the same time. For more information about CPU credits see:
CDN broker healthy
The CDN broker has a healthcheck endpoint which confirms whether all of its dependencies are working. If any are not working then the healthcheck endpoint returns a 500 response code and response body containing the errors.
Prometheus doesn’t report the response body, which can make it hard to determine the actual problem. To get this information you will need to SSH to any Cloud Foundry VM and query the endpoint yourself:
bash# bosh-ssh api/0
…
api/00673a4a-b11f-410d-a037-34d56a3e2e71:~$ curl -i https://cdn-broker.<SYSTEM_DOMAIN>/healthcheck
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
Date: Fri, 02 Jun 2017 10:10:49 GMT
Content-Length: 52
Connection: keep-alive
cloudfoundry error: Could not get api /v2/info: EOF
RDS Failures
Unfortunately, many types of failure
event are not resolvable without getting in touch with AWS. In some circumstances you may be
able to restore an instance using our point-in-time-restore
instructions.
When failure
monitor triggers, before starting any further investigation and
work, we would like to notify our users of odd behaviour and let them know
we’re working to solve this issue. Please see the How to obtain organisation
managers to learn who to contact.
For more information on the 8 types of failure
event, see the Amazon RDS
Event Categories and Event
Messages.
RDS Disk utilisation
Our tenants are likely to run low on storage on their RDS instances over time. This is no reason to panic, but we would like to notify our users when this happens.
Ideally, you should retrieve a GUID of an instance from the tags associated with the metrics on the alert. Having that will help you establish the users to contact, as they may wish to scale up their instance in response. Please see the How to obtain organisation managers to learn who to contact.
How to obtain organisation managers
We’ve prepared a script to obtain organisation, space, instance name and the list of managers connected to the instance you’re after.
From paas-cf
repository, run the following command:
./scripts/get-instance-details.sh "${RDS_INSTANCE_ID}"
DDoS mitigation
When AWS detect a possible denial-of-service attack against PaaS, our in-hours support gets an email. If you receive such an email, log into AWS Shield and look at the details of what AWS have detected.
AWS automatically mitigate layer 3 and 4 attacks (i.e., junk network traffic.) You should monitor PaaS’s performance. Unless PaaS is degraded there is likely no need to do more than monitor this.
AWS do not automatically mitigate layer 7 attacks (i.e., mass HTTP requests.) These are not always malicious: large submissions to GOV.UK Notify can run at 900rps, and AWS detects that as a request flood. Reviewing the details in AWS Shield can help identify whether there is a real attack. You can create new AWS WAF rules to combat this traffic.
If facing a large, complex or uncertain attack, contact the AWS DDoS Response Team (DRT): https://docs.aws.amazon.com/waf/latest/developerguide/ddos-responding.html.
API Latency
Last time we received an alert for API latency, it was a UAA bug that caused instances to leak memory. It also resulted in high CPU usage and the quickest solution to restoring the platform was to restart both instances one after the other.
cd ${PAAS_BOOTSTRAP_DIR}
DEPLOY_ENV=prod make prod bosh-cli
bosh restart uaa # You may want/need to target an instance, e.g. uaa/23888ebf-2dd3-4afe-b370-51705403d423
There may be different reasons for the latency to drop down on any of the VMs. We found out the issue by logging into the VM itself and dig through the logs.
cd ${PAAS_BOOTSTRAP_DIR}
DEPLOY_ENV=prod make prod bosh-cli
bosh ssh uaa/23888ebf-2dd3-4afe-b370-51705403d423
tail -1000 /var/vcap/sys/log/uaa/uaa.log
In the case of the UAA latency incident, it turned out that the CF Upgrade we did previously, contained a UAA release introducing the memory leak. The solution was to upgrade UAA release on its own.
Read more in the UAA Downtime investigation story.
Intermittent ELB failures
Users have previously experienced intermittent request timeouts and errors when accessing their apps. During a particular incident, we believe that requests timed out due to ELB node failures.
We publish two metrics:
aws.elb.unhealthy_node_count
- number of IPs that failed to respond to an HTTP requestaws.elb.healthy_node_count
- number of IPs that responded as expected
Details of the monitoring approach can be found in the README for our monitoring application.
We have been advised by AWS that if we believe we are experiencing an issue with one or more ELB nodes that we should raise a “high priority” support ticket with them.
To obtain the IP address of the failing node(s), you should inspect the logs of the paas-metrics app. This is deployed to the ‘monitoring’ space of the ‘admin’ org. You should see logs such as:
2017-12-01T14:14:09.68+0000 [APP/PROC/WEB/0] OUT
{
"timestamp":"1512137649.686135292",
"source":"metrics",
"message":"metrics.elb-node-failure",
"log_level":1,
"data":{
"addr":"52.31.169.122:443", <-- this is the ELB node IP
"err":"bad error happened!",
"start":"2017-12-01T14:14:09.679602987Z",
"stop":"2017-12-01T14:14:09.679602987Z"
}
}
Read more in the Incident Report.
Invalid Certificates
paas-metrics exposes two metrics relating to certificate validity:
tls.certificates.validity
- number of days before expiry of the platform’s public-facing certificatescdn.tls.certificates.validity
- number of days before expiry of certificates for CloudFront aliases (tenants’ custom domain names). This is for all CDNs, not just those created by the broker.
If the endpoints are misconfigured or the certificate is considered invalid for
some other reason the value will fall to 0
and alert as expired/invalid.
If the platform certificates are due to expire, check AWS Certificate Manager for the validation settings, as they should validate automatically via DNS.
For CDN certificates start with AWS Certificate Manager in the us-east-1
region to see why the certificate didn’t get renewed. The most common cause is tenants removing the DNS records required for valdiation. Use dig -t CNAME
to check if they’re still set.
If you are seeing NO_DATA
errors for this monitor then there may be a more
fundimental connectivity issue to the reported endpoint.
The cf logs
output of the paas-metrics app may contain additional
information. This is deployed to the ‘monitoring’ space of the ‘admin’ org.
Cloud Foundry internal certificates
We generate several certificates for Cloud Foundry which need to be rotated regularly (most of them have a 1 year expiry).
We have a Concourse job which checks if we have certificates expiring in less than 30 days.
If you need to rotate a certificate before it would otherwise be rotated automatically, you can perform step 1 of the Credhub CA cert rotation, and then allow the pipeline to take over.
Trusted Advisor Warnings
If you see a warning from Trusted Advisor it should be addressed.
- IAM: Access keys should be rotated every 90 days, you should chase the owner of the offending key, delete the key or you can use this script to help with rotating keys.
The following Trusted Advisor checks cause expected warnings:
- S3 Bucket Permission’s for
gds-paas-*-cdn-broker-challenge
are required to be publicly accesible so it is safe to mark these as excluded. - S3 Bucket Permission’s for
gds-paas-*-ci-releases
are required to be publicly accesible so it is safe to mark these as excluded. - S3 Bucket Permission’s for
gds-paas-*-ci-releases-blobs
are required to be publicly accesible so it is safe to mark these as excluded. - ELB Listener Security for
*-ssh-proxy
: This ELB is in TCP mode and so this warning is expected - Security Group - Specified Ports Unrestricted for
*-sshproxy
: We intentionally allow port 2222 forcf ssh
access so it is safe to mark this as excluded.
Gorouter high latency alerts
If you see an alert for gorouter latency being high;
- Check with the team if we have any known load testing occuring either by the team or by a tenant
- Log into logit via Google[external link]
- Visit kibana and filter by
@source.component: gorouter
and@source.deployment: "prod"
or@source.deployment: "prod-lon"
(depending on which deployment you are looking for) then add ingorouter.host
to the table, this will show you which hosts are being most used currently - Contact the team who own the service to see if they are aware of anything happening
As this is the first monitor of this type please investigate the gorouters to discover the issue they are encountering. We have previously seen high resource usage (CPU and Memory) these should be checked in the first case.
Diego cell ephemeral disk usage near full
To prevent a tenant-visible issue in the short term, it’s probably a good idea to recreate the cell in bosh:
cd ${PAAS_BOOTSTRAP_DIR}
DEPLOY_ENV=prod make prod bosh-cli
bosh restart diego-cell/123
In the long term, this should be investigated because it Shouldn’t Happen. The grootfs garbage collector should keep disk space under control, so perhaps this is a sign that it isn’t working properly.
The last time this happened we weren’t quick enough investigating it and the logs fell out of logit, preventing us from diagnosing the true cause. So don’t make the same mistake.