Table of contents

This is for internal use by the PaaS team. Public-facing documentation is located at docs.cloud.service.gov.uk.

Prometheus

High availability

All components run at least on two different VMs in different availability zones.

All Prometheus instances are independent and scrape the metrics endpoints separately. This means the Prometheus instances will have slightly different data (because of different scraping times). For various reasons (downtime, restart, etc.) one Prometheus instance might also have some data missing compared to the others.

All Prometheus instances will send alerts to all alertmanagers. The alertmanagers form a separate cluster and will deduplicate any identical alerts.

The Grafana instances will connect only to the Prometheus instance which is in the same availability zone.

All instances have a unique url, so in case of an AZ failure you can use the monitoring urls from the working AZ.

Alerting

Prometheus is configured to send alerts to teams mailing lists.

You can find these mailing lists here.

URLs

Primary

Secondary