ADR012: Haproxy healthcheck
We were investigating how to avoid downtime to deployed applications when we made changes to the platform. We had discovered that short outages occurred when the gorouter was taken out of service.
gorouter has drain functionality to allow upstream load balancers to gracefully take instances of the gorouter out of service before any requests start to fail.
When a USR1 signal is sent to instruct the gorouter to start draining,
healthchecking requests, identified by the header
start failing with HTTP 503. User requests are allowed to continue.
Two things prevented us from using drain mode:
- we had the ELB configured to use TCP healthchecks, not HTTP
- the ELB sends HTTP healthcheck requests with
User-Agent: ELB-HealthChecker/1.0, which means they were not recognised as healthcheck requests by gorouter, they returned HTTP 200 and the ELB did not take the draining gorouter out of service.
The result was a very small amount of downtime for deployed apps that received requests during a short window of < 1 second.
We decided to:
- submit a change upstream to allow the gorouter to recognise ELB healthchecks
- implement a healhchecking port 82, in the HAProxy we introduced in ADR008,
which appends the
User-Agent: HTTP-Monitor/1.1that gorouter expects
- enable HTTP healthchecks on the ELB
- There was less downtime for deployed applications during a deploy.
- We have an additional reason to keep the intermediate HAProxy we introduced as a temporary measure.