Table of contents

This is for internal use by the PaaS team. Public-facing documentation is located at docs.cloud.service.gov.uk.

Upgrading CF, BOSH and stemcells

 Before Upgrading

  • Check your environment isn’t resource starved, because this can cause unexpected test failures. Typically, resource starvation in dev happens when there are a number of organisations left over from smoke and acceptance tests. These are prefixed with SMOKE- and CATS- respectively. Provided tests aren’t running at the time, they can be safely deleted with cf delete-org.
  • Separate the upgrade of Cloud Foundry and stemcells from the upgrade of Bosh. Upgrades can cause problems and our experience is that it is difficult to be certain about the cause of those problems if multiple things have changed.
  • Establish the correct version to upgrade to:

    • Check cf-deployment releases documentation.
    • Prefer the latest stable release if the release notes look like they won’t introduce bugs.
    • Discuss the planned upgrade version in kick-off.
  • If you encounter issues in the releases of cf-deployment consider forking and patching them or overriding the release version with a opsfile.

  • Update the submodule in /manifests/cf-deployment for paas-cf of cf-deployment to the picked version.

  • Use git diff or GitHub compare in the cf-deployment submodule repo to see and review changes to the manifest. For example, to see differences between v1.0.0 and v1.14.0:

  git diff v1.0.0...v1.14.0 cf-deployment.yml

We also use a number of upstream ops files, so you will want to diff them too. See the manifest generation script for which ones get used.

Special differences to take into account:

  • New secrets and certificates in variables:. Maybe there are new passwords that must be rotated or blacklisted from rotation. New CA certs need to be adapted to support CA rotation.
  • Release version changes.
  • New instance_groups added

    • Read the documentation for every version of every release changed. It will save you time and pain in the long run.
    • Run the unit tests for the manifest with the new version of cf-deployment with make test or (cd manifests/cf-manifest && bundle exec rspec --fail-fast). Fix issues as you find them.
    • Update the cf-smoke-tests-release resource in the pipeline to pin the version used in cf-deployment.
    • Update the cf-acceptance-tests resource in the pipeline to use an upstream cfX.Y branch matching the cf-deployment version.
    • Note: If we are using a forked version of the smoke-tests or cf-acceptance test, create a new branch and rebase our forked version accordingly.

Credhub

(section added October 2018)

Our upgrade to cf-deployment v4.5.0 caused us to diverge from upstream, as documented in this story comment, by not including their suggested credhub instance group.

This divergence isn’t set in stone, but until a CF-level credhub is introduced, you should take care to check that no BOSH releases are relying on a CF-level credhub instance or service. It’s possible that a stronger dependency on credhub will be introduced in the future, in which case we’ll need to do the work to re-align with upstream before upgrading. It’s worth checking for this early in the upgrade story, so that any requisite work can be flagged as a blocker ASAP.

NBNB Here, we’re talking about a CF-level credhub - not a BOSH-level credhub. We anticipate that we’ll have one at the BOSH layer, to replace vars-store files, Sometime Soon (spike). Be clear about the different consumers of any credhubs in our environments, and which one is being excluded at the time of writing.

Doing the upgrade

You should test the upgrade changeset:

  • From a fully deployed master with SLIM_DEV_DEPLOYMENT=false set, which is equivalent to the change that will happen in production.
  • Deploying a fresh CF, which is something we frequently do in our development environments after the autodelete-cloudfoundry pipeline runs overnight.
  • Confirm that rotating credentials still works and doesn’t cause additional downtime during deployments.

Buildpacks

We have to a give at least a week’s notice to tenants about the buildpack upgrades, so prepare the version changes separately from the CF component upgrades.

Buildpack upgrades are typically done as a separate story. You should upgrade the buildpacks to at least the versions included in the cf-deployment version currently deployed.

Notify the tenants

Send an email to users following the upgrade template.

Problems encountered previously

DNS name resolution.

We encountered a wide variety of acceptance and smoke test failures which were intermittent. This was due to DNS health check failures. Consul uses these health checks to validate the consistency of the DNS records it serves, and will expire DNS records where the health check has failed.

The root cause was failure to remove the consul.agent.services property for Diego components in the CloudFoundry manifest. Diego release notes for version 1452 did state these had to be removed, as Diego components now use the Locket library to configure its DNS health checks.

The solution is to read the documentation on every release you are passing, for every component you are upgrading. This is a lot of documentation, but worth it in the long run.

CF CLI

The upgrade to cf-release v233 led to acceptance test failure, as it requires CF CLI version 6.16+. We upgraded to this version in our cf-cli and cf-acceptance-tests Docker containers, which confounded the tests of other developers when merged. Upgrades to CF CLI versions can be tested using your own private containers while in development. Perhaps we should think about versioning our docker images so that we can pin particular versions in the pipeline to match the CF & BOSH versions being deployed.

Acceptance Test Failures

The upgrade to v233 has introduced some new tests in the acceptance-test suite, which do not appear to be quite ready for the prime-time yet.

We experienced failures in:

  • routing suite - The multiple_app_ports test fails if users_can_select_backend is not set to true - The test receives a response of CF-BackendSelectionNotAuthorized which does not match the expected response of CF-MultipleAppPortsMappedDiegoToDea and causes the test to fail - This test should not be run unless the user is permitted to switch backends. We raised an issue with Pivotal for this test :

  • v3 suite - The task_test is being run even though we have the cf-feature-flag for task_creation set to false - The suite attempts to create a task and receives a response saying Feature Disabled: task_creation which does not cause any kind of failure - it then goes on to attempt to delete the task which fails as it receives and empty response to it’s delete attempt, when it is expecting to receive a response indicating that the task is in a FAILED state