Upgrading CF, BOSH and stemcells
- Check your environment isn’t resource starved, because this can cause unexpected test failures. Typically, resource starvation in dev happens when there are a number of organisations left over from smoke and acceptance tests. These are prefixed with
CATS-respectively. Provided tests aren’t running at the time, they can be safely deleted with
- Separate the upgrade of Cloud Foundry and stemcells from the upgrade of Bosh. Upgrades can cause problems and our experience is that it is difficult to be certain about the cause of those problems if multiple things have changed.
Establish the correct version to upgrade to:
- Check cf-deployment releases documentation.
- Prefer the latest stable release if the release notes look like they won’t introduce bugs.
- Discuss the planned upgrade version in kick-off.
If you encounter issues in the releases of
cf-deploymentconsider forking and patching them or overriding the release version with a opsfile.
Update the submodule in
/manifests/cf-deploymentfor paas-cf of cf-deployment to the picked version.
git diffor GitHub compare in the
cf-deploymentsubmodule repo to see and review changes to the manifest. For example, to see differences between v1.0.0 and v1.14.0:
git diff v1.0.0...v1.14.0 cf-deployment.yml
We also use a number of upstream ops files, so you will want to
diff them too. See the manifest generation script for which ones get used.
Special differences to take into account:
- New secrets and certificates in
variables:. Maybe there are new passwords that must be rotated or blacklisted from rotation. New CA certs need to be adapted to support CA rotation.
- Release version changes.
- Read the documentation for every version of every release changed. It will save you time and pain in the long run.
- Run the unit tests for the manifest with the new version of cf-deployment with
(cd manifests/cf-manifest && bundle exec rspec --fail-fast). Fix issues as you find them.
- Update the cf-smoke-tests-release resource in the pipeline to pin the version used in cf-deployment.
- Update the cf-acceptance-tests resource in the pipeline to use an upstream
cfX.Ybranch matching the cf-deployment version.
- Note: If we are using a forked version of the smoke-tests or cf-acceptance test, create a new branch and rebase our forked version accordingly.
(section added October 2018)
Our upgrade to cf-deployment v4.5.0 caused us to diverge from upstream, as
documented in this story
by not including their suggested
credhub instance group.
This divergence isn’t set in stone, but until a CF-level credhub is introduced, you should take care to check that no BOSH releases are relying on a CF-level credhub instance or service. It’s possible that a stronger dependency on credhub will be introduced in the future, in which case we’ll need to do the work to re-align with upstream before upgrading. It’s worth checking for this early in the upgrade story, so that any requisite work can be flagged as a blocker ASAP.
NBNB Here, we’re talking about a CF-level credhub - not a BOSH-level credhub. We anticipate that we’ll have one at the BOSH layer, to replace vars-store files, Sometime Soon (spike). Be clear about the different consumers of any credhubs in our environments, and which one is being excluded at the time of writing.
Doing the upgrade
You should test the upgrade changeset:
- From a fully deployed master with
SLIM_DEV_DEPLOYMENT=falseset, which is equivalent to the change that will happen in production.
- Deploying a fresh CF, which is something we frequently do in our development environments after the autodelete-cloudfoundry pipeline runs overnight.
- Confirm that rotating credentials still works and doesn’t cause additional downtime during deployments.
We have to a give at least a week’s notice to tenants about the buildpack upgrades, so prepare the version changes separately from the CF component upgrades.
Buildpack upgrades are typically done as a separate story. You should upgrade the buildpacks to at least the versions included in the cf-deployment version currently deployed.
Notify the tenants
Send an email to users following the upgrade template.
Problems encountered previously
DNS name resolution.
We encountered a wide variety of acceptance and smoke test failures which were intermittent. This was due to DNS health check failures. Consul uses these health checks to validate the consistency of the DNS records it serves, and will expire DNS records where the health check has failed.
The root cause was failure to remove the
consul.agent.services property for Diego components in the CloudFoundry manifest. Diego release notes for version 1452 did state these had to be removed, as Diego components now use the Locket library to configure its DNS health checks.
The solution is to read the documentation on every release you are passing, for every component you are upgrading. This is a lot of documentation, but worth it in the long run.
The upgrade to cf-release v233 led to acceptance test failure, as it requires CF CLI version 6.16+. We upgraded to this version in our cf-cli and cf-acceptance-tests Docker containers, which confounded the tests of other developers when merged. Upgrades to CF CLI versions can be tested using your own private containers while in development. Perhaps we should think about versioning our docker images so that we can pin particular versions in the pipeline to match the CF & BOSH versions being deployed.
Acceptance Test Failures
The upgrade to v233 has introduced some new tests in the acceptance-test suite, which do not appear to be quite ready for the prime-time yet.
We experienced failures in:
routingsuite - The
multiple_app_portstest fails if
users_can_select_backendis not set to
true- The test receives a response of
CF-BackendSelectionNotAuthorizedwhich does not match the expected response of
CF-MultipleAppPortsMappedDiegoToDeaand causes the test to fail - This test should not be run unless the user is permitted to switch backends. We raised an issue with Pivotal for this test :
v3suite - The
task_testis being run even though we have the
false- The suite attempts to create a task and receives a response saying
Feature Disabled: task_creationwhich does not cause any kind of failure - it then goes on to attempt to delete the task which fails as it receives and empty response to it’s delete attempt, when it is expecting to receive a response indicating that the task is in a