Skip to main content

Recoverable System Upgrades

System upgrades can introduce problems, and when upgrades go bad, manual steps need to be taken to get machines back up. The problem is compounded on public cloud infrastructure since you often have large numbers of machines, and getting to a recovery console can take a while.

We believe in tight release cycles. CoreOS is designed to keep your servers secure against the latest exploits and to give you access to the latest features and fixes in weeks or months, not years. So, how do we balance tight release cycles with a need for stability? We built some systems and technology into CoreOS to help balance it all out.

Enabling consistency and rollback

Upgrading CoreOS is a bit different than the usual distros. Our update system is based on ChromeOS. The big difference is that we have two root partitions, which we'll call root A and root B. Initially, your system is booted into the root A partition, and CoreOS begins talking to the update service to find out about new updates. If there is an update available it is downloaded and installed to root B. And to ensure we don't disrupt your application we rate limit the disk and network I/O this process is allowed to use by using Linux cgroups.

Using this dual-root scheme is an improvement on the existing workflow of yum or apt-get. Often when upgrading using these tools you can find that the package manager will kick daemons to get them using new libraries or move configuration files around. And because of that you should treat system upgrades as a deploy: taking machines out of load balancers and clusters.

But, on CoreOS, since the root partition your application is running on top of isn't modified in place, your server is never in an unstable or partially upgraded state. To complete the upgrade, simply reboot the machine, and in a few seconds you will start running on root B with a freshly updated system.

What happens when things go wrong?

To further protect against a bad upgrade, CoreOS has an automated system for rollback. If a machine fails to come up after an upgrade, simply reboot the machine and it will revert to the previous root partition.

This works by setting some metadata on the partition table. When root B gets its upgrade it has a "tries" counter that is set to 1, telling our boot system that we should attempt to use this root once, and if it fails to come up, then on the next boot revert to using root A.

The end result is that CoreOS gives you frequent, consistent updates to the core system that your applications rely on, while giving you automated tools to reduce the risks involved.