On May 15, CoreOS was informed of a vulnerability in the alpha version of CoreOS Linux. Within 8 hours of this notification, over 99% of affected systems had been automatically patched. Though this issue was limited to an alpha version, we hold all of our releases to the same security standards, and we immediately responded, reported, and corrected the issue. This post describes the nature of the vulnerability, our response, and our plans to avoid similar issues in the future.
We received notification from someone running version 1045.0.0 of CoreOS Linux who had identified a system compromise. An attacker had logged in as the
operator user and was using the compromised system to send spam. This was initially confusing, as the
operator user is disabled, but we were able to duplicate the issue internally: It was possible to log into the
core accounts with any password, even if the accounts had not been provisioned with passwords.
Further investigation revealed that the issue was limited to versions 104x.0.0 of CoreOS Linux Alpha, and that earlier versions were not affected. We examined the set of changes between the last known good version and the first affected version and narrowed the problem down to this commit enabling the PAM System Security Services module. Reverting this commit would resolve the vulnerability, but we still needed to understand why such a change had introduced the problem.
Linux supports a range of different authentication mechanisms, mediated through the Pluggable Authentication Module (PAM) library. PAM can be configured to authenticate against different backends, each provided by a separate module. A typical PAM configuration will look something like:
auth required pam_env.so auth required pam_unix.so nullok try_first_pass auth optional pam_permit.so
Modules marked as
required must return a success code - if they fail to do so, the authentication attempt will fail. This configuration requires that the
pam_unix modules must both succeed in order for authentication to succeed. The
pam_env module configures user environment variables, and the
pam_unix module validates the user password against the standard system password database. The module
pam_permit.so serves no direct purpose here, but is relevant for some complex configurations and so is included in the default Gentoo PAM configuration, and was inherited by CoreOS from there.
Enterprise deployments frequently use shared authentication systems, where password data is stored in a centralized system, rather than on individual clients. Red Hat developed a suite of tools for this purpose called System Security Services (SSS). SSS was integrated into CoreOS to enable user authentication via centralized authentication services requiring SSS. As part of this, the SSS module was added to the CoreOS PAM configuration:
auth required pam_env.so auth sufficient pam_unix.so nullok try_first_pass auth sufficient pam_sss.so try_first_pass auth optional pam_permit.so
Sufficient indicates to PAM that if a user successfully authenticates against this module, the authentication attempt will succeed; there is no need to proceed with the remaining configured modules. If a user provided a password that was present in the system password database,
pam_unix would succeed, and the user would be logged in. If a user provided a password that was present in the configured SSS backends (such as LDAP or Active Directory),
pam_sss would succeed, and the user would be logged in.
This configuration was based on the SSS documentation. Unfortunately we ran into a difference between Gentoo-based systems and Red Hat-based systems. Gentoo defaults to ending the PAM configuration with an optional
pam_permit. Red Hat defaults to ending the PAM configuration with a required
pam_deny. This difference went unnoticed while incorporating the
pam_sss configuration, resulting in the setup above.
The difference between these configurations turned out to be critical. If neither
pam_sss succeeded, PAM fell through to
pam_permit in the CoreOS configuration, rather than
pam_deny as it would have in the Red Hat configuration. This meant that failing both
pam_sss on CoreOS systems would surprisingly result in authentication succeeding, and access being granted.
This matched the observed behavior. SSS was not configured on the affected systems, so
pam_sss could not succeed. The users had disabled passwords, and so
pam_unix could not succeed. PAM then fell through to
pam_permit and allowed the user to log in.
operator user was not used by CoreOS, but existed because it exists in the Gentoo Portage system from which CoreOS is derived. The
operator user exists on several UNIX-like systems and is present in many automated SSH attack scripts, so the presence of an
operator user that could be accessed without a valid password left systems vulnerable to these automated attacks. This meant that the vulnerability was rapidly exploited on CoreOS systems with publicly accessible SSH services.
CoreOS immediately paused distribution of the vulnerable versions of CoreOS Linux after receiving the security notification. After identifying the issue, we determined the fastest way to deploy a secure version of CoreOS to the affected systems was to roll back to the last known good version of the operating system. This required us to reconfigure our system upgrade service to force rollbacks for the affected versions, but did not require any modification or administrator intervention on the client systems.
CoreOS Linux is designed to perform unattended system upgrades. Systems can be subscribed to either the stable, beta, or alpha channels. New releases are published to alpha - once we are confident that they do not have any serious bugs, they are promoted to the beta channel, and from there, once we believe they are sufficiently well-tested, to stable.
In order to avoid new releases breaking a large number of systems, our upgrade system gradually rolls out new updates. Even after a new alpha image has been published, most systems subscribed to the alpha channel will not install it immediately. If serious bugs are found, we pause the update.
This meant that most systems running CoreOS Alpha never upgraded to the vulnerable versions of CoreOS. We reconfigured our upgrade system so that on receiving upgrade requests from vulnerable systems, it would "upgrade" them — in this special case actually a downgrade — to the last good version. We disabled rate limiting on the update service to allow this critical process to happen as quickly as possible. This meant that almost all vulnerable machines were restored to a secure version of CoreOS within 2 hours of this special fix being available.
This misconfiguration was abetted by confirmation bias. The expected outcome of the change to the CoreOS PAM configuration was for users who presented a password present in an authentication database to be successfully authenticated. Because of the
pam_permit failure case explained above, this was the observed behavior in testing, so the change was assumed to be correct. No attempt was made to determine whether the observed behavior could be explained in some other way, such as the system allowing any presented password.
The issue went undetected during pre-merge review. To avoid situations like this in the future, we are concentrating on development of more comprehensive automated testing. Our verification tests now perform a series of additional security checks, including ensuring that remote login requires valid credentials. We have also taken the opportunity to introduce stronger image validation during the system image build process, automatically flagging packages with reported security issues. We will also ensure that security-related changes are accompanied by appropriate tests. For good measure, the
operator account on CoreOS systems has been further restricted, with its login shell removed to go with its already disabled password.
This issue affected only CoreOS Alpha, and was detected quickly. There's an argument that this is exactly what alpha releases are for. We disagree.
One of the biggest problems in our industry is providing appropriate security support to older software versions. Supporting multiple older versions means more work in backporting security fixes, and makes it more likely for issues to be overlooked in some of those older versions. Reducing the number of supported versions reduces security overhead, allowing more time to be spent on the development of new security features, and increasing overall security.
However, users are understandably reluctant to deploy new versions of operating systems without performing significant testing beforehand. This traditionally requires the creation of a full test environment with synthetic use-cases that attempt to duplicate real-world production environments. This is difficult, expensive, and time consuming, and it still doesn't guarantee conditions sufficiently identical to prevent all unexpected results when new versions are deployed to production.
Distributed computing changes this. When services run across a cluster of machines, each individual machine has less impact on the performance or availability of the service. A CoreOS cluster makes a staggered deployment possible, with most machines running stable, a few running beta, and an even smaller number running alpha. If no issues happen on the alpha or beta machines, new stable releases can be rolled out to the rest of the cluster with a greater degree of confidence. Such integral test environments can be a subset of the production compute resources, ensuring that the tests face real-world conditions.
This tiered versioning makes it easier to deploy security fixes and reduces the overhead involved in supporting multiple releases over an extended period of time. It only works if users have confidence that alpha versions of CoreOS don't introduce security issues. It is never acceptable for released versions of CoreOS to contain vulnerable software, even alpha versions. We treat security issues in alpha exactly the same way we treat security issues in stable, in keeping with our mission to secure the infrastructure that powers the Internet.
This issue was a critical vulnerability, its seriousness somewhat mitigated by the CoreOS staged upgrade process and the ability to rapidly push fixed versions to vulnerable systems. The issue nevertheless resulted in systems left vulnerable for a period of time, and required significant emergency effort from the CoreOS team. We do not consider this acceptable, and we apologize to all users affected by this issue. We have introduced additional security testing to reduce the probability of similar events in the future, and we continue to examine our development processes to identify other fields for automated security and release testing.
We wish to thank Tim Dettrick from the University of Queensland ITEE e-Research Group and Jake Yip from the University of Melbourne NeCTAR Research Cloud for alerting us about this issue.