Throughout this document
$var is used to refer to the directory
$uuid refers to a pod's UUID e.g. "076292e6-54c4-4cc8-9fa7-679c5f7dcfd3".
Due to rkt's architecture - and specifically its lack of any management daemon process - a combination of advisory file locking and atomic directory renames (via rename(2)) is used to represent and transition the basic pod states.
At times where a state must be reliably coupled to an executing process, that process is executed with an open file descriptor possessing an exclusive advisory lock on the respective pod's directory. Should that process exit for any reason, its open file descriptors will automatically be closed by the kernel, implicitly unlocking the pod's directory. By attempting to acquire a shared non-blocking advisory lock on a pod directory we're able to poll for these process-bound states, additionally by employing a blocking acquisition mode we may reliably synchronize indirectly with the exit of such processes, effectively providing us with a wake-up event the moment such a state transitions. For more information on advisory locks see the flock(2) man page.
At this time there are four distinct phases of a pod's life which involve process-bound states:
Each of these phases involves an exclusive lock on a given pod's directory. As an exclusive lock by itself cannot express both the phase and process-bound activity within that phase, we combine the lock with the pod's directory location to represent the whole picture:
To prevent the period between first creating a pod's directory and acquiring its lock from appearing as prepare-failed in the Prepare phase, and to provide a phase for prepared pods where they may dwell and the lock may be acquired prior to entering the Run phase, two additional directories are employed where locks have no meaning:
These phases, their function, and how they proceed through their respective states is explained in more detail below.
rkt run and
rkt prepare instantiate a new pod by creating an empty directory at
An exclusive lock is immediately acquired on the created directory which is then renamed to
$var/prepare/$uuid, transitioning to the
rkt run and
rkt prepare enter this phase identically; holding an exclusive lock on the pod directory
After preparation completes, while still holding the exclusive lock (the lock is held for the duration):
rkt prepare transitions to
Prepared by renaming
rkt run transitions directly from
Run by renaming
$var/run/$uuid, entirely skipping the
Prepare fail or be interrupted,
$var/prepare/$uuid will be left in an unlocked state.
Any directory in
$var/prepare in an unlocked state is considered a failed prepare.
rkt gc identifies failed prepares in need of clean up by trying to acquire a shared lock on all directories in
$var/prepare, renaming successfully locked directories to
$var/garbage where they are then deleted.
rkt prepare concludes successfully by leaving the pod directory at
$var/prepared/$uuid in an unlocked state before returning
$uuid to the user.
rkt run-prepared resumes where
rkt prepare concluded by exclusively locking the pod at
$var/prepared/$uuid before renaming it to
$var/run/$uuid, specifically acquiring the lock prior to entering the
rkt run never enters this phase, skipping directly from
Run with the lock held.
rkt run and
rkt run-prepared both arrive here with the pod at
$var/run/$uuid while holding the exclusive lock.
The pod is then executed while holding this lock.
It is required that the stage 1
coreos.com/rkt/stage1/run entrypoint keep the file descriptor representing the exclusive lock open for the lifetime of the pod's process.
All this requires is that the stage 1 implementation not close the inherited file descriptor.
This is facilitated by supplying stage 1 its number in the RKT_LOCK_FD environment variable.
What follows applies equally to
rkt run and
A pod is considered exited if a shared lock can be acquired on
Upon exit of a pod's process, the exclusive lock acquired before entering the
Run phase becomes released by the kernel.
Exited pods are discarded using a common mark-and-sweep style of garbage collection by invoking the
rkt gc command.
This relatively simple approach lends itself well to a minimal file-system based implementation utilizing no additional daemons or record keeping with good efficiency.
The process is performed in two distinct passes explained in detail below.
All directories found in
$var/run are tested for exited status by trying to acquire a shared advisory lock on each directory.
When a directory's lock cannot be acquired, the directory is skipped as it indicates the pod is currently executing.
When the lock is successfully acquired, the directory is renamed from
This renaming effectively implements the "mark" operation.
Since the locks are immediately released, operations like
rkt status may safely execute concurrently with
Marked exited pods dwell in the
$var/exited-garbage directory for a grace period during which their status may continue to be queried by
The rename from
$var/exited-garbage/$uuid serves in part to keep marked pods from cluttering the
$var/run directory during their respective dwell periods.
A side-effect of the rename operation responsible for moving a pod from
$var/exited-garbage is an update to the pod directory's change time.
The sweep operation takes this updated file change time as the beginning of the "dwell" grace period, and discards exited pods at the expiration of that period.
This grace period currently defaults to 30 minutes, and may be explicitly specified using the
--grace-period=duration flag with
Note that this grace period begins from the time a pod was marked by
rkt gc, not when the pod exited.
A pod becomes eligible for marking when it exits, but will not actually be marked for collection until a subsequent
The change times of all directories found in
$var/exited-garbage are compared against the current time.
Directories having sufficiently old change times are locked exclusively and cleaned up.
If a lock acquisition fails, the directory is skipped.
rkt gc may fail to acquire an exclusive lock if the pod to be collected is currently being accessed, by
rkt status or another
rkt gc, for example.
The skipped pods will be revisited on a subsequent
rkt gc invocation's sweep pass.
During the cleanup, the pod's stage1 gc entry point is first executed.
This gives the stage1 a chance to clean up anything related to the environment shared between containers.
The default stage1 uses the gc entrypoint to clean up the private networking artifacts.
After the completion of the gc entrypoint, the pod directory is recursively deleted.
To answer the questions "Has this pod exited?" and "Is this pod being deleted?" the pod's UUID is looked for in
Pods found in the
$var/exited-garbage directory must already be exited, and a shared lock acquisition may be used to determine if the garbage pod is actively being deleted.
Those found in the
$var/run directory may be exited or running, and a failed shared lock acquisition indicates a pod in
$var/run is alive at the time of the failed acquisition.
Care must be taken when acting on what is effectively always going to be stale knowledge of pod state; though a pod's status may be found to be "running" by the mechanisms documented here, this was an instantaneously sampled state that was true at the time sampled (failed lock attempt at
$var/run/$uuid), and may cease to be true by the time code execution progressed to acting on that sample.
Pod exit is totally asynchronous and cannot be prevented, relevant code must take this into consideration (e.g.
rkt enter) and be tolerant of states progressing.
For example, two
rkt run-prepared invocations for the same UUID may occur simultaneously.
Only one of these will successfully transition the pod from
Run due to rename's atomicity, which is exactly what we want.
The loser of this race needs to simply inform the user of the inability to transition the pod to the run state, perhaps with a check to see if the pod transitioned independently and a useful message mentioning it.
Another example would be two
rkt gc commands finding the same exited pods and attempting to transition them to the
Garbage phase concurrently.
They can't both perform the transitions, one will lose the race at each pod.
This needs to be considered in the error handling of the transition callers as perfectly normal.
Simply ignoring ENOENT errors propagated from the loser's rename calls can suffice.