In this second post of our series on making systemd's
efficient, we take a close look at journald's
mmap() usage. Check out
Part 1 of the journald performance post
if you missed it, or need a refresher.
There are two cases where journald commits journals to durable storage with a
call to its
When closing a journal file, such as during a journal rotation or journald shutdown
When periodically syncing to storage, on a default interval of four minutes
In the first case, since the current journal is being retired during rotate, it would be trivial to perform the offline and close in a new thread, without waiting for completion before continuing with a new journal. This would remove the potential delay on rotate. The relative infrequency of rotate events limits the returns of this technique.
The second case is more tricky since the journal actively being written to is to be set offline, yet we want to continue writing to it without waiting for the offlining process to complete. In essence, we want the journal to be simultaneously offline and online. How is this possible?
Assuming the journal mustn't be touched during its offlining until it's set
online again, couldn't we divert all journal writes to newly-established
MAP_PRIVATE mappings for the duration of the offline? Once the journal is
online again, the contents of the
MAP_PRIVATE pages may be copied to the
MAP_SHARED pages for the same journal file.
This would cache all the writes for the duration of the offlining process, while transparently fulfilling the read expectations by being backed by the offline (for writes) journal file.
Determining where to begin with such a modification can be a bit daunting. We
can safely assume that
journal_file_set_offline() is amenable to the simple
changes consisting of creating a thread for performing the work of offlining,
and setting a per-journal flag indicating offlining is in progress.
journal_file_set_online() must establish an "online-while-offlining" mode for
the journal, routing memory accesses for online journal I/O to
page mappings. This modification has a high potential to be an invasive change,
depending on journald's implementation. It makes sense to start here and
understand how journald goes about locating and mapping the pages used for
journal I/O. How many references into the
MAP_SHARED mappings does journald
keep around, and where do they reside in the code? These references must all
be replaced or otherwise invalidated when switching to
(online-while-offlining) mode, and again on return to
Fortunately it turns out there are just four members within
// journal-file.h 89 Header *header; 90 HashItem *data_hash_table; 91 HashItem *field_hash_table; ... 101 MMapCache *mmap;
The first three members are pointers directly to journal file mappings. They
can simply be replaced with either
The fourth member of the JournalFile struct requires going a bit further down the rabbit hole.
All journal file objects are located via the function
journal-file.c, which takes an offset, size, and object type as
parameters, storing a pointer to the mapped object as the result.
journal_file_move_to() relies on
mmap_cache_get() to establish a new
mapping if necessary, and return a pointer to it referencing the desired
MMapCache *mmap from
struct JournalFile is handed to
the cache to operate on.
Upon further study, it can be seen that
mmap_cache_get() maintains a cache of
mmap() windows, by doing three basic things:
Checks to see if the desired object is already present in a window-per-type cache of windows, and if so returns it
Scans a linked list of cached windows looking for a match, and if one is found, returns it
Creates a new mapping for the desired object as an 8MiB (or larger, if the
object exceeds 8MiB) window, adds it to the list of cached windows, and
returns it. If the
mmap() fails due to
ENOMEM, then an unused window is
reclaimed from the cache, and the
These steps all share a common window fit function called
which takes a pointer to the window being checked, and the criteria for
fitment: fd number, offset, size, and read/write protection.
A convenient place to integrate the
MAP_PRIVATE change would be to add a
private boolean flag to
mmap_cache_get(), propagate it down to
window_matches() as an additional criteria to match, and add it to the
struct Window type representing cached windows. It can then be set
accordingly when new windows are created via
mmap() with either
There are some interesting asides to note while we're here, since the general spirit of this series is journald performance and stability:
In step #2 above, the search is linear. With 8MiB windows and a 4GiB maximum journal size, the number of cached windows can exceed 500.
There is an opportunity for optimization here, perhaps something like an interval tree would be more appropriate than just a linked list. If this is an area of interest for you, note that mmap cache windows may overlap.
Windows are only reclaimed in response to
ENOMEM errors, until
the associated journal is closed via rotate or shutdown. This is the main
reason journald causes complaints surrounding excessive memory use/leaks.
The rationale is that these are file-backed pages, which are shared with the file's backing store, and reclaimable by the kernel. Such reclamation could potentially require page write-back if the pages are dirty, adding more latency to the pending allocation. So why not avoid the fairly expensive task of re-establishing these mappings, by filling journald's address space with them?
The problem with this approach is that the rest of the journald code is
making allocations from the same address space, completely unaware of the
mmap-cache. Outside of the mmap-cache itself, allocations don't know that
the cache may consume an entire 32-bit address space, nor that windows may
be reclaimed and freed from that space. Only mmap-cache knows to handle
ENOMEM returned from its
mmap() calls by reclaiming windows.
It may make sense to teach the rest of the systemd code, some of which
journald shares, about hooks for
ENOMEM handling. Then journald could
ENOMEM handler tying into the mmap-cache's reclaimer.
The astute reader may have already noticed a glaring problem; if mmap cache
windows may overlap, discrete
MAP_PRIVATE mappings of overlapping windows may
become incoherent due to private writes in the overlapping region.
MAP_SHARED mappings must propagate their writes to the backing store,
discrete overlapping windows are naturally kept coherent. There is no flavor of
MAP_PRIVATE capable of creating mappings that isolate writes from the backing
store, while propagating them to other private overlapping mappings on the same
The consequence of this, in lieu of eliminating the potential for overlapping
windows, is that a single
MAP_PRIVATE mapping for the entire journal file
must be established when
mmap_cache_get() first receives a request for a
private mapping. Pointers into this single private mapping may be returned for
the relevant private
This wouldn't be too big a deal if we only cared about 64-bit architectures. But since 32-bit is supported, and journald's maximum journal size is 4GiB, we're at an impasse with this strategy.
It may prove interesting to explore adding a sort of semi-shared
mapping to Linux's
mmap() that takes an additional parameter identifying
which isolated mappings of a given file descriptor within the calling process
should be shared and kept coherent, while being isolated from the underlying
file. Assuming all such mappings be kept coherent per calling process may also
suffice, which could easily be implemented with the existing
It sounds a bit like a memfd but with
the faulted pages sourced from a file. While we're fantasizing about new Linux
interfaces, another convenience would be a way to convert those mappings to
MAP_SHARED, committing their dirtied pages to the backing store in a single
system call, so user-space doesn't have to do the
mempcy() back to the
In any case, for the sake of measurement, discussion, and validation of the
concept (and my understanding of journald), the
MAP_PRIVATE strategy was
implemented and tested on x86_64, you can find the code
Parts of this post may have sounded vaguely familiar to MongoDB users due to
mmap()-based storage engine. MongoDB has a 2GiB size
32-bit architectures, a consequence of using a combination of
With any luck, this post has been illuminating even if the proposed solution isn't generally acceptable. Check back for our next installment, when we'll examine the discussion this proposal lead to with the upstream systemd community, and the changes we've landed there since that make journal offlining fully asynchronous.
See engineers from CoreOS detail their frequent work with systemd and projects leveraging its features at systemd.conf in Berlin this week. On Thursday, September 29, Alban Crequy and Luca Bruno will discuss the intimate relationship between systemd and the rkt container engine. On the following day, Friday, September 30, Luca will dive even deeper with a talk on using and improving systemd bindings from recent programming languages like Go and Rust.