Continued: Journal durability for the impatient
In this second post of our series on making systemd's
journald more efficient, we take a close look at journald's
mmap() usage. Check out Part 1 of the journald performance post if you missed it, or need a refresher.
There are two cases where journald commits journals to durable storage with a call to its
When closing a journal file, such as during a journal rotation or journald shutdown
When periodically syncing to storage, on a default interval of four minutes
In the first case, since the current journal is being retired during rotate, it would be trivial to perform the offline and close in a new thread, without waiting for completion before continuing with a new journal. This would remove the potential delay on rotate. The relative infrequency of rotate events limits the returns of this technique.
The second case is more tricky since the journal actively being written to is to be set offline, yet we want to continue writing to it without waiting for the offlining process to complete. In essence, we want the journal to be simultaneously offline and online. How is this possible?
MAP_PRIVATE: Copy-on-write mappings to the rescue?
Assuming the journal mustn't be touched during its offlining until it's set online again, couldn't we divert all journal writes to newly-established
MAP_PRIVATE mappings for the duration of the offline? Once the journal is online again, the contents of the
MAP_PRIVATE pages may be copied to the usual online
MAP_SHARED pages for the same journal file.
This would cache all the writes for the duration of the offlining process, while transparently fulfilling the read expectations by being backed by the offline (for writes) journal file.
Determining where to begin with such a modification can be a bit daunting. We can safely assume that
journal_file_set_offline() is amenable to the simple changes consisting of creating a thread for performing the work of offlining, and setting a per-journal flag indicating offlining is in progress.
journal_file_set_online() must establish an "online-while-offlining" mode for the journal, routing memory accesses for online journal I/O to
MAP_PRIVATE page mappings. This modification has a high potential to be an invasive change, depending on journald's implementation. It makes sense to start here and understand how journald goes about locating and mapping the pages used for journal I/O. How many references into the
MAP_SHARED mappings does journald keep around, and where do they reside in the code? These references must all be replaced or otherwise invalidated when switching to
MAP_PRIVATE (online-while-offlining) mode, and again on return to
MAP_SHARED (online) mode.
Fortunately it turns out there are just four members within
struct JournalFile of concern:
// journal-file.h 89 Header *header; 90 HashItem *data_hash_table; 91 HashItem *field_hash_table; ... 101 MMapCache *mmap;
The first three members are pointers directly to journal file mappings. They can simply be replaced with either
The fourth member of the JournalFile struct requires going a bit further down the rabbit hole.
mmap-cache.c: Why journald's RSS appears to only grow
All journal file objects are located via the function
journal-file.c, which takes an offset, size, and object type as parameters, storing a pointer to the mapped object as the result.
journal_file_move_to() relies on
mmap_cache_get() to establish a new mapping if necessary, and return a pointer to it referencing the desired object.
MMapCache *mmap from
struct JournalFile is handed to
mmap_cache_get() as the cache to operate on.
Upon further study, it can be seen that
mmap_cache_get() maintains a cache of
mmap() windows, by doing three basic things:
Checks to see if the desired object is already present in a window-per-type cache of windows, and if so returns it
Scans a linked list of cached windows looking for a match, and if one is found, returns it
Creates a new mapping for the desired object as an 8MiB (or larger, if the object exceeds 8MiB) window, adds it to the list of cached windows, and returns it. If the
mmap()fails due to
ENOMEM, then an unused window is reclaimed from the cache, and the
These steps all share a common window fit function called
window_matches() which takes a pointer to the window being checked, and the criteria for fitment: fd number, offset, size, and read/write protection.
A convenient place to integrate the
MAP_PRIVATE change would be to add a
private boolean flag to
mmap_cache_get(), propagate it down to
window_matches() as an additional criteria to match, and add it to the
struct Window type representing cached windows. It can then be set accordingly when new windows are created via
mmap() with either
There are some interesting asides to note while we're here, since the general spirit of this series is journald performance and stability:
In step #2 above, the search is linear. With 8MiB windows and a 4GiB maximum journal size, the number of cached windows can exceed 500.
There is an opportunity for optimization here, perhaps something like an interval tree would be more appropriate than just a linked list. If this is an area of interest for you, note that mmap cache windows may overlap.
Windows are only reclaimed in response to
ENOMEMerrors, until the associated journal is closed via rotate or shutdown. This is the main reason journald causes complaints surrounding excessive memory use/leaks.
The rationale is that these are file-backed pages, which are shared with the file's backing store, and reclaimable by the kernel. Such reclamation could potentially require page write-back if the pages are dirty, adding more latency to the pending allocation. So why not avoid the fairly expensive task of re-establishing these mappings, by filling journald's address space with them?
The problem with this approach is that the rest of the journald code is making allocations from the same address space, completely unaware of the mmap-cache. Outside of the mmap-cache itself, allocations don't know that the cache may consume an entire 32-bit address space, nor that windows may be reclaimed and freed from that space. Only mmap-cache knows to handle
ENOMEMreturned from its
mmap()calls by reclaiming windows.
It may make sense to teach the rest of the systemd code, some of which journald shares, about hooks for
ENOMEMhandling. Then journald could install an
ENOMEMhandler tying into the mmap-cache's reclaimer.
MAP_PRIVATE vs. MAP_SHARED: Two extremes on the coherence spectrum
The astute reader may have already noticed a glaring problem; if mmap cache windows may overlap, discrete
MAP_PRIVATE mappings of overlapping windows may become incoherent due to private writes in the overlapping region.
MAP_SHARED mappings must propagate their writes to the backing store, discrete overlapping windows are naturally kept coherent. There is no flavor of
MAP_PRIVATE capable of creating mappings that isolate writes from the backing store, while propagating them to other private overlapping mappings on the same file.
The consequence of this, in lieu of eliminating the potential for overlapping windows, is that a single
MAP_PRIVATE mapping for the entire journal file must be established when
mmap_cache_get() first receives a request for a private mapping. Pointers into this single private mapping may be returned for the relevant private
This wouldn't be too big a deal if we only cared about 64-bit architectures. But since 32-bit is supported, and journald's maximum journal size is 4GiB, we're at an impasse with this strategy.
It may prove interesting to explore adding a sort of semi-shared
MAP_ISOLATED mapping to Linux's
mmap() that takes an additional parameter identifying which isolated mappings of a given file descriptor within the calling process should be shared and kept coherent, while being isolated from the underlying file. Assuming all such mappings be kept coherent per calling process may also suffice, which could easily be implemented with the existing
It sounds a bit like a memfd but with the faulted pages sourced from a file. While we're fantasizing about new Linux interfaces, another convenience would be a way to convert those mappings to
MAP_SHARED, committing their dirtied pages to the backing store in a single system call, so user-space doesn't have to do the
mempcy() back to the
In any case, for the sake of measurement, discussion, and validation of the concept (and my understanding of journald), the
MAP_PRIVATE strategy was implemented and tested on x86_64, you can find the code here.
Parts of this post may have sounded vaguely familiar to MongoDB users due to MongoDB's
mmap()-based storage engine. MongoDB has a 2GiB size limitation on 32-bit architectures, a consequence of using a combination of
MAP_SHARED whole-file mappings.
With any luck, this post has been illuminating even if the proposed solution isn't generally acceptable. Check back for our next installment, when we'll examine the discussion this proposal lead to with the upstream systemd community, and the changes we've landed there since that make journal offlining fully asynchronous.
CoreOS at systemd.conf 2016
See engineers from CoreOS detail their frequent work with systemd and projects leveraging its features at systemd.conf in Berlin this week. On Thursday, September 29, Alban Crequy and Luca Bruno will discuss the intimate relationship between systemd and the rkt container engine. On the following day, Friday, September 30, Luca will dive even deeper with a talk on using and improving systemd bindings from recent programming languages like Go and Rust.