Skip to main content

Eliminating Delays From systemd-journald, Part 2

Continued: Journal durability for the impatient

In this second post of our series on making systemd's journald more efficient, we take a close look at journald's mmap() usage. Check out Part 1 of the journald performance post if you missed it, or need a refresher.

There are two cases where journald commits journals to durable storage with a call to its journal_file_set_offline() function:

  1. When closing a journal file, such as during a journal rotation or journald shutdown

  2. When periodically syncing to storage, on a default interval of four minutes

In the first case, since the current journal is being retired during rotate, it would be trivial to perform the offline and close in a new thread, without waiting for completion before continuing with a new journal. This would remove the potential delay on rotate. The relative infrequency of rotate events limits the returns of this technique.

The second case is more tricky since the journal actively being written to is to be set offline, yet we want to continue writing to it without waiting for the offlining process to complete. In essence, we want the journal to be simultaneously offline and online. How is this possible?

MAP_PRIVATE: Copy-on-write mappings to the rescue?

Assuming the journal mustn't be touched during its offlining until it's set online again, couldn't we divert all journal writes to newly-established MAP_PRIVATE mappings for the duration of the offline? Once the journal is online again, the contents of the MAP_PRIVATE pages may be copied to the usual online MAP_SHARED pages for the same journal file.

This would cache all the writes for the duration of the offlining process, while transparently fulfilling the read expectations by being backed by the offline (for writes) journal file.

Determining where to begin with such a modification can be a bit daunting. We can safely assume that journal_file_set_offline() is amenable to the simple changes consisting of creating a thread for performing the work of offlining, and setting a per-journal flag indicating offlining is in progress.

journal_file_set_online() must establish an "online-while-offlining" mode for the journal, routing memory accesses for online journal I/O to MAP_PRIVATE page mappings. This modification has a high potential to be an invasive change, depending on journald's implementation. It makes sense to start here and understand how journald goes about locating and mapping the pages used for journal I/O. How many references into the MAP_SHARED mappings does journald keep around, and where do they reside in the code? These references must all be replaced or otherwise invalidated when switching to MAP_PRIVATE (online-while-offlining) mode, and again on return to MAP_SHARED (online) mode.

Fortunately it turns out there are just four members within struct JournalFile of concern:

// journal-file.h

 89         Header *header;
 90         HashItem *data_hash_table;
 91         HashItem *field_hash_table;
...
101         MMapCache *mmap;

The first three members are pointers directly to journal file mappings. They can simply be replaced with either MAP_PRIVATE or MAP_SHARED mappings.

The fourth member of the JournalFile struct requires going a bit further down the rabbit hole.

mmap-cache.c: Why journald's RSS appears to only grow

All journal file objects are located via the function journal_file_move_to() in journal-file.c, which takes an offset, size, and object type as parameters, storing a pointer to the mapped object as the result. journal_file_move_to() relies on mmap_cache_get() to establish a new mapping if necessary, and return a pointer to it referencing the desired object.

MMapCache *mmap from struct JournalFile is handed to mmap_cache_get() as the cache to operate on.

Upon further study, it can be seen that mmap_cache_get() maintains a cache of mmap() windows, by doing three basic things:

  1. Checks to see if the desired object is already present in a window-per-type cache of windows, and if so returns it

  2. Scans a linked list of cached windows looking for a match, and if one is found, returns it

  3. Creates a new mapping for the desired object as an 8MiB (or larger, if the object exceeds 8MiB) window, adds it to the list of cached windows, and returns it. If the mmap() fails due to ENOMEM, then an unused window is reclaimed from the cache, and the mmap() retried.

These steps all share a common window fit function called window_matches() which takes a pointer to the window being checked, and the criteria for fitment: fd number, offset, size, and read/write protection.

A convenient place to integrate the MAP_PRIVATE change would be to add a private boolean flag to mmap_cache_get(), propagate it down to window_matches() as an additional criteria to match, and add it to the struct Window type representing cached windows. It can then be set accordingly when new windows are created via mmap() with either MAP_PRIVATE or MAP_SHARED.

There are some interesting asides to note while we're here, since the general spirit of this series is journald performance and stability:

  1. In step #2 above, the search is linear. With 8MiB windows and a 4GiB maximum journal size, the number of cached windows can exceed 500.

    There is an opportunity for optimization here, perhaps something like an interval tree would be more appropriate than just a linked list. If this is an area of interest for you, note that mmap cache windows may overlap.

  2. Windows are only reclaimed in response to mmap() ENOMEM errors, until the associated journal is closed via rotate or shutdown. This is the main reason journald causes complaints surrounding excessive memory use/leaks.

    The rationale is that these are file-backed pages, which are shared with the file's backing store, and reclaimable by the kernel. Such reclamation could potentially require page write-back if the pages are dirty, adding more latency to the pending allocation. So why not avoid the fairly expensive task of re-establishing these mappings, by filling journald's address space with them?

    The problem with this approach is that the rest of the journald code is making allocations from the same address space, completely unaware of the mmap-cache. Outside of the mmap-cache itself, allocations don't know that the cache may consume an entire 32-bit address space, nor that windows may be reclaimed and freed from that space. Only mmap-cache knows to handle ENOMEM returned from its mmap() calls by reclaiming windows.

    It may make sense to teach the rest of the systemd code, some of which journald shares, about hooks for ENOMEM handling. Then journald could install an ENOMEM handler tying into the mmap-cache's reclaimer.

MAP_PRIVATE vs. MAP_SHARED: Two extremes on the coherence spectrum

The astute reader may have already noticed a glaring problem; if mmap cache windows may overlap, discrete MAP_PRIVATE mappings of overlapping windows may become incoherent due to private writes in the overlapping region.

Because MAP_SHARED mappings must propagate their writes to the backing store, discrete overlapping windows are naturally kept coherent. There is no flavor of MAP_PRIVATE capable of creating mappings that isolate writes from the backing store, while propagating them to other private overlapping mappings on the same file.

The consequence of this, in lieu of eliminating the potential for overlapping windows, is that a single MAP_PRIVATE mapping for the entire journal file must be established when mmap_cache_get() first receives a request for a private mapping. Pointers into this single private mapping may be returned for the relevant private mmap_cache_get() requests.

This wouldn't be too big a deal if we only cared about 64-bit architectures. But since 32-bit is supported, and journald's maximum journal size is 4GiB, we're at an impasse with this strategy.

It may prove interesting to explore adding a sort of semi-shared MAP_ISOLATED mapping to Linux's mmap() that takes an additional parameter identifying which isolated mappings of a given file descriptor within the calling process should be shared and kept coherent, while being isolated from the underlying file. Assuming all such mappings be kept coherent per calling process may also suffice, which could easily be implemented with the existing mmap() API.

It sounds a bit like a memfd but with the faulted pages sourced from a file. While we're fantasizing about new Linux interfaces, another convenience would be a way to convert those mappings to MAP_SHARED, committing their dirtied pages to the backing store in a single system call, so user-space doesn't have to do the mempcy() back to the MAP_SHARED mappings.

In any case, for the sake of measurement, discussion, and validation of the concept (and my understanding of journald), the MAP_PRIVATE strategy was implemented and tested on x86_64, you can find the code here.

Intermission

Parts of this post may have sounded vaguely familiar to MongoDB users due to MongoDB's mmap()-based storage engine. MongoDB has a 2GiB size limitation on 32-bit architectures, a consequence of using a combination of MAP_PRIVATE and MAP_SHARED whole-file mappings.

With any luck, this post has been illuminating even if the proposed solution isn't generally acceptable. Check back for our next installment, when we'll examine the discussion this proposal lead to with the upstream systemd community, and the changes we've landed there since that make journal offlining fully asynchronous.

CoreOS at systemd.conf 2016

See engineers from CoreOS detail their frequent work with systemd and projects leveraging its features at systemd.conf in Berlin this week. On Thursday, September 29, Alban Crequy and Luca Bruno will discuss the intimate relationship between systemd and the rkt container engine. On the following day, Friday, September 30, Luca will dive even deeper with a talk on using and improving systemd bindings from recent programming languages like Go and Rust.