Skip to main content

Eliminating Delays From systemd-journald, Part 1

Overview: Why this is about systemd-journald

systemd-journald is the logging daemon in the systemd project, referred to as simply journald for the remainder of this article. The name systemd is often used ambiguously. Throughout this article whenever systemd is mentioned, it's in reference to the project, not the init daemon of the same name.

Like most systemd daemons, journald is implemented as a single-process, non-threaded, IO-multiplexing, unix socket server. If you're familiar with such unix daemons and their use of select()/poll() with non-blocking sockets, you can probably imagine approximately how journald is implemented — just replace select()/poll() with the newer epoll_wait() equivalent.

Unlike most systemd daemons however, journald's primary function is to reliably write data (log data, metadata, and associated indices) to the filesystem. Log-producing processes send log data and metadata over a unix socket to journald. Additionally, journald itself manufactures "trusted" metadata by inspecting log-producing processes when their log data is consumed. All this data needs to be written to the journal in a structured and indexed fashion.

CoreOS uses systemd to manage services, and nearly all services and processes on a stock CoreOS system log their output to journald. This makes journald itself a potential bottleneck, where delays and even crashes impacting the entire system can occur.

For example, if journald can't service its event loop frequently enough, backlogs may form in the interconnecting sockets. These backlogs may block the log-producing processes from making their own forward progress on subsequent logging attempts, depending on how their logging has been implemented.

In more extreme scenarios, if journald is blocked long enough for its watchdog timer to expire, it will be killed with SIGABRT, dump core, and abruptly close all connected sockets losing whatever log data was in-flight. In some cases this can even cause services to crash due to the unexpected hangup of their stdout and/or stderr.

It cannot be overstated; journald is a critical, central component of modern, systemd-utilizing GNU/Linux operating systems like CoreOS.

When we began receiving reports of journald slowing down the execution of log-heavy services and in some cases even dumping core due to expired watchdog timeouts, we started taking a closer look at journald's implementation. This article is an attempt to share some of what was found and the progress we've made working upstream to improve journald.

As this is an ongoing effort, and in the interests of keeping the posts easier to digest, this article is being published in multiple parts. Expect to see a follow-up to pick up where this post ends in the near future.

Speculation: The expected implementation of systemd-journald

In general terms, before diving down into the weeds of reality, what should journald need to do for every log message it receives and what opportunities does this create for substantial delays to emerge?

  1. Read log data from a non-blocking socket into journald's address space

    This should only occur when the socket has data ready to be copied, and should not experience delays from blocking due to the use of non-blocking sockets.

  2. Interrogate the process on the other end of the socket, acquiring "trusted" metadata

    This involves accesses within the /proc filesystem, opening and reading files, and any necessary parsing of what's found. There are probably opportunities for optimization but we don't expect significant delays in accessing the proc virtual filesystem.

  3. Parse the received log data into individual messages

    This will vary depending on the logging source, and probably has some opportunities for optimization. Ideally this mostly consists of pure user-space execution looking for message boundaries in populated buffers, there shouldn't be significant delays here.

  4. Write the individual messages with their associated metadata and indices into the journal

    This is where we'd expect to find potential sources of delays. In this style of unix program, it's critical to never significantly delay the event loop and there are limited options for writing files to stable storage without potentially blocking.

    For example, if journald were simply writing using buffered but blocking file IO calls such as writev(), should blocking occur, all forward progress is momentarily stalled until writev() returns. Buffered file IO is mostly non-blocking as writes get buffered in the page cache, but things like read-modify-write patterns and file metadata updates can still cause blocking delays depending on the filesystem and mount options used.

    Databases often resort to using linux native AIO w/O_DIRECT on pre-allocated files via io_submit() for performing overlapping, asynchronous file IO without potentially blocking the event loop. Journald lies somewhere between a database and a plain old O_APPEND logging daemon, so it's entirely reasonable to consider this as a possibility.

    Other options for performing these writes asynchronously include emulated AIO via threads (this is how POSIX AIO is implemented in glibc on GNU/Linux), or simply copying to mapped (via mmap()) regions of the journal and letting the kernel asynchronously perform write-back of the dirty pages. These options are less commonly used by databases for a variety of reasons.

Investigation: The actual implementation of systemd-journald

Armed with some speculative ideas of what to expect, let's look at strace output capturing journald handling a very simple and controlled workload on an otherwise idle system to see what it actually does at the system call level.

I'll be using the systemd-cat utility, which logs stdin via the "stream" journald source, analogous to how a systemd-managed service's output may be logged. The specific command being used is echo hi | systemd-cat.

epoll_wait(8, {{EPOLLIN, {u32=1837720528, u64=94074506404816}}}, 11, -1) = 1
clock_gettime(CLOCK_REALTIME, {1452541347, 345412183}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 489315464}) = 0
clock_gettime(CLOCK_BOOTTIME, {181385, 490035599}) = 0
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 14
sendmsg(14, {msg_name(21)={sa_family=AF_LOCAL, sun_path="/run/systemd/notify"}, msg_iov(1)=[{"WATCHDOG=1", 10}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 10
close(14)                               = 0
timerfd_settime(9, TFD_TIMER_ABSTIME, {it_interval={0, 0}, it_value={181421, 951465000}}, NULL) = 0
accept4(3, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 14
getsockopt(14, SOL_SOCKET, SO_PEERCRED, {pid=1021, uid=500, gid=500}, [12]) = 0
getsockopt(14, SOL_SOCKET, SO_PEERSEC, "system_u:system_r:kernel_t:s0\0", [30]) = 0
shutdown(14, SHUT_WR)                   = 0
epoll_ctl(8, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=1837750592, u64=94074506434880}}) = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 352813406}) = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 353310404}) = 0
epoll_wait(8, {{EPOLLIN|EPOLLHUP, {u32=1837750592, u64=94074506434880}}}, 12, -1) = 1
clock_gettime(CLOCK_REALTIME, {1452541347, 354486481}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 498328204}) = 0
clock_gettime(CLOCK_BOOTTIME, {181385, 498821426}) = 0
read(14, "\n\n6\n1\n0\n0\n0\nhi\n", 2048) = 15
fstat(14, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
stat("/run/systemd/journal", {st_mode=S_IFDIR|0755, st_size=160, ...}) = 0
mkdir("/run/systemd/journal/streams", 0755) = -1 EEXIST (File exists)
stat("/run/systemd/journal/streams", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
umask(077)                              = 022
open("/run/systemd/journal/streams/.#8:26217nJ7kNH", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 16
umask(022)                              = 077
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
fstat(16, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f22b4e41000
write(16, "# This is private data. Do not p"..., 122) = 122
rename("/run/systemd/journal/streams/.#8:26217nJ7kNH", "/run/systemd/journal/streams/8:26217") = 0
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 17
sendmsg(17, {msg_name(21)={sa_family=AF_LOCAL, sun_path="/run/systemd/notify"}, msg_iov(1)=[{"FDSTORE=1", 9}], msg_controllen=40, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {14}}, msg_flags=0}, MSG_NOSIGNAL) = -1 EINVAL (Invalid argument)
close(17)                               = 0
close(16)                               = 0
munmap(0x7f22b4e41000, 4096)            = 0
open("/proc/1021/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/1021/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlinkat(AT_FDCWD, "/proc/1021/exe", 0x558f6d89da30, 99) = -1 ENOENT (No such file or directory)
open("/proc/1021/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/1021/status", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/1021/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/1021/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/1021/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
clock_gettime(CLOCK_REALTIME, {1452541347, 373292336}) = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 373798961}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 517584016}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 518106069}) = 0
fstat(13, {st_mode=S_IFREG|0640, st_size=41943040, ...}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 519134143}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 519645092}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 520142521}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 520635209}) = 0
mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_SHARED, 13, 0x1287000) = 0x7f22affb6000
ftruncate(13, 41943040)                 = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 379074827}) = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 379588987}) = 0
epoll_wait(8, {{EPOLLIN|EPOLLHUP, {u32=1837750592, u64=94074506434880}}}, 12, -1) = 1
clock_gettime(CLOCK_REALTIME, {1452541347, 380814545}) = 0
clock_gettime(CLOCK_MONOTONIC, {181385, 524597032}) = 0
clock_gettime(CLOCK_BOOTTIME, {181385, 525073972}) = 0
read(14, "", 2048)                      = 0
unlink("/run/systemd/journal/streams/8:26217") = 0
epoll_ctl(8, EPOLL_CTL_DEL, 14, NULL)   = 0
close(14)                               = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 383979000}) = 0
clock_gettime(CLOCK_REALTIME, {1452541347, 384485833}) = 0
epoll_wait(8,

That's a lot of noise. Here's the same trace stripped down to the bare essentials with some annotations interspersed:

 Accepting the new stream connection:

  epoll_wait(8, {{EPOLLIN, {u32=1837720528, u64=94074506404816}}}, 11, -1) = 1
  accept4(3, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 14
  getsockopt(14, SOL_SOCKET, SO_PEERCRED, {pid=1021, uid=500, gid=500}, [12]) = 0
  getsockopt(14, SOL_SOCKET, SO_PEERSEC, "system_u:system_r:kernel_t:s0\0", [30]) = 0
  epoll_ctl(8, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=1837750592, u64=94074506434880}}) = 0

 Reading the log message:

  epoll_wait(8, {{EPOLLIN|EPOLLHUP, {u32=1837750592, u64=94074506434880}}}, 12, -1) = 1
  read(14, "\n\n6\n1\n0\n0\n0\nhi\n", 2048) = 15

 Inspecting the process for "trusted" metadata (fails since process has exited):

  open("/proc/1021/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
  open("/proc/1021/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
  readlinkat(AT_FDCWD, "/proc/1021/exe", 0x558f6d89da30, 99) = -1 ENOENT (No such file or directory)
  open("/proc/1021/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
  open("/proc/1021/status", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
  open("/proc/1021/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
  open("/proc/1021/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
  open("/proc/1021/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

 Writing to the journal file:

  fstat(13, {st_mode=S_IFREG|0640, st_size=41943040, ...}) = 0
  mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_SHARED, 13, 0x1287000) = 0x7f22affb6000
  ftruncate(13, 41943040)                 = 0

 Concluding the stream connection:

  epoll_wait(8, {{EPOLLIN|EPOLLHUP, {u32=1837750592, u64=94074506434880}}}, 12, -1) = 1
  read(14, "", 2048)                      = 0
  unlink("/run/systemd/journal/streams/8:26217") = 0
  epoll_ctl(8, EPOLL_CTL_DEL, 14, NULL)   = 0
  close(14)                               = 0

 Back to the event loop:

  epoll_wait(8,

Nothing looks particularly surprising or unexpected. We can conclude the journal writes are being done via mmap(). Using mmap() to achieve asynchronous writes is convenient but does have the potential to block, should page faults intercede, or, on some storage subsystems and kernel version combinations, if simultaneous write-back happens.

mmap() does have some appealing advantages like the ability to treat the on-disk file as an array of bytes in memory, simplifying the implementation of disk-resident, complex data structures. mmap() also enjoys relaxed file/inode metadata synchronization semantics, requiring the use of truncate() to explicitly trigger things like inotify on writes, which we see clearly happening in the trace above. There may also be performance advantages when using mmap() due to fewer copies and system calls, but this depends heavily on the access patterns since there's significant kernel overhead in establishing and servicing the mapped pages.

It can be tricky however to achieve crash-safe, durable results using mmap() for complex data structures. Employing techniques such as two-phase object commits in the mapped region may be effective for achieving atomicity with regard to concurrent programs accessing the same file through the page cache, but this ordering isn't preserved when write-back occurs without explicit synchronizing of the two phases down to the backing store, which would be too costly to do on a granular, per-object level.

Having examined the above trace, though we see the use of mmap(), it remains a mystery how journald makes the data copied to the mapping durable.

The copying of log data into an file-backed mapping only places the data in the page cache of the file in the form of dirty pages to become written at a later time, when write-back of those pages occurs. This decoupling of copying the data into the page cache and write-back to disk is what enables the asynchronous file writes journald requires to continue servicing its event loop in a timely fashion.

So when and how exactly does this data become written, or in database terms, made durable? Is it left entirely up to the kernel's discretion, or does journald explicitly force write-back at certain times and/or events? If journald is involved, does it have to wait? If it has to wait, does it do so in a stop-the-world fashion, during which the event loop is blocked?

There are limited ways for dirty file-backed pages to be explicitly written out in Linux. There's a per-kernel sync(), per-filesystem syncfs(), per-file fsync() and fdatasync(), per-mmap msync(), and per-file-range sync_file_range(). None of these were seen in the trace above, so it's time to go a bit deeper and examine the code to see what journald is doing and when, to ensure the durability of log data.

A quick grep of the journald source for the above function names reveals a number of fsync() calls in journal-file.c, from functions included below:

// journal-file.c

  78 static int journal_file_set_online(JournalFile *f) {
  79         assert(f);
  80
  81         if (!f->writable)
  82                 return -EPERM;
  83
  84         if (!(f->fd >= 0 && f->header))
  85                 return -EINVAL;
  86
  87         if (mmap_cache_got_sigbus(f->mmap, f->fd))
  88                 return -EIO;
  89
  90         switch(f->header->state) {
  91                 case STATE_ONLINE:
  92                         return 0;
  93
  94                 case STATE_OFFLINE:
  95                         f->header->state = STATE_ONLINE;
  96                         fsync(f->fd);
  97                         return 0;
  98
  99                 default:
 100                         return -EINVAL;
 101         }
 102 }
// journal-file.c

 104 int journal_file_set_offline(JournalFile *f) {
 105         assert(f);
 106
 107         if (!f->writable)
 108                 return -EPERM;
 109
 110         if (!(f->fd >= 0 && f->header))
 111                 return -EINVAL;
 112
 113         if (f->header->state != STATE_ONLINE)
 114                 return 0;
 115
 116         fsync(f->fd);
 117
 118         if (mmap_cache_got_sigbus(f->mmap, f->fd))
 119                 return -EIO;
 120
 121         f->header->state = STATE_OFFLINE;
 122
 123         if (mmap_cache_got_sigbus(f->mmap, f->fd))
 124                 return -EIO;
 125
 126         fsync(f->fd);
 127
 128         return 0;
 129 }
// journal-file.c

 221 static int journal_file_refresh_header(JournalFile *f) {
 222         sd_id128_t boot_id;
 223         int r;
 224
 225         assert(f);
 226
 227         r = sd_id128_get_machine(&f->header->machine_id);
 228         if (r < 0)
 229                 return r;
 230
 231         r = sd_id128_get_boot(&boot_id);
 232         if (r < 0)
 233                 return r;
 234
 235         if (sd_id128_equal(boot_id, f->header->boot_id))
 236                 f->tail_entry_monotonic_valid = true;
 237
 238         f->header->boot_id = boot_id;
 239
 240         r = journal_file_set_online(f);
 241
 242         /* Sync the online state to disk */
 243         fsync(f->fd);
 244
 245         return r;
 246 }

This looks pretty straightforward; fsync() is being used to achieve durability.

Also shown are f->header->state manipulations, which are clearly being done carefully in an explicitly ordered fashion down to the backing store via fsync(). This is how journald is making mmap() use safe enough for complex data structures. When journals are first opened by journald, if f->header->state is STATE_ONLINE, a crash is detected and the journal is simply archived.

In terms of our delays-oriented investigation, the journal_file_set_offline() function is of particular interest. We know the journal must be online if being written to by journald, and the first fsync() in journal_file_set_offline() would expedite and block on write-back of all accumulated dirty journal file pages. This is where the potentially big delays would be expected.

So when does journal_file_set_offline() occur?

// journal-file.c

 131 void journal_file_close(JournalFile *f) {
 132         assert(f);
 133
 134 #ifdef HAVE_GCRYPT
 135         /* Write the final tag */
 136         if (f->seal && f->writable)
 137                 journal_file_append_tag(f);
 138 #endif
 139
 140         journal_file_set_offline(f);
 141
 ...
// journald-server.c

 344 void server_sync(Server *s) {
 345         JournalFile *f;
 346         void *k;
 347         Iterator i;
 348         int r;
 349
 350         if (s->system_journal) {
 351                 r = journal_file_set_offline(s->system_journal);
 352                 if (r < 0)
 353                         log_error_errno(r, "Failed to sync system journal: %m");
 354         }
 355
 356         ORDERED_HASHMAP_FOREACH_KEY(f, k, s->user_journals, i) {
 357                 r = journal_file_set_offline(f);
 358                 if (r < 0)
 359                         log_error_errno(r, "Failed to sync user journal: %m");
 360         }
 361
 ...

We can see that when a journal is closed, it's set offline, and may block in fsync() if it were in an online state having dirty pages.

Judging from the name, server_sync() is some kind of journald-wide sync mechanism, but let's focus on journal_file_close() first.

Studying the code reveals journal_file_close() calls in journal_file_rotate(), server_done(), and server_flush().

Blocking during shutdown isn't too concerning. The server flush is wired to a SIGUSR1 handler, so we'll ignore it. Rotate however, rotate is interesting, and I've included its implementation below.

// journal-file.c

2774 int journal_file_rotate(JournalFile **f, bool compress, bool seal) {
2775         _cleanup_free_ char *p = NULL;
2776         size_t l;
2777         JournalFile *old_file, *new_file = NULL;
2778         int r;
2779
2780         assert(f);
2781         assert(*f);
2782
2783         old_file = *f;
2784
2785         if (!old_file->writable)
2786                 return -EINVAL;
2787
2788         if (!endswith(old_file->path, ".journal"))
2789                 return -EINVAL;
2790
2791         l = strlen(old_file->path);
2792         r = asprintf(&p, "%.*s@" SD_ID128_FORMAT_STR "-%016"PRIx64"-%016"PRIx64".journal",
2793                      (int) l - 8, old_file->path,
2794                      SD_ID128_FORMAT_VAL(old_file->header->seqnum_id),
2795                      le64toh((*f)->header->head_entry_seqnum),
2796                      le64toh((*f)->header->head_entry_realtime));
2797         if (r < 0)
2798                 return -ENOMEM;
2799
2800         /* Try to rename the file to the archived version. If the file
2801          * already was deleted, we'll get ENOENT, let's ignore that
2802          * case. */
2803         r = rename(old_file->path, p);
2804         if (r < 0 && errno != ENOENT)
2805                 return -errno;
2806
2807         old_file->header->state = STATE_ARCHIVED;
2808
2809         /* Currently, btrfs is not very good with out write patterns
2810          * and fragments heavily. Let's defrag our journal files when
2811          * we archive them */
2812         old_file->defrag_on_close = true;
2813
2814         r = journal_file_open(old_file->path, old_file->flags, old_file->mode, compress, seal, NULL, old_file->mmap, old_file, &new_file);
2815         journal_file_close(old_file);
2816
2817         *f = new_file;
2818         return r;
2819 }

A rotate normally occurs when the currently open journal is full and more log data needs to be written. Depending on the maximum size of your journals, and the logging rate, rotate may occur often or infrequently.

So journal_file_close() potentially blocking on fsync() is problematic for journal rotate. What about server_sync(), where is server_sync() called?

Studying the code, you'd find server_sync() is called by a timer which defaults to a four minute interval. This timer is configurable via the SyncInterval directive in journald.conf. So by default, every four minutes, an fsync() will occur in journal_file_set_offline() if the journal is online and has been written to.

There's another specific case where server_sync() is triggered, and thats when a log message has been processed of LOG_{CRIT,ALERT,EMERG} priority. When such a message occurs, server_sync() is immediately called outside of the normal timer.

So we've identified the following cases where journal_file_set_offline() occurs, where we'd be sensitive to blocking substantially in fsync():

  1. Journal rotate
  2. Periodic sync governed by SyncInterval as set in journald.conf
  3. Processing of critical or higher priority messages

When any of these occur, fsync() will block the event loop from being serviced for a duration proportional to the amount of dirty pages requiring write-back, and the available bandwidth of the journal file's backing store.

Instrumentation: Measuring latencies affecting systemd-journald

Before we make any modifications to the code, we need a way to measure their impact. We could use strace to see how much time is spent in fsync() under sustained heavy logging when SyncInterval expires, or rotate occurs, but strace would slow the execution of journald substantially. In slowing the execution, we would prevent journald from producing dirty pages in quantities representative of normal execution, while allowing fsync() to still occur at normal speed within the kernel. This combination would result in superficially short fsync() durations.

While fsync() is the current focus, what we're really interested in is how much time is spent between returning from and reentering calls to epoll_wait().

Systemd implements a generalized event loop API called sd-event. This API is used by all the systemd daemons and journald is no exception. It's in the sd-event implementation where we find the epoll_wait() integration.

It makes sense to add some general instrumentation to sd-event for measuring event loop latencies like those affecting journald. A general solution would have potential utility in profiling the rest of the project's sd-event consumers.

If we simply took note of the current time after returning from epoll_wait(), and the current time before entering epoll_wait(), we would know how much time passed since we last returned from epoll_wait(). This time delta is the event loop latency for one iteration, and the overhead of retrieving these times is insignificant.

We can then take the base-2 logarithm of the time delta, and use the result as an index into a small array of counters. By incrementing the counter located by this latency-derived index, we account for its occurrence. This is done repeatedly as the event loop executes, accumulating occurrences in their respective counters.

All that's left is to periodically report the contents of the array of counters while resetting them to zero, and we now have precise reports of the distribution of latencies affecting the event loop over time.

Gate this functionality behind an environment variable and now every sd-event program has this feature available without even requiring a recompile.

In the course of this effort we implemented the mechanism described above which has since been merged upstream. and we've been using it extensively to inform and measure improvements to journald. The environment variable used is SD_EVENT_PROFILE_DELAYS, and here's the relevant commit for those interested.

Observation: What the latency measurements reveal

Now that we have a shiny new tool to play with, it's time for some numbers.

Using the following one-liner to simulate sustained heavy logging: while true; do dd if=/dev/urandom bs=8k count=1 | od -x | systemd-cat; done

This is what's seen from unmodified journald:

                log2(us)   1 2 3  4 5  6   7   8  9   10 11   12   13 14 15 16 17 18 19
               -----------------------------------------------------------------------
[10700.416297] iterations: 0 0 0  0 0  659 50  4  318 38 2111 2152 6  0  1  0  0  1  1
[10705.417136] iterations: 0 0 0  0 2  660 48  4  320 38 2129 2146 12 1  1  0  0  1  1
[10710.489114] iterations: 0 0 0  0 0  673 38  3  321 37 1925 2339 7  0  0  0  0  1  1
[10715.489613] iterations: 0 0 0  0 3  656 64  8  317 48 2365 2007 7  0  0  0  0  0  1
[10720.490467] iterations: 0 0 0  0 5  661 41  6  310 49 2187 2087 12 0  0  0  0  1  1
[10725.490973] iterations: 0 0 0  0 1  662 38  3  318 38 2072 2156 2  2  0  0  0  1  1
[10730.491324] iterations: 0 0 0  0 1  669 29  8  302 51 1983 2251 4  0  4  0  1  0  1
[10735.492206] iterations: 0 0 0  0 1  679 51  6  334 34 2282 2109 14 0  2  0  0  1  0
[10740.492360] iterations: 0 0 0  0 0  664 45  8  320 38 2228 2060 11 0  1  0  0  1  1
[10745.492483] iterations: 0 0 0  0 2  639 56  4  316 34 1960 2220 11 1  0  0  0  1  1
[10750.493694] iterations: 0 0 0  0 1  649 58  6  329 29 1969 2299 11 0  1  0  0  1  1
[10755.494967] iterations: 0 0 0  0 2  647 64  6  322 36 1963 2338 6  1  0  0  0  0  1
[10760.495093] iterations: 0 0 0  0 1  666 45  5  329 32 2403 1900 9  0  0  0  0  1  1
[10765.495603] iterations: 0 0 0  0 1  679 39  3  325 36 2232 2073 20 1  0  1  0  0  1

The majority of event loop iterations are experiencing delays of the 11th or 12th order. Also made visible is that there are regular outliers at the far right column, order 19 which is upwards of ~.5 seconds.

Here's the same workload run with fsync() commented out entirely:

                log2(us)   1 2 3  4 5  6   7   8  9   10 11   12   13 14 15 16
               --------------------------------------------------------------
[12434.222216] iterations: 0 0 0  0 0  683 49  10 330 40 2287 2152 4  5  7  1
[12439.222690] iterations: 0 0 0  0 1  626 66  8  315 35 2177 2001 12 7  11 6
[12444.224353] iterations: 0 0 0  0 2  662 62  3  330 34 2450 1882 16 4  13 3
[12449.224789] iterations: 0 0 0  0 1  659 61  10 335 27 2284 2065 22 4  9  1
[12454.224894] iterations: 0 0 0  0 3  677 52  8  329 42 2434 1979 22 4  8  0
[12459.225477] iterations: 0 0 0  0 1  669 54  5  328 36 2260 2072 15 8  10 1
[12464.225893] iterations: 0 0 0  0 2  645 60  8  319 41 2351 1917 12 5  7  4
[12469.226193] iterations: 0 0 0  0 1  680 52  5  334 34 2552 1823 40 4  6  1
[12474.226788] iterations: 0 0 0  0 1  681 63  6  339 36 2553 1922 15 6  8  1
[12479.227711] iterations: 0 0 0  0 2  660 73  4  338 31 2456 1935 13 10 8  2
[12484.228380] iterations: 0 0 0  0 1  659 57  4  316 44 2116 2172 14 4  6  3
[12489.228518] iterations: 0 0 0  0 2  662 56  5  321 40 2179 2148 17 1  2  1
[12494.229577] iterations: 0 0 0  0 2  678 59  8  329 44 2638 1808 18 3  5  2
[12499.230199] iterations: 0 0 0  0 0  632 87  1  327 34 2223 2065 21 6  10 1

You can clearly see the highest-order delays 17-19 are completely gone. By comparing these two profiles with only fsync() changed, we've confirmed with hard data fsync() is the principal source of latencies in journald.

These reports are printing every 5 seconds, so we were seeing a delay between .1-.5 seconds twice every 5 seconds, between orders 18-19.

Intermission

What began as some user reports of slow journald performance and watchdog timer expirations quickly became an expedition into the depths of journald. It became clear journald employs fsync() for durability, and these synchronous, blocking fsync() calls introduce significant latency in what is an otherwise non-blocking, latency-sensitive architecture.

Of course we can't simply remove fsync() from journald, since durability is important. There are however a few options available to us for running the event loop simultaneous to the explicit write-back required by journal_file_set_offline().

In the next installment, we'll explore some of those options, why some worked, some didn't, and what we've arrived at so far upstream in journald.