May 2, 2014
Participants: Andy Lutomirski, Christoph Lameter, Dave Jones, David Woodhouse, Guenter Roeck, H. Peter Anvin, James Bottomley, Jan Kara, Jiri Kosina, Josh Boyer, Josh Triplett, Julia Lawall, Mark Brown, Matthew Wilcox, Michael Kerrisk, Steven Rostedt, Ted Ts'o, Tim Bird, and Tony Luck.
People tagged: Paul E. McKenney, Darren Hart, Greg KH, Andrew Morton, Sarah Sharp, Julia Lawall, Dan Carpenter, Tom Zanussi, and Michael Kerrisk.
Josh Triplett suggested a renewed focus on size requirements, both for main memory and mass storage. Josh listed a number of tools that can help spot size regressions and shrink size, and also suggested discussions on why size matters as well as how to avoid size regressions.
Dave Jones added that reducing size can also reduce
attack surface, thus improving security.
Dave would like to see Kconfig options that allow the more obscure
(and thus perhaps more buggy) system calls to be removed, in particular
sys_remap_file_pages()
.
Dave would like to see discussion of which syscalls could be configured
out without too much userspace damage and what the optimal degree of
configurability would be.
He also notes that a number of syscalls are already configurable in this way.
Josh Triplett
would like to see most syscalls be optional, which would allow specialized
devices to reduce both memory footprint and attack surface.
However, Josh notes that seccomp also decreases attack surface, and does
so without the need to build a separate kernel, but that seccomp does
not free us from the obligation of securing kernel APIs from hostile
userspace.
Josh included a list of syscalls that do not appear in
kernel/sys_ni.c
, and thus always exist, and
also included a list of related classes of system calls
(for example, legacy syscalls could be excluded by devices running only
non-legacy userspace code).
Tim Bird
described a mechanism leveraging SYSCALL_DEFINE
that
allowed individual syscalls to be excluded.
Josh
countered with a suggestion to make syscall functions garbage-collectible
[presumably via LTO or something similar],
and to keep only those that are referenced from the actual syscall table.
Christoph Lameter
noted that kernel size matters for performance, with smaller size leaving
more of the processor caches for the application.
Christoph therefore calls for the ability to remove unwanted functionality
(e.g., cgroups), and for userspace tools (e.g., systemd) to tolerate a
kernel with reduced functionality.
James Bottomley
contrasted memory footprint with cache footprint, arguing that in some cases,
unused kernel code does not take any of the processor cache away from the
user application.
That said, James agreed that cgroups does add to the execution path of
a number of system calls, but asked what the measured performance impact
actually is.
James also suggested using static branching to out-of-line areas to
reduce that impact, if needed.
Christoph
responded that instruction layout matters, so that just focusing on
instruction count will miss optimization opportunities.
In particular, although static branching can reduce the number of instructions
speculated and executed, it still puts pressure on TLBs.
Christoph suggested sorting functions so as to put the most frequently used
set in one place, where they could be covered by a single huge page, preferably
using automated tools for this purpose.
Christoph also noted that there are older kernels in production use in the
financial industry because these older kernels have better performance and
latency.
Smaller memory footprint is required to get these site to move to newer
kernels.
Steven
believes that most of the core kernel code (excluding modules) is already
covered by huge pages.
Steven also noted that his experiments moving tracepoint code out of line
did not produce measurable benefits, and asked if the reason for use of
older kernels wasn't due more to fewer features rather than on raw size.
Christoph
agreed that the kernel is covered by huge pages, but noted that there are
a limited number of huge-page TLB entries, and a bloated kernel would
consume them at the expense of application code, which also wants to use
huge pages.
Christoph agreed that features were also important, and noted that
folding of small functions into larger ones (and vice versa) can help,
but that it can be difficult to determine which way to go.
Josh
argued that factoring out helper functions, when done properly, should
improve the cache hit rate of the code making up the helper functions.
James
agreed that link-time optimizations can group functions, but reiterated
Steven's call for actual measurements of the benefit.
James also pointed out that the compiler often inlines functions, undoing
the careful by-hand refactoring.
Christoph
objected that providing proof requires doing all the work up front.
Steven
replied “Hello Chicken, Meet Egg!”
Julia Lawall
asked what sorts of functions are to be refactored, pointing out that
similar drivers often have similar code, but that only the code for
drivers used by a particular OS instance will be executed.
Julia then asked if all of the similar functions need to actually be
executed for there to be any benefit.
Steven
said that he was thinking more in terms of core kernel code than of drivers.
Matthew Wilcox
believes that any benefit will be workload dependent, with scientific
workloads typically being more sensitive to cache issues than memory-intensive
commercial workloads.
Mark Brown
would like a way of auditing which system calls are actually in use on
a given system as a tinification aid, which prompted
Tony Luck
to suggest strace -c
,
which in turn prompted Mark to point out that he needs a system-wide view
of system calls, where strace -c
only tracks a single process.
Dave Jones
suggested tracepoints or kprobes,
Andy Lutomirski
suggested programming seccomp
to send SIGSYS
and then watch the kernel logs, and
David Woodhouse
suggested setting up per-syscall audit rules for each system call believed
to be unused.
Mark Brown
raised concerns about tracepoint buffer overflow, but agreed that it
could work in a suitably constrained setup.
He also agreed that kprobes could work, at least given a reasonably
canned setup.
Mark
is also concerned that the userspace tools required for per-syscall audit might
be too heavyweight for many target systems, but nevertheless believes that
this approach would work in many cases.
Tony Luck
pointed out the worst-case syscall usage is needed, and that any monitoring
tool will only list out typical syscall usage.
For example, trivial testing might show that bash
does
not use the pipe()
system call, resulting in fatal disappointment
the first time some user typed dmesg | grep ixgbe
.
H. Peter Anvin
suggested using seccomp
to sandbox processes, preventing them
from using functionality not required for a given super-low-end
embedded system.
Jan Kara
believes that security modules and audit subsystems are to be used for
this purpose, but then asked whether he was dreaming too much.
James Bottomley liked the idea of reducing attack surface, but is concerned about having a huge number of per-syscall Kconfig options and about userspace binary incompatibilities induced by kernels with different sets of supported syscalls. He prefers an approach where there is a Kconfig option for each use case, such as secure routers and reduced-attack-surface distributions. This was seconded by Guenter Roeck and by Steven Rostedt, who recalls Linus asking for config profiles. Dave Jones agreed that it will be tricking to draw a precise line between core and optional syscalls, and that having workload-specific “profiles” could be helpful. However, Dave was skeptical that a reduced-attack-surface option would help, given the tendency of people to want the reduced attack surface, but also to want one or more of the normally excluded system calls. Dave also pointed out that the large distros are guaranteed to have a critical mass of users for each and every system call, which led him to suggest a runtime option to disable unneeded system calls. David Woodhouse questioned the utility of use-case-based configurations, asking if anyone had seen the list of things that OpenWRT packages. James uses OpenWRT, and likes its kitchen-sink approach. However, James doubts that his use case is typical, and thus expects that a secure-router profile would include OpenWRT.
Josh Boyer noted that new system calls could be disabled by default, which would prevent users from growing attached to them, and would also allow the distros to gauge demand for a given new system call. H. Peter Anvin argued that disabing a system call be default was equivalent to not providing it at all, seconded by Michael Kerrisk. Josh countered that the system call could be enabled as soon as some package requiring it was added to that distro, but noted that this does not help the “one binary doesn't work on multiple distros” problem. In fact, Josh believes that any large general-purpose distro would simply enable the widest range of system call.
Ted Ts'o
pointed out that system calls are a small fraction of the total attack
surface, and that new system calls are added fairly infrequently in
any case
(though
Michael Kerrisk
noted that sched_getattr()
and sched_setattr()
were added just this past March),
and sometimes (as in renameat()
) require very little code.
Ted is more concerned with the attack surface provided by things like
pluggable security LSMs, control groups, namespaces, and systemd.
Dave Jones
agreed that the rate of addition of system calls has slowed down,
but noted that the rate at which bugs were exposed via system calls
has accelerated.
Dave is not all that concerned about system calls like renameat()
,
which simply enhance other system calls,
instead calling out system calls that enable significant quantities of code,
especially those system calls that are used only be a few very well-written
applications, which tend to avoid exposing buggy corner cases by design.
Steven Rostedt
argued that the acceleration in bug-finding is due more to advances in
testing (specifically, Dave Jones's trinity) than to added system calls.
Michael Kerrisk
agreed with Dave, arguing that the APIs delivered to userspace
“continue to be infested with bugs
and design infelicities, many of which go undetected for a long time.”
Michael gave the addition of the recvmmesg()
function's
timeout
argument as an example of a poorly done feature addition
(for more information, see the
bugzilla
or
discussion thread).
Ted called out the attack surface exposed via non-syscall mechanisms such as pseudo filesystems, new ioctls, fallocate code points, and so on, but with special concern for code that can be exercised by non-root users. Ted notes that root-only code tends to be used by a few well-behaved programs, which makes it easier to change root-only code. In contrast, code used by non-root programs might be used by any code anywhere, making it almost impossible to change the user-visible API, which in turn suggests maximal paranoia is design, coding, review, and testing. Dave Jones pointed out that secure boot means that some root-only code might be used by large amounts of code of dubious provenance. Michael Kerrisk is tracking kernel API changes here.
Ben Hutchings
suggested restricting large-code-size features to root,
for example, using perf_event_paranoid=3
to restrict
sys_perf_event_open()
to programs running as root.
Ben also pointed out that Michael Kerrisk's documentation efforts
seemed to find odd corner cases, which re-raises the old question
of whether code should not be accepted into the kernel until the
documentation is done.
Dave Jones
suggested that lack of test cases also block acceptance of new features.
Michael Kerrisk
agreed, arguing that test cases and documentation should go hand in glove,
further suggesting this as a separat LKS topic.
Mention of secure boot prompted Josh Boyer to ask if last year's “What to do about the secure_modules/trusted_kernel/whatever patch set that distros are carrying to support Secure Boot?” topic should be reprised, given that progress along these lines appeared to have been derailed again, and the resulting subthread summary may be found here.
Jiri Kosina notes that systemd is a mandatory feature of a number of distros that depends on optional kernel features. In short, pruning the kernel might require pruning userspace utilities and daemons. Ted Ts'o suggested that anyone interested in working this problem feel free to start a separate ksummit-discuss thread, but preferably only after coming up with at least one proposed solution.
Tim Bird called out some of his work on deeply embedded systems (here and here). Tim said that these techniques eliminated 161 syscalls from a default-configured kernel (saving 95KB) and 120 syscalls from a minimal-configured kernel (saving 48KB). H. Peter Anvin called out the irony of one of Tim's techniques being to prevent LTO from preventing unreferenced code from being optimized out.