Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
[Full-Disclosure] Linux kernel i386 SMP page fault handler privilege escalation

From: Paul Starzetz (ihaquerisec.pl)
Date: Wed Jan 12 2005 - 05:58:18 CST

Hash: SHA1

Synopsis: Linux kernel i386 SMP page fault handler privilege escalation
Product: Linux kernel
Version: 2.2 up to and including 2.2.27-rc1, 2.4 up to and including
           2.4.29-rc1, 2.6 up to and including 2.6.10
Vendor: http://www.kernel.org/
URL: http://isec.pl/vulnerabilities/isec-0022-pagefault.txt
CVE: CAN-2005-0001
Author: Paul Starzetz <ihaquerisec.pl>
Date: Jan 12, 2005


Locally exploitable flaw has been found in the Linux page fault handler
code that allows users to gain root privileges if running on
multiprocessor machine.


The Linux kernel is the core software component of a Linux environment
and is responsible for handling of machine resources. One of the
functions of an operating system kernel is handling of virtual memory.
On Linux virtual memory is provided on demand if an application accesses
virtual memory areas.

One of the core components of the Linux VM subsystem is the page fault
handler that is called if applications try to access virtual memory
currently not physically mapped or not available in their address space.

The page fault handler has the function to properly identify the type of
the requested virtual memory access and take the appropriate action to
allow or deny application's VM request. Actions taken may also include a
stack expansion if the access goes just below application's actual stack

An exploitable race condition exists in the page fault handler if two
concurrent threads sharing the same virtual memory space request stack
expansion at the same time. It is only exploitable on multiprocessor
machines (that also includes systems with hyperthreading).


The vulnerable code resides for the i386 architecture in
arch/i386/mm/fault.c in your kernel source code tree:

[186] down_read(&mm->mmap_sem);

       vma = find_vma(mm, address);
       if (!vma)
              goto bad_area;
       if (vma->vm_start <= address)
              goto good_area;
       if (!(vma->vm_flags & VM_GROWSDOWN))
              goto bad_area;
       if (error_code & 4) {
               * accessing the stack below %esp is always a bug.
               * The "+ 32" is there due to some instructions (like
               * pusha) doing post-decrement on the stack and that
               * doesn't show up until later..
[*] if (address + 32 < regs->esp)
                     goto bad_area;
       if (expand_stack(vma, address))
              goto bad_area;

where the line number has been given for the kernel 2.4.28 version.

Since the page fault handler is executed with the mmap_sem semaphore
held for reading only, two concurrent threads may enter the section
after the line 186.

The checks following line 186 ensure that the VM request is valid and in
case it goes just below the actual stack limit [*], that the stack is
expanded accordingly. On Linux the notion of stack includes any
VM_GROWSDOWN virtual memory area, that is, it need not to be the actual
process's stack.

The exploitable race condition scenario looks as follows:

A. thread_1 accesses a VM_GROWSDOWN area just below its actual starting
address, lets call it fault_1,

B. thread_2 accesses the same area at address fault_2 where fault_2 +
PAGE_SIZE <= fault_1, that is:

[ NOPAGE ] [fault_1 ] [ VMA ] ---> higher addresses
[fault_2 ] [ NOPAGE ] [ VMA ]

where one [] bracket pair stands for a page frame in the application's
page table.

C. if thread_2 is slightly faster than thread_1 following happens:

[ PAGE2 ] [PAGE1 VMA ]

that is, the stack is first expanded inside the expand_stack() function
to cover fault_2, however it is right after 'expanded' to cover only
fault_1 since the necessary checks have already been passed. In other
words, the process's page table includes now two page references (PTEs)
but only one is covered by the virtual memory area descriptor (namely
only page1). The race window is very small but it is exploitable.

Once the reference to page2 is available in the page table, it can be
freely read or written by both threads. It will also not be released to
the virtual memory management on process termination. Similar techniques
like in


may be further used to inject these lost page frames into a setuid
application in order to gain elevated privileges (due to kmod this is
also possible without any executable setuid binaries).


Unprivileged local users can gain elevated (root) privileges on SMP


Paul Starzetz <ihaquerisec.pl> has identified the vulnerability and
performed further research. RedHat reported that a customer also pointed
out some problems with the page fault handler on SMP about 20.12.2004
and they already included a patch for this vulnerability in the
kernel-2.4.21-27.EL release, however the bug did not make it to the
security division.



This document and all the information it contains are provided "as is",
for educational purposes only, without warranty of any kind, whether
express or implied.

The authors reserve the right not to be responsible for the topicality,
correctness, completeness or quality of the information provided in
this document. Liability claims regarding damage caused by the use of
any information provided, including any kind of information which is
incomplete or incorrect, will therefore be rejected.


A proof of concept code won't be disclosed now. Special thanks goes to
OSDL and Marcelo Tosatti for providing a SMP testbed.

- --
Paul Starzetz
iSEC Security Research

Version: GnuPG v1.0.7 (GNU/Linux)


Full-Disclosure - We believe in it.
Charter: http://lists.netsys.com/full-disclosure-charter.html