DPC STACK GROWER
DPC Stack Grower is a program I wrote some time ago that allows to modify the size of the kernel stack used by the system for servicing "Deferred Procedure Calls" in Windows NT and later. For the reader it can be an example of a quite complex hook application that works on SMP systems.
INTRODUCTION
Anyone who knows me knows that I'm kinda obsessive about my primary Windows 2000 system: for my everyday development work I use the same DELL Pentium 3 500 MHz notebook since 2000. I installed Windows 2000 Advanced Server on that machine only once in 2000: since then I never needed to reinstall the Operating System or upgrade the machine or the system itself. One further reason why I am so conservative is because I have installed on it 40 GBytes of applications and programs that I use everyday for my work-related activities: SQL Server 2000, Exchange Server, all the relevant versions of Visual Studio (from 6.0 to Whidbey) etc. etc. Furthermore, the installation of the Terminal Services on that machine has disabled the standby and hibernate APM features: therefore this computer is powered on almost 24 hours a day and it is always connected to the internet, almost always downloading something. As you can imagine, its first and most important requirement is that it never has to get a blue screen: also when I am not using it, I always leave open a lot of windows and applications that I love to find again there when returning to the computer.
Well, my problems began when my ISP doubled the speed of my ADSL internet connection. For some unknown reason, from time to time, after leaving the machine locked with some sort of internet application opened in the background trasferring data at higher rates, when returning to the computer, I was discovering with horror that the machine had rebooted itself. Fortunately the system was able to take a full dump of the memory at the time of the crash, so at least some light could be shed on the problem.
THE EXCEPTION_DOUBLE_FAULT
After having opened the crash file with WinDbg, I discovered that the reason of the reboot was an UNEXPECTED_KERNEL_MODE_TRAP error and, more precisely, the EXCEPTION_DOUBLE_FAULT bugcheck argument was specified. The documentation says that these kinds of errors happen when there are hardware problems on the machine: however, from my experience, in 99% of cases, an EXCEPTION_DOUBLE_FAULT error happens in Windows simply when the kernel stack has overflowed due to a poorly written third party driver.
The stack of kernel threads is a very scarce resource in Windows NT: in Windows 2000 by design only 12KB of NonPaged pool memory is reserved to stacks of kernel threads. NonPaged memory is always resident and cannot be paged out to the disk by the system: in fact the code that runs in kernel space may be executed at any IRQL. Hardware interrupts, for example, are serviced by the system in the context of the same KTHREAD that was executing at the time of the interruption: because the CPU privilege level of both the ISR and the interrupted code is the same, no thread context switch occurs and so the stack memory is shared among all kernel code. Incidentally when the exception or the interruption happens at a CPL of 3 when the processor is executing user code mapped below the MmSystemRangeStart limit, a stack switch do occur and the stack pointer and segment details are retrieved by the CPU from the NT common TSS. The 3-pages stack memory limit is hard fixed and cannot be changed in any way. The layered nature of the driver model in NT may increase the incidence of this kind of problems: drivers with large stack frames that call repeatedly other drivers of the same type can consume a lot of stack space.
Specifically, the EXCEPTION_DOUBLE_FAULT fancy name can be explained by the fact that the initial page fault error that happens when the processor hits the stack guard page at the bottom of the 12KB allocation raises an other exception because the ESP register points to a non-mapped memory area. In fact, because the CPL is 0 and no stack switch is expected, the CPU is unable to save in the stack the processor state and the error code to pass to the exception handler (writing at the invalid ESP memory position results in that second page fault error). The reason that the system at this point doesn't reboot itself spontaneously is due to the fact that the processor double fault error is handled in the IDT by a Task Gate whose TSS points to the KiTrap08 private kernel function (whose purpose is to crash the system "gracefully", showing the Blue Screen and eventually taking a snapshot of the memory in the MEMORY.DMP file). The KiTrap08 function is executed in the context of its own private stack space whose pointers are specified in its special TSS. The stack memory that is reserved for double fault and NMI exceptions is allocated at system startup for each initialized processor in the system and in the case of the bootstrap processor is statically mapped in the NTOSKRNL module (as the stack memory of the idle thread) and is identified by the "KiDoubleFaultStack" symbol.
The first thing that usually is done in these cases is to examine the stack back trace at the time of the system death:
kd> Kffff
Memory ChildEBP RetAddr
00000000 8040f54c nt!KiTrap08+0x3e
ed42100c ed42100c 8041f505 nt!ExAllocateFromPPNPagedLookasideList+0x20
1c ed421028 ed33045a nt!IoAllocateMdl+0x5e
30 ed421058 ed3319eb USBD!USBD_ProcessURB+0x122
34 ed42108c ed330c38 USBD!USBD_FdoDispatch+0x221
28 ed4210b4 ed3d0409 USBD!USBD_Dispatch+0x76
34 ed4210e8 8041fbbb uhcd!UHCD_Dispatch+0x23
14 ed4210fc ed100539 nt!IopfCallDriver+0x35
4 ed421100 ed101fd2 usbhub!USBH_PassIrp+0x15
1c ed42111c ed10228a usbhub!USBH_PdoUrbFilter+0x64
1c ed421138 ed10069a usbhub!USBH_PdoDispatch+0xd8
10 ed421148 8041fbbb usbhub!USBH_HubDispatch+0x46
14 ed42115c bd7eff12 nt!IopfCallDriver+0x35
WARNING: Stack unwind information not available. Following frames may be wrong.
2c ed421188 8041fd6f adiusbaw+0x11f12
2c ed4211b4 ed331c92 nt!IopfCompleteRequest+0xab
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24 ed423f58 ed3d2d96 USBD!USBD_CompleteRequest+0x4e
4c ed423fa4 ed3d2769 uhcd!UHCD_CompleteTransferDPC+0x28e
3c ed423fe0 804650d4 uhcd!UHCD_IsrDpc+0x9d
14 ed423ff4 804041b6 nt!KiRetireDpcList+0x30
By looking at this stack trace, the cause of the system crash becomes clear: subtracting the frame pointers of the first and of the last function in the trace (KiRetireDpcList and ExAllocateFromPPNPagedLookasideList functions) gives 0x2FE8 as the result: this value is extraordinarily near to the aforementioned 0x3000 hardcoded stack limit. Evidently the "ExAllocateFromPPNPagedLookasideList" function was called when no enough stack space was available for its automatic variables allocation. This has triggered the EXCEPTION_DOUBLE_FAULT phenomenon exactly as described in the first part of this section.
POORLY WRITTEN THIRD PARTY DRIVERS
Poorly written third party drivers are by far the major cause of deadly crashes in Windows NT systems. Sadly it is amazing how many poorly trained kernel programmers are out there writing driver and kernel-related software. For example, if I try to boot my machine with Driver Verifier attached to all the non-Microsoft drivers installed in the system and I activate only the standard tests (i.e. I leave disabled the low resources simulation test, for example), the OS is not even able to pass over the "Starting Windows" boot-up message.
As you can guess looking at the stack trace above, the cause of the malfunction is the badly written driver of my U.S. Robotics modem (adiusbaw.sys). For an unknown reason that I decided not to investigate further, the increased line speed my ISP decided to assign to me was causing the modem driver to consume too much stack space when executing one of its DPC function call. You can conclude that the problem originates in the context of a DPC function execution because of the KiRetireDpcList kernel function that is always present at the head of every stack trace I collected from the memory dumps I examined. This function is called by the OS for processing the Deferred Procedure Calls queue for each initialized processor in the system.
Before deciding to pursue the extreme solution of writing an application for increasing the size of the kernel stack memory available when DPC functions are serviced, I tried the following:
|
Checked the U.S. Robotics site for updates of my driver. None found. |
|
|
Tried to modify the "MaxDataRate" property of the modem device. Sadly who wrote the driver did not even think that it would be wise to read the value of this setting in order to enforce it when connecting and/or communicating with my ISP server. The modem always transmits and receives data at the fastest rate possible (that is logical; what is illogical is the presence of a parameter that is not read)... |
Incidentally it must be noted that (among the other troubles it has caused) this driver is strangely incompatible with SoftICE (if popping in the debugger pressing the hotkey, later, when returning to the normal execution, the system always and inevitably bugchecks) and that if the modem is left connected to the USB port before powering on the computer, later, from time to time, when turning on the system, Windows refuses to boot-up showing an infamous blue screen...
THE KIRETIREDPCLIST FUNCTION
The KiRetireDpcList function is called for draining the DPC queue of the current processor. It is called in a different manner and in a different context according to the version of Windows being considered (2000 or XP) and even to which processor queue being serviced (whether the one of the bootstrap processor or not).
In the case of Windows 2000, my research revealed that this function is called whenever there are DPC function requests to service either from the KiIdleLoop function or from the KiDispatchInterrupt implementation. KiIdleLoop is the function that is executed when your system is in the idle state: specifically, it waits for hardware interrupts, checks out whether there is a thread to schedule immediately for execution and, above all for our purposes, processes its DPC queue in the case it is not empty. The implementation of KiDispatchInterrupt is very similar to that of the KiIdleLoop function, in that it is called for draining the DPC queue of the current processor and for thread scheduling. The most obvious difference between the two functions is that the KiDispatchInterrupt function constitutes a convenient way used by the system whenever there is the need to drain the DPC queue in an explicit manner, as when the current IRQL drops below DISPATCH_LEVEL. When playing with SoftICE or WinDBG it is very easy to see in a stack trace a KiDispatchInterrupt function call below a KfLowerIrql invocation in the call order... A more intriguing difference is the way the ESP register is managed just before the KiRetireDpcList function is called. In fact in the case of KiIdleLoop, the DPC invocations are serviced in the context of the stack of the idle thread. The memory used for the idle stack is mapped directly from a section of the NTOSKRNL image. This is valid only for uniprocessor machines and for the bootstrap processor in the case of multiprocessor systems. We may call this stack space "the bootstrap stack", because it is referenced and used in the early phases of the system initialization and later is inherited by the KiIdleLoop implementation. One very interesting characteristic of the bootstrap stack is that it has no guard page: in fact the stack memory of the bootstrap stack confines with some memory that is reserved by NT for NMI and the double fault exception itself, as you can experiment with SoftICE:
This means that actually the DPC functions queued on the bootstrap processor and serviced in the context of the KiIdleLoop function have a bit more memory on their stacks in the case when their stack frames and memory requirements start to become more substantial. This however produces a drawback: in fact, in this execution context, if the stack space required when executing DPC functions or servicing hardware interrupts grows (very) excessively, consuming also the double fault stack space that happens to be there next to the idle stack memory, the system will begin to write to other areas of the ".data" section of the NTOSKRNL image. As you can imagine, this can have disastrous effects on the system stability and in the case of a stack overflow it is unlikely that you'll get an EXCEPTION_DOUBLE_FAULT bluescreen. The absence of the guard page simply will prevent the triggering of the double fault mechanism described in the first part of the article. In fact the resulting (fatal) data corruption (several vital operating system structures are hosted in the ".data" section of the kernel, as you can guess) will simply hang your system (in the better case) or will cause a sneaky fatal system error whose causes will not be immediately clear. However, as stated, this can happen only for DPC functions executed from the KiIdleLoop implementation in the case of the bootstrap processor. The situation is however different for the stack memory allocated for the idle threads of non-bootstrap processors on SMP machines: in this case the stack memory is allocated calling the standard "MmCreateKernelStack" function, that is used also when creating normal user or kernel threads.
It has to be noted that the DPC queue drained from the KiDispatchInterrupt function (not from the KiIdleLoop function) is processed (even on the bootstrap processor) using stack memory allocated in the standard way calling MmCreateKernelStack. In this case, the KiDispatchInterrupt function switches to the DPC stack prior to calling the KiRetireDpcList implementation. A pointer to the corresponding DPC stack allocated for each initialized processor at system startup (even for the bootstrap one, as stated above) is retrieved from the Processor Control Region (PCR) data structure and then simply MOVed in the ESP register, as you can see here:
It is very important to note for our purposes that this stack switch is very essential in its form; it consists only of a x86 MOV instruction. No actual thread context switch occurs: in fact the current thread remains the same, not even the stack base and limit pointers are updated in the KTHREAD and PCR structures... I have implemented this same mechanism in my DpcStackGrower example application for enlarging the dimension of the kernel stack that is made available when draining a DPC queue.
THE DPCSTACKGROWER APPLICATION
For implementing the stack enlarging capabilities of DpcStackGrower, I considered various solutions:
|
Queuing a number of targetted DPCs to all the processors in the system calling from the DPC implementation the MmGrowKernelStack kernel private function. Unfortunately this function cannot work on the special KiIdleLoop stack of the bootstrap CPU (that is mapped in a section of the kernel image) and cannot be used on the transitional stack memory of the KiDispatchInterrupt function. |
|
|
Replacing the pointer to the DPC stack memory in the PCR structure. Unfortunately this solution works only for the DPCs that are serviced in the context of the KiDispatchInterrupt function. The KiIdleLoop DPCs continue to use their original stack space. |
|
|
Hooking the KiRetireDpcList function for replacing in the ESP register the pointer to the original stack with a pointer to a previously allocated NonPaged memory area of an arbitrary size. This is the solution I chose for the DpcStackGrower application and it seemed to work very well for reducing drastically the number of bluescreens caused by my faulty modem driver. |
In order to use the application, the user must specify the new size of the kernel stack used when draining the DPC queue. A screenshot of the frontend MFC application is provided for reference at the beginning of this article.
Then the MFC application loads the kernel driver and sends to it the IOCTL_DPCSTKGRNT_INITIALIZE command in order to start the hook procedure. Once the KiRetireDpcList is under the control of the hook, the kernel driver has to stay resident in memory until the system is turned off. The driver responds to the initialization IOCTL returning to the user mode counterpart a success flag and the found virtual address of the KiRetireDpcList function.
The first step in the hook process is to check out whether writing in code sections of kernel images can be accomplished. This can be done in a number of ways:
|
Call the "IsEnforceWriteProtectionSetTo0" function defined in the kernel driver dpcstkgr.c source file in order to check out whether the "read-only write protection" is set to 0 in the registry before any memory write attempt. Consider that by default the "EnforceWriteProtection" value in the "\REGISTRY\MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" key is not present and by design is defaulted to 1 in newly installed Windows 2000/XP systems. Before using DpcStackGrower, you must create a DWORD value with that name and you have to set it to 0. Then you have to reboot your machine for allowing the system to enforce the new setting. |
|
|
This method is more an hack than a solution and was NOT implemented in DpcStackGrower for simplicity. You have to save the contents of the CR0 register and then you have to set to 0 its WP bit (bit 16, refer to the Intel Pentium 4 manuals for further infos and a complete description, Volume 3, Section 2.5). Then you can write-access the contents of read-only image sections. When finished, you should restore the contents of the CR0 register. It is a good idea to disable the hardware interrupts (CLI instruction) before doing anything. |
The next step is to determine the address of the KiRetireDpcList function in memory. To achieve this, I have implemented a method that I have studied and implemented in BugChecker and that I have also seen used in the Compuware SoftICE debugger itself. In fact, for a third party kernel debugger in order to intercept and control special system events and actions (such as a physical page allocation in response to a copy-on-write request in order to, among other things, manage correctly user-mode breakpoints, for example) one or more kernel public or private APIs have to be hooked. Tipically, in the case of exported public functions, the problem is relatively easy to resolve: the hooker can simply import statically the API of interest in its own image and then simply assign the address of the function to a variable or can interate through the export table of the target kernel module searching for the required function. Generally, if the system API is somehow guaranteed to be supported and exported in all the present and future versions of the operating system, the former approach is preferred; in the other cases, the latter approach of export table iteration has to be chosen. In fact, importing in a driver module a private and/or undocumented function exported by a kernel image (NTOSKRNL, HAL etc.) can result in the final driver module to fail to be started by the SCM on other versions of the platform. In fact, as happens for regular user mode DLLs and executables, if the loader is unable to resolve the address of an imported function, the image cannot be mapped in memory and an error is returned (LoadLibrary returns NULL and GetLastError returns ERROR_PROC_NOT_FOUND). Similarly, an unresolved reference in a kernel driver leads to the same results (for example the "ExReleaseResourceForThread" function is exported from the Windows 2000's NTOSKRNL but not from the Windows XP kernel...): in any case, you could move the search functions out to secondary driver modules that could be loaded and unloaded as needed according to the current platform version, without incurring in such problems with the image loader.
However, tipically, in real implementations, the virtual address of a private function (such as our KiRetireDpcList) that is not exported by any kernel module (thus preventing the use of the two methods described above) is tracked down in memory as follows:
|
In the case of BugChecker and SoftICE, if the kernel symbols are present and the image characteristics match (size of image, timestamp etc.), the function virtual address is obtained from symbol information. |
|
|
If symbol info is not present or is mismatched, another approach is followed. Specifically, the hooker module imports statically a kernel function whose virtual address is somehow near to the address of the private function we intend to track down. Then, starting from this address, a byte search is performed considering as the search string a short sequence of bytes taken from the entry point of the function we are searching for. In the case of DpcStackGrower, we search for the first 9 bytes of the KiRetireDpcList entry point starting from the virtual address of several well-known exported APIs, as explained above. The correct approach needs to check out the version of the current platform in order to determine which search string to use and which kernel API to take as the starting point of our search. As you can imagine, either the "signature" of the searched function and/or its position in a kernel module can change due to actual changes in the operating system source files or to compiler and/or linker optimizations. In the recent versions of SoftICE, for example, for simplifying the whole procedure, the search strings and the exported function references that are needed to make up the search, are stored in an external file (OSINFO.DAT) for easier deployment and distribution. When a new build of an operating system or (under some circumstances) a completely new platform is released, it is enough to update the OSINFO.DAT file in order to provide to the debugger all the required informations for tracking down the modified and/or relocated private functions. |
As you can imagine, the best approach remains surely to get the symbol information and then obtain the function virtual address out of it. The availability of the Microsoft symbol server makes the adoption of the symbol info solution even more logical. However, in the case no internet connection is available or in the rare circumstances where the symbol info is not enough, these are a few guidelines that must be followed in order to implement the function search procedure in the correct manner:
|
Sample the signatures of the private functions of interest for each service pack and for both the free and the checked versions of the platform. |
|
|
Call the RtlGetVersion function in order to obtain the exact version of the operating system. According to the returned version info, decide which search string and which OS exported function to use in the search. |
|
|
Determine whether the Uniprocessor or the Multiprocessor kernel image has been mapped in memory: a function signature and position may change from a UP version to a SMP version of the kernel. This is not an easy task: reading, for example, the KeNumberProcessors variable is not enough. In fact specifying the /ONECPU option in the boot.ini file may mess up the whole thing because, in this case, still the MP kernel is loaded at boot time. One solution I have studied is to check out the entry point of a function like "KeAcquireSpinLockAtDpcLevel": as you may know, spin locks are implemented only in MP versions of the kernel; in UP kernels, non-interrupt critical sections (like executive spin locks) can be implemented simply raising the IRQL to DISPATCH_LEVEL. In UP kernels the aforementioned "KeAcquireSpinLockAtDpcLevel" is implemented only as a x86 RET instruction:
BOOLEAN IsMpKernel ()
{
BYTE* pbFN = (BYTE*) & KeAcquireSpinLockAtDpcLevel;
return * pbFN != 0xC2; // opcode of RET instr.
}
|
|
|
Determine whether the PAE kernel has been mapped at boot time, because, as in the case of MP kernels, a function signature and position may change from a PAE version to a NON-PAE version of the kernel image. For determining this, you can check it out in the registry:
NTSTATUS IsPhysicalAddressExtensionSetTo0( OUT BOOLEAN* pbResult )
{
NTSTATUS nsRetVal = STATUS_UNSUCCESSFUL;
UNICODE_STRING usMemoryManagementKeyName;
OBJECT_ATTRIBUTES oaMemoryManagementKeyAttr;
NTSTATUS nsOpenMemoryManagementKeyRes;
HANDLE hMemoryManagementKeyHandle;
PKEY_VALUE_FULL_INFORMATION pkvfiPhysicalAddressExtensionValueBuffer;
ULONG ulPhysicalAddressExtensionValueBufferLen = 8 * 1024;
UNICODE_STRING usPhysicalAddressExtensionValueName;
NTSTATUS nsReadPhysicalAddressExtensionValueRes;
* pbResult = FALSE;
// Read from the Registry.
RtlInitUnicodeString( & usMemoryManagementKeyName, L"\\REGISTRY\\MACHINE\\SYSTEM\\CurrentControlSet\\Control\\Session Manager\\Memory Management" );
InitializeObjectAttributes( & oaMemoryManagementKeyAttr,
& usMemoryManagementKeyName, 0, NULL, NULL );
nsOpenMemoryManagementKeyRes = ZwOpenKey( & hMemoryManagementKeyHandle,
KEY_READ, & oaMemoryManagementKeyAttr );
if ( nsOpenMemoryManagementKeyRes == STATUS_SUCCESS && hMemoryManagementKeyHandle )
{
pkvfiPhysicalAddressExtensionValueBuffer = (PKEY_VALUE_FULL_INFORMATION) ExAllocatePool( PagedPool, ulPhysicalAddressExtensionValueBufferLen );
if ( pkvfiPhysicalAddressExtensionValueBuffer )
{
RtlInitUnicodeString( & usPhysicalAddressExtensionValueName, L"PhysicalAddressExtension" );
nsReadPhysicalAddressExtensionValueRes = ZwQueryValueKey( hMemoryManagementKeyHandle,
& usPhysicalAddressExtensionValueName, KeyValueFullInformation,
pkvfiPhysicalAddressExtensionValueBuffer, ulPhysicalAddressExtensionValueBufferLen, & ulPhysicalAddressExtensionValueBufferLen );
if ( nsReadPhysicalAddressExtensionValueRes == STATUS_SUCCESS &&
| |