By Scott M. Fulton, III, Betanews
The key to a huge plurality, if not a majority, of exploits that have plagued Microsoft Windows over the past two decades has been tricking the system into executing data as though it were code. A malicious process can place data into its own heap -- the pile of memory reserved for its use -- that bears the pattern of executable instructions. Then once that process intentionally crashes, it can leave behind a state where the data in that heap is pointed to and then executed, usually without privilege attached.
Yet it doesn't take a malicious user to craft a heap corruption. Multithreaded applications that make use of collective heaps become like multiple users of a single, distributed database. Without intensive methodologies to maintain vigilance, making sure one thread doesn't corrupt an application's heap for all the other threads, the app collapses into something more closely resembling the more colloquial meaning of the metaphor "heap." Microsoft would like to present its development environments and runtime frameworks as providing these vigilance services on behalf of the developer, so she can concentrate on her application. But in recent years, what developers don't know about what's going on under the hood, has come back to bite them.
Bolt-on patches to the heap corruptibility problem have historically been, in a word, pathetic. The best that security software can generally do is wait for certain patterns of corruption to appear, and then act when the corruption happens -- which is a bit like waiting for the underpants bomber to catch fire before tightening airport security. No longer wasting time crafting ingenious hacks, malicious users have devolved into an industry of slackers, waiting for Microsoft to release the latest bulletin on heap corruption patterns before working to build exploits around them.
So the malicious user's worst nightmare is Mark Russinovich, the SysInternals engineer hired by Microsoft to improve system reliability. Last November, Russinovich triumphantly introduced developers at the company's annual PDC conference in Los Angeles to a multitude of measures implemented in Windows 7 and Windows Server 2008 R2 not only to improve reliability and harden security, but to overcome the deficiencies he openly admits characterized the brief era of Windows Vista. Collectively, just the introductions to these new features by Russinovich and his partners consumed 11 hours over the first two days, all of that time with a standing-room-only crowd.
Though they were just bullet points in Russinovich's long programs, history may record the most critically important improvements made to Windows 7 -- more important than any round of Patch Tuesday fixes we've ever seen or may ever see -- address the heap corruption dilemma. To the degree to which the heap is less corruptible, the operating system is less exploitable. And Patch Tuesday ceases to be a headline.
You start by improving the OS
On its surface, it might not appear that the addition of something called the Unified Background Process Manager might lead to Windows' most important security and reliability improvement to date. That's because it's not a security bolt-on. Rather, it begins with the simple goal of reducing the number of concurrently running processes in Windows.
As Russinovich explained to his audience, a lot of the stuff that made Vista and its predecessors slow were services hanging around in memory, waiting for an excuse to do something useful. But because they had to be active to react when the time was right, they had to be running, even if they were doing nothing.
"One of the overheads of just running the operating system, are these background services and scheduled tasks, where some of these services are not really providing anything useful until certain things happen," he said. "In the previous design of the Windows service model, if you were writing a Windows service, you said, 'My service really needs to be running if a Bluetooth device gets added to the system.' What you had to do is have your service be an auto-start service, and then watch for the arrival of the Bluetooth device. There's lots of examples like that where a service is really doing nothing other than waiting around for something to do."
Ironically, Windows 2000 had introduced something called Event Tracing for Windows [ETW], ostensibly to drive the Task Scheduler. That enabled systems to trigger backups for certain times, for example. So a triggering mechanism had actually been in place in Windows since before the turn of the decade. Now there was a new excuse to actually use it.
"The Unified Background Process Manager is a component that's implemented in the Service Control Manager, SERVICE.EXE -- we didn't want to introduce another process for this -- [that] serves as the central registrar for what we call triggers. A trigger is an ETW event [that] can be registered to start something or to stop something. So statically, in the definition for your service, you would say [for example], 'My service only needs to be running if a Bluetooth device is ready,' [and] you would specify that ETW event as your trigger. UBPM then registers as an ETW consumer for that event, turning on the Bluetooth device arrival ETW provider, and watching for that ETW event to come through. When it sees that event come through, it's going to go start your service."
Handset owners: What this means is the potential for drivers that don't have to waste time and energy in Windows waiting for your phone to ping the Bluetooth receiver.
"You can also specify events that will stop your service, and it will watch for those as well," Russinovich added. "The nice thing about having UBPM, besides your service not having to run -- it being a central registrar -- is that if multiple services or components are registered for the same trigger, UBPM is the only one that is watching for that event, and multiple things don't need to do it."
If you're wondering when this article stopped being about preventing heap corruption, it didn't -- keep following. One of the processes that had typically been hanging around Vista waiting for something to happen and consuming CPU cycles, was the Windows Error Reporting (WER) service. In Win7 and WS2K8 R2, that service now uses UBPM. Therefore, the crash is the ETW trigger that wakes up the new WER. The new WER doesn't have to spend its time analyzing the system in wait for a crash to happen -- it knows a crash is happening, otherwise it wouldn't be running.
With the overhead of crash detection removed, the new WER now has more opportunity to use its valuable time analyzing the causes of crashes as they happen. And with that data, Russinovich and the reliability engineers discovered some extremely valuable information.
"When we look at the root causes of crashes that come into Windows from applications to the online Windows Error Reporting...we saw that roughly 15% of all user-mode crashes are caused by heap corruption," he reported. "And if you look at the shutdown, 30%, roughly one-third, of crashes during the shutdown path -- when your program is shutting down and cleaning up -- are caused by heap corruption."
Next: "The Big Fix"...
"The Big Fix"
With more data on hand thanks to the vastly improved and partly repurposed Windows Error Reporting service, engineers could craft more effective ways to address the root problem of as much as one-third of key categories of crashes. That's what enabled them to implement what may go down as the Big Fix.
"We introduced this thing called fault-tolerant heap," stated Mark Russinovich. "The basic idea is, it's monitoring for heap corruptions. When it sees a heap corruption in a process, it enables heap mitigations, then it monitors the effectiveness of that heap mitigation, and if it's effective, it keeps it on. At the process shutdown, [FTH] is keeping a record of mitigations it's detected it had to apply with stack traces, and then capturing that and sending it up to Windows Error Reporting, so that you've got a record of exactly where your application was causing a heap problem. And then we can look at it, see if it's our code or your code [that's to blame] (it's probably your code) and fix it."
The core system library in Windows is NTDLL. There are multiple reasons why NTDLL may cause a crash, but there's only one place for a thread to crash on account of heap corruption, and that's NTDLL.
Once again, an existing Windows architectural element is used for a new purpose: Vista introduced the idea of the shim to accompany older programs loaded into memory, to help ease their transition to the new system and reduce incompatibility problems. When you run an old program in "compatibility mode," you're introducing a shim. Now, with the new event triggers in place, the FTH system can install a new kind of shim atop a process whenever it faults. The purpose of this new shim is to capture information about the stat of the process' heap.
As Russinovich explained, faulting processes have a tendency to reference their heaps after they're supposed to have freed them. With the shim in place, such a call to an officially freed heap will be redirected to a new 4 MB buffer that will, for the intent of the crashing app, look like the heap. This is a separate area of memory managed by FTH. What this means is, during a heap corruption event, instead of less control in Windows, there's now more control. What's more, each page of allocated memory in the heap is now buffered with an extra eight bytes of otherwise superfluous non-data, simply to mitigate the event of a likely buffer overrun -- another extremely exploitable event.
"When a process faults and we've detected that it's crashed in the heap, and FTH has been applied to it, what we do is watch for further crashes of the process that look like heap crashes," Russinovich explained. "We start out with...our starting value. If the process exits without a heap corruption, then we leave it where it is. If it crashes with a heap corruption, then we say we weren't effective at capturing that, and decrement that count. If that count goes to zero, then we end up removing the shim. The shim's cost depends on the application's use of the heap, and for things like Internet Explorer -- which runs a lot of dodgy stuff inside of it, and might get shims applied to it -- we want to be very careful about only applying the shims if it's really having a beneficial effect."
After years of semi-fruitful brain-wracking over how the operating system should best respond to heap corruption events, a side benefit of an effort to simply reduce the number of running services so the OS could respond better to things like Bluetooth, led to perhaps the most important reliability and security enhancement to Windows since address space load randomization. When someone asks, "What does improving performance one iota or two really matter to users," here is your response.
Fewer running processes is partly responsible for the measurable size reduction in Windows 7 over Windows Vista.
11:00 am EST December 31, 2009 · If I were to give a pop quiz to several of the various media aggregators who read and "reported" on this story, posted yesterday, I would be handing out F's until I had to borrow some from my own last name. So in the interest of correcting the News from the Alternate Universe:
- No, there is no new zero-day Windows flaw or bug discovered by Mark Russinovich. This is a story of a chronic Windows problem addressed by improvements to the operating system architecture that were rolled out in Windows 7 and Windows Server 2008 R2.
- Russinovich did not build these new features, such as Unified Background Process Manager, alone. There is a large company called Microsoft which is his employer, and which has hired a battalion of developers.
- This is not some future feature. These improvements are things that are in Windows 7 now. If you've installed it on your computer, you're using them now.
- Heap corruption is not the only cause of crashes. Likewise, not every process crash leaves the heap corrupted.
As always, Betanews is not responsible for misinterpreted facts on the part of aggregators beyond our control.
Copyright Betanews, Inc. 2009