By Scott M. Fulton, III, Betanews
The reason Windows Vista seemed slow, and somehow, strangely seemed even slower over time, is now abundantly clear to Microsoft's architects: The evolution of computer hardware, particularly the CPU, exceeded anyone's expectations at the time of Vista's premiere in early 2007. But the surge in virtualization, coupled with the rise of the multicore era, produced a new reality where suddenly Vista found itself managing systems with more than 64 total cores.
Architects had simply not anticipated that the operating system would be managing this many cores, this soon -- at least, that appears to be the underlying message we're receiving here at PDC 2009 in Los Angeles. As independent scientists were speculating about possible performance drop-offs after 8 cores, server administrators were already seeing it. There were design tradeoffs for Windows Vista -- tradeoffs in efficiencies that could have been obtained through complex methods, for simplicity.
Those tradeoffs were fair enough for the dual-core era, but that only lasted a short while. Quad-core processors are quickly becoming commonplace, even in laptops. So with Vista's architecture, users could actually feel the lack of scalability. In fact, they were making investments in quad-core systems earlier in Vista's lifecycle than originally anticipated, and they when they didn't see four cores as right around double the performance of two cores...and later when they saw Vista's lag times slow down their computers over time, some critical elements of Vista's architecture became not an advantage but a burden.
Microsoft performance expert Mark Russinovich is one of the more popular presenters every year at PDC, mainly because he demonstrates from the very beginning of his talks that he absolutely understands what they're going through. It's difficult for a performance expert to put a good face on Vista...and Russinovich, to his credit, didn't even try.
After having quizzed the audience as to how many used Windows 7 on a daily basis (virtually all of the crowd of about 400 people), Russinovich quizzed them, "How many people are sticking with Windows Vista because that's so awesome?" He pretended to wait for an answer, and just before everyone's hands had descended, he answered his own question: "Yea, that's what I thought.
"One of the things we had decided to do with Windows 7 was, we got a message loud and clear, especially with the trend of netbooks, on top of [other] things," he went on. "People wanted small, efficient, fast, battery-efficient operating systems. So we made a tremendous effort from the start to the finish, from the design to the implementation, measurements, tuning, all the way through the process to make sure that Windows 7 was fast and nimble, even though it provided more features. So this is actually the first release of Windows that has a smaller memory footprint than a previous release of Windows, and that's despite adding all [these] features."
To overcome the Vista burden, Windows 7 had to present scalability that everyday users could see and appreciate.
As kernel engineer Arun Kishan explained, "When we initially decided to be able to support 256 logical processors, we set the scalability goal to be about 1.3 - 1.4x, up at the high end. And our preliminary TPCC number was about 1.4x scalability on 128 LPs [logical processors], when compared to a 64 LP system. So that's not bad; but when we dug into that, we saw that about 15% of the CPU time was spent waiting for a contended kernel spinlock." What Kishan means by that term is, while one thread is executing a portion of the kernel, other threads have to wait their turn. About the only way they can do that and remain non-idle is by spinning their wheels, quite literally -- a kind of "running in place" called spinlock.
"If you think about it, 15% of the time on a 128-processor system is, more than 15 of these CPUs are pretty much full-time just waiting to acquire contended locks. So we're not getting the most out of this hardware."
The part of the older Windows kernel that had responsibility for managing scheduling was the dispatcher, and it was protected by a global lock. "The dispatcher database lock originally protected the integrity of all the scheduler-related data structures," said Kishan. "This includes things like thread priorities, ready queues, any object that you might be able to wait on, like an event, semaphore, mutex, I/O completion port timers, asynchronous procedure calls -- all of it was protected by the scheduler, which protected everything by the dispatcher lock.
"Over time, we moved some paths out of the dispatcher lock by introducing additional locks, such as thread locks, timer table locks, processor control block locks, etc.," Kishan continued. "But still, the key thing that the dispatcher lock was used for was to synchronize thread state transitions. So if a thread's running, and it waits on a set of objects and goes into a wait state, that transition was synchronized by the dispatcher lock. The reason that needed a global lock was because the OS provides pretty rich semantics on what applications can do, and an application can wait on a single object, it can wait on a single object with a timeout, it can wait on multiple objects and say, 'I just want to wait on any of these,' or it can say, 'I just want to wait on all of these. It can mix and match types of objects that it's using in any given wait call. So in order to provide this kind of flexibility, the back end had to employ this global dispatcher lock to manage the complexity. But the downside of that, of course, was that it ended up being the most contended lock in most of our workloads, by an order of magnitude or more as you went to these high-end systems."
In the new kernel for Win7 and Windows Server 2008 R2, the dispatcher lock is completely gone -- a critical element of Windows architecture up until Vista, absolutely erased. Its replacement is something called fine-grained locking, with eleven types of locks for the new scheduler -- for threads, processors, timers, objects -- and rules for how locks may be obtained to avoid what engineers still call, and rightly so, deadlock. Synchronization at a global level is no longer observed, Kishan explained, so many operations are now lock-free. In its place is a kind of parallel wait path made possible by transactional semantics -- a complex way for threads, and the LPs that execute them, to be negotiated symbolically.
But the threads themselves won't really "know" about the change. "Everything works exactly as it did before," Kishan said, and this is a totally under-the-covers transparent change to applications, except for the fact that things scale better now."
Next: Speeding processes up by putting processors to sleep...
Speeding processes up by putting processors to sleep
Some of the design changes Windows 7 architects made may seem counter-intuitive on the surface -- explained too simply, you might think they went the wrong direction.
For example, Mark Russinovich told the audience this morning, the new system is designed to increase the idle time for processors (both logical and physical), to make them latent for longer stretches. Sending processors fewer clock ticks is one way to bring this about. Why? "Timer coalescing means we minimize the number of timer interrupts that come into the system," Russinovich explained, "so that the processors stay idle for longer, and then go into sleep states. And then tick skipping means that we don't send timer interrupts to processors that are sleeping, so we don't wake them up needlessly."
In other words, keeping processors busy to reduce latency -- one of the methodologies that we were told years ago would help Vista -- actually reduces overall efficiency. Multicore processors work better when their logical processors (LPs) can be put to sleep, or "parked," and their active threads shipped to another LP. Keeping LPs awake put more of a load on the scheduler, and all that scheduling chatter was a burden on Core 0, where all the scheduling activity used to take place -- serially, one call at a time.
Here's another counter-intuitive notion: It's more efficient for a system to use as much memory as possible -- not to fill it with data, necessarily, but to populate memory pages with something. In an illustration for his part of this morning's workshop, Microsoft Distinguished Engineer Landy Wang showed a Windows 7 Task Manager panel where a machine with 8 GB of DRAM, running just a handful of regular processes, ended up with 97 MB free, or completely "zeroed." And that was a good thing.
"A lot of people might think, 'Wow, 97 megabytes doesn't seem like a lot of free memory on a machine of that size,' said Wang. "And we like to see this row actually be very small, 97 MB, because we figure that free and zero pages won't generally have a lot of use in this system. What we would rather do, if we have free and zero pages, is populate them with speculated disk or network reads, such that if you need the data later, you won't have to wait for a very slow disk or a very slow network to respond. So we will typically take these free and zero pages as we come across them, and pre-populate them with any files you might have read before, or executables we think you might run in the future -- we will get that in advance, so you don't have to wait when you click on something. It's already in memory, but you shouldn't take this low counter as implying that we're using a lot of memory."
Then Wang paused before adding, "We really are using a lot of memory, but we think we're using it in a smart way that you really want us to be using it in."
Microsoft's Arun Kishan explained the page dispatcher lock, and its abolition in Windows 7, replaced by a more complex symbolic system of semantics that lets threads execute in a more parallel, efficient fashion in the end. Locks that never seemed to be a problem in the Windows XP era ended up being a serious obstacle for Vista in more than one respect, as Landy Wang explained: "As we go into higher and higher numbers of cores, the page frame number [PFN] lock was something that we had historically used for nearly 20 years to manage the page frame database array -- a virtually contiguous, although it can be physically sparse, array."
A page frame number entry describes the physical state of the page of memory, Wang reminded attendees -- Is it zeroed out or free, is it on standby, is it active, is it shared, how many processes are communicating with it concurrently? "Basically all the data that we need about the page, such that we can manipulate it into a state transition into any other state that's needed at any point in time. The size of this array is critical, as well as how to best manage the information."
On a 32-bit system with 64 MB, at 4K per page, that's about 16 million pages. Each PFN database is 28 bytes each, fitting into a 32-byte segment, for a total database size of 450 MB of virtual address space. "You would think that's a fairly cheap price, a cheap tax to pay. It's definitely below 1% of the physical memory in your machine, so you would think this is pretty good. But for us it wasn't enough, because we realized that while the physical cost is cheap, the virtual cost is high."
As was the case with the page dispatcher lock, Windows 7 architects had to do away with certain other methodologies that were implemented for simplicity in Vista, but which failed as workloads increased and cores multiplied.
"The problem with the PFN lock is that the huge majority of all virtual memory operations were synchronized by a single, system-wide PFN lock," remarked Microsoft's Landy Wang. "We had one lock that covered this entire array, and this worked...okay 20 years ago, where a four-processor system was a big system, 64 MB was almost unheard of in a single machine, and so your PFN database was fairly small -- several thousand entries at most -- and you didn't have very many cores contending for it."
But more operations and data structures were tacked onto the PFN lock; at the same time, the number of cores and memory in systems ballooned to proportions that engineers had originally planned for something closer to 2016. That increased the pressure on global locks...and it was in Vista where these old architectures began to fail. It was here where Wang presented an astounding statistic that surprised no one in the room who dealt with this subject personally -- it confirmed what they already knew:
While spinlocks comprised 15% of CPU time on systems with about 16 cores, that number rose terribly, especially with SQL Server. "As you went to 128 processors, SQL Server itself had an 88% PFN lock contention rate. Meaning, nearly one out of every two times it tried to get a lock, it had to spin to wait for it...which is pretty high, and would only get worse as time went on."
So this global lock, too, is gone in Windows 7, replaced with a more complex, fine-grained system where each page is given its own lock. As a result, Wang reported, 32-processor configurations running some operations in SQL Server and other applications, ended up running them 15 times faster on Windows Server 2008 R2 than in its WS2K8 predecessor -- all by means of a new lock methodology that is binary-compatible with the old system. The applications' code does not have to change.
We've said before that, for the end user's intents and purposes, Windows 7 "is 'Vista Service Pack 3.'" But in these critical departments of architectural change, where concepts dating back as much as two decades ended up faltering in Vista were scuttled for seemingly complex but more efficient replacements -- ideas that two years ago may have been considered for 2015 -- make the new operating system more like Windows 9.
However, we know that processor power and virtualization will only continue to explode, and to magnify each other's magnitude. So the huge changes under the hood for Win7 may actually end up being stopgap measures, before the onset of a time when more drastic sacrifices will be considered.
Copyright Betanews, Inc. 2009