Gregarius » BetaNews.Com » The Betanews Comprehensive Relative Performance Index: How it works and why

By Scott M. Fulton, III, Betanews

After several months of intense research, helped along by literally hundreds of reader suggestions, Betanews has revised and updated its testing suite for Windows-based Web browser performance. The result is the Comprehensive Relative Performance Index (CRPI). If it's "creepy" to you, that's fine.

We've kept one very important element of our testing from the very beginning: We take a slow Web browser that you might not be using much anymore, and we pick on its sorry self as our test subject. We base our index on the assessed speed of Microsoft Internet Explorer 7 on Windows Vista SP2 -- the slowest browser still in common use. For every test in the suite, we give IE7 a 1.0 score. Then we combine the test scores to derive a CRPI index number that, in our estimate, best represents the relative performance of each browser compared to IE7. So for example, if a browser gets a score of 6.5, we believe that once you take every important factor into account, that browser provides 650% the performance of IE7.

As you'll see, we believe that "performance" means doing the complete job of providing rendering and functionality the way you expect, and the way Web developers expect. So we combine speed, computational efficiency, and standards compliance tests. This way, a browser with a 6.5 score can be thought of as doing the job more than five times faster and better. Here now are the eight batteries we use for our suite, and how we've modified them where necessary to suit our purposes:

The Nontroppo CSS rendering test. Up to now, we've been using a modified version of a rendering test used by HowToCreate.co.uk, whose two purposes have been to time how long it takes to re-render the contents of multiple arrays of <DIV> elements and to time the loading of the page that includes those elements. The reason we modified this page was because the JavaScript onLoad event fires at different times for different browsers -- despite its documented purpose, it doesn't necessarily mean the page is "loaded." There's a real-world reason for these variations: In Apple Safari, for instance, some page contents can be styled the moment they're available, but before the complete page is rendered, so firing the event early enables the browser to do its job faster -- in other words, Apple doesn't just do this to cheat. But the actual creators of the test themselves, at nontroppo.org, did a better job of compensating for the variations than we did: Specifically, the new version now tests to see when the browser is capable of accessing that first <DIV> element, even if (and especially when) the page is still loading.
Here's how we developed our new score for this test: There are three loading events: one for Document Object Model (DOM) availability, one for first element access, and the third being the conventional onLoad event. We counted DOM load as one sixth, first access as two sixths, and onLoad as three sixths of the rendering score. Then we adjusted the re-rendering part of the test so that it iterates 50 times instead of just five. This is because some browsers do not count milliseconds properly in some platforms -- this is the reason why Opera mysteriously mis-reported its own speed in Windows XP as slower than it was. (Opera users everywhere...you were right, and we thank you for your persistence.) By running the test for 10 iterations for five loops, we can get a more accurate estimate of the average time for each iteration because the millisecond timer will have updated correctly. The element loading and re-rendering scores are averaged together for a new and revised cumulative score -- one which readers will discover is much fairer to both Opera and Safari than our previous version.

Celtic Kane's JavaScript suite. The independent developer who calls himself Celtic Kane is noted for developing a battery of simplified tests, which first became noteworthy for having demonstrated the rendering ability of Opera over its competition at the time, Mozilla Firefox and IE7. What impresses us about CK is its ability to render a "signature" of eight integer scores that distinguish just about each version of each browser we test -- whereas many other tests are susceptible to variations in the environment, CK is relatively quite stable. As before, each of the eight tests in the CK battery (array handling, date and timing object manipulation, error handling, math objects, regular expressions, string objects, DOM manipulation, and AJAX declarations) are judged for relative performance against IE7, and the result is averaged for a cumulative score.

SunSpider JavaScript benchmark. Maybe the most respected general benchmark suite in the field focuses on computational JavaScript performance rather than rendering -- the raw ability of the browser's underlying JavaScript engine. Though it comes from the folks who produce the WebKit open source rendering engine that currently has closer ties with Safari, though is also used elsewhere, we've found SunSpider's results to appear fair and realistic, and not weighted toward WebKit-based browsers. There are nine categories of real-world computational tests (3D geometry, memory access, bitwise operations, complex program control flow, cryptography, date objects, math objects, regular expressions, and string manipulation), some of which overlap with Celtic Kane, although we feel those that do overlap should be treated more importantly anyway. All nine categories are scored and average relative to IE7 in Vista SP2.

Next: The additions and changes we've made...

The additions and changes we've made to our performance index

The Mozilla 3D cube by Simon Speich (Testcube 3D) is an unusual discovery from an unusual source: an independent Swiss developer who devised a simple and quick test of DHTML 3D rendering while researching the origins of a bug in Firefox. That bug has been addressed already, but the test fulfills a useful function for us: It fills precisely the gap in our test suite that we've been needing to fill, testing only graphical dynamic HTML rendering -- which is finally becoming more important thanks to more capable JavaScript engines. And it's not weighted toward Mozilla -- it's a fair test of anyone's DHTML capabilities. There are two simple heats whose purpose is to draw an ordinary wireframe cube and rotate it in space, accounting for forward-facing surfaces.
Each heat produces a set of five results: total elapsed time, the amount of that time spent actually rendering the cube, the average time each loop takes during rendering, and the elapsed time in milliseconds of the fastest and slowest loop. We add those last two together to obtain a single average, which is compared with the other three times against scores in IE7 to yield a comparative index score.

The SlickSpeed CSS selectors test suite is a new and probably controversial addition to our suite, but as you'll see, we addressed the reason for the controversy and compensated for it. As JavaScript developers know, there are a multitude of third-party libraries in addition to the browser's native JS library, that enable browsers to access elements of a very detailed and intricate page (among other things). For our purposes, we've chosen a modified version of SlickSpeed by Llama Lab, which covers many more third-party libraries including Llama's own. This version tests no fewer than 56 shorthand methods that are supposed to be commonly supported by all JavaScript libraries, for accessing certain page elements. These methods are called CSS selectors (one of the tested libraries, called Spry, is supported by Adobe and documented here).
So Llama's version of the SlickSpeed battery tests 56 selectors from 10 libraries, including each browser's native JavaScript (which should follow prescribed Web standards). Multiple iterations of each selector are tested, and the final elapsed times are rendered. Here's the controversial part: Some have said the final times are meaningless because not every selector is supported by each browser; although SlickSpeed marks each selector that generates an error in bold black, the elapsed time for an error is usually only 1 ms, while a non-error is as high as 1000. We compensate for this by creating a scoring system that penalizes each error for 1/56 of the total, so only the good selectors are scored and the rest "get zeroes."
Here's where things get hairy: As some developers already know, IE7 got all zeroes for native JavaScript selectors. It's impossible to compare a good score against no score, so to fill the hole, we use the geometric mean of IE7's positive scores with all the other libraries, as the base number against which to compare the native JavaScript scores of the other browsers, including IE8. The times for each library are compared against IE7, with penalties assessed for each error (Firefox, for example, can generate 42 errors out of 560, for a penalty of 7.5%.) Then we assess the geometric mean, not the average, of each battery -- the reason we do this is because we're comparing the same functions for each library, not different categories of functions as with the other suites. Geometric means will account better for fluctuations and anomalies.

Nontroppo table rendering test. As has already been proven in the field, CSS is the better platform for rendering complex pages using magazine-style layout. Still, a great many of the world's Web pages continue to use HTML's old <TABLE> element (created to render data in formal tables) for dividing pages into grids. We heard from you that if IE7 is still important (it is our index browser after all), old-style table rendering should still be tested. And we've decided to concur.
The creator of our CSS rendering test has created a similar platform for testing not only how long it takes a browser to render a huge table, but how soon the individual cells (<TD> elements) of that table are available for manipulation. When the test starts, it times the duration until the browser starts rendering the table and then ends that rendering, from the same mark, for two index scores. It also times the loading of the page, for a third index score. Then we have it re-render the contents of the table five times, and average the time elapsed for each one, for a fourth score. The four items are then averaged together for a cumulative score.

Nontroppo standard browser load test. (That Nontroppo gets around, eh?) This may very well be the most generally boring test of the suite: It's an extremely ordinary page with ordinary illustrations, followed by a block full of nested <DIV> elements. But it allows us to take away all the variable elements and concentrate on straight rendering and relative load times, especially when we launch the page locally. It produces document load time, document plus image load times, DOM load times, and first access times, all of which are compared to IE7 and averaged.

Acid3 standards compliance test. The function of the Acid3 test has changed dramatically, especially as most of our browsers become fully compliant. IE7 only scored a 12% on the Acid3; but today, most of the alternative browsers are at 100% compliance, with Firefox at 93% and flirting with 94%. So it means less now than it did in earlier months to have Acid3 yield an index score of 8.33, which is the score for any browser that scores 100% thanks to IE7. Now that cumulative index scores are closer to 20, having an eight-and-a-third in the mix has become a deadweight rather than a reward. So for the first time, we're making Acid3 count in a different way: For the other batteries that have to do with rendering (all three Nontroppos and TestCube 3D), plus the native JavaScript library portion of the SlickSpeed test, we're multiplying the index score by the Acid3 percentage. As a result, the amount of any non-compliance with the Web Standards Project's assessment is applied as a penalty against those rendering scores. Today, only Mozilla and Microsoft browsers are affected by this penalty, and Firefox only slightly -- all the others are unaffected.

The physical test platform we've chosen for our tests is a triple-boot system, which enables us to boot different Windows versions from the same large hard drive. Our platforms are Windows XP Professional SP3, Windows Vista Ultimate SP2, and Windows 7 Release Candidate.

All platforms are always brought up to date using the latest Windows updates from Microsoft, prior to testing. We realize, as some have told us, that this could alter the speed of the underlying platform. However, we expect real-world users to be making the same changes, rather than continuing to use unpatched and outdated software. Certainly the whole point of testing Web browsers on a continual basis is because folks want to know how Web browsers are evolving, and to what degree, on as close to real-time a scale as possible. When we update Vista, we re-test IE7 on that platform to ensure that all index scores are relative to the most recent available performance, even of that aging browser on that old platform.

The physical test unit is an Intel Core 2 Quad Q6600-based computer using a Gigabyte GA-965P-DS3 motherboard, an Nvidia 8600 GTS-series video card, 3 GB of DDR2 DRAM, and a 640 GB Seagate Barracuda 7200.11 hard drive (among others). Three Windows XP SP3, Vista SP2, and Windows 7 RC partitions are all on this drive. Since May 2009, we've been using a physical platform for browser testing, replacing the virtual test platforms we had been using up to that time. Although there are a few more steps required to manage testing on a physical platform, you've told us you believe the results of physical tests will be more reliable and accurate.

Gregarius » BetaNews.Com » The Betanews Comprehensive Relative Performance Index: How it works and why

Canaux

BetaNews.Com

The Betanews Comprehensive Relative Performance Index: How it works and why

Publié: septembre 21, 2009, 5:34pm CEST par Scott M. Fulton, III