The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
TechInterview.org
CityDesk
FogBugz
Fog Creek Copilot


The Old Forum


Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Microsoft server crash nearly causes 800-plane pile-up

Spot the FUD in this article, boys and girls:

http://www.techworld.com/opsys/news/index.cfm?NewsID=2275
muppet Enterprise v 4.56 Send private email
Wednesday, September 29, 2004
 
 
I looked everywhere but I didn't see any FUD.

They replaced the air traffic control systems in California with off the shelf DELL boxes running Windows. The system is very buggy and goes down all the time without warning, leading to many near-crashes. If the situation persists, there will be some mid air collisions and hundreds will die.
Tony Chang
Wednesday, September 29, 2004
 
 
Sounds like someone at Harris Corporation didn't RTFM for GetTickCount() and the scheduled maintenance reboot in place as a workaround didn't get done.

I don't see MS being at fault for either.
Chris Altmann
Wednesday, September 29, 2004
 
 
Cool title for one of the "related" stories though - "Corrupted porn pics expose Microsoft hole". Ewww!

Wednesday, September 29, 2004
 
 
Deja vu. Didn't we discuss this yesterday ?
Nemesis
Wednesday, September 29, 2004
 
 
It has been discussed before.

The fact:
The bug only exists on Windows 95/98, not in Windows 2000 as quoted in the report.

Therefore
either
  The report writer ...
or
  The report writer ...

Conclusion
  FUD
Richard Sunarto
Wednesday, September 29, 2004
 
 
We did, but the message takes a while to get through to people at the slashbot level.
el
Wednesday, September 29, 2004
 
 
> The failure was ultimately down to a combination of human error and a design glitch in the Windows servers

If this statement is true, that there is a design problem with the Windows servers themselves and not the custom software on them, that means there is either a problem with the hardware or with the OS software. Thus, either Dell or MS are to blame. But there is also the possibility that the author is lying and the problem is with the custom software itself.

I don't buy that there is human error on site. Since the only reason they are 'required' to restart the servers every 30 days is because of a flaw in the system, it is that flaw that is to blame and not the fact they forgot to do it.
Tony Chang
Wednesday, September 29, 2004
 
 
What is a FUD?
ignorant
Wednesday, September 29, 2004
 
 
FUD = Fear Uncertainty Doubt
Richard Sunarto
Wednesday, September 29, 2004
 
 
Richard: The bug does not exist in the Win2K kernel (as it does in the Win95), but it does exist in Win2K - e.g. [http://support.microsoft.com/default.aspx?scid=kb;en-us;823273 ] forces you to reboot every 50 days or suffer high CPU load in some cases.

Half FUD. Microsoft are not to blame for this specific air traffic control breakdown. But their server software is still far from being up to mission critical standards.
Ori Berger Send private email
Wednesday, September 29, 2004
 
 
Current patch level on Win2K server should eliminate this problem anyhow.
Simon Lucy Send private email
Wednesday, September 29, 2004
 
 
I think it's funny that they're "working on a permanent fix". Patch, bitches! Patch!
Brad Wilson Send private email
Wednesday, September 29, 2004
 
 
I still wonder what the domain semantic relevance could be of the OS start-up time in such a system, assuming it is at all related to that.
Just me (Sir to you) Send private email
Wednesday, September 29, 2004
 
 
Hmm, memory leaks and counter overflows.... I remember a previous place had to use MSMQ everywhere, but due to a counter overflow in it, the server had to be rebooted weekly, seems that the counter would overflow every 11 1/2 days (every 2^31 milliseconds). Sure, we ended up getting a hot fix for it, and I've been told it was fixed in a real patch, but still, that was a "mission critical" server, dealing with healthcare information (prescriptions and lab reports were the high priority items, eligibility and insurance claims were the low priority items) and it had to be rebooted to keep it from locking up.

And it was also a race to see if the memory leakage would kill the box before the counter overflowed. I seem to remember about 100 bytes of every message leaked, but after a week and a half, MSMQ was chewing up several gig of ram, just for its own process.
Peter
Wednesday, September 29, 2004
 
 
I don't see this as a MS problem. The article says that the servers are timed to shut down after 49.7 days. Surely this is a problem in some of the servers software, because no windows server needs that done (I have one internal w2k server running for over two years now (no, I haven't patched it because it's in internal use only so why bother)). I bet there is a tick counter that nobody thought to check the range of. Now, instead of a shutdown, why not simply reset the counter in the software every day, for example, and keep another counter for days? Easy stuff really.

They also say that an improperly trained employee failed to reset the system, which is again not microsofts problem.

Then, the article says that Harris Corporation completed testing of the system in 2001. Well obviously they didn't test it very well, which again I fail to see is Microsofts fault exaclty how?

And later in the article it says that the problem is already elimiated in Seattle, so why haven't they done the same in California? Microsoft's fault - not.

I don't exactly love Windows, but I don't have a big problem with it, either. I do, however, have a problem with people as careless as this bunch, running air traffic controlling. I hope someone got fired (the testing people, maybe a bunch of the original developers, and whoever improperly trained someone, and then the idiot who let that improperly trained person in charge of such a vital function).

Anyhoop, it's fun to bash MS, as usual.
Antti Kurenniemi Send private email
Wednesday, September 29, 2004
 
 
You cats that say this is not a MS problem are so delusional. Even Microsoft admits as such and they even have an article on it an that article is linked to in this thread and you tards say "This is not a MS problem ! This is fearmongering by the anti-MS camp!"

From microsoft.com:

> The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.

CAUSE
This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.

This problem may occur if one of the following conditions is true:

• You are using Microsoft Windows 2000 Server.
• You are using Microsoft Windows NT Server 4.0, and you installed the hotfixes that are described in the following Knowledge Base article
Tony Chang
Wednesday, September 29, 2004
 
 
"Current patch level on Win2K server should eliminate this problem anyhow."

Not true! Read the freaking article!
Tony Chang
Wednesday, September 29, 2004
 
 
First of all, you conviniently leave out the next paragraph:

"A supported hotfix is now available from Microsoft, but it is only intended to correct the problem that is described in this article. Only apply it to systems that are experiencing this specific problem. This hotfix may receive additional testing. Therefore, if you are not severely affected by this problem, we recommend that you wait for the next Windows 2000 service pack that contains this hotfix."

The hotfix files date Jan-07-2004

Furthermore, unless you have information we have not had access to, nothing in the article seems to indicate this was the issue they ran into. What is more, it seems to suggest that this is >not< the issue they ran into:

"But they said the quirk in the system, known as Voice Switching and Control System, is a "design anomaly" that should have been corrected after it was discovered last year in Atlanta."

I assume that these are not the people that designed Windows 2000, and that the referenced "design anomaly" was in their application. Who knows, it might even have been an inappropriate use of GetTickCount().
Just me (Sir to you) Send private email
Wednesday, September 29, 2004
 
 
"Not true! Read the freaking article!"

Relax Tony.

Here's the facts as I can see them-

a) They have a custom application where the developer relies upon GetTickCount and doesn't accommodate the fact that the value rolls over (i.e. their delta logic does now - then, rather than first checking if now < then at which point rollover logic needs to be taken into account). There is nothing wrong with GetTickCount, but throughout time many developers have ignored the warnings (and the obvious limits of a 32-bit value holding milliseconds) and built faulty logic around it.

b) The customer accepted that their application has this fault and built a standard operation procedure of rebooting the server to set tick count back at zero because of the fault in the customer application mentioned in point a.

Yes some Microsoft programmers throughout time have fallen for this same problem (for instance some guy programming a certain scenario in RPCSS), but the case in question clearly is not a problem with Microsoft software, and they aren't rebooting because of 823273. 823273, for the record, only affected certain servers in certain scenarios.

This, like the navy boat issue, is a standard user application problem and has nothing whatsoever to do with Microsoft - it would occur on any platform just the same. Microsoft has lots of real faults in their software that negatively impact uptime, but the fact that the raving hoardes keep choosing the wrong cases to use as sample cases (like the navy boat which again was a user application problem) just makes them look like misguided zealots.
Dennis Forbes Send private email
Wednesday, September 29, 2004
 
 
Joel on Software, my ass.  More like Microsoft Apologists on Software.
I know whereof I speak
Wednesday, September 29, 2004
 
 
Further, if you want a RTOS then use one, don't put a time sharing OS under the same constraints as you would a RTOS.
Simon Lucy Send private email
Wednesday, September 29, 2004
 
 
"makes them look like misguided zealots."

You just wait until the guided zealots show up.
/.
Wednesday, September 29, 2004
 
 
RTFM is not a defense.

It is very common for software companies to make
all sorts of horrible decisions and then say
if we just document it everything is ok, we
are free of responsibility.

That's just not acceptable. Like the tax code
there could be anything stuck in a FM

Common sense should be sufficient. Does anyone
at any time ever expect their server to reboot
because of a time limit?

No. It is a microsoft issue.
son of parnas
Wednesday, September 29, 2004
 
 
"They replaced the air traffic control systems in California with off the shelf DELL boxes running Windows. The system is very buggy and goes down all the time without warning, leading to many near-crashes."

WRONG, VSCS is just the interphone and radio selection system. Radar and flight data processing is still on the IBM HOST and display system running Unix. It's been up and running 24-7 since the late 60's.

VSCS has been deployed 24-7 since 1996 without "crashes", but this situation was a safety shutdown when it switched to backup and found it was misconfigured. Obviously it shouldn't have happened and nobody's defending it, but having a misconfigured VSCS is a far cry from RDP and FDP being down and "leading to many near-crashes".
Lockheed Larry
Wednesday, September 29, 2004
 
 
I should say the HOST system has been around in various forms since the 60's, but the Unix display system (DSR) was deployed to LA Center I think in '99.
Lockheed Larry
Wednesday, September 29, 2004
 
 
son of parnas,

You miss the point.  The person who wrote this software (which just happens to work on Windows), used a function that resets it's return value every 49.7 days.  This is by design.  The function was made to measure time intervals of milliseconds.  The limitations of this function are clearly documented in MSDN.

The developer didn't take this into account when he wrote the code for the system.  Because of this, the FAA has problems.

The developer should have used a different timekeeping method.  There are pleanty of other methods built into Windows.  Some of them are even mentioned on the MSDN page for the very same function where the problem happens.

It's not that the function is lousy or buggy.  It was just designed for something different than he used it for.

If you want to blame anyone, blame the authors of the C programming language for making variables loop around on an overflow.
Myron A. Semack Send private email
Wednesday, September 29, 2004
 
 
> If you want to blame anyone, blame the
> authors of the C programming language
> for making variables loop around on an overflow.

It seems i did miss the point. Thanks for
the clarification.

Perhaps i am over sensitive because on one
project we had all of our vxworks target
reboot out in the field for a "known"
issue we didn't know about. Grr.
son of parnas
Wednesday, September 29, 2004
 
 
I posted this in the last thread.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp

Now, it does seem that there is also a bug in Win2K and NT4 that can cause problems with RPC if the system is up for >49.7d (an MS developer used GetTickCount incorrectly).  Apparently, it's not a commonly-occurring bug.  MS has a fix available for system that see this problem, but you have to contact them.

However, this RPC bug isn't the cause of the FAA problem.  They had to reboot because their application broke, not the Windows server.

Windows 95 and 98 did have a bug where they would lock up after being on for >49.7d.  Similar mistake.  Again, this is totally unrelated to the FAA problem.  The FAA system uses Windows 2000.

I can sort of understand the mistake.  Once upon a time, I accidentally used a signed value to track a time interval.  It started out just mesuring a short amount of time.  As the app developed, it was measuring a larger and larger interval.  By my third version, the interval was long enough to overflow the int.  My program didn't respond too well to negative time intervals.  It was embarassing.  Luckily it wasn't a mission critical thing.  It did teach me to be religious about using unsigned values, considering overflow conditions, etc.
Myron A. Semack Send private email
Wednesday, September 29, 2004
 
 
And you neglected the signifigance of "This hotfix may receive additional testing."

that menas it is an UNTESTED HACK.

You CAN NOT put an untested hack on a freaking air traffic control system as a LIVE patch! Are you INSANE?
Tony Chang
Wednesday, September 29, 2004
 
 
I love how the MS sphincter-lickers are saying two things at once:

1. This is the fault of the programmer's bug not microsoft!
2. Microsoft has an untested patch available that fixes the bug in their software.

So. Which is it?

Are you saying that microsoft has a patch to fix a bug in the custom software developed  for the air traffic system?

How can I sign up for this program where Microsoft will issue patches to Windows to compensate for each bug in programs I have written? That sounds like a dandy program! What service!
Tony Chang
Wednesday, September 29, 2004
 
 
> a far cry from RDP and FDP being down and "leading to many near-crashes"

If you read the article, the specific number of near-air collisions they had during this incident was FIVE near-collisions involving TEN separate aircraft. And that's just this one incident on one day.

I don't know about you, but most people would consider the burning wreckage of ten airliners to be excessive. Look how absolutely freaking ape-shit the US has gone over just a mere four airplanes going down in September 2001 -- we're still bombing people over that little incident.
Tony Chang
Wednesday, September 29, 2004
 
 
Actually Tony, MS does bend over backwards to help buggy third-party programs. If you read Raymond Chen's blog http://weblogs.asp.net/oldnewthing/ at all you will find out about the hoops MS goes through making their OS compatible with all kinds of programs.

Although in this case I believe that the third-party programmers f*cked up and are trying to blame Microsoft.
DJ
Wednesday, September 29, 2004
 
 
I program MS products for a living and often tell people that it's not as bad as they think and that stuff is a lot better and more stable nowadays. But I would never recommend MS products off the shelf for use in life-critical applications. Never. They are not designed to that level of reliability.

Downplaying the importance of the communication subsystem is just stupid. If the pilots can't get information about wher ethey are and what approach they need to take, it's the same as not having any air traffic control system at all. It's extremely dangerous. You Lockheed Larry, it's no wonder the wings on your crap airplanes fall off with the eye for detail their employees like yourself have.
 
This is futile since you guys can't read but maybe you can get your friends to read this for you:

>The Rpcss.exe process consumes 60 percent or more of CPU time, and system performance and network performance are affected. This symptom typically occurs 49.7 days after the server is started.

>CAUSE
>This problem occurs because a call to the GetTickCount timer function causes the function to overflow 49.7 days after the server is started.

Please read this explanation from Microsoft. The program that is improperly using GetTickCount is WRITTEN BY MICROSOFT and is PART OF Windows NT and Windows Server 2000.

Would you get your heads out of your asses?

Retards.
Tony Chang
Wednesday, September 29, 2004
 
 
>Representatives of the National Air Traffic Controllers Association called three of the five incidents near misses, and said all were serious violations of standards designed to keep aircraft safely separated.

>"Three pairs (of planes) were so close that on-board collision avoidance systems were activated,'' said Mark Sherry, regional vice president of the union and a controller in the San Francisco tower. "We had three controllers who couldn't do anything but watch their screens as two dots merged into one, then wait five or six seconds and hope that two came back out.''

But hey it's no big deal right since the radar systems on the ground were working OK, even if the communication systems went down, those radars kept working.

And the elevators in the Empire State Building continued to function even after the World Trade Center towers blew up, so that's no big deal either, right? Cause some thing else was working somewhere so all is fine.
Tony Chang
Wednesday, September 29, 2004
 
 
Tony,

You're not understanding.  There is a standard function in Windows, called GetTickCount.  It returns the number of msec since the system booted.  It is a DWORD return value, so it overflows eventually (49.7 days).

Normally, this isn't a problem, unless you're trying to measure a time interval larger than 49.7 days.  If you use this function in your code, you need to be aware of how it works.

The developer of this FAA system used GetTickCount in his application, and had problems because he (ignored / fogot about / wasn't aware of) it's limits.  This caused the issues.

Now, as a somewhat unrelated issue, there is a block of code in Microsoft's RPC service that uses this GetTickCount function, and uses it improperly.  On SOME systems (but not this particular FAA one), it causes problems.  MS has a patch for this.  Note: THE PATCH DOES NOT CHANGE THE BEHAVIOR OF GETTICKCOUNT.  It wouldn't make the FAA program start working.  It changes the RPC code in Windows to not use GetTickCount (or use it differently, I'm not sure which).
Myron A. Semack Send private email
Wednesday, September 29, 2004
 
 
"But hey it's no big deal right since the radar systems on the ground were working OK, even if the communication systems went down, those radars kept working."

That's right, Tony, and if you knew the business you'd know two-way radio communication is just one of around seven layers of concurrent, redundant levels of safety. Controller instructions are always issued under the assumption that radios (controllers' or pilots') may fail at any time, a concept called "positive separation". In this case, two-way radios did fail, so the other layers such as TCAS (onboard collision avoidance) kicked in as it was supposed to, as did the transfer of control to adjacent facilities.

I'm sure you know there are *dozens* of TCAS collision avoidance maneuvers that happen every *day* in the U.S.? And a complete loss of minimum separation occurs *daily*? That's with everything functioning 100%.

The system worked as designed. Everything is redundant and there is no dependence on Windows, or Unix, or IBM, or VSCS. Human error happens as it did here, but the system was designed to recover without loss of life, and it did.
Lockheed Larry
Wednesday, September 29, 2004
 
 
> if you knew the business you'd know two-way radio communication is just one of around seven layers of concurrent, redundant levels of safety

Hi Larry, if YOU knew the business like you say you do, you would know that the FAA ordered the removal of that system (EARS) sometime back and THAT is why there were multiple near misses.

You'd also know that these are serious problems and not to be taken lightly as you are implying.

I don't think you really are who you say you are.

MS is going bonkers over their culpability in this incident. Very high level contacts have been made from MS to the Bush administration, which is why you see the FAA trying to sweep this under the rug. But the controllers are trying to make the truth known about what really happened. They are being threatened for their jobs over this. You, and your other pseudonyms in this thread are part of MS damage control. You don't reall yknow al lthe details but you spread the typical MS BS line to cover up the facts. Not everyone is fooled.
Tony Chang
Wednesday, September 29, 2004
 
 
Tony:

What was the actual issue here that is Microsoft's fault?

So far, the following have been mentioned:

* Maybe a developer used GetTickCount without handling the fact that it wraps. (Note that the interval length is irrelevant, a 1 sec interval over the wrap time will still have a very wrong delta between two calls to GetTickCount). This is NOT Microsoft's fault at all, the provide other techniques to handle this.

* Perhaps the buggy RPC prcoess brought down the system. This WOULD be Microsoft's fault, though they have a patch.

* Perhaps some other system overflows. I read somewhere that some temporary table would overflow, so they programmed an shutdown before that was likely to happen, after a manual reset to prevent it from happening. This WOULD NOT be Microsoft's fault.

So which is it? Or is it something else? Do you have actual knowledge of what it is?
mb Send private email
Wednesday, September 29, 2004
 
 
The company that delivered this system for 1.3 billion dollars today announced that the FAA has awarded them a new $265 million contract to expand the system:

http://www.harris.com/view_pressrelease.asp?act=lookup&pr_id=1433

Only two weeks after the system failed and hundreds nearly died.

Important to keep down the chatter so these contracts and their kickbacks aren't affected.
Tony Chang
Wednesday, September 29, 2004
 
 
The custom software does not use GetTickCount inappropriately if at all. The entire problem is the flaw, admitted by Microsoft, that the server can thrash if it's been up 49 days. There are three proposed solutions for this bug:

1. Reboot the server before 49 days during a scheduled period when there are no flights.
2. Install a patch, available by request, which Microsoft says has not been fully tested and they don't recommend in general and which they specificall ydon't recommend for this system because it is life-critical. This is not an acceptable solution to use this patch because of the risk.
3. Wait until a tested patch is integrated into a future service pack, which has not yet happened, partly because MS is scared to death of this whole situation and of making a mistake and seeing their name flashed over images of hundreds of charred remains.

The FAA chose solution #1 because it is the only safe way to deal with this serious flaw in Windows Server 2000, which is the system being using. The FAA is hoping that someday they will be able to move to #3. The untested temporary patch, which many of you are recommending, is not a solution that anyone with knowledge of life critical applications would advocate. It would be irresponsible to do so and expose them to enormous liability, except that the FAA can not actually be sued by anyone for their mistakes due to their special legal status.

Moving to #3 doesn't happen overnight. It happens in the course of a hundreds-millions upgrade to the system. With these sort of applications, you would be mentally deranged to go applying service packs to live running air traffic control systems.
Tony Chang
Wednesday, September 29, 2004
 
 
"later in the article it says that the problem is already elimiated in Seattle"

I wanted to clarify that this is not actually true. The reason that Seattle stayed up during the incident is because a manager name of Cox refused the orders of the FAA to destroy the VEARS radio backup system. The FAA said that the new Windows based Harris system and its single backup were infallible and thus the old analog system was not needed any more. The controllers here have been maintaining the old equipment OUT OF THEIR OWN PERSONAL MONEY and so when disaster struck and the Microsoft system and its backup failed, they were able to grab the old system and maintain radio contact and there was never a single risk. This was the only airport in which radio contact was maintained. At least that's what my air traffic controller friends are telling me and they are PISSED and they know for a fact that there is a coverup that extends into the highest levels of administration both federally and in Redmond.
Tony Chang
Wednesday, September 29, 2004
 
 
Tony you have absolutely no information validating your supposition that it was a fault in Windows (a low level KB article about a edge condition in a Windows service is hardly proof. Yes maybe they did have a particular scenario that exploited the fault, but your absolute conviction of this seems unfounded), yet you speak with such absolutely conviction. In fact the LA Times article indicates otherwise when it uses phrasing such as

"But they said the quirk in the system, known as Voice Switching and Control System, is a "design anomaly" that should have been corrected after it was discovered last year in Atlanta."

"Richard Riggs, an advisor to the technicians union, said the FAA had been planning to fix the program for some time."

Both of these, among others, imply that the fault is in the specific application. Having incorrectly used GetTickCount several times myself (usually adding a "// Make sure to add code to deal with rollovers" comment) I can certainly see how it would happen.

Stop using /. as your source of facts.
Dennis Forbes Send private email
Wednesday, September 29, 2004
 
 
Tony, what crack you are smoking, man? EARS has nothing to do with the hardware systems, it's a software reporting tool.

And I don't know who these controller "friends" of yours are that are maintaining radio systems with their "own money", because nothing in the FAA works that way. The controllers and technicians have completely different unions, and do not overlap their jobs.

Coverups indeed... don't give the FAA too much credit.
Lockheed Larry
Wednesday, September 29, 2004
 
 
"The custom software does not use GetTickCount inappropriately if at all. The entire problem is the flaw, admitted by Microsoft, that the server can thrash if it's been up 49 days."

Multiple news articles I've found online say otherwise.  Do you have some kind of insider information?

You are making a lot of bold claims here (coverups and such).  Do you have any hard evidence of this?  Honestly, do you?

I don't mean to be an ass, but you're really sounding like a tinfoil-hat conspiracy theorist.  If you have some real evidence, please lay it out.  Otherwise you really don't have any credibility (as much as anyone can have on an anonymous forum, anyway).  I'm willing to believe you, but only if you can back it up.
Myron A. Semack Send private email
Wednesday, September 29, 2004
 
 
> usually adding a "// Make sure to add code to deal with rollovers" comment

It's a good idea (or habit, or coding standard) to make that "//TODO Make sure to add code to deal with rollovers", or to use "#pragma message" or similar, so that all such places in the code can be found by a Find in Files.
Christopher Wells Send private email
Thursday, September 30, 2004
 
 
See http://catless.ncl.ac.uk/Risks/23.54.html#subj9 (RISKS Digest), where an ATC from Seattle talks about the various systems involved, including EARS.
Chris Hoess Send private email
Thursday, September 30, 2004
 
 
Well that controller is mistaken (they're not trained or taught anything technical except how to control traffic). VEARS was used elsewhere besides seattle (maybe Tony had a typo and that's what he meant?). EARS is the software package that forwards traffic data to central flow control (Enroute Aviation Reporting System I believe it stands for).

It's funny though how everybody pops out AFTER the problem and said "this has always been an issue." How about raising the objection BEFORE?
Lockheed Larry
Thursday, September 30, 2004
 
 
Tony, that's standard Microsoft PSS policy on ALL hotfixes. They don't want to have to support untested COMBINATIONS of patches, because it would cause their support matrix to explode, making it physically impossible to test the system. Microsoft do make the effort for security patches, and it's a vast effort. The IE team posted a blog entry about their support matrix at http://blogs.msdn.com/ie/archive/2004/08/17/216080.aspx. As we know from some of the kernel patches this year (MS04-011 is an example), sometimes there's an incompatibility between a released security patch and an available, but not public, hotfix.

As a slashdotter, I don't expect you to understand support matrix. That's where you test your software under lab conditions to verify that the combination of software is stable, that the bug has been fixed, and no regressions have been introduced. Since the open source poster child, Linux, has no formal testing before release to speak of (there are minor post-release efforts) I won't trust it with anything. If asked to recommend a *nix I'll suggest one of the BSDs, but never Linux.

If you are experiencing the problem documented in the article, call MS support. They'll charge you for an incident (or deduct one from the allowance if you have a support contract with a limited number of incidents) but if the problem is proven to be a genuine bug and a hotfix is issued - whether an existing one or one that had to be written for you - the incident will be refunded. The fee will be refunded if you paid up front, or the contract will be credited. (If anyone from Microsoft PSS is reading: you need to make this deal more prominent; too many people are scared of calling support)
Mike Dimmick
Friday, October 01, 2004
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz