The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
TechInterview.org
CityDesk
FogBugz
Fog Creek Copilot


The Old Forum


Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

The Windows 49.7 days shutdown bug

Microsoft server crash nearly causes 800-plane pile-up
http://www.techworld.com/opsys/news/index.cfm?NewsID=2275
LAX
Tuesday, September 28, 2004
 
 
Did you even read the article you linked to?
Just me (Sir to you) Send private email
Tuesday, September 28, 2004
 
 
Strange... My Windows server has never shutdown by itself after 49.7 days.  What version are they running?

Wait... Oh! The software *installed* on a Windows server shuts the server down automatically.

Maybe the article should have been title "Harris Corporation software nearly causes 800-plane pile-up."  But, I guess that doesn't bring in the readers.
Caffeinated Send private email
Tuesday, September 28, 2004
 
 
I loved the fact that the article was covered in Windows 2003 Advertisements, but apart from that the article appears wrong in so many ways.

As far as I knew, that 49 day restart bug was only in 95/98 and the article states that the software was running on Windows 2000 Advanced Server which I have personnaly had running for over 100 days and only had to restart for software installations.

And how many different ways does the author want to point out that they had switched from Unix to Windows?

I am not trying to defend windows, in fact we ran into a drama today with it that required a restart (of IIS only though apparently) it is just that I have noticed so many of these biased articles lately that it is quite rediculous and is getting pretty boring.
Chris Ormerod Send private email
Tuesday, September 28, 2004
 
 
To quote the documentation:

"If the difference between the two calls to GetTickCount is more than 49.7 days, the return value could wrap more than one time and this code will not work; use the system time instead."

How clear does it need to be?

It seems the new stupidity defence in law must be to blame Microsoft, everyone else does.
el
Tuesday, September 28, 2004
 
 
I actually experienced a bug in Solaris 8 which caused it to crash after 300 or something days of uptime.  Sun did patch have a patch that wasn't applied at the time.  I just remember the UNIX admins bragging on how they could keep the system up to encounter the bug.  Nevermind it showed how they never patched.
Bill Rushmore Send private email
Tuesday, September 28, 2004
 
 
Yeah, it does seem to be a bit silly that the system shuts down radio comms after the wrap-around.

It looks like someone tested it for a day or two and it "worked", but without anyone actually understanding the implications of the platform they were using.
Nemesis
Tuesday, September 28, 2004
 
 
It is a Windows bug.

http://support.microsoft.com/default.aspx?scid=http://support.microsoft.com:80/support/kb/articles/q216/6/41.asp&NoWebContent=1

Gee, I guess Redmond doesn't always put out flawless software.  So instead of blaming the OP for not reading the article, you should place the blame at the feet of the company for thinking software from Microsoft could handle enterprise applications.
I know whereof I speak
Tuesday, September 28, 2004
 
 
Thanks,

Thats the article I was looking for. It affects Windows 95 and 98.... Hmm, quite a bit before Windows 2000 Advanced Server was released.

As the other poster pointed out, the Windows time functions will wrap around after 49 days but it is documented and the software designers should have read that documentation before relying on that functionality.
Chris Ormerod Send private email
Tuesday, September 28, 2004
 
 
At least read the stuff you link to, slashbots.

" The information in this article applies to:
Microsoft Windows 95
Microsoft Windows 98 "

The patches for these bugs, also in the article, date from 1999.

Can't you ABM trolls at least point to the current GDI+ misery or something, or was that to close to the libpng mess for comfort?
Just me (Sir to you) Send private email
Tuesday, September 28, 2004
 
 
What I don't get is, if this server was so important, why wasn't it clustered?  They're using a version of Windows that supports clustering.

That alone would have prevented it.  It would also give them the ability to reboot the server, but keep the system up and running.

Redundancy for mission critical systems is common sense.

Of course, they should have written their code better too.
Myron A. Semack Send private email
Tuesday, September 28, 2004
 
 
Why would having a cluster have prevented the problem? If a common software fault exists in all of the software in a cluster, all with a common environments (i.e. started at the same time), then you would have a cluster of failed applications rather than just one. Doesn't really help much.

Of course it's surprizing that any sort of critical system is running atop Windows (or Linux for that matter) - I would have expected QNX or some other ultra-reliable system for such a use.
Dennis Forbes Send private email
Tuesday, September 28, 2004
 
 
I remember when people found out about the Win 98 49.7 day bug, various magazines offered a prize for anybody who could prove their windows 98 box had been up long enough to be affected.

I don't think anybody claimed it.

The artcile referred to sounds suspicious. I really can't beleive anybody would program a server to shut down every 49.7 days (and the coincidence just seems too great).
Stephen Jones Send private email
Tuesday, September 28, 2004
 
 
From what I've been reading, every 30 days a technician would manually reboot the server.  The tech forgot, and the crash happened.  If they had a cluster, one system could've been rebooted while another one took over.

One of the other things I read is that the server wasn't setup for an auto-reboot because they we conecerned about the system not coming back to life (say someone left a floppy in the disk drive).  If they had a cluster, it could be automatically rebooted, because the other servier would always be online.

Also, with a cluster you could have the servers look out for each other.  "Hey, my app isn't working.  You take over while I reboot."

I don't mean to say a cluster would magically solve their problems (I didn't word that too well earlier), but it is another layer of protection.

This system caused a huge problem because it wasn't rebooted.  Immagine how bad it would've been if something really tricky happend, that couldn't be solved quickly?  Think multiple RAID drives crapping out, major DRAM failure (not the single bit ECC stuff), CPU failure, cooling fan fails.  The downtime from something like that probably would've been much worse.

There is no "silver bullet" to system reliability (hardware or software), but there are lots of things you should do in combination.

- High quality, fault-tolerant hardware (RAID, ECC, Hot-swap, etc).
- Locally Redundant systems.
- Off-site failover.
- Software watchdog timer.
- Hardware watchdog timer.
- Outside monitoring.  Automatically page someone when it fails.
- Major source code audit.  Someone might have cought the mis-used function.
- Lint.
- Backups.
Myron A. Semack Send private email
Tuesday, September 28, 2004
 
 
>To avoid this automatic shutdown, technicians are >required to restart the system manually every 30 >days.

Whoever approved this solution should lose their job, be taken out back and shot. Allowing manual scheduled reboots every day? Hello? It's not like ATC would qualify as mission critical?

Amazing.
no talent ass clowns united
Tuesday, September 28, 2004
 
 
"slashbots"

ROFLMAO!  I haven't heard that before.
Caffeinated Send private email
Tuesday, September 28, 2004
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz