Tuesday, February 10, 2004

One Of My Sacred Cows Is Dying


A new Manager In A Strange Land is up, after some technical difficulties.

My wife is back from Long Island. Thank God. I've gotten really used to having her around.

Put in an 80 hour week last week. Been a while since I crunched that hard - over a year, at least. I don't recommend it.

And now for the main course:

Once, I was very into "brittle" systems - Chuck Tolman's term, I'm not sure if that's the official software engineering name for it - the idea being if somebody did something wrong it would break. Have fatal asserts, and don't bother checking if a pointer is null and failing gracefully. "I say let 'em crash!" It helps Catch Bugs Early, because when somebody can't work he goes and drags a programmer to his desk to fix his problem NOW. And catching bugs early is a good thing. I indoctrinated the team with this attitude, and it seems to have paid off - we shipped our last few products on time, and our current, most ambitious, project seems pretty solid, generally soaking for an hour or two before it crashes. (Side note: I've noticed this happy thing about fixing crash bugs - as you fix them, the soak time of your game increases exponentially. Your game crashes after a minute. You fix the bug. Your game crashes after two minutes. You fix the bug. Your game crashes after four minutes. Etcetera. Then it's people - one out of two people who play the game all the way through see a crash. Then one out of four. Then one out of eight. At this point, testing is too slow to find the remaining crash bugs. You ship a product which crashes at least once for one in eight people. You live with it.)

Still, others have shipped games without this attitude. David Cook of Triple Play and Kelly Slater likes skippable, nonfatal asserts. Chris Carollo of Deus Ex likes them too. And I have to admit, this attitude has its warts. Just yesterday I was fixing a dereferenced null pointer. The programmer put in a fatal error before the dereferenced null, but did not handle the null case. The error was that the wrong kind of icon was being displayed over the head of a certain entity. This wrong icon was displayed for a single frame while the screen was black. Our code was incorrect, and the fatal assert caught us. But is it a bug if the end user never sees it? Which ties in with my bitching about test driven development. What percentage of the time do the tests catch non bugs? Although it's cheaper to fix a bug if you catch it early in the lifecycle, is it really so much cheaper that it's worth the cost of false alarms going off in the code? (I'm sorry, but I think those numbers about it being 100 times cheaper to catch a bug at design time than in beta are fallacious. I'll accept several times.)

Thing Two: a few days ago I demanded that we make our sound system robust to missing sound files, just like we made our graphics system robust to missing textures. There was a painful chicken-and-egg problem with sounds - here's the lifecycle of a sound asset on our project:

- mission designer requests sound
- sound engineer adds sound to the database, submits
- mission designer syncs, adds sound to mission script, submits
- sound engineer rebuilds all sounds in the game (this takes about an hour), submits. (this involves over a gig of data and brings Perforce to its knees)
- mission designer syncs, tests to make sure the sound is okay

If this process was done out of order the game would crash due to the missing sound asset.

And as fast as Perforce is compared to Sourcesafe, whenever we do a sync we usually have to recompile and rebuild a ton of data, which can take anywhere from a few minutes to half an hour.

Are you feeling our pain yet?

Now, once we slaughter my sacred cow, and force the system to be robust to missing sounds:

- mission designer requests sound
- sound engineer gives mission designer a filename in e-mail
- mission designer adds sound to mission script but doesn't get to hear it yet. submits
- sound engineer adds sound, rebuilds sounds, submits
- mission designer syncs, tests

We've gone from four handoffs to two.

So, after all this, I would revise my philosophy. First - don't settle for brittle. Go for Steve Maguire's "Brittle & robust." Whenever you add an assertion, also take the steps to make sure your system will survive when that assertion fails. Second - some things should be warnings, not errors. Missing textures and sounds should be warnings. Third - at the beginning of the project, fatal asserts are good: they keep your development on a leash, they keep you slow and steady, they make sure you take the time to do things right. But at the *end* of a project, when you need to fix the high priority bugs, and triage dictates that you should ship the low priority ones - let the asserts be skipped. Because those asserts may be just confusing your priorities - you have real crash bugs to fix, and these asserts are often firing on things that are cosmetic bugs...or not even bugs at all.