Eliminate and Test

Troubleshooting Technology History Career

Windsor in 1992 was an unlikely place to be doing battle with the future of computing. Windsor is famous for its castle, its park, and for being the kind of place where tourists arrive clutching guidebooks and leave clutching fudge. It is not, on the face of it, where you would expect to find a small team of engineers trying to work out why Windows kept collapsing whenever someone attempted to scan a document.

But there we were, in Hawes Hill Court, tucked behind Windsor Great Park, doing exactly that. I was working for Logitech UK. The castle was technically visible from the right angle on a clear day, though we were rarely afforded the luxury of looking.

 

The Greatest Troubleshooter I Have Ever Met

My boss was a man called John Simpson. I have worked alongside a great many technically capable people over the years, and a handful of genuinely brilliant ones, but John was something else. He was not merely good at finding problems. He understood, at a level that felt almost philosophical, what it meant to look for one correctly.

His approach could be stated in a single sentence: eliminate and test, then eliminate and test again. It sounds obvious, said like that. It sounds like something you might embroider on a cushion and sell in a gift shop alongside the fudge. The catch is that it is only obvious until you watch someone actually do it with real rigour, at which point it becomes apparent that almost nobody does it at all. Most people, presented with a broken system, make a guess, try one thing, decide it has not worked, and make a different guess. John did not guess. John built a logical space, ruled out possibilities methodically, and followed the evidence wherever it chose to lead, regardless of how inconvenient the destination turned out to be.

It was the scientific method applied to a malfunctioning scanner driver on a beige box running DOS 5, and it was genuinely instructive to watch.

 

The ScanMan, DMA 3, and Windows Having a Moment

The first problem John taught me to think about properly was the ScanMan 256. The ScanMan was Logitech's handheld scanner, a device you dragged across a document in as straight a line as you could manage while trying not to breathe. We were getting reports that machines with the ScanMan board installed were throwing General Protection Faults in Windows. Not occasionally. Consistently. Under specific conditions that nobody had yet identified.

A General Protection Fault, for those who did not have the pleasure, was Windows 3.x's way of informing you that something had gone sufficiently wrong that it considered the conversation over. It appeared as a dialog box, usually at the worst possible moment, and it had a quality of finality that no amount of clicking could improve.

The ScanMan board used DMA channel 3 for its data transfers. DMA, Direct Memory Access, is the mechanism by which hardware components move data directly into system memory without troubling the processor with the details. The ISA bus provided eight DMA channels, and received wisdom at the time held that DMA 3 was free for use. Received wisdom was incorrect. Certain parallel port implementations, particularly those configured for ECP mode, had quietly appropriated DMA 3 for themselves, and when both the ScanMan board and the parallel port attempted to use it at the same moment, Windows noticed. Windows expressed its displeasure in the only vocabulary it had available at the time, which was the GPF.

The elegant part was how John found it. He did not ring Microsoft. He did not stare at the code until it confessed. He built a matrix of variables and started removing them one at a time, watching what changed with each iteration. Remove the printer. Does the fault persist? Reconfigure the parallel port mode in the BIOS. Does the fault persist? Reassign the ScanMan board to DMA 1 instead. Does the fault persist? Each step was a question with a binary answer, and each answer narrowed the space until one possibility remained. It took most of an afternoon. Most people would have taken a week, complained about Microsoft for the second half of it, and still not found the answer.

 

The Mouse That Hated Sunlight

The second lesson arrived via a mouse. The Logitech Pilot was a tidy little two-button serial mouse, and a customer had reported that theirs would only move in a single direction. Not the familiar "the ball is dirty and moving through what appears to be cheese" problem. The cursor moved vertically with complete responsiveness. Horizontally, it declined to move at all.

The Pilot used two opto-encoders, one for each axis. An opto-encoder works by shining an infrared LED through a slotted disc and counting how often the beam is interrupted as the disc rotates when the ball moves. The X-axis encoder reports horizontal movement; the Y-axis encoder reports vertical. If one encoder stops working, you lose that axis entirely and your cursor becomes a philosophical statement about the limits of freedom.

John's first question was not "is the encoder broken?" A broken encoder is a conclusion, not a diagnosis. His question was: under what conditions does this behaviour occur? The customer was using the mouse on a desk beside a south-facing window. John's hypothesis, formed after approximately forty-five seconds of thought, was that direct sunlight at a low angle was flooding the photodetector in the X-axis encoder with enough ambient light that it could no longer distinguish the infrared pulses from the background. The encoder, effectively blinded, reported no movement. The Y-axis encoder sat at a slightly different orientation, shielded enough by the mouse body to function normally.

The test was straightforward. Reproduce the reported conditions. Angle a desk lamp at the mouse from low and to one side. Watch what the cursor does. The cursor duly refused to move horizontally. Remove the lamp. Movement restored. The fix was to advise the customer to move the mouse away from direct sunlight, which is the kind of resolution that feels unsatisfying until you consider how long it would have taken to reach it without the method.

Eliminate and test. Then eliminate and test again.

 

The Screwdriver, the ISA Bus, and the Fundamental Difference Between DOS and Windows

The most memorable lesson John gave me was not about fixing anything. It was about understanding what you were working with at a level that no amount of documentation quite manages to convey. It also involved deliberately breaking things, which is an underrated educational approach.

The ISA bus ran along the bottom of every expansion slot in the machine. The edge connector pins were accessible to anyone with a standard flat-bladed screwdriver and a willingness to do something that the manual would not have endorsed. Pins A1 and B1 on the ISA bus are positioned such that you can bridge them directly with the blade of a screwdriver. When you do this, you apply five volts directly to the IRQ0 line.

IRQ0 is the hardware interrupt line connected to the system timer. Under normal operation, it fires 18.2 times per second. It is, in a genuine sense, the heartbeat of the PC: the regular pulse from which the operating system derives its sense of time and, in the case of Windows, its decisions about which application should be doing what.

The experiment was this: boot into DOS, short the pins briefly with the screwdriver, and observe. Then boot into Windows 3.x, short the pins again, and compare.

In DOS, the result was almost disappointing. The system hiccupped slightly, perhaps displayed a momentary oddity in whatever was running, and then continued without complaint. DOS handled timer interrupts in a direct and uncomplicated way: the interrupt handler ran, incremented the internal tick counter, and returned control to whatever had been interrupted. A spurious IRQ0 was absorbed and forgotten. DOS had no particular opinion about the matter.

In Windows, the result was considerably more theatrical. Windows 3.x used the timer interrupt as a central mechanism for its cooperative multitasking model, mediating between applications and managing the timing of its message queue. An unexpected IRQ0 at the wrong moment in the wrong context was enough to corrupt sufficient internal state that Windows responded by ceasing to function in any organised fashion. The system crashed, immediately, decisively, and with what one could only describe as commitment.

The same hardware, the same interrupt, the same five volts. Two completely different responses, because the two systems had fundamentally different architectural assumptions about what the world was allowed to do to them.

John's point was not that you should go around shorting ISA bus pins with screwdrivers, though the entertainment value was clearly not lost on him. His point was that you cannot understand why something fails under unexpected conditions unless you understand what it depends on. DOS and Windows both consumed IRQ0, but their internal architectures made them react to the same unexpected input in completely different ways. If you understood the architecture, you could predict the failure mode. If you could predict the failure mode, you could find it. If you could find it, you could fix it.

Understanding the system is not optional. It is the entire job.

 

The Lesson That Has Lasted

I have been troubleshooting things professionally for over thirty years now. Not only hardware, though there has been plenty of that. Software systems, organisational failures, strategic decisions that had gone sideways in ways nobody could quite articulate, projects that were clearly wrong but where nobody had yet done the work of identifying precisely how. The vocabulary changes considerably depending on what is broken. The methodology does not change at all.

Formulate a hypothesis. Design the simplest test that distinguishes between that hypothesis and the alternatives. Run the test. Record the result. Update your model. Repeat until only one explanation remains consistent with the evidence, because whatever is left, however improbable it seemed at the start, is the answer.

What John understood, and communicated without ever quite stating it as a formal principle, is that troubleshooting is not primarily about cleverness. A clever person makes an intuitive leap and is sometimes right and sometimes spectacularly, expensively wrong. A methodical person follows the evidence, rules out the impossible, and arrives at the answer because no other answer fits. The clever approach is faster on the occasions when it works. The methodical approach is faster overall, and it does not require you to be particularly clever. It requires you to be disciplined, which is available to everyone.

It is also, as it happens, an excellent way to approach most problems that are not hardware faults in handheld scanners. Organisational dysfunction yields to elimination and testing as readily as DMA conflicts do, if you are willing to do the work of constructing the hypotheses properly and running the tests honestly rather than selecting the evidence that confirms what you already suspected.

I still occasionally reach for a screwdriver when the situation demands it. I now try to be more selective about which pins I choose to connect.