Friday, June 22, 2012

A SIGNIFICANT SYMPTOM


Well. After taking a somewhat lengthy break, and with the help of Dragon 11.5, I'm going to try my hand at this blogging stuff again.

This time I want to talk about symptoms. In my prior blog, The Microsoft Calculator Challenge, I uncovered and described a series of symptoms of failure of the Microsoft calculator function. Each of the symptoms is indicative of a defect in the calculator code. The specific defects are not mentioned. But then, my function as a tester is not to determine or diagnose the specific defect causing the symptoms I have uncovered.

Having spent a good part of my career as a diagnostic engineer, I am familiar with the techniques and skills needed to trace from a symptom that my test discovered to the specific defect causing that symptom.

I recall an instance while General Electric was installing a new computer system at Mobil Beaumont in Texas in competition with the simultaneous installation of a new system by IBM. When I arrived at the site there must have been a dozen engineers, a couple of managers, and some technicians working on the problems. We were serious about getting that system up and running. I was brought in for the purpose of diagnosing the problems, reporting them, and then someone else would fix them.

There were many issues. One of those issues was that the printer would not print. Hopping from issue to issue, I took one look at the printer and saw that the "online" light was out. I pointed this out to the nearest technician who said, "It's just that the light is burned out. The problem is that the printer won't print!" And I said, "Just replace the light, we'll deal with the printing later." So he replaced the light.

Sometime later I returned to the printer problem, loaded up the diagnostic, and the printer ran fine! The printer was working! That was most unsatisfying. Why would the printer start working when all we did was replace the stupid lightbulb?

With some effort at root cause analysis, I found that the printer design had a circuit for returning an error to the printer interface when one of the lightbulbs was burned out. Our engineers had logically ORed the printer's error lines into our single printer error line. When that error line was asserted by the burned-out lightbulb, the printer would not print. Minor design change; problem solved.

But was the problem really solved? What about the technician that assumed that the burned-out lightbulb was not a significant symptom of a system defect.

Would that Microsoft would heed the complex and insignificant symptom cited in my challenge. Indeed, in the past, Microsoft has paid some attention to the trivial and insignificant symptoms that I had identified because they put in some kind of "fix." Was this because Microsoft might have been embarrassed by the malfunction of such a "simple" function? Were these symptoms an indication of lack of attention to detail? Is that a pandemic problem?

I have mentioned in other forums that the significance of a defect is not necessarily related to the significance of a single particular symptom of that defect. The only way to determine the significance of a defect is to know the defect. To know the defect you must diagnose the defect. Often the most significant portion of the repair cost of a defect is the diagnosis of that defect. It is no wonder that so frequently we repair the symptom and not the defect.

The technician wanted to ignore the burned-out lightbulb. Microsoft repaired the symptom of the 2+2 = 8 failure but not the defect. All of this is symptomatic of something. These assumptions about symptom importance are significant symptoms of a broken process. I am wondering, "Just what IS the defect?"