Well. After taking a somewhat lengthy break, and with the
help of Dragon 11.5, I'm going to try my hand at this blogging stuff again.
This time I want to talk about symptoms. In my prior blog, The
Microsoft Calculator Challenge, I uncovered and described a series of
symptoms of failure of the Microsoft calculator function. Each of the symptoms
is indicative of a defect in the calculator code. The specific defects are not
mentioned. But then, my function as a tester is not to determine or diagnose
the specific defect causing the symptoms I have uncovered.
Having spent a good part of my career as a diagnostic
engineer, I am familiar with the techniques and skills needed to trace from a
symptom that my test discovered to the specific defect causing that symptom.
I recall an instance while General Electric was installing a
new computer system at Mobil Beaumont in Texas in competition with the
simultaneous installation of a new system by IBM. When I arrived at the site
there must have been a dozen engineers, a couple of managers, and some
technicians working on the problems. We were serious about getting that system
up and running. I was brought in for the purpose of diagnosing the problems, reporting
them, and then someone else would fix them.
There were many issues. One of those issues was that the
printer would not print. Hopping from issue to issue, I took one look at the
printer and saw that the "online" light was out. I pointed this out
to the nearest technician who said, "It's just that the light is burned
out. The problem is that the printer won't print!" And I said, "Just
replace the light, we'll deal with the printing later." So he replaced the
light.
Sometime later I returned to the printer problem, loaded up
the diagnostic, and the printer ran fine! The printer was working! That was
most unsatisfying. Why would the printer start working when all we did was
replace the stupid lightbulb?
With some effort at root cause analysis, I found that the
printer design had a circuit for returning an error to the printer interface
when one of the lightbulbs was burned out. Our engineers had logically ORed the
printer's error lines into our single printer error line. When that error line
was asserted by the burned-out lightbulb, the printer would not print. Minor
design change; problem solved.
But was the problem really solved? What about the technician
that assumed that the burned-out lightbulb was not a significant symptom of a
system defect.
Would that Microsoft would heed the complex and
insignificant symptom cited in my challenge. Indeed, in the past, Microsoft has
paid some attention to the trivial and insignificant symptoms that I had
identified because they put in some kind of "fix." Was this because
Microsoft might have been embarrassed by the malfunction of such a "simple"
function? Were these symptoms an indication of lack of attention to detail? Is
that a pandemic problem?
I have mentioned in other forums that the significance of a
defect is not necessarily related to the significance of a single particular
symptom of that defect. The only way to determine the significance of a defect
is to know the defect. To know the defect you must diagnose the defect. Often
the most significant portion of the repair cost of a defect is the diagnosis of
that defect. It is no wonder that so frequently we repair the symptom and not
the defect.
The technician wanted to ignore the burned-out lightbulb.
Microsoft repaired the symptom of the 2+2 = 8 failure but not the defect. All of
this is symptomatic of something. These assumptions about symptom importance
are significant symptoms of a broken process. I am wondering, "Just what
IS the defect?"