Thursday, August 20, 2009

Testing Untestable

Robert Martin from Object Mentor has recently posted a blog where he describes how after 10 years of practicing Test-Driven Development (TDD) he encountered a bug he was unable to create a meaningful test for. I'm full of real respect to Uncle Bob, learned a lot from him and continue learning. I also did not manage to disprove his claim due to the lack of some technical details and lack of time to discover them by myself. Still I want to reflect on some premises made (or at least perceived from) this memo. These premises suspiciously look like identification mental disorder: considering closely related or similar things to be identical (described by A.Korzybsky in his "Science and Sanity"). The reason, why it so important is quite simple. Today thousands, if not more, of software developers practice TDD based on the assumption that everything is testable, at least in principle. Normally when I present TDD to novices a typical response (especially from the embedded software world), with a kind a skeptical grim, is: "Yeah. It all looks good in theory, but in the real world the most interesting bugs are NOT testable this way and have to spend the most of your time on the real equipment in the real environment". If not everything is testable than following this kind of excuse slippery road nothing will ever be testable. Therefore any particular case requires a very careful study in order to understand what happened and what could be done. Now to premises, exposed (or at least perceived by me) in the Uncle's paper:
  1. If I, Uncle Bob, who has been teaching the whole world how to TDD, cannot test it, this bug in non-testable.
  2. My class unit test is the minimal unit test possible, because see above
  3. If I cannot test it using my favorite IDE, it's untestable
  4. If I cannot test it using my unit test library (JUnit is this case) it's untestable
  5. If I cannot test it in batch mode using my favorite build system (say, ant), it's untestable
Just to reiterate, I have no idea if this was the real intent of Rob Martin or not, I just claim that this was my impression. Here are my fixes to these identifications (first thing first):
  1. We all have blind spots. Gurus and experts are especially prone to this, since they are too much convinced by the rest of the world that they do know the best. If such a thing like untestable bug does exist, it should be verified and analyzed by a larger community. What if some of us, mere morons, will find a way to test it?
  2. Agile test automation need a very accurate definition of terms and conditions (see below). What one developer considers as a minimal unit test could still be an integrated test under certain angle of view (see below).
  3. Tools are very important and useful things, but they are by no means identical to the unit test practice. If something cannot be tested using JUnit it does not mean a unit test is not possible. It might require some more imagination and effort, but still be possible.
Now I want to describe the basic premises of Agile Test Automation (specifically unit and acceptance testing) as I understand them. Basically learned from gurus like Robert Martin and Kent Back, but they probably have never formulated them in exactly this way:
  1. Unit and Acceptance test automation suite specify unequivecally that if the system does pass all these tests it does behave according to requirements under assumptions reflected in tests. There is no claim that the system does not contain bugs in some sense. Even more this test automation suite IS the system requirements. Anything else are just wishes or speculations. If, for example, it's essential for our system that Java HashSet has a fixed order of elements when converted to List and had duplicates (R. Martin's case), we have to specify an automated test, which validates this assumption (in practice it's a bit more complicated, see below).
  2. For every branch or every method of every class from whatever we decide to be our system core it is possible to write a unit test, which validates that this particular branch is developed according to the specification. All assumptions about the class surroundings are reflected in the unit test using mocks.
  3. At the system boundary it is possible to introduce simple adapters, which will make unit testing of the core more convinient. Unit testing of these adapters might be impractical and therefore they should not contain any essential functionality, but rather to just raise a level of interfaces.
  4. For every assumption about the system software behavior it is possible to write a simple, unit test, that disproves this assumption. The opposite, that is to write an automated test, which proves that all assumptions about the underlying system are correct in general case is not possible, or at least is not practical.
  5. By passing all unit tests it is not possible to draw a conclusion that our system behaves correctly as a whole. For that purpose an acceptance test suite is required. As it was stated above collectively unit and acceptance test suites specify under which assumptions what functionality the system has to provide.
  6. It is not possible to proof that the system will never fail, will do things not reflected in the automated test suite, or will not have some unpredictable defects emerging from putting multiple features together. The latter could be spotted only with a manual exploratory test.

To keep things simple in this blog I do not address the issue of additional types of tests such as stress, endurance, etc. See my separate post on the subject. Now to the specific point mentioned in the Robert Martin's post. My interpretation is as follows. There was an implicit assumption made somewhere in new Fitnesse design that Java HashSet will preserve the order of elements in convertion to List, even when there are duplicates. There was a suspicion that this assumption is incorrect and Robert came up with a conclusion that this kind of bug is not unit testable. Quatation from his blog:

"Unfortunately, the order of the rows in the list that was copied from the set is indeterminate.

Now, maybe you missed the importance of that last paragraph. It described the bug. So let me re-emphasize. In the case of duplicate rows I was depending on the order of elements in a list; but I built the list from a HashSet. HashSets don’t order their elements. So the order of the list was indeterminate.

The fix was simple. I changed the HashSet into an ArrayList. That fixed it. I think…

The problem is that I had no way to reliably create the symptom. I could not write a test that failed because I was using a HashSet instead of an ArrayList. I could not mock out the HashSet because the bug was in the use of the HashSet. I could not run the application 1000 times to statistically see the fault, because the behavior did not change from run to run. The only thing that seemed able to sometimes expose the fault was a recompile! And I wasn’t about to put my system into a recompile-til-fail loop."

As I (Asher Sterkin) mentioned I was unable to get a failing test due to the lack of some details, but I still do claim that it's always possible to create a simple unit test, which disproves ANY specific assumption about the underlying system. Here is a Java class I wrote specifically for this purpose:

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public final class TestSetDuplicates {
public static void main(String[] args) {
Set<String>rawSet = new HashSet<String>();
String values[] = new String[] {
"SuiteChildOne.SuiteSetUp",
"SuiteChildOne.TestOneOne",
"SuiteChildOne.TestOneTwo",
"SuiteChildOne.SuiteTearDown",
"SuiteChildOne.SuiteSetUp",
"SuiteChildOne.TestOneThree",
"SuiteChildOne.SuiteTearDown"
};
for(String i : values)rawSet.add(i);
List<String> list = new ArrayList<String>(rawSet);
for(String i : list) System.out.format("%s \n",i);
  System.out.println("");
 }
}

This is obviously not a JUnit test and strictly saying is not test at all. It's a part of what's going on to be a unit test for HashSet to List conversion functionality. It also doesn't do too much, and this is a very important point: whatever test code we write it must do as little as possible in order to avoid any side effects we may be not able to predict.

Now, what was the Rob Martin's point: "The only thing that seemed able to sometimes expose the fault was a recompile!" In other words, for some reason recompilation has presumably an impact on the order in which HashSet will handle duplicates (to stress again, I miss many technical details therefore it's just a speculation, but hopefully plausible enough). Ok, if recompilation has a potential impact, let's test it. Here you go:
require 'ftools'

def compile_run()
File.delete('TestSetDuplicates.class') if File.exist?('TestSetDuplicates.class')
 `javac TestSetDuplicates.java`
return `java TestSetDuplicates`
end

first = compile_run()
puts first
1000.times do |i|
print "\r#{i}"
current = compile_run()
raise "Inconsistent HashSet Behavior(#{i}): #{current}" unless first == current
end
This silly Ruby script does just this: recompile, run the Java test snippet and compare with the first run to see if there are any differences. Do it as many times as required (1000 in this case). If the number is an issue I could put this script on a nightly run for 1,000,000 times or more. Frankly I do not understand Rob's statement "And I wasn’t about to put my system into a recompile-til-fail loop" Lack of confidence in the JVM behavior is a serious enough problem to spend some time on its proper investigation.

Together the Java code and Ruby script constitute a unit test for this particular aspect of HashSet to List conversion. On my computer it did not produce any interesting results. In other words there were no any differences in order. I did it 1000 times and my gut feeling is that if the problem did exist it would show up itself at least once. Why it did not fail? There could be a number of reasons:
  1. I misunderstood the Rob's problem. The most probable cause. Perhaps I just need to free up some time, to grab the Fitnesse code from github and to investigate it first hand.
  2. The problem is correct but it fails only on Rob's computer, on his operating system, and/or on his version of JVM and JDK.
  3. HashSet to List conversion is determinate and the problem is elsewhere.
As for now we just do not know, but the current premise is that whatever the case we will ALWAYS be able to construct a minimal convincing test, which disproves this or another assumption. There is more, which could be said about our hierarchy of assumptions. We do assume, that qunatum mechanics equations correctly model behavior of electrons. We do assume, that basic electronics elements of our computer (transistors, integrated circuits) are correct from the engineering perspective. The same is about CPU chip, memory, mother board, etc. We do assume that drivers of every device behave correctly. We do assume that our operating system does not contain bugs, which will affect our program. We do beleive that our virtual machine and framewor libraries are correct. The basic premise is that whenever there is a suspicion in any level we will be able to test this suspicion at THAT level without need for the whole system. Roughly saying the whole modern engineering and science are built on the top of this premise.

The problem is obviously is not with this particular functionality for that particular problem. Probably considering Fitnesse release pressure it would the most optimal resolution to just find a workaround. The problem is of more philosophical nature. One should not confuse goals with methods and the methods with tools. Our goal is to get enough confidence that our software does fulfill well defined requirements under certain conditions (expressed in a form of implicit and explicit assumptions). And we want to do it automatically every time there is some change. This is just a reasonable risk management strategy, which helps us to avoid last minute unpleasant surprises. We normally do not bother to check ALL our assumptions about underlying infrastructure" computer hardware, operating system, virtual machine, SDK. We just rely on the documentation. This is perfectly reasonable approach since checking all these assumptions might be impractical. However if we suspect some particular assumption we could and should check. TDD is a method to achieve this goal. It prescribes certain rituals, which help to achieve the goal in a cost effective way. More specifically it relies on the premise that adding tests after the code has been written is not practical. This premise is fully supported by the current experience. Tools just support rituals prescribed by the method in the maximum possible convenient way. Not to undermine there importance: without strong tools support many important methods including unit testing will remain in theory. Still when using a tool gets into a conflict with the method we have to choose method, and when applying the method gets into a conflict with the goal we have to stick with the goal and to adjust the method. Hopefully such kind of "faith crisis" does not happen too often.

The more fundamental problem is with our understanding about how do we think and how do we solve the problems. "Map is not the territory" claimed Korzybski. Whatever picture we hold in our brains is just a model, map, abstraction of the outside world. There is always a possibility that there is a MISTAKE. We cannot avoid it completely, but we could make it less probable. The biggest problem is with experts, gurus, Masters. Sometimes we acquire an overconfidence in our ability to grasp a correct picture. "It's obvious", we say. Let me put it blunt: nothing is obvious, there always must be room for a doubt. The less we suspect something the higher the probability that the mistake is exactly there, the more catastrophic could be consequences. Experts like Rob Martin are supposed to know this the best. As we can see this does not always happen. A good lesson for us, mere mortal morons: the expert opinion is just an input, no less, more.

No comments:

Post a Comment