[TIP] test results tracking, etc...

Sat Apr 4 21:33:05 PDT 2009

Just watched Titus' PyCon 2009 talk (http://us.pycon.org/2009/conference/schedule/event/30/ 
) (I wasn't in the multiprocessing talk. Best I can figure from my  
notes I was at an Open Spaces event)

A little bit of background how I'm coming "at" testing:
My work involves testing hardware attached to Windows systems using  
Python 2.4.4.
We have extended the unittest and use our own regression running  
wrapper around the 2.4.4 framework.
The test runs are automated with a rack of PCs controlled by buildbot.

To touch on some of the issues that Titus mentioned in his talk...

Knowing that all tests have been run is definitely an issue once you  
get to more than a few hundred.
- We currently have almost 500.
- While we use buildbot to orchestrate running tests, and while its  
waterfall is nice for getting snapshots of status, we don't have a  
good/easy way to track results over time. Day to day test work changes  
are filtered through an automated (buildbot) gateway before being  
released into the main racks. Logs from the gateway are committed to  
version control. Logs from the rack in buildbot are just in the  
waterfall.

Our regression test hardness always loads all the tests it can find.  
It logs on where they loaded from and how many there were. If that  
number ever goes down, something's wrong. (That is just part of what  
it logs, and these are the same logs I was referring to in the  
previous paragraph.)

Because we load all the tests all the time, we have made some  
enhancement's to unittest and unittest.TestCase:
- We're testing hardware, so when the regression framework starts up,  
it does some inquiries to see what device it is testing, and sets  
feature flags based on the results.
- We also have separate device configuration files which are processed  
and they also define features that should be present in the device  
under test.
- Currently the regression tests are run from the device configuration  
directory, and so the framework is able to do some simple checks to  
make sure that the device being tested corresponds to the  
configuration being used.
- The tests themselves declare, in various ways:
	which features they require
	which features they are incompatible with
	approximately how long they should take to run
	which functional areas (via keywords) that they test (1)

To support this, one of the changes to unittest was to add an  
eligibility function that is called before the test is run. If a test  
declares that is requires some feature, and that feature is not  
present on the device for this test run, the test is marked as  
ineligible including an explanation as to why. In the regression log,  
the test is flagged as ineligible (not pass, not fail, ineligible),  
counted in the "ineligible tests count", and the explanation is  
included.

We also have used the notion of skipping tests for when the above  
mechanisms are not sufficient, so a test may be eligible for a  
particular device, but when run, some sub-feature of the device  
indicates that the test must be skipped. This is a bit clunky since  
skip and eligible were categories we thought would have more  
distinction, but in practice there hasn't been much of a useful  
distinction.

Our biggest itch to scratch now is processing, posting, aggregating,  
etc. test results. We need, to use the agile term, an information  
radiator.
I don't know if raw TAP will work since we have more than just PASS/ 
FAIL for our results. We have PASS, FAIL, SKIP, INELIGIBLE and  
INCONCLUSIVE. Inconclusive means that while the test didn't fail, it  
also did not run long enough to produce a statistically valid result.  
As I mentioned above, SKIP and INELIGIBLE we're probably going to  
collapse into one state, so that would leave us with four possible  
test results.

I'm not sure if this is of interest on this list, since a lot of what  
I saw and heard at PyCon (and on this list since I've joined) was  
either plain unit-testing or "functional" testing, which seemed to be  
a code-word for GUI/Web testing.

I am hoping that a common test reporting system would still be  
something we could use and/or contribute to, but regardless we are  
going to have pursue a solution. I'd personally prefer to not have to  
re-invent the wheel. :)

--Doug

(1) - keywords are to support internal customers who are debugging new  
devices and may only want to run certain subsets of the tests