[TIP] why you should distribute tests with your application / module

Pete pfein at pobox.com
Wed Sep 17 08:24:05 PDT 2008


On Sep 16, 2008, at 7:02 PM, Jesse Noller wrote:

> On Tue, Sep 16, 2008 at 4:34 PM, Pete <pfein at pobox.com> wrote:
>> On Sep 16, 2008, at 3:07 PM, Jesse Noller wrote:
>>>>
>>>> What about fixture data though?  That can easily get larger than  
>>>> the
>>>> size of the rest of your distribution...
>>>>
>>>
>>> Why not generate the fixture data on the fly though? For example,  
>>> you
>>> can easily generate file data on the fly (that will always be the
>>> same) each time a test is run - I do this with file sizes ranging  
>>> from
>>> 1 mb to 100s of gigabytes. This way I don't need to check in test
>>> data, or store it. I just generate it from the ether. The same  
>>> applies
>>> to database/fixture data - why not generate it from some seed/ID on
>>> the fly?
>>
>> Because I need to know what's in the data so that I can verify that  
>> full
>> text queries against it return the correct results. How do you make  
>> sure
>> your code is giving you the right output if you're feeding it  
>> random input?
>> Makes no sense to me...
>>
>> --Pete
>>
>
> Here is the nominal example. Normally, instead of using __file__ for
> the data, I'd use a lorem ipsum file. Note that doing it this way

> A use case is simple - you need to generate large strings or data to
> put in a file on a http server (stream from memory to the pycurl
> object) or stream it to a file to test a filesystem.

A random stream of bytes/words is going to work for me.  Remember, I'm  
doing tests on full text searches[0] - a large variety of words is  
essential.  Representative frequencies (ie 'the' appears a lot) is  
also somewhat important.  Sensible ordering is nice. And so forth.   
Permuting lorem ipsum a few thousand ways to Sunday is not going to  
give me the kind of data I need to test effectively (let alone  
benchmark).

The only text generator I've seen that looks feasible is this thing,  
and it'd take some hacking to make it work for my purposes: http://code.google.com/p/lorem-ipsum-generator/

I rather like the idea of generating fixture data on the fly, I just  
don't think it's going to work for me here; I can imagine other  
situations where it wouldn't either (if you were testing numerical  
algorithms by aiming for a known good result).

We still don't have a nice solution for situations where generating  
fixture data isn't an option. One approach would be to require users  
to download fixture data & just skip the tests if they're not  
available.  Hmm, maybe a nose plugin to automate that?

--Pete

[0] http://en.wikipedia.org/wiki/Full_text_search



More information about the testing-in-python mailing list