[TIP] why you should distribute tests with your application / module

Jesse Noller jnoller at gmail.com
Wed Sep 17 08:40:05 PDT 2008

On Wed, Sep 17, 2008 at 11:24 AM, Pete <pfein at pobox.com> wrote:
> On Sep 16, 2008, at 7:02 PM, Jesse Noller wrote:
>> On Tue, Sep 16, 2008 at 4:34 PM, Pete <pfein at pobox.com> wrote:
>>> On Sep 16, 2008, at 3:07 PM, Jesse Noller wrote:
>>>>> What about fixture data though?  That can easily get larger than the
>>>>> size of the rest of your distribution...
>>>> Why not generate the fixture data on the fly though? For example, you
>>>> can easily generate file data on the fly (that will always be the
>>>> same) each time a test is run - I do this with file sizes ranging from
>>>> 1 mb to 100s of gigabytes. This way I don't need to check in test
>>>> data, or store it. I just generate it from the ether. The same applies
>>>> to database/fixture data - why not generate it from some seed/ID on
>>>> the fly?
>>> Because I need to know what's in the data so that I can verify that full
>>> text queries against it return the correct results. How do you make sure
>>> your code is giving you the right output if you're feeding it random
>>> input?
>>> Makes no sense to me...
>>> --Pete
>> Here is the nominal example. Normally, instead of using __file__ for
>> the data, I'd use a lorem ipsum file. Note that doing it this way
>> A use case is simple - you need to generate large strings or data to
>> put in a file on a http server (stream from memory to the pycurl
>> object) or stream it to a file to test a filesystem.
> A random stream of bytes/words is going to work for me.  Remember, I'm doing
> tests on full text searches[0] - a large variety of words is essential.
>  Representative frequencies (ie 'the' appears a lot) is also somewhat
> important.  Sensible ordering is nice. And so forth.  Permuting lorem ipsum
> a few thousand ways to Sunday is not going to give me the kind of data I
> need to test effectively (let alone benchmark).
> The only text generator I've seen that looks feasible is this thing, and
> it'd take some hacking to make it work for my purposes:
> http://code.google.com/p/lorem-ipsum-generator/
> I rather like the idea of generating fixture data on the fly, I just don't
> think it's going to work for me here; I can imagine other situations where
> it wouldn't either (if you were testing numerical algorithms by aiming for a
> known good result).
> We still don't have a nice solution for situations where generating fixture
> data isn't an option. One approach would be to require users to download
> fixture data & just skip the tests if they're not available.  Hmm, maybe a
> nose plugin to automate that?
> --Pete
> [0] http://en.wikipedia.org/wiki/Full_text_search

Then don't make it completely random - weight the selections instead.
My example was focused on random-ish file data that you could *always*
reproduce with a given key. For a full text search, you're going could
use the same seed concept, but break the words from your source (a
static lorem ipsum file[0]) into groups and assign them
popularity/frequencies and so on - or use the generator class from the
lorem-ipsum-generator you linked to[1]. Heck use /usr/share/dict/words
to generate the data :)

You should be able to generate relatively random (but reproducible)
test set data with the word frequencies you need.

[0] http://desktoppub.about.com/library/weekly/lorem.txt
[1] http://code.google.com/p/lorem-ipsum-generator/source/browse/trunk/lipsum.py

More information about the testing-in-python mailing list