[TIP] why you should distribute tests with your application / module

Tue Sep 16 17:02:31 PDT 2008

On Tue, Sep 16, 2008 at 4:34 PM, Pete <pfein at pobox.com> wrote:
> On Sep 16, 2008, at 3:07 PM, Jesse Noller wrote:
>>>
>>> What about fixture data though?  That can easily get larger than the
>>> size of the rest of your distribution...
>>>
>>
>> Why not generate the fixture data on the fly though? For example, you
>> can easily generate file data on the fly (that will always be the
>> same) each time a test is run - I do this with file sizes ranging from
>> 1 mb to 100s of gigabytes. This way I don't need to check in test
>> data, or store it. I just generate it from the ether. The same applies
>> to database/fixture data - why not generate it from some seed/ID on
>> the fly?
>
> Because I need to know what's in the data so that I can verify that full
> text queries against it return the correct results. How do you make sure
> your code is giving you the right output if you're feeding it random input?
>  Makes no sense to me...
>
> --Pete
>

Here is the nominal example. Normally, instead of using __file__ for
the data, I'd use a lorem ipsum file. Note that doing it this way
instead of using os.random()/reading from /dev/urandom is much, much
faster - and it doesn't require large amounts of entropy so you can
use it in threads to generate mucho data files fast. Also, deque is
fast, and cool to say.

A use case is simple - you need to generate large strings or data to
put in a file on a http server (stream from memory to the pycurl
object) or stream it to a file to test a filesystem.

import collections

def data_generator(unique_id, maxbytes=None):
    """ Use the local file to built a word list that allows us to cycle
    the deque in a repeatable fashion based off of a name/id - assumes unique
    id is in fact, a unique id, comprised of integers where the file is named
    file.size.randint, e.g. file.1024.3487109378 would make a 1024 byte file.
    """
    unique_id = unique_id.split('.')[1:]
    file_size = int(unique_id[1])
    seed = unique_id[-1]
    words = open(__file__, "r").read().split()
    chunk = 1048576 # You could make this smaller or pass in maxbytes.
    alloc = -1
    if maxbytes:
        chunk = maxbytes

    word_q = collections.deque(words)
    seed_q = collections.deque(seed)
    current_size = file_size
    while current_size > 0:
        mychunk = chunk
        mystr = []
        while mychunk > 0:
            data = ' '.join(list(word_q)[0:alloc])
            if len(data) > chunk:
                mystr.append(data[0:mychunk])
            else:
                mystr.append(data)
            mychunk -= len(data)
        data = ' '.join(mystr)
        if len(data) > current_size:
            data = data[0:current_size]
        yield data
        current_size -= len(data)
        word_q.rotate(int(seed_q[0]))
        seed_q.rotate(1)

if __name__ == "__main__":
    import os, hashlib
    # Make a 10mb byte file in 1024 byte chunks, twice and compare the data
    fnames = ['f.1', 'f.2']
    hashes = []
    for f in fnames:
        gen = data_generator('file.10485760.3487109378', 1024)
        fh = open(f, 'w')
        while os.path.getsize(f) != 10485760:
            fh.write(gen.next())
        fh.close()
        h = hashlib.md5()
        h.update(open(f, 'r').read())
        hashes.append(h.hexdigest())
    print hashes

-jesse