[TIP] why you should distribute tests with your application / module
Jesse Noller
jnoller at gmail.com
Tue Sep 16 17:02:31 PDT 2008
On Tue, Sep 16, 2008 at 4:34 PM, Pete <pfein at pobox.com> wrote:
> On Sep 16, 2008, at 3:07 PM, Jesse Noller wrote:
>>>
>>> What about fixture data though? That can easily get larger than the
>>> size of the rest of your distribution...
>>>
>>
>> Why not generate the fixture data on the fly though? For example, you
>> can easily generate file data on the fly (that will always be the
>> same) each time a test is run - I do this with file sizes ranging from
>> 1 mb to 100s of gigabytes. This way I don't need to check in test
>> data, or store it. I just generate it from the ether. The same applies
>> to database/fixture data - why not generate it from some seed/ID on
>> the fly?
>
> Because I need to know what's in the data so that I can verify that full
> text queries against it return the correct results. How do you make sure
> your code is giving you the right output if you're feeding it random input?
> Makes no sense to me...
>
> --Pete
>
Here is the nominal example. Normally, instead of using __file__ for
the data, I'd use a lorem ipsum file. Note that doing it this way
instead of using os.random()/reading from /dev/urandom is much, much
faster - and it doesn't require large amounts of entropy so you can
use it in threads to generate mucho data files fast. Also, deque is
fast, and cool to say.
A use case is simple - you need to generate large strings or data to
put in a file on a http server (stream from memory to the pycurl
object) or stream it to a file to test a filesystem.
import collections
def data_generator(unique_id, maxbytes=None):
""" Use the local file to built a word list that allows us to cycle
the deque in a repeatable fashion based off of a name/id - assumes unique
id is in fact, a unique id, comprised of integers where the file is named
file.size.randint, e.g. file.1024.3487109378 would make a 1024 byte file.
"""
unique_id = unique_id.split('.')[1:]
file_size = int(unique_id[1])
seed = unique_id[-1]
words = open(__file__, "r").read().split()
chunk = 1048576 # You could make this smaller or pass in maxbytes.
alloc = -1
if maxbytes:
chunk = maxbytes
word_q = collections.deque(words)
seed_q = collections.deque(seed)
current_size = file_size
while current_size > 0:
mychunk = chunk
mystr = []
while mychunk > 0:
data = ' '.join(list(word_q)[0:alloc])
if len(data) > chunk:
mystr.append(data[0:mychunk])
else:
mystr.append(data)
mychunk -= len(data)
data = ' '.join(mystr)
if len(data) > current_size:
data = data[0:current_size]
yield data
current_size -= len(data)
word_q.rotate(int(seed_q[0]))
seed_q.rotate(1)
if __name__ == "__main__":
import os, hashlib
# Make a 10mb byte file in 1024 byte chunks, twice and compare the data
fnames = ['f.1', 'f.2']
hashes = []
for f in fnames:
gen = data_generator('file.10485760.3487109378', 1024)
fh = open(f, 'w')
while os.path.getsize(f) != 10485760:
fh.write(gen.next())
fh.close()
h = hashlib.md5()
h.update(open(f, 'r').read())
hashes.append(h.hexdigest())
print hashes
-jesse
More information about the testing-in-python
mailing list