[pygr-notify] Issue 45 in pygr: BlastDB has gotten slow due to cache

codesite-noreply at google.com codesite-noreply at google.com
Mon Oct 20 19:40:49 PDT 2008


Issue 45: BlastDB has gotten slow due to cache
http://code.google.com/p/pygr/issues/detail?id=45

Comment #3 by cjlee112:
Hmm, in my initial test, the time for indexing a file is the same in the  
current
version and the August 8 version (git commit 11e3814).  In both cases, it  
took 30 sec
(on my macbook pro) to index a file of 1 million sequences.

One difference that I see between the older version and the new version is  
that at
the very end of the indexing process I see memory usage expanding rapidly  
(from
around 5 MB to at least 35 MB), then quickly dropping down to baseline (5  
MB).  In
the older version I didn't see any such memory usage surge.  If we  
extrapolate from
30 MB for 1 million sequences, your case of 50 million sequences might take  
1.5 GB,
which could easily send the machine into swap hell, which could make the  
process take
much longer than it should.  So this seems to fit with what you reported...

OK.  I now understand the problem.  The bsddb module btree index is  
screwing us over:
when you simply ask for an iterator, it apparently loads the entire index  
into
memory.  Anyway, just doing the following causes the 30 MB increase in  
memory usage I
mentioned above:

>>> s2 = classutil.open_shelve('R1.seqlen','r')
>>> it = iter(s2)
>>> seqID = it.next()

The memory increase happens when you ask the iterator for the first item,  
and the
memory isn't released until the iterator is garbage collected.

The reason this problem was NOT present in earlier versions of Pygr, is  
that we used
to have a function read_fasta_one_line() that just read the first sequence  
line of
the FASTA file.  BlastDB.set_seqtype() used that function to read a line of  
sequence,
and then to infer when the sequence is protein or nucleotide.

When we made seqdb more modular (created SequenceDB class), I got rid of
read_fasta_one_line() as being too limited (only works on FASTA format),  
and switched
to just getting the first sequence by getting an iterator on the sequence  
database.
Now we discover that bsddb iterators act more like keys() (i.e. reads the  
entire
index into memory) than like an iterator...  They are NOT scalable!!!!

You claim that the older version of Pygr can index a file of 50 million  
sequences in
1 sec.  I guess that might be possible, but it seems much faster than I'd  
expect.
Are you sure that you tested indexing of the file, as opposed to just  
opening an
index that has already been constructed?



Issue attribute updates:
	Labels: -Type-Enhancement Type-Defect Milestone-Release0.8

-- 
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings



More information about the pygr-notify mailing list