[pygr-notify] Issue 89 in pygr: Need for solexa seqdb with integer (64bit) ID which requires no hashing/indexing

Thu May 14 07:45:03 PDT 2009

Updates:
	Summary: Need for solexa seqdb with integer (64bit) ID which requires no  
hashing/indexing

Comment #2 on issue 89 by deepreds: Need for solexa seqdb with integer  
(64bit) ID which requires no hashing/indexing
http://code.google.com/p/pygr/issues/detail?id=89

Extremely crude idea. 1 character for size of sequence + score (ordinal),  
then save
sequence + score.

outfile1 = open('test.ifa', 'w')
outfile2 = open('test.pfa', 'w')
infile = open('test.fq', 'r')
iCount = 0
while 1:
     line1 = infile.readline()
     line2 = infile.readline()
     line3 = infile.readline()
     line4 = infile.readline()
     if line1 == '' or line2 == '' or line3 == '' or line4 == '': break
     myacc = line1[1:].strip()
     myseq = '%s%s' % (line2.strip(), line4.strip())
     seqsize = chr(len(myseq))
     outfile1.write('%s%s' % (seqsize, myseq))
     outfile2.write('%s\t%d\n' % (myacc, iCount))
     iCount += 1 + len(myseq)
outfile1.close()
outfile2.close()

Two short .seek and .read operations. One for reading size of sequence and  
the other
for reading sequence + score. If we know integer sequence ID (position of  
file, we
can seek by .seek), we can read the sequence and score.

infile = open('test.ifa', 'r')
for lines in open('test.pfa', 'r').xreadlines():
     oldacc, intacc = lines.strip().split('\t')
     intacc = int(intacc)
     infile.seek(intacc)
     seqsize = ord(infile.read(1))
     infile.seek(intacc + 1)
     readseq = infile.read(seqsize)
     myseq, myscore = readseq[:seqsize/2], readseq[seqsize/2:]
     print oldacc, intacc, myseq, myscore

Let me know what you think.

--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings