[pygr-notify] Issue 89 in pygr: Need for solexa seqdb with integer (64bit) ID which requires no hashing/indexing
codesite-noreply at google.com
codesite-noreply at google.com
Thu May 14 07:45:03 PDT 2009
Updates:
Summary: Need for solexa seqdb with integer (64bit) ID which requires no
hashing/indexing
Comment #2 on issue 89 by deepreds: Need for solexa seqdb with integer
(64bit) ID which requires no hashing/indexing
http://code.google.com/p/pygr/issues/detail?id=89
Extremely crude idea. 1 character for size of sequence + score (ordinal),
then save
sequence + score.
outfile1 = open('test.ifa', 'w')
outfile2 = open('test.pfa', 'w')
infile = open('test.fq', 'r')
iCount = 0
while 1:
line1 = infile.readline()
line2 = infile.readline()
line3 = infile.readline()
line4 = infile.readline()
if line1 == '' or line2 == '' or line3 == '' or line4 == '': break
myacc = line1[1:].strip()
myseq = '%s%s' % (line2.strip(), line4.strip())
seqsize = chr(len(myseq))
outfile1.write('%s%s' % (seqsize, myseq))
outfile2.write('%s\t%d\n' % (myacc, iCount))
iCount += 1 + len(myseq)
outfile1.close()
outfile2.close()
Two short .seek and .read operations. One for reading size of sequence and
the other
for reading sequence + score. If we know integer sequence ID (position of
file, we
can seek by .seek), we can read the sequence and score.
infile = open('test.ifa', 'r')
for lines in open('test.pfa', 'r').xreadlines():
oldacc, intacc = lines.strip().split('\t')
intacc = int(intacc)
infile.seek(intacc)
seqsize = ord(infile.read(1))
infile.seek(intacc + 1)
readseq = infile.read(seqsize)
myseq, myscore = readseq[:seqsize/2], readseq[seqsize/2:]
print oldacc, intacc, myseq, myscore
Let me know what you think.
--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings
More information about the pygr-notify
mailing list