[khmer] Extracting the original reads after diginorm + partitioning

Eric McDonald emcd.msu at gmail.com
Fri Mar 15 15:09:49 PDT 2013


Yes.

Default k = 32.
Full sequence length is 75. 75 - 32 = 43.
Trimmed sequence length is 37. 37 -32 = 5.
To get a median of >0, you need a trimmed sequence length of at least 53,
because 43 / 2 = 21, 21 - 5 = 16, 37 + 16 = 53.

For example, try this as your sequence in test1.fa:
  AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCA

Nothing wrong with the code - just your math skills, Adina. :-)

Hope that helps,
  Eric



On Wed, Mar 13, 2013 at 12:03 AM, Adina Chuang Howe
<adina.chuang at gmail.com>wrote:

> It is definitely reproducible.  And I have at least one sequence that can
> be identified that is causing this.
>
> Here's what I'm seeing:  (these are on the same HPC scratch space as
> before)
>
> test1.fa contains only (37 bp):
>
> >SRR172902.316476 42391
> AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCC
>
> SRR172902.316476.single.fq contains the untrimmed same:
>
> @SRR172902.316476 USI-EAS376:1:5:311:233 length=75
> AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCAAACAGCGCAAGGCAACAGATCG
> +SRR172902.316476 USI-EAS376:1:5:311:233 length=75
> BBBBBABBB at BBBBBB
> >B>B=AA>B?A=4ABA;B>>AAA79;;068799;2;====>3>9>7922=;739#####
>
> python sweep-reads3.py test1.fa SRR172902.316476.single.fq
>
> results in an empty test1.fa.sweep
>
> But...
>
> python sweep-reads3.py SRR172902.316476.single.fq
> SRR172902.316476.single.fq
>
> Results in
>
> ==> SRR172902.316476.single.fq.sweep3 <==
> >SRR172902.316476
> AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCAAACAGCGCAAGGCAACAGATCG
>
> Any clue?
>
>
> On Sun, Mar 10, 2013 at 11:33 PM, C. Titus Brown <ctb at msu.edu> wrote:
>
>> On Thu, Mar 07, 2013 at 02:10:17PM -0500, Adina Chuang Howe wrote:
>> > Possible bug in sweep-reads...I'm not recovering the partitioned reads
>> > from the original dataset.
>> >
>> > First observed this when I looked at lotsa partitions and trying to
>> > recover swept reads - some swept files would show up empty:
>> >
>> > command:
>> > python sweep-reads3.py -N 4 -k 32 -x 1e9
>> > /mnt/research/gpgc/hmp-mock-partitions/001264-files/no-sweep-pids/pid*fa
>> > /mnt/research/gpgc/hmp-mock-partitions/SRR-combined.fastq
>> >
>> > troubleshooting:
>> > Then I looked at just one partition:
>> > on HPC:  /mnt/scratch/howead/test
>> > python sweep-reads3.py pid-42391.fa SRR-combined.fastq
>> >
>> > And resulting sweepfile is empty.
>> >
>> > If I run:
>> > python sweep-reads3.py pid-42391.fa pid-42391.fa
>> >
>> > Behavior is correct.
>>
>> Very weird.  I don't see anything obviously wrong in the script, which
>> just means it's a subtle and deep bug.
>>
>> Two questions --
>>
>>  - is it reproducible, i.e. do you get the same results every time you
>>    run it? (please yes)
>>
>>  - can you break it down to a smaller failure point than with
>> SRR-combined,
>>    e.g. maybe a few hundred k reads?
>>
>> thanks,
>> --titus
>> --
>> C. Titus Brown, ctb at msu.edu
>>
>
>


-- 
Eric McDonald
HPC/Cloud Software Engineer
  for the Institute for Cyber-Enabled Research (iCER)
  and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130315/f5869bd9/attachment-0002.htm>


More information about the khmer mailing list