[khmer] Extracting the original reads after diginorm + partitioning

Tue Mar 12 21:03:01 PDT 2013

It is definitely reproducible.  And I have at least one sequence that can
be identified that is causing this.

Here's what I'm seeing:  (these are on the same HPC scratch space as before)

test1.fa contains only (37 bp):

>SRR172902.316476 42391
AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCC

SRR172902.316476.single.fq contains the untrimmed same:

@SRR172902.316476 USI-EAS376:1:5:311:233 length=75
AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCAAACAGCGCAAGGCAACAGATCG
+SRR172902.316476 USI-EAS376:1:5:311:233 length=75
BBBBBABBB at BBBBBB>B>B=AA>B?A=4ABA;B>>AAA79;;068799;2;====>3>9>7922=;739#####

python sweep-reads3.py test1.fa SRR172902.316476.single.fq

results in an empty test1.fa.sweep

But...

python sweep-reads3.py SRR172902.316476.single.fq SRR172902.316476.single.fq

Results in

==> SRR172902.316476.single.fq.sweep3 <==
>SRR172902.316476
AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCAAACAGCGCAAGGCAACAGATCG

Any clue?

On Sun, Mar 10, 2013 at 11:33 PM, C. Titus Brown <ctb at msu.edu> wrote:

> On Thu, Mar 07, 2013 at 02:10:17PM -0500, Adina Chuang Howe wrote:
> > Possible bug in sweep-reads...I'm not recovering the partitioned reads
> > from the original dataset.
> >
> > First observed this when I looked at lotsa partitions and trying to
> > recover swept reads - some swept files would show up empty:
> >
> > command:
> > python sweep-reads3.py -N 4 -k 32 -x 1e9
> > /mnt/research/gpgc/hmp-mock-partitions/001264-files/no-sweep-pids/pid*fa
> > /mnt/research/gpgc/hmp-mock-partitions/SRR-combined.fastq
> >
> > troubleshooting:
> > Then I looked at just one partition:
> > on HPC:  /mnt/scratch/howead/test
> > python sweep-reads3.py pid-42391.fa SRR-combined.fastq
> >
> > And resulting sweepfile is empty.
> >
> > If I run:
> > python sweep-reads3.py pid-42391.fa pid-42391.fa
> >
> > Behavior is correct.
>
> Very weird.  I don't see anything obviously wrong in the script, which
> just means it's a subtle and deep bug.
>
> Two questions --
>
>  - is it reproducible, i.e. do you get the same results every time you
>    run it? (please yes)
>
>  - can you break it down to a smaller failure point than with SRR-combined,
>    e.g. maybe a few hundred k reads?
>
> thanks,
> --titus
> --
> C. Titus Brown, ctb at msu.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130312/c942d90a/attachment-0001.htm>