[khmer] Extracting the original reads after diginorm + partitioning

Eric McDonald emcd.msu at gmail.com
Fri Mar 15 15:17:43 PDT 2013


Sorry, in plain English, my previous reply amounts to this:

Your trimmed sequence only had 5 k-mers of length 32.
Your full sequence had 43 k-mers of length 32.
Since the median samples from the middle of the sorted counts, you will not
see a count of 1 unless at least half of the counts are 1.
To achieve this, you need a trimmed sequence of at least length 53.

Make sense?



On Fri, Mar 15, 2013 at 6:09 PM, Eric McDonald <emcd.msu at gmail.com> wrote:

> Yes.
>
> Default k = 32.
> Full sequence length is 75. 75 - 32 = 43.
> Trimmed sequence length is 37. 37 -32 = 5.
> To get a median of >0, you need a trimmed sequence length of at least 53,
> because 43 / 2 = 21, 21 - 5 = 16, 37 + 16 = 53.
>
> For example, try this as your sequence in test1.fa:
>   AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCA
>
> Nothing wrong with the code - just your math skills, Adina. :-)
>
> Hope that helps,
>   Eric
>
>
>
> On Wed, Mar 13, 2013 at 12:03 AM, Adina Chuang Howe <
> adina.chuang at gmail.com> wrote:
>
>> It is definitely reproducible.  And I have at least one sequence that can
>> be identified that is causing this.
>>
>> Here's what I'm seeing:  (these are on the same HPC scratch space as
>> before)
>>
>> test1.fa contains only (37 bp):
>>
>> >SRR172902.316476 42391
>> AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCC
>>
>> SRR172902.316476.single.fq contains the untrimmed same:
>>
>> @SRR172902.316476 USI-EAS376:1:5:311:233 length=75
>>
>> AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCAAACAGCGCAAGGCAACAGATCG
>> +SRR172902.316476 USI-EAS376:1:5:311:233 length=75
>> BBBBBABBB at BBBBBB
>> >B>B=AA>B?A=4ABA;B>>AAA79;;068799;2;====>3>9>7922=;739#####
>>
>> python sweep-reads3.py test1.fa SRR172902.316476.single.fq
>>
>> results in an empty test1.fa.sweep
>>
>> But...
>>
>> python sweep-reads3.py SRR172902.316476.single.fq
>> SRR172902.316476.single.fq
>>
>> Results in
>>
>> ==> SRR172902.316476.single.fq.sweep3 <==
>> >SRR172902.316476
>>
>> AAGACACCCTCACCCCTAGCTGCGCGAGGCCCTCTCCCCTGGGTAGAGGGTCAAACAGCGCAAGGCAACAGATCG
>>
>> Any clue?
>>
>>
>> On Sun, Mar 10, 2013 at 11:33 PM, C. Titus Brown <ctb at msu.edu> wrote:
>>
>>> On Thu, Mar 07, 2013 at 02:10:17PM -0500, Adina Chuang Howe wrote:
>>> > Possible bug in sweep-reads...I'm not recovering the partitioned reads
>>> > from the original dataset.
>>> >
>>> > First observed this when I looked at lotsa partitions and trying to
>>> > recover swept reads - some swept files would show up empty:
>>> >
>>> > command:
>>> > python sweep-reads3.py -N 4 -k 32 -x 1e9
>>> >
>>> /mnt/research/gpgc/hmp-mock-partitions/001264-files/no-sweep-pids/pid*fa
>>> > /mnt/research/gpgc/hmp-mock-partitions/SRR-combined.fastq
>>> >
>>> > troubleshooting:
>>> > Then I looked at just one partition:
>>> > on HPC:  /mnt/scratch/howead/test
>>> > python sweep-reads3.py pid-42391.fa SRR-combined.fastq
>>> >
>>> > And resulting sweepfile is empty.
>>> >
>>> > If I run:
>>> > python sweep-reads3.py pid-42391.fa pid-42391.fa
>>> >
>>> > Behavior is correct.
>>>
>>> Very weird.  I don't see anything obviously wrong in the script, which
>>> just means it's a subtle and deep bug.
>>>
>>> Two questions --
>>>
>>>  - is it reproducible, i.e. do you get the same results every time you
>>>    run it? (please yes)
>>>
>>>  - can you break it down to a smaller failure point than with
>>> SRR-combined,
>>>    e.g. maybe a few hundred k reads?
>>>
>>> thanks,
>>> --titus
>>> --
>>> C. Titus Brown, ctb at msu.edu
>>>
>>
>>
>
>
> --
> Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733
>



-- 
Eric McDonald
HPC/Cloud Software Engineer
  for the Institute for Cyber-Enabled Research (iCER)
  and the Laboratory for Genomics, Evolution, and Development (GED)
Michigan State University
P: 517-355-8733
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130315/5f0fd99c/attachment-0002.htm>


More information about the khmer mailing list