[khmer] partitioning pipeline output, fastq

Fri May 24 07:50:57 PDT 2013

On Fri, May 24, 2013 at 05:47:13PM +0300, Jens-Konrad Preem wrote:
> On 05/24/2013 05:40 PM, Jordan Fish wrote:
>> One thing I'd recommend is to do your mate pair merging -before- digi
>> norm and partitioning.  Feed the reads that merge successfully in to
>> diginorm first, in line with the put your best data in first.
>>
>> Jordan
>>
>> On Fri, May 24, 2013 at 9:06 AM, C. Titus Brown <ctb at msu.edu> wrote:
>>> Correct!
>>>
>>> ---
>>> C. Titus Brown, ctb at msu.edu
>>>
>>> On May 24, 2013, at 8:59, Jens-Konrad Preem <jpreem at ut.ee> wrote:
>>>
>>>> Hi,
>>>> similar to my question about filter-below-abund.by output that got already solved. Thanks!
>>>> The input and output for partitioning pipeline as mentioned by your Guide, and example of partitioning large data on your website is fasta formatted file. The next step for partitioned data would be assembly. I am thinking on pre-assembling the mate pairs with FLASH *before full assembly with SoapDenovo2 or Velvet. The input files for FLASH are fastq.
>>>>
>>>> Do I understand correctly that nothing happens to the sequences themselves during the partitioning- they are just binned/sorted around into groups/partitions?
>>>> In such case it should be no problem for me to take the quality scores from the filter-below-abund.py output fastq (the brother of filter-below-abundpy fasta output :D) and then just attach those to the partitioned sequences?
>>>>
>>>> Jens
>>>>
>>>>
>>>>
>>>> * they seem to apply that the genome assembly furhter down the line would be remarkably improved, at least as it is for the case of Soapdenovo, maybe it is not such a case for Velvet, the assembler you have suggested?
>>>> Mago??, T., & Salzberg, S. L. (2011). FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics (Oxford, England), 27(21), 2957???63. doi:10.1093/bioinformatics/btr507
>>>>
>>>> --
>>>> Jens-Konrad Preem, MSc., University of Tartu
>>>>
>>>>
>>>> _______________________________________________
>>>> khmer mailing list
>>>> khmer at lists.idyll.org
>>>> http://lists.idyll.org/listinfo/khmer
>>> _______________________________________________
>>> khmer mailing list
>>> khmer at lists.idyll.org
>>> http://lists.idyll.org/listinfo/khmer
> That is a good idea. So what do you think about following pipeline  
> quality control(was thinking Musket*), merging-pairs (FLASH), diginorm  
> and partitioning as per "partitioning large datasets"(feed in both  
> merged ones and the single ones), assembly (considering Soapdenovo2 or  
> Velvet).
> Jens
> *http://musket.sourceforge.net/homepage.htm#latest

Hmmm.  Let us know how it goes? :)

Musket might be a bottleneck.  What about:

diginorm -p -C 20, Musket, merge pairs, diginorm -C 5, filter-below-abund,
partition?

Also, don't know how well Musket will work on metagenomic data.

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu