[khmer] filter-below-abund.py fastq scores from previous file

Adina Chuang Howe adina.chuang at gmail.com
Sat Apr 13 10:13:05 PDT 2013


Hi,

Apologies for any delayed response - there are many projects in a "bit
of a hurry" right now.  ;)

Regarding assembly advice, I do not have any experience using the
quality scores to guide assembly.  If anything, I use quality scores
to trim reads prior to assembly rather than during assembly.  My
intuition is that the quality scores are among the less important
variables needed for a good assembly, much more important are the
coverage and abundance of overlaps.  There are a number of assemblers
which focus on putting different priorities on different metrics and
understanding them all and which is best to use is a quite an effort
(even without considering the effectiveness of Phred quality scores).

In my experience, I've defaulted to assemblers and approaches which
I've methodically tested with datasets with known references.  That
being said, the pipelines we suggest have been tested mostly with
Velvet (and also soapdenovo and meta-idba) to produce similar if not
slightly better assemblies with the khmer diginorm/high abundance
filtering/partitioning (http://arxiv.org/abs/1212.2832).  Currently,
my pipeline for metagenomics involves removing adapters, merging
paired reads that overlap, diginorm, high abundance filtering, and
partitioning followed by multi-k assembly with Velvet.  These steps
often get me to an assembly I can work with though I'm sure there are
incremental improvements that could be made in parameter choices and
assembler.  The advantage for partitioning is that the results can be
used for multiple assemblers rather easily.

Hope this helps, Adina


On Fri, Apr 12, 2013 at 10:08 PM, Eric McDonald <emcd.msu at gmail.com> wrote:
> My reading of the code is that they're trimmed from the end and so I think
> you should be able to write the script without too much difficulty. (Sorry
> for the inconvenience - we'll hopefully have the necessary changes in khmer
> in the next month or two.)
>
> For suggestions about assemblers, I know that there are people on this list
> who are much better qualified than I am to answer that question. I _hope_
> that one or more of them will speak up. I just CC'd two of them in an
> attempt to prod them into participating in this thread. (My background is in
> physics and high performance computing, not in biology. I've been teaching
> myself some things about bioinformatics during the 1.25 years I've worked at
> this job, but I still don't have much exposure to bioinformatics tools
> outside of khmer and screed. :-)
>
>
> On Thu, Apr 11, 2013 at 3:05 AM, Jens-Konrad Preem <jpreem at ut.ee> wrote:
>>
>> I eagerly wait for such functionality, right now I'm in a bit of a hurry.
>> Maybe if I have some more free time I can decipher the "below -abund" and
>> screed codes eneact my own changes in it if so I'll post it. Right now I've
>> made a Perl script (slowish though) to gather the scores from keep and give
>> those to names/seqs of keep.below - if they're trimmed then I' of course in
>> trouble again, are they trimmed in some specific way from ends or something,
>> then it might not be too hard to add this trimming function to my script to
>> make the scores to correspond to sequence.
>>
>> In case I fail in my task, have you any suggestions for paired end
>> assemblers that don't take quality scores+ I've tried Cope but for some
>> reason it produced very little alignments. (Compared to flash for example) -
>> I'd like to merge the normalized (and hopefully partitioned) pairs before
>> assembly6 with the likes of velvet or SoapDenovo. If all goes sideways what
>> are your thoughts on - a) assembly without merging pairs, b) merging pairs
>> (with Flash etc.) before any modification by khmer
>> (normalisation,parititioning).
>>
>> Jens-Konrad
>>
>>
>> On 04/11/2013 03:08 AM, Eric McDonald wrote:
>>
>> I believe that the 'filter-below-abund.py' script trims sequences. So,
>> your Perl script may need to also truncate the quality scores line down to
>> the length of the trimmed sequences.
>>
>> By the way, we are working on getting the scripts to output FASTQ if they
>> receive FASTQ inputs, but that functionality is not ready yet. You're
>> definitely not the only person interested in that functionality. ;-)
>>
>> Hope that helps,
>>   Eric
>>
>>
>>
>> On Wed, Apr 10, 2013 at 8:13 AM, Jens-Konrad Preem <jpreem at ut.ee> wrote:
>>>
>>> Hi,
>>> I have just a quick question. Filter-below-abund takes a fastq file and
>>> outputs a fasta file.
>>> Can I make use of a Perl script that would take the names from the
>>> resulting file and add the quality scores from the previous file. As I
>>> understand nothing happens to the names or sequences - some of them just get
>>> culled.
>>>
>>> I want to try out the normalized data with some paired end reads
>>> assemblers that use quality scores/fastq files.
>>> I think its easier for me to write such Perl script than to modify
>>> filter-below-abund.py to output fastq.
>>>
>>> Not much of a python guy - though it seems that there shouldn't be too
>>> much work on replacing screed.fasta with screed.fastq etc., but I find it is
>>> quite often easier to write a few lines than to parse what someone else
>>> wrote and why and then try to modify it :D.
>>>
>>> Jens-Konrad Preem, MSc., University of Tartu
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> khmer mailing list
>>> khmer at lists.idyll.org
>>> http://lists.idyll.org/listinfo/khmer
>>
>>
>>
>>
>> --
>> Eric McDonald
>> HPC/Cloud Software Engineer
>>   for the Institute for Cyber-Enabled Research (iCER)
>>   and the Laboratory for Genomics, Evolution, and Development (GED)
>> Michigan State University
>> P: 517-355-8733
>>
>>
>> --
>> Jens-Konrad Preem, MSc, University of Tartu
>>
>>
>> _______________________________________________
>> khmer mailing list
>> khmer at lists.idyll.org
>> http://lists.idyll.org/listinfo/khmer
>>
>
>
>
> --
> Eric McDonald
> HPC/Cloud Software Engineer
>   for the Institute for Cyber-Enabled Research (iCER)
>   and the Laboratory for Genomics, Evolution, and Development (GED)
> Michigan State University
> P: 517-355-8733




More information about the khmer mailing list