[khmer] Dealing with paired-End Data (Alexis Groppi)

Mon Mar 25 08:18:23 PDT 2013

Hi Alexis,

See below for comments.

   1. Dealing with paired-End Data (Alexis Groppi)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 25 Mar 2013 15:29:19 +0100
> From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr>
> Subject: [khmer] Dealing with paired-End Data
> To: "khmer at lists.idyll.org" <khmer at lists.idyll.org>
> Cc: "C. Titus Brown" <ctb at msu.edu>
> Message-ID: <51505F3F.3020906 at u-bordeaux2.fr>
> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>
> Hi Titus,
>
> May be a very dumb question :
> How to deal with paired-end data (Illumina reads of 75 nt) ?
> For some sample, I have paired-end data : it means 2 .fastq file
> (SampleN_R1.fastq and SampleN_R2fastq).
> What is the best strategy :
> a/ Treat each file (R1 and R2) separatly (normalization, filtering,
> partition) but then how to deal with the resulting files .part files
> from each R1 and R2 for assembly ?
>

We have a couple paired end options for users implemented within khmer that
take shape in two forms:

Keep paired ends always:

There is an option within khmer to retain paired-end information, i.e., if
digital normalization retains one pair, the other pair will also be
retained regardless of its coverage within a dataset (--paired).

Currently, the only implementation we have for this (as far as I know)
requires that you have the paired ends adjacent to each other within your
dataset.  Depending on the sequencing facility, you may have to convert R1
and R2 files to one file with a script like
https://github.com/ged-lab/khmer/blob/master/sandbox/interleave.py

If you do turn this option off, you should keep in mind that diginorm gives
precedence to the order in which reads are taken as an input to decide
whether to retain it or not.  For reads which contain the same information
and are above the coverage threshold, diginorm will keep the first ones it
sees.  The take home here is to feed in your best reads first.

Use any paired end information for assembly:

Assemblies can be run with paired ends even if I turn off the paired end
retention parameter in diginorm - with the strip and split for assembly
script which separates paired end reads and single end reads that remain
after diginorm.

Which to choose:
To choose what you want to do, it really depends on your question and the
type of coverage you think you have for your dataset.  For complex
metagenomes, I have to balance data reduction with paired end information
in order to be able to complete my assemblies efficiently.  Its difficult
to provide advice on this without knowing what your questions are.

If you're focused on scaffolding and longer assemblies in general, maybe
you want to prioritize the retention of your paired ends. If you're having
trouble completing assemblies at all, you might try discarding more data at
the cost of paired ends.

I've found that assembly involves much trial and error with a result that
you can always improve upon and can constantly change.  Given this, there's
not clear workflow that I can offer advice on for every user except to get
your data to a point where rapid exploration can occur.  I've started to
work with aggressively quality trimmed data in which I lose paired end
information all the time so I tend nowadays to not worry about retaining
paired ends in my workflow.

Hope this helps and good luck,
Adina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130325/325a4204/attachment-0002.htm>