[khmer] Dealing with paired-End Data (Alexis Groppi)

Mon Mar 25 10:46:28 PDT 2013

Hi Adina,

Thanks for your very clear and very useful comments.
They meet my thoughts ;)
I'm starting now from scratch with your advices.
The questions are about which species (human ? bacterian, other..?) 
species were present in the ancient DNA (~25000 year old) sequenced.

Cheers from Bordeaux

Alexis

Le 25/03/2013 16:18, Adina Chuang Howe a écrit :
> Hi Alexis,
>
> See below for comments.
>
>
>        1. Dealing with paired-End Data (Alexis Groppi)
>
>
>     ----------------------------------------------------------------------
>
>     Message: 1
>     Date: Mon, 25 Mar 2013 15:29:19 +0100
>     From: Alexis Groppi <alexis.groppi at u-bordeaux2.fr
>     <mailto:alexis.groppi at u-bordeaux2.fr>>
>     Subject: [khmer] Dealing with paired-End Data
>     To: "khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>"
>     <khmer at lists.idyll.org <mailto:khmer at lists.idyll.org>>
>     Cc: "C. Titus Brown" <ctb at msu.edu <mailto:ctb at msu.edu>>
>     Message-ID: <51505F3F.3020906 at u-bordeaux2.fr
>     <mailto:51505F3F.3020906 at u-bordeaux2.fr>>
>     Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>
>     Hi Titus,
>
>     May be a very dumb question :
>     How to deal with paired-end data (Illumina reads of 75 nt) ?
>     For some sample, I have paired-end data : it means 2 .fastq file
>     (SampleN_R1.fastq and SampleN_R2fastq).
>     What is the best strategy :
>     a/ Treat each file (R1 and R2) separatly (normalization, filtering,
>     partition) but then how to deal with the resulting files .part files
>     from each R1 and R2 for assembly ?
>
>
> We have a couple paired end options for users implemented within khmer 
> that take shape in two forms:
>
> Keep paired ends always:
>
> There is an option within khmer to retain paired-end information, 
> i.e., if digital normalization retains one pair, the other pair will 
> also be retained regardless of its coverage within a dataset (--paired).
>
> Currently, the only implementation we have for this (as far as I know) 
> requires that you have the paired ends adjacent to each other within 
> your dataset.  Depending on the sequencing facility, you may have to 
> convert R1 and R2 files to one file with a script like 
> https://github.com/ged-lab/khmer/blob/master/sandbox/interleave.py
>
> If you do turn this option off, you should keep in mind that diginorm 
> gives precedence to the order in which reads are taken as an input to 
> decide whether to retain it or not.  For reads which contain the same 
> information and are above the coverage threshold, diginorm will keep 
> the first ones it sees.  The take home here is to feed in your best 
> reads first.
>
> Use any paired end information for assembly:
>
> Assemblies can be run with paired ends even if I turn off the paired 
> end retention parameter in diginorm - with the strip and split for 
> assembly script which separates paired end reads and single end reads 
> that remain after diginorm.
>
> Which to choose:
> To choose what you want to do, it really depends on your question and 
> the type of coverage you think you have for your dataset.  For complex 
> metagenomes, I have to balance data reduction with paired end 
> information in order to be able to complete my assemblies efficiently. 
>  Its difficult to provide advice on this without knowing what your 
> questions are.
>
> If you're focused on scaffolding and longer assemblies in general, 
> maybe you want to prioritize the retention of your paired ends. If 
> you're having trouble completing assemblies at all, you might try 
> discarding more data at the cost of paired ends.
>
> I've found that assembly involves much trial and error with a result 
> that you can always improve upon and can constantly change.  Given 
> this, there's not clear workflow that I can offer advice on for every 
> user except to get your data to a point where rapid exploration can 
> occur.  I've started to work with aggressively quality trimmed data in 
> which I lose paired end information all the time so I tend nowadays to 
> not worry about retaining paired ends in my workflow.
>
> Hope this helps and good luck,
> Adina
>
>
>
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130325/5c0f5ca5/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Signature_Mail_A_Groppi.png
Type: image/png
Size: 29033 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130325/5c0f5ca5/attachment-0002.png>