[khmer] Partitioning: are resulting lumps that different from each other? (YiJin Liew)

Tue Jan 14 06:13:18 PST 2014

Hi YiJin,

The partitions are different than the groups -- maybe I can help clarify
the difference.
The groups are a set of partitions which have been somewhat arbitrarily put
together based on your max size argument.  Basically, what happens after
partitioning is that you have millions to billions of partitions and its
just impractical to work with each of these individually - though it might
be desired in some cases.  When I want to assemble partitions, for example,
its much easier to work with 100 groups of partitions than billions of
partitions.  To group them, the partitions are pretty much rank ordered by
the number of sequences they have within them and then added to a group
until the maximum size is hit.

What happens after I group them is that I want to know the distribution of
sequences within partitions, and this is where the *dist file comes in.  It
describes the partitions, not the groups -- I see how this is confusing and
apologize.  The columns in the dist file are as follows:  Number of
sequences, number of partitions with that number of sequences, cumulative
number of partitions, cumulative number of sequences (reads)

To blast two partitions against each other doesn't make sense to me unless
there is a reason you would think they are related (based on sequence
homology).  That being said, its quite possible and even likely that
partitions with similar numbers of reads are related but you want to be
careful with this assumption.  Partitions with similar #s of reads reflect
biology which has been sampled with similar coverage, which means that they
*could* be related...arguably more than partitions which have very
different abundances.  In my hands, this seems to hold true, but I'd want
to validate it a lot more before committing to anything.  And it would vary
by what you are actually sampling of course.

Hope this helps,
Adina

On Mon, Jan 13, 2014 at 3:00 PM, <khmer-request at lists.idyll.org> wrote:

> Send khmer mailing list submissions to
>         khmer at lists.idyll.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.idyll.org/listinfo/khmer
> or, via email, send a message with subject or body 'help' to
>         khmer-request at lists.idyll.org
>
> You can reach the person managing the list at
>         khmer-owner at lists.idyll.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of khmer digest..."
>
>
> Today's Topics:
>
>    1. Re: Partitioning: are resulting lumps that different from
>       each other? (YiJin Liew)
>    2. Re: Partitioning: are resulting lumps that different from
>       each      other? (C. Titus Brown)
>    3. Re: Partitioning: are resulting lumps that different from
>       each other? (YiJin Liew)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 12 Jan 2014 23:28:35 +0000
> From: YiJin Liew <YiJin.Liew at KAUST.EDU.SA>
> Subject: Re: [khmer] Partitioning: are resulting lumps that different
>         from each other?
> To: "khmer at lists.idyll.org" <khmer at lists.idyll.org>
> Message-ID: <52D32525.1020007 at kaust.edu.sa>
> Content-Type: text/plain; charset="utf-8"
>
> Thanks for the reply! Could I ask a few follow-up questions regarding
> the format of the .dist file then, as I can't seem to find a full
> description of how the file is structured?
>
> Take for example
>
> --- iowa-corn-50m.dist ---
> 1 19750012 19750012 19750012
> 2 2905935 22655947 25561882
> 3 745747 23401694 27799123
> 4 324017 23725711 29095191
> 5 167228 23892939 29931331
> <snip>
> 2312 1 24356713 37268397
> 2359 1 24356714 37270756
> 2714 1 24356715 37273470
> 3008 1 24356716 37276478
> 3296530 1 24356717 40573008   <-- is this the most interesting group?
>
> I can sort of guess what the numbers mean, but let me double-check: does
> this indicate that there's 19.8 million clusters that are "singlets";
> followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
> cumulative figures for clusters and sequences respectively?
>
> If you don't mind, could you elaborate briefly on how groups are created
> based on the dist file? Judging from the line counts, I suspect that the
> script fills the first group with singlets till --max-size is hit, if
> not, continue filling with doublets, then move on to the next group once
> --max-size is crossed?
>
> On my data, I've tried blastn-ing the groups0000 and 0001 produced from
> the partitioning process, but from the results I'd wager that they're
> roughly the same - which was what prompted me to seek advice on how the
> script functioned.
>
> Apologies for the wall-of-text, thanks again for your help!
>
> Yours
> Yi Jin
>
>
>
> On 12/01/2014 20:51, C. Titus Brown wrote:
> > On Thu, Jan 09, 2014 at 02:23:29PM +0000, YiJin Liew wrote:
> >> Dear Dr Brown,
> >>
> >> Before I delve into my sob story, I'd like to thank you (and your lab)
> >> for writing khmer. I must say that the digital normalisation pipeline
> >> proved to be an elegant method of reducing the amount of errors in
> >> sequencing data, and our resulting assembly have improved (and sped up
> >> considerably) because of your programs. Thanks.
> >>
> >> After the digital normalisation pipeline, I tried out the partitioning
> >> pipeline as described in
> >> http://khmer.readthedocs.org/en/latest/partitioning-big-data.html. I'm
> >> having some trouble wrapping my head around the results produced by
> >> extract-partitions.py - the resulting lumps (in group000x files) seem to
> >> be strongly influenced by the -X (--max-size) parameter that one uses.
> >>
> >> Take for example the 1.1G Iowa corn dataset you made available online,
> >> specifically the
> >
> > Hi YiJin,
> >
> > apologies for taking so long to reply.  The 'group' files output by
> > extract-partitions contain multiple partitions; the -X parameter
> controls how
> > many sequences, roughly, go into each group.  So this is entirely
> expected.
> >
> > Partitions are connected sequences; groups are merely collections of
> similarly
> > sized partitions.
> >
> > The file to take a look at is the '.dist' file; that's the distribution
> > of partition sizes.
> >
> > best,
> > --titus
> >
>
> ________________________________
>
> This message and its contents including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
> ------------------------------
>
> Message: 2
> Date: Sun, 12 Jan 2014 17:46:50 -0800
> From: "C. Titus Brown" <ctb at msu.edu>
> Subject: Re: [khmer] Partitioning: are resulting lumps that different
>         from    each    other?
> To: YiJin Liew <YiJin.Liew at KAUST.EDU.SA>
> Cc: khmer at lists.idyll.org
> Message-ID: <20140113014650.GB30578 at idyll.org>
> Content-Type: text/plain; charset=us-ascii
>
> On Sun, Jan 12, 2014 at 11:26:50PM +0000, YiJin Liew wrote:
> > Thanks for the reply! Could I ask a few follow-up questions regarding
> > the format of the .dist file then, as I can't seem to find a full
> > description of how the file is structured?
> >
> > take for example
> >
> > --- iowa-corn-50m.dist ---
> > 1 19750012 19750012 19750012
> > 2 2905935 22655947 25561882
> > 3 745747 23401694 27799123
> > 4 324017 23725711 29095191
> > 5 167228 23892939 29931331
> > <snip>
> > 2312 1 24356713 37268397
> > 2359 1 24356714 37270756
> > 2714 1 24356715 37273470
> > 3008 1 24356716 37276478
> > 3296530 1 24356717 40573008   <-- is this the most interesting group?
> >
> > I can sort of guess what the numbers mean, but let me double-check: does
> > this indicate that there's 19.8 million clusters that are "singlets";
> > followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
> > cumulative figures for clusters and sequences respectively?
>
> Exactly!
>
> > If you don't mind, could you elaborate briefly on how groups are created
> > based on the dist file? Judging from the line counts, I suspect that the
> > script fills the first group with singlets till --max-size is hit, if
> > not, continue filling with doublets, then move on to the next group once
> > --max-size is crossed?
>
> Yep.
>
> > On my data, I've tried blastn-ing the groups0000 and 0001 produced from
> > the partitioning process, but from the results I'd wager that they're
> > roughly the same - which was what prompted me to seek advice on how the
> > script functioned.
>
> Roughly the same... no, shouldn't be.  Those are probably spurious
> BLAST matches of some sort.  If partitioning worked (and at least from
> the examples above you got a lot of partitions) then those reads are
> from different components of the overall de Bruijn graph.
>
> cheers,
> --titus
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 13 Jan 2014 14:21:30 +0000
> From: YiJin Liew <YiJin.Liew at KAUST.EDU.SA>
> Subject: Re: [khmer] Partitioning: are resulting lumps that different
>         from each other?
> To: "titus at idyll.org" <titus at idyll.org>
> Cc: "khmer at lists.idyll.org" <khmer at lists.idyll.org>
> Message-ID: <52D3F66B.6030200 at kaust.edu.sa>
> Content-Type: text/plain; charset="utf-8"
>
> Thanks for the confirmations!
>
> To back my intuition, I repeated the rough blastn searches on
> groups0000, 0001 and 0002 produced from your Iowa corn dataset.
>
> What I did was:
> 1. blastn the 1 mil reads of each dataset against nt
> 2. for each read, pick the top hit with e-value of less than 1e-5
> 3. and do a bit of BASH-fu to record what species the read comes from,
> and show the top 20.
>
> Results!
>
> --- iowa-corn-50m.group0000_vs_nt.blastn.tsv ---
>     3173 Rhodanobacter sp.
>     2122 Rhodopseudomonas palustris
>     1971 Bradyrhizobium japonicum
>     1878 Streptomyces sp.
>     1822 Bradyrhizobium sp.
>     1202 Ramlibacter tataouinensis
>     1024 Variovorax paradoxus
>      890 Streptomyces fulvissimus
>      801 Uncultured bacterium
>      774 Intrasporangium calvum
>      687 Kribbella flavida
>      616 Agromonas oligotrophica
>      568 Actinoplanes sp.
>      539 Conexibacter woesei
>      518 Streptomyces griseus
>      508 Nocardioides sp.
>      502 Streptosporangium roseum
>      466 Micromonospora sp.
>      432 Verrucosispora maris
>      390 Clavibacter michiganensis
>
> --- iowa-corn-50m.group0001_vs_nt.blastn.tsv ---
>     5072 Rhodanobacter sp.
>     3322 Streptomyces sp.
>     2287 Bradyrhizobium japonicum
>     2164 Rhodopseudomonas palustris
>     2129 Bradyrhizobium sp.
>     1658 Ramlibacter tataouinensis
>     1486 Streptomyces fulvissimus
>     1325 Intrasporangium calvum
>     1271 Variovorax paradoxus
>      889 Uncultured bacterium
>      841 Streptomyces griseus
>      728 Agromonas oligotrophica
>      627 Nocardioides sp.
>      626 Actinoplanes sp.
>      555 Streptosporangium roseum
>      552 Kribbella flavida
>      546 Conexibacter woesei
>      517 Micromonospora sp.
>      474 Verrucosispora maris
>      434 Nitrobacter hamburgensis
>
> --- iowa-corn-50m.group0002_vs_nt.blastn.tsv ---
>     7099 Rhodanobacter sp.
>     4219 Streptomyces sp.
>     2559 Bradyrhizobium japonicum
>     2399 Bradyrhizobium sp.
>     2336 Rhodopseudomonas palustris
>     1931 Intrasporangium calvum
>     1852 Ramlibacter tataouinensis
>     1803 Streptomyces fulvissimus
>     1350 Variovorax paradoxus
>     1170 Streptomyces griseus
>      849 Agromonas oligotrophica
>      848 Uncultured bacterium
>      668 Conexibacter woesei
>      646 Actinoplanes sp.
>      615 Streptosporangium roseum
>      585 Nocardioides sp.
>      559 Micromonospora sp.
>      491 Clavibacter michiganensis
>      473 Kribbella flavida
>      454 Rubrivivax gelatinosus
>
>
> ... come to think of it, I should've repeated the same analysis on the
> biggest lump (group0007). However, from what I've observed from my own
> datasets, the biggest lump should be different from the smaller lumps.
> It's just that I feel that the small lumps are actually not very
> different from each other, and they're split mainly because of the -X
> setting. Hope this (very rough) analysis of mine illustrates what I'm
> trying to say!
>
> Yours
> Yi Jin
>
> On 13/01/2014 04:46, C. Titus Brown wrote:
> > On Sun, Jan 12, 2014 at 11:26:50PM +0000, YiJin Liew wrote:
> >> Thanks for the reply! Could I ask a few follow-up questions regarding
> >> the format of the .dist file then, as I can't seem to find a full
> >> description of how the file is structured?
> >>
> >> take for example
> >>
> >> --- iowa-corn-50m.dist ---
> >> 1 19750012 19750012 19750012
> >> 2 2905935 22655947 25561882
> >> 3 745747 23401694 27799123
> >> 4 324017 23725711 29095191
> >> 5 167228 23892939 29931331
> >> <snip>
> >> 2312 1 24356713 37268397
> >> 2359 1 24356714 37270756
> >> 2714 1 24356715 37273470
> >> 3008 1 24356716 37276478
> >> 3296530 1 24356717 40573008   <-- is this the most interesting group?
> >>
> >> I can sort of guess what the numbers mean, but let me double-check: does
> >> this indicate that there's 19.8 million clusters that are "singlets";
> >> followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
> >> cumulative figures for clusters and sequences respectively?
> >
> > Exactly!
> >
> >> If you don't mind, could you elaborate briefly on how groups are created
> >> based on the dist file? Judging from the line counts, I suspect that the
> >> script fills the first group with singlets till --max-size is hit, if
> >> not, continue filling with doublets, then move on to the next group once
> >> --max-size is crossed?
> >
> > Yep.
> >
> >> On my data, I've tried blastn-ing the groups0000 and 0001 produced from
> >> the partitioning process, but from the results I'd wager that they're
> >> roughly the same - which was what prompted me to seek advice on how the
> >> script functioned.
> >
> > Roughly the same... no, shouldn't be.  Those are probably spurious
> > BLAST matches of some sort.  If partitioning worked (and at least from
> > the examples above you got a lot of partitions) then those reads are
> > from different components of the overall de Bruijn graph.
> >
> > cheers,
> > --titus
> >
>
> ________________________________
>
> This message and its contents including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
> ------------------------------
>
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer
>
>
> End of khmer Digest, Vol 12, Issue 7
> ************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20140114/bf5fd46c/attachment-0002.htm>