[khmer] Partitioning: are resulting lumps that different from each other?

Wed Jan 15 20:35:51 PST 2014

On Wed, Jan 15, 2014 at 03:37:40PM +0000, YiJin Liew wrote:
> Hi Adina,
> 
> Thanks for the reply, and for clarifying the structure of the dist file!
> I apologise for using the terms "partition" and "group" loosely in the past.
> 
> Unless I have misinterpreted your reply, it seems that my blast analysis
> has been misunderstood. I DID NOT blast the Iowa corn groups against
> each other. I blasted them, individually, against *nt*, with the hopes
> of (crudely and quickly) checking the composition of reads in each group
> (i.e. 10% came from organism A, 5% came from organsim B, 3% from C,
> ...). I concur with you that blasting them against each other makes
> little sense.
> 
> If partitioning worked well, a reasonable hypothesis would be that
> different groups would have markedly different compositions of reads -
> reads from organism A might be a majority in group0000 for instance,
> while reads from B predominate in group0001. What I see from my tests,
> however, seem to refute this: groups0000, 0001 and 0002 have very
> similar read compositions. You get lots of Rhodobacter stuff, followed
> by Streptomyces, then Bradyrhizobium, etc.
> 
> My issue with this is that khmer suggests that the individual group
> files should be assembled individually. What I suspect is that you'd
> probably get very similar assembly outputs for the groups that hit
> --max-size (I haven't tried that out, though). I have no doubt that the
> biggest group would be different from the others, though. My dataset
> produced a small group and a big group post-partitioning, and I obtained
> very different sequence proportions when I carried out a similar
> analysis on them.
> 
> Let me know if you'd like more details on how my blast test worked. Thanks!

Hi, Yi Jin,

I see a few possibilities; I'll give them in reverse order of likelihood,
just to switch it up ;)

First, if you have a bunch of reasonably polymorphic strain variants OR
extremely high error rate sequencing, you might see reads from the same
"species" broken up into many partitions.

Second, if you have an organism sequenced to only low coverage, then a single
organism may well end with their reads broken up into many, many partitions,
due to breaks in the assembly graph from low coverage.  This could give you the
results you're seeing, also.

Third, you could have largely unknown organisms in your sample and BLAST
might be reporting spurious matches to organisms in the database based on
common sequences (repeats of some kind, would be my guess).  To put it another
way, systematic bias in BLAST matching might be causing what you see. This
would be exacerbated by BLAST's insistence on finding matches and a failure on
its part to take into account the number of queries.

I'm pretty sure it's #2 or #3 for this data set, with a really strong
likelihood of just being bad BLAST matches.  I seem to recall that we
used bowtie1 to map everything against known genomes and found < 1% mapping,
which generally indicates low-to-no coverage.

If you want to follow this up to check on my intuition, I would suggest one or
two things.

a) bowtie or bwa map everything to one of the reference genomes, and then
   use 'samtools tview' or Tablet to look at where the reads are mapping
   in the reference genome.  If you find reads stacked up at one or two
   locations rather than spread evenly across the genome, then the reads are
   mapping to either a repeat or a highly conserved DNA element.

b) assemble two groups, and then compare the longer (assembled) sequences to
   each other with a stricter BLAST e-value.  If there is indeed shared
   underlying genome sequence, it should assemble and you will see it at
   much higher e-value.

More generally:

We're reasonably confident that partitioning works in the way that Adina
spoke, because we have two papers that explored it in detail -- see

http://www.pnas.org/content/109/33/13272.full

and

http://arxiv.org/abs/1212.2832

The former shows that the partitions do not, in fact, assemble together --
you get identical results from the abyss assembler when you assemble the
partitions separately.

The latter shows that partitions correlate quite well to species in a
situation where we know what the species are.

That having been said, we could have broken our software or could easily
be missing something so if you do track this down further in either (a) or
(b) then I'd love to hear about it.

cheers,
--titus

[0] ged.msu.edu/angus/tutorials-2013/bwa-tutorial.html
> 
> Yours
> Yi Jin
> 
> ________________________________
> 
> This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
> _______________________________________________
> khmer mailing list
> khmer at lists.idyll.org
> http://lists.idyll.org/listinfo/khmer

-- 
C. Titus Brown, ctb at msu.edu