[metagenomics-jclub] Metagenomics Journal Club, Wed, Nov 3, 930, PSB 271

Adina Chuang Howe adina.chuang at gmail.com
Thu Oct 28 08:32:04 PDT 2010


Hi all:

Jim Cole will present the attached papers discussing methods to classify
metagenomic data.  Many of us are dealing with the challenge of how to get
useful information from unassembled short reads in our projects.

Hope to see you there (9:30!),
Adina



Phymm and PhymmBl: metagenomic phylogenetic classification with interpolated
markov models

Metagenomics projects collect dnA from uncharacterized
environments that may contain thousands of species per
sample. one main challenge facing metagenomic analysis is
phylogenetic classification of raw sequence reads into groups
representing the same or similar taxa, a prerequisite for
genome assembly and for analyzing the biological diversity of a
sample. new sequencing technologies have made metagenomics
easier, by making sequencing faster, and more difficult, by
producing shorter reads than previous technologies. classifying
sequences from reads as short as 100 base pairs has until now
been relatively inaccurate, requiring researchers to use older,
long-read technologies. We present Phymm, a classifier for
metagenomic data, that has been trained on 539 complete,
curated genomes and can accurately classify reads as short
as 100 base pairs, a substantial improvement over previous
composition-based classification methods. We also describe
how combining Phymm with sequence alignment algorithms
improves accuracy.

Metagenome Fragment Classification Using N-Mer Frequency Profiles

A vast amount of microbial sequencing data is being generated through
large-scale projects in ecology, agriculture, and human health. Efficient
high-throughput methods are needed to analyze the mass amounts of
metagenomic data, all DNA present in an environmental sample. A major
obstacle in metagenomics is the inability to obtain accuracy using
technology that yields short
reads. We construct the unique N-mer frequency profiles of 635 microbial
genomes publicly available as of February 2008. These profiles are used to
train a naive Bayes classifier (NBC) that can be used to identify the genome
of any fragment. We show that our method is comparable to BLAST for small 25
bp fragments but does not have the ambiguity of BLAST’s tied top scores.
We demonstrate that this approach is scalable to identify any fragment from
hundreds of genomes. It also performs quite well at the strain, species, and
genera levels and achieves strain resolution despite classifying ubiquitous
genomic fragments (gene and nongene regions). Cross-validation analysis
demonstrates that species-accuracy achieves 90% for highly-represented
species containing an average of 8 strains. We demonstrate that such a tool
can be used on the Sargasso Sea dataset, and our analysis shows that NBC can
be further enhanced.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/metagenomics-jclub/attachments/20101028/91eac716/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Nat Meth 2009 Salzberg.pdf
Type: application/pdf
Size: 297394 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/metagenomics-jclub/attachments/20101028/91eac716/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Adv Bioinformatics 2008 Sokhansanj.pdf
Type: application/pdf
Size: 1765274 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/metagenomics-jclub/attachments/20101028/91eac716/attachment-0003.pdf>


More information about the metagenomics-jclub mailing list