[ged] Fwd: journal club this week

Adina Chuang Howe adina.chuang at gmail.com
Mon Feb 21 12:09:59 PST 2011


I changed my mind...I'll be presenting this paper instead.  Actually,
it will be an overall presentation on how 16S rRNA gene sequences have
been used to estimate species richness and how sample size (amount of
sequencing) has been used in this effort to estimate total species
diversity.   Much like QP has been trying to estimate sequencing
coverage based on the kmers...

I will be showing efforts from these three papers -- no need to read
them all.... just here for reference  :

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1324986/ (attached)
http://www.ncbi.nlm.nih.gov/pubmed/18585806?ordinalpos=1&itool=PPMCLayout.PPMCAppController.PPMCArticlePage.PPMCPubmedRA&linkpos=2
http://www.ncbi.nlm.nih.gov/pubmed/19561178

Adina







On Mon, Feb 21, 2011 at 1:15 PM, Likit Preeyanon <preeyano at msu.edu> wrote:
> Meet with Adina at noon this Wednesday.
>
> Begin forwarded message:
>
> From: Adina Chuang Howe <adina.chuang at gmail.com>
> Date: February 20, 2011 9:58:02 PM EST
> To: Likit Preeyanon <preeyano at msu.edu>
> Subject: journal club this week
>
> Hey,
>
> I'm going to attempt to explain this -- or find another paper in the
> mean time.  Can you please forward this off to the gang?
>
> -adina
>
> ==================
> Abstract:
>
> Efficient algorithms for accurate hierarchical clustering of huge
> datasets: tackling the entire protein space
>
> Motivation: UPGMA (average linking) is probably the most popular
> algorithm for hierarchical data clustering, especially in
> computational biology. However, UPGMA requires the entire
> dissimilarity matrix in memory. Due to this prohibitive requirement,
> UPGMA is not scalable to very large datasets.
>
> Application: We present a novel class of memory-constrained UPGMA
> (MC-UPGMA) algorithms. Given any practical memory size constraint,
> this framework guarantees the correct clustering solution without
> explicitly requiring all dissimilarities in memory. The algorithms are
> general and are applicable to any dataset. We present a data-dependent
> characterization of hardness and clustering efficiency. The presented
> concepts are applicable to any agglomerative clustering formulation.
>
> Results: We apply our algorithm to the entire collection of protein
> sequences, to automatically build a comprehensive evolutionary-driven
> hierarchy of proteins from sequence alone. The newly created tree
> captures protein families better than state-of-the-art large-scale
> methods such as CluSTr, ProtoNet4 or single-linkage clustering. We
> demonstrate that leveraging the entire mass embodied in all sequence
> similarities allows to significantly improve on current protein family
> clusterings which are unable to directly tackle the sheer mass of this
> data. Furthermore, we argue that non-metric constraints are an
> inherent complexity of the sequence space and should not be
> overlooked. The robustness of UPGMA allows significant improvement,
> especially for multidomain proteins, and for large or divergent
> families.
>
> Availability: A comprehensive tree built from all UniProt sequence
> similarities, together with navigation and classification tools will
> be made available as part of the ProtoNet service. A C++
> implementation of the algorithm is available on request.
>
>
>
> _______________________________________________
> ged-jclub mailing list
> ged-jclub at lists.idyll.org
> http://lists.idyll.org/listinfo/ged-jclub
>
> _______________________________________________
> ged mailing list
> ged at lists.idyll.org
> http://lists.idyll.org/listinfo/ged
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hong 2006 Predicting microbial species richness.pdf
Type: application/pdf
Size: 291339 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/ged-jclub/attachments/20110221/7255f3a0/attachment-0001.pdf>


More information about the ged-jclub mailing list