[ged] Fwd: journal club this week

Mon Feb 21 10:15:39 PST 2011

Meet with Adina at noon this Wednesday.

Begin forwarded message:

> From: Adina Chuang Howe <adina.chuang at gmail.com>
> Date: February 20, 2011 9:58:02 PM EST
> To: Likit Preeyanon <preeyano at msu.edu>
> Subject: journal club this week
> 
> Hey,
> 
> I'm going to attempt to explain this -- or find another paper in the
> mean time.  Can you please forward this off to the gang?
> 
> -adina
> 
> ==================
> Abstract:
> 
> Efficient algorithms for accurate hierarchical clustering of huge
> datasets: tackling the entire protein space
> 
> Motivation: UPGMA (average linking) is probably the most popular
> algorithm for hierarchical data clustering, especially in
> computational biology. However, UPGMA requires the entire
> dissimilarity matrix in memory. Due to this prohibitive requirement,
> UPGMA is not scalable to very large datasets.
> 
> Application: We present a novel class of memory-constrained UPGMA
> (MC-UPGMA) algorithms. Given any practical memory size constraint,
> this framework guarantees the correct clustering solution without
> explicitly requiring all dissimilarities in memory. The algorithms are
> general and are applicable to any dataset. We present a data-dependent
> characterization of hardness and clustering efficiency. The presented
> concepts are applicable to any agglomerative clustering formulation.
> 
> Results: We apply our algorithm to the entire collection of protein
> sequences, to automatically build a comprehensive evolutionary-driven
> hierarchy of proteins from sequence alone. The newly created tree
> captures protein families better than state-of-the-art large-scale
> methods such as CluSTr, ProtoNet4 or single-linkage clustering. We
> demonstrate that leveraging the entire mass embodied in all sequence
> similarities allows to significantly improve on current protein family
> clusterings which are unable to directly tackle the sheer mass of this
> data. Furthermore, we argue that non-metric constraints are an
> inherent complexity of the sequence space and should not be
> overlooked. The robustness of UPGMA allows significant improvement,
> especially for multidomain proteins, and for large or divergent
> families.
> 
> Availability: A comprehensive tree built from all UniProt sequence
> similarities, together with navigation and classification tools will
> be made available as part of the ProtoNet service. A C++
> implementation of the algorithm is available on request.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/ged-jclub/attachments/20110221/08c07b1a/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Bioinformatics-2008-Loewenstein-i41-9.pdf
Type: application/pdf
Size: 489868 bytes
Desc: not available
URL: <http://lists.idyll.org/pipermail/ged-jclub/attachments/20110221/08c07b1a/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/ged-jclub/attachments/20110221/08c07b1a/attachment-0002.htm>