<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Meet with Adina at noon this Wednesday.<br><div><br><div>Begin forwarded message:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>From: </b></span><span style="font-family:'Helvetica'; font-size:medium;">Adina Chuang Howe &lt;<a href="mailto:adina.chuang@gmail.com">adina.chuang@gmail.com</a>&gt;<br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>Date: </b></span><span style="font-family:'Helvetica'; font-size:medium;">February 20, 2011 9:58:02 PM EST<br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>To: </b></span><span style="font-family:'Helvetica'; font-size:medium;">Likit Preeyanon &lt;<a href="mailto:preeyano@msu.edu">preeyano@msu.edu</a>&gt;<br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>Subject: </b></span><span style="font-family:'Helvetica'; font-size:medium;"><b>journal club this week</b><br></span></div><br><div>Hey,<br><br>I'm going to attempt to explain this -- or find another paper in the<br>mean time. &nbsp;Can you please forward this off to the gang?<br><br>-adina<br><br>==================<br>Abstract:<br><br>Efficient algorithms for accurate hierarchical clustering of huge<br>datasets: tackling the entire protein space<br><br>Motivation: UPGMA (average linking) is probably the most popular<br>algorithm for hierarchical data clustering, especially in<br>computational biology. However, UPGMA requires the entire<br>dissimilarity matrix in memory. Due to this prohibitive requirement,<br>UPGMA is not scalable to very large datasets.<br><br>Application: We present a novel class of memory-constrained UPGMA<br>(MC-UPGMA) algorithms. Given any practical memory size constraint,<br>this framework guarantees the correct clustering solution without<br>explicitly requiring all dissimilarities in memory. The algorithms are<br>general and are applicable to any dataset. We present a data-dependent<br>characterization of hardness and clustering efficiency. The presented<br>concepts are applicable to any agglomerative clustering formulation.<br><br>Results: We apply our algorithm to the entire collection of protein<br>sequences, to automatically build a comprehensive evolutionary-driven<br>hierarchy of proteins from sequence alone. The newly created tree<br>captures protein families better than state-of-the-art large-scale<br>methods such as CluSTr, ProtoNet4 or single-linkage clustering. We<br>demonstrate that leveraging the entire mass embodied in all sequence<br>similarities allows to significantly improve on current protein family<br>clusterings which are unable to directly tackle the sheer mass of this<br>data. Furthermore, we argue that non-metric constraints are an<br>inherent complexity of the sequence space and should not be<br>overlooked. The robustness of UPGMA allows significant improvement,<br>especially for multidomain proteins, and for large or divergent<br>families.<br><br>Availability: A comprehensive tree built from all UniProt sequence<br>similarities, together with navigation and classification tools will<br>be made available as part of the ProtoNet service. A C++<br>implementation of the algorithm is available on request.<br></div></blockquote></div></body></html>