[ged] Journal Club paper for next week

Jason Pell jason.pell at gmail.com
Tue Sep 6 09:59:44 PDT 2011


Hi everyone,

Next week, I will be presenting the paper "Efficient construction of
an assembly string graph using the FM-index" by Jared Simpson (author
of ABySS assembler) and Richard Durbin (well-known bioinformatician).
Abstract and link below.

Motivation: Sequence assembly is a difficult problem whose importance
has grown again recently as the cost of sequencing has dramatically
dropped. Most new sequence assembly software has started by building a
de Bruijn graph, avoiding the overlap-based methods used previously
because of the computational cost and complexity of these with very
large numbers of short reads. Here, we show how to use suffix
array-based methods that have formed the basis of recent very fast
sequence mapping algorithms to find overlaps and generate assembly
string graphs asymptotically faster than previously described
algorithms.

Results: Standard overlap assembly methods have time complexity O(N2),
where N is the sum of the lengths of the reads. We use the
Ferragina-Manzini index (FM-index) derived from the Burrows-Wheeler
transform to find overlaps of length at least τ among a set of reads.
As well as an approach that finds all overlaps then implements
transitive reduction to produce a string graph, we show how to output
directly only the irreducible overlaps, significantly shrinking memory
requirements and reducing compute time to O(N), independent of depth.
Overlap-based assembly methods naturally handle mixed length read
sets, including capillary reads or long reads promised by the third
generation sequencing technologies. The algorithms we present here
pave the way for overlap-based assembly approaches to be developed
that scale to whole vertebrate genome de novo assembly.

http://bioinformatics.oxfordjournals.org/content/26/12/i367.abstract

Jason



More information about the ged-jclub mailing list