[khmer-announce] khmer v2.0 released

Michael R. Crusoe crusoe at ucdavis.edu
Tue Sep 8 08:31:21 PDT 2015


This is the v2.0 release of khmer and the first from our new lab at the
University of California, Davis. It features Python 3 compatibility,
streaming I/O from Unix Pipes, mixed-pair sequence file format support, and
a new parameter to simplify memory usage. We also have a software paper
in-press describing the project and the citation reminders have been
updated to reflect that.

Overall there are an additional 2,380 lines of Python code (mostly tests)
and 283 less lines of C++ (despite adding features). This release is the
product of over 1,000 commits to the codebase since v1.4.

Documentation is at https://khmer.readthedocs.org/en/v2.0/
New items of note:New behaviorStreaming I/O from Unix Pipes

All scripts now accept input from named (like /dev/stdin, or that created
using <( list )process substituion) and unnamed pipes (like output piped in
from another program with |). The STDIN stream can also be specified using
a single dash: -. #1186 <https://github.com/dib-lab/khmer/pull/1186> @mr-c
<https://github.com/mr-c> #1042 <https://github.com/dib-lab/khmer/pull/1042>
 #763 <https://github.com/dib-lab/khmer/issues/763>@SherineAwad
<https://github.com/SherineAwad> #1085
<https://github.com/dib-lab/khmer/pull/1085> @ctb <https://github.com/ctb>
New parameter for memory usage, and/or tablesize/number of table parameters.

There is now a -M/--max-memory-usage parameter that sets the number of
tables (-N/--n_tables) and tablesize (-x/--max-tablesize) parameters
automatically to match the desired memory usage. #1106
<https://github.com/dib-lab/khmer/pull/1106> #621
<https://github.com/dib-lab/khmer/pull/621> #1126
<https://github.com/dib-lab/khmer/pull/1126> #390
<https://github.com/dib-lab/khmer/issues/390> #1117
<https://github.com/dib-lab/khmer/issues/1117> #1055
<https://github.com/dib-lab/khmer/issues/1055> #1050
<https://github.com/dib-lab/khmer/pull/1050> #1214
<https://github.com/dib-lab/khmer/pull/1214> #1179
<https://github.com/dib-lab/khmer/issues/1179> #1133
<https://github.com/dib-lab/khmer/pull/1133> #1145
<https://github.com/dib-lab/khmer/pull/1145>@ctb <https://github.com/ctb>
@qingpeng <https://github.com/qingpeng> @bocajnotnef
<https://github.com/bocajnotnef>
Digital normalization script now supports mixed paired and unpaired read
input

normalize-by-median.py now supports mixed paired and unpaired (or
"broken-paired") input. Behavior can be forced to either treat all reads as
singletons or to require all reads be properly paired using --force_single
 or --paired, respectively. If --paired is set, --unpaired-reads can be
used to include a file of unpaired reads. The unpaired reads will be
examined after all of the other sequence files. normalize-by-median.py now
has a --quiet option to reduce the amount of output. #1200
<https://github.com/dib-lab/khmer/pull/1200> @bocajnotnef
<https://github.com/bocajnotnef>
Mixed-pair sequence file format support

split-paired-reads.py --output-orphaned/-0 has been added to allow for
orphaned reads and give them a file to be sorted into. #847
<https://github.com/dib-lab/khmer/issues/847> #1164
<https://github.com/dib-lab/khmer/pull/1164> @ctb <https://github.com/ctb>
Scripts now output columnar data in CSV format by default

All scripts that output any kind of columnar data now do so in CSV format,
with headers. Previously this had to be enabled with --csv. (Affects
abundance-dist-single.py,abundance-dist.py, count-median.py,
and count-overlap.py.) normalize-by-median.py --report also now outputs in
CSV format. #1011 <https://github.com/dib-lab/khmer/issues/1011> #1180
<https://github.com/dib-lab/khmer/pull/1180> @ctb <https://github.com/ctb>
Reservoir sampling script extracts paired reads by default

sample-reads-randomly.py now retains pairs in the output, by default. This
can be overridden to match previous behavior with --force_single.
Most input and output files can be compressed

We support gzip and bzip2 input and output file compression everywhere that
it makes sense#505 <https://github.com/dib-lab/khmer/issues/505> #747
<https://github.com/dib-lab/khmer/pull/747> @bocajnotnef
<https://github.com/bocajnotnef>
New scriptsEstimate number of unique kmers

unique-kmers.py estimates the k-mer cardinality of a dataset using the
HyperLogLog probabilistic data structure. This allows very low memory
consumption, which can be configured through an expected error rate. Even
with low error rate (and higher memory consumption), it is still much more
efficient than exact counting and alternative methods. It supports
multicore processing (using OpenMP) and streaming, and so can be used in
conjunction with other scripts (like normalize-by-median.py and
filter-abund.py). This script is the work of @luizirber
<https://github.com/luizirber> and the subject of a paper in draft. #390
<https://github.com/dib-lab/khmer/issues/390> #1239
<https://github.com/dib-lab/khmer/issues/1239> #1252
<https://github.com/dib-lab/khmer/pull/1252> #1053
<https://github.com/dib-lab/khmer/pull/1053> #1072
<https://github.com/dib-lab/khmer/pull/1072> #1145
<https://github.com/dib-lab/khmer/pull/1145> #1176
<https://github.com/dib-lab/khmer/pull/1176> #1207
<https://github.com/dib-lab/khmer/pull/1207> #1204
<https://github.com/dib-lab/khmer/pull/1204>#1245
<https://github.com/dib-lab/khmer/pull/1245>
Incompatible changesNew datastructure and script names

For clarity the Count-Min Sketch based data structure previously known as
"counting_hash" or "counting_table" and variations of these is now known as
countgraph. Likewise with the Bloom Filter based data structure previously
known at "hashbits", "presence_table" and variations of these is now known
as nodegraph. Many options relating to table have been changed tograph.
#1112 <https://github.com/dib-lab/khmer/issues/1112> #1209
<https://github.com/dib-lab/khmer/pull/1209> @mr-c <https://github.com/mr-c>
Binary file formats have changed

All binary khmer formats (presence tables, counting tables, tag sets, stop
tags, and partition subsets) have changed. Files are now pre-pended with
the string OXLI to indicate that they are from this project.
#519 <https://github.com/dib-lab/khmer/issues/519> #1031
<https://github.com/dib-lab/khmer/pull/1031> @mr-c <https://github.com/mr-c>
 #1159 <https://github.com/dib-lab/khmer/pull/1159> @luizirber
<https://github.com/luizirber>

Files of the above types made in previous versions of khmer are not
compatible with v2.0; the reverse is also true.

In addition to the OXLI string, the Nodegraph and Countgraph file format
now includes the number of occupied bins. See
http://khmer.readthedocs.org/en/v2.0/dev/binary-file-formats for details.
#1093 <https://github.com/dib-lab/khmer/issues/1093> @ctb
<https://github.com/ctb> @mr-c <https://github.com/mr-c> #1101
<https://github.com/dib-lab/khmer/pull/1101> #1103
<https://github.com/dib-lab/khmer/issues/1103> @kdmurray91
<https://github.com/kdmurray91>
load-graph.py no longer appends .pt to the specified filename

Previously, load-graph.py appended a .pt extension to the specified output
filename and partition-graph.py appended a .pt to the given input filename.
Now, load-graph.py writes to the specified output filename and
partition-graph.py does not append a .pt to the given input filename. #1189
<https://github.com/dib-lab/khmer/issues/1189> #747
<https://github.com/dib-lab/khmer/pull/747> @bocajnotnef
<https://github.com/bocajnotnef>
Some reporting options have been turned always on

The total number of unique k-mers will always be reported every time a new
countgraph is made. The --report-total-kmers option has been removed from
abundance-dist-single.py,filter-abund-single.py, and normalize-by-median.py to
reflect this. Likewise with --write-fp-rate for load-into-counting.py and
load-graph.py; the false positive rate will always be written to the .info
 files. #1097 <https://github.com/dib-lab/khmer/issues/1097> #1180
<https://github.com/dib-lab/khmer/pull/1180> @ctb <https://github.com/ctb>
An uncommon error recovery routine was removed

To simplify the codebase --save-on-failure and its helper option
--dump-frequency have been removed from normalize-by-median.py.
Single file output option names have been normalized

--out is now --output for both normalize-by-median.py and trim-low-abund.py
. #1188 <https://github.com/dib-lab/khmer/issues/1188>#1164
<https://github.com/dib-lab/khmer/pull/1164> @ctb <https://github.com/ctb>
Miscellaneous changes

The common option --min-tablesize was renamed to --max-tablesize to reflect
this more desirable behavior.

In conjuction with the new split-paired-reads.py --output-orphaned option,
the option --force-paired/-p has been eliminated.

As CSV format is now the default, the --csv option has been removed.
Removed script

count-overlap.py
<http://khmer.readthedocs.org/en/v1.4.1/user/scripts.html#count-overlap-py> has
been removed.
Notable bugs fixed/issues closed:

When normalize-by-median.py decides to keep both parts of a pair of reads
it was only adding the k-mers & counts from one to the countgraph. #1000
<https://github.com/dib-lab/khmer/issues/1000> #1010
<https://github.com/dib-lab/khmer/pull/1010> @drtamermansour
<https://github.com/drtamermansour>@bocajnotnef
<https://github.com/bocajnotnef>

The partition map file format was not robust to truncation and would hang
waiting for more data.#437 <https://github.com/dib-lab/khmer/issues/437>
#1037 <https://github.com/dib-lab/khmer/issues/1037> #1048
<https://github.com/dib-lab/khmer/pull/1048> @ctb <https://github.com/ctb>

extract-paired-reads.py and split-paired-reads.py no longer create default
files when the user supplies filename(s). #1005
<https://github.com/dib-lab/khmer/issues/1005> #1132
<https://github.com/dib-lab/khmer/pull/1132> @kdmurray91
<https://github.com/kdmurray91>
Additional fixes/features

find-knots.py was missing a --force option and unit tests. #358
<https://github.com/dib-lab/khmer/issues/358> #1078
<https://github.com/dib-lab/khmer/pull/1078> @ctb <https://github.com/ctb>
The check for excessively high false-positive rate has also received a
--force option #1168 <https://github.com/dib-lab/khmer/pull/1168>
@bocajnotnef <https://github.com/bocajnotnef>

A bug leading to an infinite loop with large gzipped countgraphs was found
#1038 <https://github.com/dib-lab/khmer/issues/1038> #1043
<https://github.com/dib-lab/khmer/pull/1043>@kdmurray91
<https://github.com/kdmurray91>

All scripts that create nodegraphs or countgraphs report the total number
of unique k-mers. #491 <https://github.com/dib-lab/khmer/pull/491>#609
<https://github.com/dib-lab/khmer/issues/609> #429
<https://github.com/dib-lab/khmer/issues/429> @mr-c
<https://github.com/mr-c>

Read pairs from SRA are fully supported. Reported by @macmanes
<https://github.com/macmanes> in #1027
<https://github.com/dib-lab/khmer/issues/1027>, fixed by at kdmurray91
<https://github.com/kdmurray91> @SherineAwad
<https://github.com/SherineAwad> in #1173
<https://github.com/dib-lab/khmer/pull/1173> #1088
<https://github.com/dib-lab/khmer/pull/1088>
Of interest to users:

Added Hashtable::get_kmers(), get_kmer_hashes(), and get_kmer_counts() with
corresponding CPython functions. #1047
<https://github.com/dib-lab/khmer/issues/1047> #1049
<https://github.com/dib-lab/khmer/pull/1049> @ctb <https://github.com/ctb>

The DEFAULT_DESIRED_COVERAGE for normalize-by-median.py is now 20. #1073
<https://github.com/dib-lab/khmer/issues/1073> #1081
<https://github.com/dib-lab/khmer/pull/1081> @ctb <https://github.com/ctb>

FIFOs are no longer seen as empty. #1147
<https://github.com/dib-lab/khmer/issues/1147> #1163
<https://github.com/dib-lab/khmer/pull/1163> @bocajnotnef
<https://github.com/bocajnotnef>

When the k-size is requested to be larger than 32 (which is unsupported) a
helpful error message is reported. #1094
<https://github.com/dib-lab/khmer/issues/1094> #1050
<https://github.com/dib-lab/khmer/pull/1050> @ctb <https://github.com/ctb>

We try to report more helpfully during errors, such as suggesting the
--force option when outputs files already exist. #1162
<https://github.com/dib-lab/khmer/issues/1162> #1170
<https://github.com/dib-lab/khmer/pull/1170> @bocajnotnef
<https://github.com/bocajnotnef>

There is a paper related to trim-low-abund.py: "Crossing the streams: a
framework for streaming analysis of short DNA sequencing reads" and it has
been added to the CITATION file and program output. #1180
<https://github.com/dib-lab/khmer/pull/1180> #1130
<https://github.com/dib-lab/khmer/issues/1130> @ctb <https://github.com/ctb>

We have dropped support for Python 2.6 #1009
<https://github.com/dib-lab/khmer/issues/1009> #1180
<https://github.com/dib-lab/khmer/pull/1180> @ctb <https://github.com/ctb>

Our user documentation got a bit out of date and has been updated. #1156
<https://github.com/dib-lab/khmer/issues/1156> #1247
<https://github.com/dib-lab/khmer/pull/1247> @bocajnotnef
<https://github.com/bocajnotnef>@mr-c <https://github.com/mr-c> #1104
<https://github.com/dib-lab/khmer/pull/1104> @kdmurray91
<https://github.com/kdmurray91> #1267
<https://github.com/dib-lab/khmer/pull/1267> @ctb
<https://github.com/ctb> Links
to lists of publications that use khmer have been added #1063
<https://github.com/dib-lab/khmer/issues/1063> #1222
<https://github.com/dib-lab/khmer/pull/1222> @mr-c
<https://github.com/mr-c> The
help text from the scripts has also had a thorough cleanup for formatting.
#1268 <https://github.com/dib-lab/khmer/pull/1268> @mr-c
<https://github.com/mr-c>

fastq-to-fasta.py's --n_keep option has incorrect help text. We now point
out that all reads with Ns will be dropped by default unless this option is
supplied. #657 <https://github.com/dib-lab/khmer/issues/657> #814
<https://github.com/dib-lab/khmer/pull/814> #1208
<https://github.com/dib-lab/khmer/pull/1208>@ACharbonneau
<https://github.com/ACharbonneau> @bocajnotnef
<https://github.com/bocajnotnef>

We've updated the URL to the '88m-reads.fa.gz' file. #1242
<https://github.com/dib-lab/khmer/issues/1242> #1269
<https://github.com/dib-lab/khmer/pull/1269> @mr-c <https://github.com/mr-c>

@camillescott <https://github.com/camillescott> designed and implemented an
optimization for normalize-by-median.py #862
<https://github.com/dib-lab/khmer/pull/862>

abundance-dist.py can now be used without counts over 255 with --no-bigcount
. #1067 <https://github.com/dib-lab/khmer/pull/1067>#909
<https://github.com/dib-lab/khmer/pull/909> @drtamermansour
<https://github.com/drtamermansour> @bocajnotnef
<https://github.com/bocajnotnef> Its input file requirement can no longer
be overridden#1201 <https://github.com/dib-lab/khmer/issues/1201> #1202
<https://github.com/dib-lab/khmer/pull/1202> @bocajnotnef
<https://github.com/bocajnotnef>

khmer v2.0 will be released as a package for the Debian GNU/Linux operating
system. Big thanks to @kdmurray91 <https://github.com/kdmurray91> for his
assistance. #1148 <https://github.com/dib-lab/khmer/pull/1148> #1240
<https://github.com/dib-lab/khmer/pull/1240> The C++ library, now named
liboxli, will have its own package as well.

sandbox/multi-rename.py now wraps long FASTA sequences at 80 columns. #450
<https://github.com/dib-lab/khmer/issues/450> #1136
<https://github.com/dib-lab/khmer/pull/1136>@SherineAwad
<https://github.com/SherineAwad>
Of interest to developers:

The khmer project is now a Python 3 codebase with backwards compatibility
to Python 2.7. Huge credit to @luizirber <https://github.com/luizirber> #978
<https://github.com/dib-lab/khmer/pull/978> #922
<https://github.com/dib-lab/khmer/pull/922> #1045
<https://github.com/dib-lab/khmer/pull/1045> #1066
<https://github.com/dib-lab/khmer/pull/1066> #1089
<https://github.com/dib-lab/khmer/pull/1089> #1157
<https://github.com/dib-lab/khmer/pull/1157> #1191
<https://github.com/dib-lab/khmer/pull/1191> #1108
<https://github.com/dib-lab/khmer/pull/1108> Many developer impacting
changes including the file khmer/_khmermodule.cc is now khmer/_khmer.cc.
#169 <https://github.com/dib-lab/khmer/issues/169>#904
<https://github.com/dib-lab/khmer/issues/904>

@camillescott <https://github.com/camillescott> did an extensive refactor
of the C++ graph traversal code which removed a considerable amount of
redundant code and will be very useful for future work. #1231
<https://github.com/dib-lab/khmer/pull/1231> #1080
<https://github.com/dib-lab/khmer/pull/1080>

We now use some and allow all C++11 features in the codebase. #598
<https://github.com/dib-lab/khmer/issues/598> #1122
<https://github.com/dib-lab/khmer/pull/1122> @mr-c <https://github.com/mr-c>

normalize-by-median.py was extensively refactored. #1006
<https://github.com/dib-lab/khmer/issues/1006> #1010
<https://github.com/dib-lab/khmer/pull/1010> #1057
<https://github.com/dib-lab/khmer/pull/1057> #1039
<https://github.com/dib-lab/khmer/pull/1039> #1135
<https://github.com/dib-lab/khmer/pull/1135> #1182
<https://github.com/dib-lab/khmer/pull/1182>@bocajnotnef
<https://github.com/bocajnotnef> @ctb <https://github.com/ctb> @camillescott
<https://github.com/camillescott>

The CPython glue was refactored so that CountingHash and Hashbits inherit
from Hashtable.#1044 <https://github.com/dib-lab/khmer/pull/1044> @ctb
<https://github.com/ctb>

The tests no longer stop on the first failed test. #1124
<https://github.com/dib-lab/khmer/issues/1124> #1134
<https://github.com/dib-lab/khmer/pull/1134> @ctb <https://github.com/ctb> and
some noisy tests were silenced #1125
<https://github.com/dib-lab/khmer/issues/1125> #1137
<https://github.com/dib-lab/khmer/pull/1137> @bocajnotnef
<https://github.com/bocajnotnef>

The check_space() calls were cleaned up. #1167
<https://github.com/dib-lab/khmer/issues/1167> #1166
<https://github.com/dib-lab/khmer/issues/1166> #1170
<https://github.com/dib-lab/khmer/pull/1170> #993
<https://github.com/dib-lab/khmer/issues/993>

Developer docs have been expanded #737
<https://github.com/dib-lab/khmer/issues/737> #1184
<https://github.com/dib-lab/khmer/pull/1184> @bocajnotnef
<https://github.com/bocajnotnef> #1083
<https://github.com/dib-lab/khmer/pull/1083> #1282
<https://github.com/dib-lab/khmer/pull/1282> @ctb <https://github.com/ctb>
@mr-c <https://github.com/mr-c>#1269
<https://github.com/dib-lab/khmer/pull/1269>

A lot of code was deleted: TRACE related code in #274
<https://github.com/dib-lab/khmer/issues/274> #1180
<https://github.com/dib-lab/khmer/pull/1180> @ctb <https://github.com/ctb>
hashtable_collect_high_abundance_kmers in #1142
<https://github.com/dib-lab/khmer/issues/1142> #1044
<https://github.com/dib-lab/khmer/pull/1044> @ctb <https://github.com/ctb>
lib/ht-diff.cc,lib/test-HashTables.cc, lib/test-Parser.cc #1144
<https://github.com/dib-lab/khmer/pull/1144>, @mr-c
<https://github.com/mr-c> bink.ipynb,lib/graphtest.cc, lib/primes.hh #1289
<https://github.com/dib-lab/khmer/pull/1289> @mr-c <https://github.com/mr-c>

@bocajnotnef <https://github.com/bocajnotnef> deleted more unused code and
added new tests elsewhere to increase testing coverage in #1236
<https://github.com/dib-lab/khmer/pull/1236>. @mr-c
<https://github.com/mr-c> had his own go in #1279
<https://github.com/dib-lab/khmer/pull/1279>

cppcheck installation for OSX has been documented #777
<https://github.com/dib-lab/khmer/issues/777> #952
<https://github.com/dib-lab/khmer/pull/952> #945
<https://github.com/dib-lab/khmer/issues/945> @elmbeech
<https://github.com/elmbeech>

ccache and git-merge-changelog has been documented for Linux users #610
<https://github.com/dib-lab/khmer/issues/610> #1122
<https://github.com/dib-lab/khmer/pull/1122> #614
<https://github.com/dib-lab/khmer/issues/614> @mr-c
<https://github.com/mr-c>

The graphalign parameters can be saved/loaded from disk. In addition the
align_forwardmethod has been introduced. #755
<https://github.com/dib-lab/khmer/issues/755> #750
<https://github.com/dib-lab/khmer/pull/750> @mr-c <https://github.com/mr-c>
@ctb <https://github.com/ctb>

labelhash is now known as graphlabels #1032
<https://github.com/dib-lab/khmer/issues/1032> #1209
<https://github.com/dib-lab/khmer/pull/1209> @mr-c <https://github.com/mr-c> It
is also now a 'friend' of Hashtable and one can make either a nodegraph or
countgraph version. These graphlabels can now be saved & loaded from disk.
#1021 <https://github.com/dib-lab/khmer/pull/1021> @ctb
<https://github.com/ctb>

Spelling is hard; we've added instructions on how to run codespell to the
developer docs. #890 <https://github.com/dib-lab/khmer/issues/890>#1203
<https://github.com/dib-lab/khmer/pull/1203> @bocajnotnef
<https://github.com/bocajnotnef>

A redundant and contradictory named test has been removed. Reported by
@jgluck <https://github.com/jgluck> in #662
<https://github.com/dib-lab/khmer/issues/662> fixed by @bocajnotnef
<https://github.com/bocajnotnef> in #1220
<https://github.com/dib-lab/khmer/pull/1220> @SherineAwad
<https://github.com/SherineAwad> contributed some additional tests #809
<https://github.com/dib-lab/khmer/pull/809> #615
<https://github.com/dib-lab/khmer/issues/615>.

The new oxli command, while disabled in the v2.0 release, has been added to
all the QA makefile targets as we continue to refactor the codebase. #1199
<https://github.com/dib-lab/khmer/issues/1199> #1218
<https://github.com/dib-lab/khmer/pull/1218> @bocajnotnef
<https://github.com/bocajnotnef>

The CPython code was audited to ensure that all possible C++ exceptions
were caught and dealt with. The exception hierarchy was also simplified
#1016 <https://github.com/dib-lab/khmer/issues/1016> #1015
<https://github.com/dib-lab/khmer/pull/1015> #1017
<https://github.com/dib-lab/khmer/pull/1017> #1151
<https://github.com/dib-lab/khmer/pull/1151> @kdmurray91
<https://github.com/kdmurray91>@mr-c <https://github.com/mr-c>

get_kadian_count has been removed. #1034
<https://github.com/dib-lab/khmer/issues/1034> #1194
<https://github.com/dib-lab/khmer/pull/1194> @ctb <https://github.com/ctb>

We use argparse's metavars to aid with autogenerated documentation for the
scripts. This has been documented in the dev docs. #620
<https://github.com/dib-lab/khmer/issues/620>
<https://github.com/dib-lab/khmer/pull/1222>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer-announce/attachments/20150908/effa0aeb/attachment-0001.htm>


More information about the khmer-announce mailing list