[khmer] khmer v1.4 released

Michael R. Crusoe mcrusoe at msu.edu
Wed May 13 15:25:54 PDT 2015


This is the v1.4 release of khmer featuring the results of our March and
April (PyCon) coding sprints and the 16 new contributors; the use of the
new v0.8 release of screed (the library we use for pure Python reading of
nucleotide sequence files); and the addition of @luizirber
<https://github.com/luizirber>'s HyperLogLog counter for quick cardinality
estimation.

Documentation is at https://khmer.readthedocs.org/en/v1.4/
New items of note:

Casava 1.8 read naming is now fully supported and in general the scripts no
longer mangle read names. Side benefits: split-paired-reads.py will no
longer drop reads with 'bad' names;count-median.py can generate output in
CSV format. #759 <https://github.com/ged-lab/khmer/pull/759> #818
<https://github.com/ged-lab/khmer/pull/818> @ctb <https://github.com/ctb>
#873 <https://github.com/ged-lab/khmer/issues/873> @ahaerpfer
<https://github.com/ahaerpfer>

Most scripts now support a "broken" interleaved paired-read format for
FASTA/FASTQ nucleotide sequence files. trim-low-abund.py
<http://khmer.readthedocs.org/en/v1.4/user/scripts.html#trim-low-abund-py> has
been promoted from the sandbox as well (with streaming support). #759
<https://github.com/ged-lab/khmer/pull/759> @ctb <https://github.com/ctb>
#963 <https://github.com/ged-lab/khmer/pull/963> @sguermond
<https://github.com/sguermond> #933
<https://github.com/ged-lab/khmer/pull/933> @standage
<https://github.com/standage>

The script to transform an interleaved paired-read nucleotide sequence file
into two files now allows one to name the output files which can be useful
in combination with named pipes for streaming processing #762
<https://github.com/ged-lab/khmer/pull/762> @ctb <https://github.com/ctb>

Streaming everywhere: thanks to screed v0.8 we now support streaming of
almost all inputs and outputs. #830
<https://github.com/ged-lab/khmer/pull/830> @aditi9783
<https://github.com/aditi9783> #812
<https://github.com/ged-lab/khmer/pull/812> @mr-c <https://github.com/mr-c>
#917 <https://github.com/ged-lab/khmer/pull/917> @bocajnotnef
<https://github.com/bocajnotnef> #882
<https://github.com/ged-lab/khmer/pull/882> @standage
<https://github.com/standage>

Need a quick way to count total number of unique k-mers in very low memory?
the unique-kmers.py script in the sandbox uses a HyperLogLog counter to
quickly (and with little memory) provide an estimate with a controllable
error rate. #257 <https://github.com/ged-lab/khmer/pull/257> #738
<https://github.com/ged-lab/khmer/pull/738> #895
<https://github.com/ged-lab/khmer/pull/895> #902
<https://github.com/ged-lab/khmer/pull/902> @luizirber
<https://github.com/luizirber>

normalize-by-median.py can now process both a paired interleaved sequence
file and a file of unpaired reads in the same invocation thus removing the
need to write the counting table to disk as required in the workaround. #957
<https://github.com/ged-lab/khmer/pull/957> @susinmotion
<https://github.com/susinmotion>
Notable bugs fixed/issues closed:

Paired-end reads from Casava 1.8 no longer require renaming for use in
normalize-by-median.py and abund-filter.py when used in paired mode #818
<https://github.com/ged-lab/khmer/pull/818> @ctb <https://github.com/ctb>

Python version support clarified. We do not (yet) support Python 3.x #741
<https://github.com/ged-lab/khmer/pull/741> @mr-c <https://github.com/mr-c>

If a single output file mode is chosen for normalize-by-median.py we now
default to overwriting the output. Appending the output is available by
using the append redirection operator from the shell. #843
<https://github.com/ged-lab/khmer/pull/843> @drtamermansour
<https://github.com/drtamermansour>

Scripts that consume sequence data using C++ will now properly throw an
error on truncated files.#897 <https://github.com/ged-lab/khmer/pull/897>
@kdmurray91 <https://github.com/kdmurray91>
And while writing to disk we properly check for errors #856
<https://github.com/ged-lab/khmer/pull/856> #962
<https://github.com/ged-lab/khmer/pull/962> @mr-c <https://github.com/mr-c>

abundance-dist-single.py no longer fails with small files and many threads.
#900 <https://github.com/ged-lab/khmer/pull/900> @mr-c
<https://github.com/mr-c>
Additional fixes/featuresOf interest to users:

Many documentation updates #753 <https://github.com/ged-lab/khmer/pull/753>
@PamelaM <https://github.com/PamelaM>, #782
<https://github.com/ged-lab/khmer/pull/782> @bocajnotnef
<https://github.com/bocajnotnef>, #845
<https://github.com/ged-lab/khmer/pull/845> @alameldin
<https://github.com/alameldin>, #804
<https://github.com/ged-lab/khmer/pull/804>@ctb <https://github.com/ctb>,
#870 <https://github.com/ged-lab/khmer/pull/870> @SchwarzEM
<https://github.com/SchwarzEM>, #953
<https://github.com/ged-lab/khmer/pull/953> #942
<https://github.com/ged-lab/khmer/pull/942> @safay
<https://github.com/safay>, #929 <https://github.com/ged-lab/khmer/pull/929>
, at davelin1 <https://github.com/davelin1>, #687
<https://github.com/ged-lab/khmer/pull/687> #912
<https://github.com/ged-lab/khmer/pull/912> #926
<https://github.com/ged-lab/khmer/pull/926> @mr-c <https://github.com/mr-c>

Installation instructions for Conda, Arch Linux, and Mac Ports have been
added #723 <https://github.com/ged-lab/khmer/pull/723>@reedacartwright
<https://github.com/reedacartwright> #952
<https://github.com/ged-lab/khmer/pull/952> @elmbeech
<https://github.com/elmbeech> #930
<https://github.com/ged-lab/khmer/pull/930> @ahaerpfer
<https://github.com/ahaerpfer>

The example script for the STAMPS database has been fixed to run correctly
#781 <https://github.com/ged-lab/khmer/pull/781>@drtamermansour
<https://github.com/drtamermansour>

split-paired-reads.py: added -o option to allow specification of an output
directory #752 <https://github.com/ged-lab/khmer/pull/752>@bede
<https://github.com/bede>

Fixed a string formatting and a boundry error in sample-reads-randomly.py
#773 <https://github.com/ged-lab/khmer/pull/773> @qingpeng
<https://github.com/qingpeng>#995
<https://github.com/ged-lab/khmer/pull/995> @ctb <https://github.com/ctb>

CSV output added to abundance-dist.py, abundance-dist-single.py, and
count-overlap.py, and readstats.py #831
<https://github.com/ged-lab/khmer/pull/831> #854
<https://github.com/ged-lab/khmer/pull/854> #855
<https://github.com/ged-lab/khmer/pull/855> @drtamermansour
<https://github.com/drtamermansour> #959
<https://github.com/ged-lab/khmer/pull/959> @anotherthomas
<https://github.com/anotherthomas>

TSV/JSON output of load-into-counting.py enhanced with the total number of
reads processed #996 <https://github.com/ged-lab/khmer/pull/996> @kdmurray91
<https://github.com/kdmurray91>
Output files are now also checked to be writable *before* loading the input
files #672 <https://github.com/ged-lab/khmer/pull/672> @pgarland
<https://github.com/pgarland>@bocajnotnef <https://github.com/bocajnotnef>

interleave-reads.py now prints the output filename nicely #827
<https://github.com/ged-lab/khmer/pull/827> @kdmurray91
<https://github.com/kdmurray91>

Cleaned up error for input file not existing #772
<https://github.com/ged-lab/khmer/pull/772> @jessicamizzi
<https://github.com/jessicamizzi> #851
<https://github.com/ged-lab/khmer/pull/851> @ctb <https://github.com/ctb>

Fixed error in find-knots.py #860
<https://github.com/ged-lab/khmer/pull/860> @TheOneHyer
<https://github.com/TheOneHyer>

The help text for load-into-counting.py for the --no-bigcounts/-b flag has
been clarified#857 <https://github.com/ged-lab/khmer/pull/857> @kdmurray91
<https://github.com/kdmurray91>

@lexnederbragt <https://github.com/lexnederbragt> confirmed an old bug has
been fixed with his test for whitespace in sequence identifiers interacting
with the extract-partitions.py script #979
<https://github.com/ged-lab/khmer/pull/979>

Now safe to copy-and-paste from the user documentation as the smart quotes
have been turned off. #967 <https://github.com/ged-lab/khmer/pull/967>
@ahaerpfer <https://github.com/ahaerpfer>

The script make-coverage.py has been restored to the sandbox. #920
<https://github.com/ged-lab/khmer/pull/920> @SherineAwad
<https://github.com/SherineAwad>

normalize-by-median.py will warn if two of the input files have the same
name #932 <https://github.com/ged-lab/khmer/pull/932>@elmbeech
<https://github.com/elmbeech>
Of interest to developers:

Switched away from using --user install for developers #740
<https://github.com/ged-lab/khmer/pull/740> @mr-c <https://github.com/mr-c>
@drtamermansour <https://github.com/drtamermansour> &#883
<https://github.com/ged-lab/khmer/issues/883> @standage
<https://github.com/standage>

Developers can now see a summary of important Makefile targets via make help
 #783 <https://github.com/ged-lab/khmer/pull/783>@standage
<https://github.com/standage>

The unused khmer.load_pe module has been removed #828
<https://github.com/ged-lab/khmer/pull/828> @kdmurray91
<https://github.com/kdmurray91>

Versioneer bug due to new screed release was squashed #835
<https://github.com/ged-lab/khmer/pull/835> @mr-c <https://github.com/mr-c>

A Python 2.6 and 2.7.2 specific bug was worked around #869
<https://github.com/ged-lab/khmer/pull/869> @kdmurray91
<https://github.com/kdmurray91> @ctb <https://github.com/ctb>

added functions hash_find_all_tags_list and hash_get_tags_and_positions to
CountingHash objects #749 <https://github.com/ged-lab/khmer/pull/749> #765
<https://github.com/ged-lab/khmer/pull/765> @ctb <https://github.com/ctb>

The make diff-cover and ChangeLog formatting requirements have been added
to checklist#766 <https://github.com/ged-lab/khmer/pull/766> @mr-c
<https://github.com/mr-c>

A useful message is now presented if large tables fail to allocate enough
memory #704 <https://github.com/ged-lab/khmer/pull/704> @mr-c
<https://github.com/mr-c>

A checklist for developers adding new CPython types was added #727
<https://github.com/ged-lab/khmer/pull/727> @mr-c <https://github.com/mr-c>

The sandbox graduation checklist has been updated to include streaming
support #951 <https://github.com/ged-lab/khmer/pull/951>@sguermond
<https://github.com/sguermond>

Specific policies for sandbox/ and scripts/ content, and a process for
adding new command line scripts into scripts/ have been added to the
developer documentation #799 <https://github.com/ged-lab/khmer/pull/799>
@ctb <https://github.com/ctb>

Sandbox scripts update: corrected #! Python invocation #815
<https://github.com/ged-lab/khmer/pull/815> @Echelon9
<https://github.com/Echelon9>, executable bits, copyright headers, no
underscores in filenames #823 <https://github.com/ged-lab/khmer/pull/823>
#826 <https://github.com/ged-lab/khmer/pull/826> #850
<https://github.com/ged-lab/khmer/pull/850> @alameldin
<https://github.com/alameldin> several scripts deleted, docs + requirements
updated #852 <https://github.com/ged-lab/khmer/pull/852> @ctb
<https://github.com/ctb>

Avoid running big-memory tests on OS X #819
<https://github.com/ged-lab/khmer/pull/819> @ctb <https://github.com/ctb>

Unused callback code was removed #698
<https://github.com/ged-lab/khmer/pull/698> @mr-c <https://github.com/mr-c>

The CPython code was updated to use the new checklist and follow additional
best practices#785 <https://github.com/ged-lab/khmer/pull/785> #842
<https://github.com/ged-lab/khmer/pull/842> @luizirber
<https://github.com/luizirber>

Added a read-only view of the raw counting tables #671
<https://github.com/ged-lab/khmer/pull/671> @camillescott
<https://github.com/camillescott> #869
<https://github.com/ged-lab/khmer/pull/869> @kdmurray91
<https://github.com/kdmurray91>

Added a Python method for quickly getting the number of underlying tables
in a counting or presence table #879
<https://github.com/ged-lab/khmer/pull/879> #880
<https://github.com/ged-lab/khmer/pull/880> @kdmurray91
<https://github.com/kdmurray91>

The C++ library can now be built separately for the brave and curious
developer #788 <https://github.com/ged-lab/khmer/pull/788>@kdmurray91
<https://github.com/kdmurray91>

The ReadParser object now keeps track of the number of reads processed #877
<https://github.com/ged-lab/khmer/pull/877> @kdmurray91
<https://github.com/kdmurray91>

Documentation is now reproducible #886
<https://github.com/ged-lab/khmer/pull/886> @mr-c <https://github.com/mr-c>

Python future proofing: specify floor division #863
<https://github.com/ged-lab/khmer/pull/863> @mr-c <https://github.com/mr-c>

Miscellaneous spelling fixes; thanks codespell! #867
<https://github.com/ged-lab/khmer/pull/867> @mr-c <https://github.com/mr-c>

Debian package list update #984 <https://github.com/ged-lab/khmer/pull/984>
@mr-c <https://github.com/mr-c>

khmer.kfile.check_file_status() has been renamed to check_input_files() #941
<https://github.com/ged-lab/khmer/pull/941>@proteasome
<https://github.com/proteasome>
filter-abund.py now uses it to check the input counting table #931
<https://github.com/ged-lab/khmer/pull/931> @safay
<https://github.com/safay>

normalize-by-median.py was refactored to not pass the ArgParse object
around #965 <https://github.com/ged-lab/khmer/pull/965>@susinmotion
<https://github.com/susinmotion>

Developer communication has been clarified #969
<https://github.com/ged-lab/khmer/pull/969> @sguermond
<https://github.com/sguermond>

Tests using the 'fail_okay=true' parameter to runscript have been updated
to confirm the correct error occurred. 3 faulty tests were fixed and the
docs were clarified #968 <https://github.com/ged-lab/khmer/pull/968> #971
<https://github.com/ged-lab/khmer/pull/971>@susinmotion
<https://github.com/susinmotion>

FASTA test added for extract-long-sequences.py #901
<https://github.com/ged-lab/khmer/pull/901> @jessicamizzi
<https://github.com/jessicamizzi>

'added silly test for empty file warning' #557
<https://github.com/ged-lab/khmer/pull/557> @wltrimbl
<https://github.com/wltrimbl> @bocajnotnef <https://github.com/bocajnotnef>

A couple tests were made more resilient and some extra error checking added
in CPython land#889 <https://github.com/ged-lab/khmer/pull/889> @mr-c
<https://github.com/mr-c>

Copyright added to pull request checklist #940
<https://github.com/ged-lab/khmer/pull/940> @sguermond
<https://github.com/sguermond>

khmer_exceptions are now based on std::strings which plugs a memory leak
#938 <https://github.com/ged-lab/khmer/pull/938>@anotherthomas
<https://github.com/anotherthomas>

Python docstrings were made PEP257 compliant #936
<https://github.com/ged-lab/khmer/pull/936> @ahaerpfer
<https://github.com/ahaerpfer>

Some C++ comments were converted to be Doxygen compliant #950
<https://github.com/ged-lab/khmer/pull/950> @josiahseaman
<https://github.com/josiahseaman>

The counting and presence table warning logic was refactored and
centralized #944 <https://github.com/ged-lab/khmer/pull/944>@susinmotion
<https://github.com/susinmotion>

The release checklist was updated to better run the post-install tests #911
<https://github.com/ged-lab/khmer/pull/911> @mr-c <https://github.com/mr-c>

The unused method find_all_tags_truncate_on_abundance was removed from the
CPython API #924 <https://github.com/ged-lab/khmer/pull/924> @anotherthomas
<https://github.com/anotherthomas>

OS X warnings quieted #887 <https://github.com/ged-lab/khmer/pull/887> @mr-c
<https://github.com/mr-c>
Known issues:

All of these are pre-existing.

Some users have reported that normalize-by-median.py will utilize more
memory than it was configured for. This is being investigated in #266
<https://github.com/ged-lab/khmer/issues/266>

Some scripts only output FASTA even if given a FASTQ file. This issue is
being tracked in #46 <https://github.com/ged-lab/khmer/issues/46>
Contributors

@ctb <https://github.com/ctb>, @kdmurray91 <https://github.com/kdmurray91>,
@mr-c <https://github.com/mr-c>, @drtamermansour
<https://github.com/drtamermansour>, @luizirber
<https://github.com/luizirber>, @standage <https://github.com/standage>,
@bocajnotnef <https://github.com/bocajnotnef>, *@susinmotion
<https://github.com/susinmotion>, @jessicamizzi
<https://github.com/jessicamizzi>, *@elmbeech <https://github.com/elmbeech>,
*@anotherthomas <https://github.com/anotherthomas>, *@sguermond
<https://github.com/sguermond>, *@ahaerpfer <https://github.com/ahaerpfer>,
*@alameldin <https://github.com/alameldin>, *@TheOneHyer
<https://github.com/TheOneHyer>, *@aditi9783 <https://github.com/aditi9783>,
*@proteasome <https://github.com/proteasome>, *@bede
<https://github.com/bede>, *@davelin1 <https://github.com/davelin1>,
@Echelon9 <https://github.com/Echelon9>,
*@reedacartwright <https://github.com/reedacartwright>, @qingpeng
<https://github.com/qingpeng>, *@SchwarzEM <https://github.com/SchwarzEM>, *
@scottsievert <https://github.com/scottsievert>, @PamelaM
<https://github.com/PamelaM>, at SherineAwad <https://github.com/SherineAwad>,
*@josiahseaman <https://github.com/josiahseaman>, *@lexnederbragt
<https://github.com/lexnederbragt>,

* Indicates new contributors
Issue reporters

@moorepants <https://github.com/moorepants>, @teshomem
<https://github.com/teshomem>, @macmanes <https://github.com/macmanes>,
@lexnederbragt <https://github.com/lexnederbragt>, @r-gaia-cs
<https://github.com/r-gaia-cs>, @magentashades
<https://github.com/magentashades>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20150513/d9c7a97e/attachment-0001.htm>


More information about the khmer mailing list