[bip] HTSeq: A Python framework to work with high-throughput sequencing data

Tue Apr 27 08:52:02 PDT 2010

Hi

I would like to advertise the release of HTSeq, a Python framework to
process and analyse high-throughput sequencing (HTS) data.

While many Perl and Python tools exist for classical bioinformatics
tasks as occuring in sequence analysis and phylogenics,
bioinformaticians working with HTS data still often have to start from
scratch when writing a script.

And scripts are still needed, even though new alignment software are
quite flexible now, because, in practice,  data often needs to be
converted, tweaked, filtered, or otherwise pre-processed before they can
be given to the aligner, and the results require similar processing to
do the statistical analysis one needs.

HTSeq is meant to render such tasks easy and convenient, and so act as a
"glue" between aligners and other existing tools.

Some examples of typical use cases for HTSeq:

* Quality assessment of reads: Check the dependence of the proportions
of base calls and quality scores on the position in the reads, stratify
by alignment status.

* Counting: How many reads fall onto each exon, or each gene? For such
tasks, you may want to design and implement rules on how to deal with
overlapping features or ambiguous assignments.

* Calculating coverage: HTSeq helps you not only to produce a Wiggle
file for visualization in a genome browser, but also to do customized
statistics on this.

* Multiple alignments: Many aligners can output multiple alignments for
each read, but what to do with this? HTSeq makes it easy to implement
post-processing to choose the right alignment according to your criteria.

* Adapter trimming: In miRNA-Seq, you often sequence into the adapter at
the other end and need to cut this off before aligning. In multiplexed
sequencing, you may need to cut off and sort by the mutiplex tag.

Have a look and give it a try: http://www-huber.embl.de/users/anders/HTSeq/

For programmers, HTSeq has been designed to keep thing simple:

* All classes have extensive reference documentation, and a tutorial
demonstrates their use.

* All supported file formats (Fasta, Fastq, SAM, SolexaPipeline files,
GFF, GTF, etc.) can be read in a loop, providing an object describing
one record at a time to the loop body. This object describes the data in
a convenient and consistent way.

* The 'GenomicArray' class is the Swiss army knife of HTSeq. It is a
container that can efficiently store anything that has a position on the
genome: integer number to represent coverage, objects with feature data
to represent exons, sets of objects to handle overlapping features, etc.

For users without programming knowledge, stand-alone scripts for common
tasks are provided: htseq-count to count the overlap of reads with
features (such as exons), htseq-qa to get a quick overview of the
quality of your sequencing run, and htseq-bedgraph (coming soon) to
convert an alignment file into a Bedgraph Wiggle file for visualization
with a genome browser. In a way, these rather serve as examples though.
There are much more tasks for which HTSeq should be useful.

If you try it, please let me know what you think.

Best regards
   Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de