[khmer] Using khmer for producing k-mer frequency distribution

Rajat Shuvro Roy rajatroy at cs.rutgers.edu
Wed Aug 28 07:36:31 PDT 2013


Thanks. I fixed the python path problem and it is now invoking khmer from
the new location.

python -c "import khmer; print khmer"
<module 'khmer' from '/home/rajatroy/khmer/python/khmer/__init__.pyc'>

I tried to invoke the default mode with :

python load-into-counting.py  -k 31  out.kh 1Mreads.fa

But it probably is not invoking the default mode where the memory should be
expanding indefinitely. It says:

PARAMETERS:
 - kmer size =    31            (-k)
 - n hashes =     4             (-N)
 - min hashsize = 1e+06         (-x)

Estimated memory usage is 4e+06 bytes (n_hashes x min_hashsize)
--------
** WARNING: hashsize is default!  You absodefly want to increase this!
** Please read the docs!
Saving hashtable to out.kh
Loading kmers from sequences in ['/projects/Genomes/drosophila/1Mreads.fa']
making hashtable
consuming input /projects/Genomes/drosophila/1Mreads.fa
saving out.kh
fp rate estimated to be 1.000
**
** ERROR: the counting hash is too small for
** this data set.  Increase hashsize/num ht.
**

I could not find any example of running the default mode in the khmer
documentation (khmer/doc/scripts.txt). Could you please give me a sample
command that invokes the default mode?

Thanks

Rajat

On Tue, Aug 27, 2013 at 9:51 PM, C. Titus Brown <ctb at msu.edu> wrote:

> On Tue, Aug 27, 2013 at 05:35:49PM -0400, Rajat Shuvro Roy wrote:
> > The new version is in a complete new directory. make test gives:
>
> OK, all the tests pass, including the ones that run normalize-by-median. In
> that case it's almost certainly a problem with your PYTHONPATH -- make sure
> it points to the new directory's 'python' subdirectory.
>
> Do:
>
> % python -c "import khmer; print khmer"
>
> to see where khmer is being imported from -- it should be the new location.
>
> And yes, fixing installation is in the near future :)
>
> cheers,
> --titus
>
> >
> > make test
> > cd lib && \
> > make
> > make[1]: Entering directory `/u2/home/rajatroy/khmer/lib'
> > make[1]: Nothing to be done for `all'.
> > make[1]: Leaving directory `/u2/home/rajatroy/khmer/lib'
> > cd python && \
> > make    DEFINE_KHMER_EXTRA_SANITY_CHECKS="" \
> >         CXX_DEBUG_FLAGS=""
> > make[1]: Entering directory `/u2/home/rajatroy/khmer/python'
> > python setup.py build_ext -i
> > running build_ext
> > copying build/lib.linux-x86_64-2.7/khmer/_khmermodule.so -> khmer
> > make[1]: Leaving directory `/u2/home/rajatroy/khmer/python'
> > nosetests -v -x -a \!known_failing
> > tests.test_align.test_alignnocov ... ok
> > tests.test_align.test_readalign ... ok
> > tests.test_align.test_alignerrorregion ... ok
> > tests.test_c_wrapper.test_raise_in_consume_fasta ... ok
> > tests.test_c_wrapper.test_raise_in_fasta_file_to_minmax ... ok
> > tests.test_counting_hash.Test_CountingHash.test_collision_1 ... ok
> > tests.test_counting_hash.Test_CountingHash.test_collision_2 ... ok
> > tests.test_counting_hash.Test_CountingHash.test_collision_3 ... ok
> > tests.test_counting_hash.test_3_tables ... ok
> > tests.test_counting_hash.test_simple_median ... ok
> > tests.test_counting_hash.test_simple_kadian ... ok
> > tests.test_counting_hash.test_simple_kadian_2 ... ok
> > tests.test_counting_hash.test_2_kadian ... ok
> > tests.test_counting_hash.test_save_load ... ok
> > tests.test_counting_hash.test_load_gz ... ok
> > tests.test_counting_hash.test_save_load_gz ... ok
> > tests.test_counting_hash.test_trim_full ... ok
> > tests.test_counting_hash.test_trim_short ... ok
> > tests.test_counting_hash.test_maxcount ... ok
> > tests.test_counting_hash.test_maxcount_with_bigcount ... ok
> > tests.test_counting_hash.test_maxcount_with_bigcount_save ... ok
> > tests.test_counting_hash.test_bigcount_save ... ok
> > tests.test_counting_hash.test_nobigcount_save ... ok
> > tests.test_counting_hash.test_bigcount_abund_dist ... ok
> > tests.test_counting_hash.test_bigcount_abund_dist_2 ... ok
> > tests.test_counting_hash.test_bigcount_overflow ... ok
> > tests.test_counting_hash.test_get_ksize ... ok
> > tests.test_counting_hash.test_get_hashsizes ... ok
> > tests.test_counting_single.Test_AbundanceDistribution.test_count_A ... ok
> > tests.test_counting_single.Test_ConsumeString.test_abundance_by_pos ...
> ok
> >
> tests.test_counting_single.Test_ConsumeString.test_abundance_by_pos_bigcount
> > ... ok
> > tests.test_counting_single.Test_ConsumeString.test_bounded ... ok
> > tests.test_counting_single.Test_ConsumeString.test_bounded_2 ... ok
> > tests.test_counting_single.Test_ConsumeString.test_bounded_2_rc ... ok
> > tests.test_counting_single.Test_ConsumeString.test_bounded_rc ... ok
> > tests.test_counting_single.Test_ConsumeString.test_max_count ... ok
> > tests.test_counting_single.Test_ConsumeString.test_max_count_in_bound
> ... ok
> > tests.test_counting_single.Test_ConsumeString.test_max_count_out_bound
> ...
> > ok
> > tests.test_counting_single.Test_ConsumeString.test_min_count ... ok
> > tests.test_counting_single.Test_ConsumeString.test_min_count_in_bound
> ... ok
> > tests.test_counting_single.Test_ConsumeString.test_min_count_out_bound
> ...
> > ok
> > tests.test_counting_single.Test_ConsumeString.test_n_occupied ... ok
> > tests.test_counting_single.Test_ConsumeString.test_n_occupied_args ... ok
> > tests.test_counting_single.Test_ConsumeString.test_simple ... ok
> > tests.test_counting_single.Test_ConsumeString.test_simple_2 ... ok
> > tests.test_counting_single.Test_ConsumeString.test_simple_rc ... ok
> > tests.test_counting_single.test_no_collision ... ok
> > tests.test_counting_single.test_collision ... ok
> > tests.test_counting_single.test_complete_no_collision ... ok
> > tests.test_counting_single.test_complete_2_collision ... ok
> > tests.test_counting_single.test_complete_4_collision ... ok
> > tests.test_counting_single.test_maxcount ... ok
> > tests.test_counting_single.test_maxcount_with_bigcount ... ok
> > tests.test_counting_single.test_consume_uniqify_first ... ok
> > tests.test_counting_single.test_maxcount_consume ... ok
> > tests.test_counting_single.test_maxcount_consume_with_bigcount ... ok
> > tests.test_counting_single.test_get_mincount ... ok
> > tests.test_counting_single.test_get_maxcount ... ok
> > tests.test_counting_single.test_get_maxcount_rc ... ok
> > tests.test_counting_single.test_get_mincount_rc ... ok
> > tests.test_counting_single.test_64bitshift ... ok
> > tests.test_counting_single.test_64bitshift_2 ... ok
> > tests.test_counting_single.test_very_short_read ... ok
> > tests.test_filter.Test_Filter.test_abund ... ok
> > tests.test_filter.test_filter_sodd ... ok
> > tests.test_functions.test_forward_hash ... ok
> > tests.test_functions.test_forward_hash_no_rc ... ok
> > tests.test_functions.test_reverse_hash ... ok
> > tests.test_functions.test_get_primes ... ok
> > tests.test_graph.Test_ExactGraphFu.test_counts ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_next_a ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_next_c ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_next_g ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_next_t ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_prev_a ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_prev_c ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_prev_g ... ok
> > tests.test_graph.Test_ExactGraphFu.test_graph_links_prev_t ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_next_a ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_next_c ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_next_g ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_next_t ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_prev_a ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_prev_c ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_prev_g ... ok
> > tests.test_graph.Test_InexactGraphFu.test_graph_links_prev_t ... ok
> > tests.test_graph.Test_Partitioning.test_connected_20_a ... ok
> > tests.test_graph.Test_Partitioning.test_connected_20_b ... ok
> > tests.test_graph.Test_Partitioning.test_connected_31_c ... ok
> > tests.test_graph.Test_Partitioning.test_disconnected_20_a ... ok
> > tests.test_graph.Test_Partitioning.test_disconnected_20_b ... ok
> > tests.test_graph.Test_Partitioning.test_disconnected_31_c ... ok
> > tests.test_graph.Test_Partitioning.test_not_output_unassigned ... ok
> > tests.test_graph.Test_Partitioning.test_output_unassigned ... ok
> > tests.test_graph.Test_PythonAPI.test_ordered_connect ... ok
> > tests.test_hashbits.test__get_set_tag_density ... ok
> > tests.test_hashbits.test_n_occupied_1 ... ok
> > tests.test_hashbits.test_bloom_python_1 ... ok
> > tests.test_hashbits.test_bloom_c_1 ... ok
> > tests.test_hashbits.test_n_occupied_2 ... ok
> > tests.test_hashbits.test_bloom_c_2 ... ok
> > tests.test_hashbits.test_filter_if_present ... ok
> > tests.test_hashbits.test_combine_pe ... ok
> > tests.test_hashbits.test_load_partitioned ... ok
> > tests.test_hashbits.test_count_within_radius_simple ... ok
> > tests.test_hashbits.test_count_within_radius_big ... ok
> > tests.test_hashbits.test_count_kmer_degree ... ok
> > tests.test_hashbits.test_find_radius_for_volume ... ok
> > tests.test_hashbits.test_circumference ... ok
> > tests.test_hashbits.test_save_load_tagset ... ok
> > tests.test_hashbits.test_save_load_tagset_noclear ... ok
> > tests.test_hashbits.test_stop_traverse ... ok
> > tests.test_hashbits.test_tag_across_stoptraverse ... ok
> > tests.test_hashbits.test_notag_across_stoptraverse ... ok
> > tests.test_hashbits.test_find_stoptags ... ok
> > tests.test_hashbits.test_find_stoptags2 ... ok
> > tests.test_hashbits.test_get_ksize ... ok
> > tests.test_hashbits.test_get_hashsizes ... ok
> > tests.test_hashbits.test_extract_unique_paths_0 ... ok
> > tests.test_hashbits.test_extract_unique_paths_1 ... ok
> > tests.test_hashbits.test_extract_unique_paths_2 ... ok
> > tests.test_hashbits.test_extract_unique_paths_3 ... ok
> > tests.test_hashbits.test_extract_unique_paths_4 ... ok
> > tests.test_hashbits.test_find_unpart ... ok
> > tests.test_hashbits.test_find_unpart_notraverse ... ok
> > tests.test_hashbits.test_find_unpart_fail ... ok
> > tests.test_hashbits.test_simple_median ... ok
> > Verify that 'has_extra_sanity_checks' exists. ... ok
> > Verify that all of the various attributes exist. ... ok
> > Verify that all of the various attributes exist. ... ok
> > Verify that all of the various attributes exist. ... ok
> > Verify that all of the various attributes exist. ... ok
> > Verify that the number of threads set is what is reported. ... ok
> > Verify that the reads file chunk size is what is reported. ... ok
> > tests.test_ktable.Test_KTable.test_basic ... ok
> > tests.test_ktable.Test_KTable.test_clear ... ok
> > tests.test_ktable.Test_KTable.test_consume ... ok
> > tests.test_ktable.Test_KTable.test_hash ... ok
> > tests.test_ktable.Test_KTable.test_intersection ... ok
> > tests.test_ktable.Test_KTable.test_operator_in ... ok
> > tests.test_ktable.Test_KTable.test_populate ... ok
> > tests.test_ktable.Test_KTable.test_update ... ok
> > tests.test_ktable.test_rc ... ok
> > tests.test_ktable.test_KmerCount ... ok
> > tests.test_lump.test_fakelump_together ... ok
> > tests.test_lump.test_fakelump_stop ... ok
> > tests.test_lump.test_fakelump_stop2 ... ok
> > tests.test_lump.test_fakelump_repartitioning ... ok
> > tests.test_minmax.Test_Basic.test_max_1 ... ok
> > tests.test_minmax.Test_Basic.test_max_2 ... ok
> > tests.test_minmax.Test_Basic.test_merge_1 ... ok
> > tests.test_minmax.Test_Basic.test_merge_2 ... ok
> > tests.test_minmax.Test_Basic.test_merge_3 ... ok
> > tests.test_minmax.Test_Basic.test_merge_4 ... ok
> > tests.test_minmax.Test_Basic.test_min_1 ... ok
> > tests.test_minmax.Test_Basic.test_min_2 ... ok
> > tests.test_minmax.Test_Basic.test_tablesize ... ok
> > tests.test_minmax.Test_Filestuff.test_save_no_load ... ok
> > tests.test_minmax.Test_Filestuff.test_saveload ... ok
> > tests.test_read_parsers.test_read_properties ... ok
> > tests.test_read_parsers.test_with_default_arguments ... ok
> > tests.test_read_parsers.test_gzip_decompression ... ok
> > tests.test_read_parsers.test_bzip2_decompression ... ok
> > tests.test_read_parsers.test_with_multiple_threads ... ok
> > tests.test_read_parsers.test_old_illumina_pair_mating ... ok
> > tests.test_read_parsers.test_casava_1_8_pair_mating ... ok
> > tests.test_read_parsers.test_iterator_identities ... ok
> > tests.test_read_parsers.test_read_pair_iterator_in_error_mode_xfail ...
> ok
> > tests.test_scripts.test_load_into_counting ... ok
> > tests.test_scripts.test_load_into_counting_fail ... ok
> > tests.test_scripts.test_filter_abund_1 ... ok
> > tests.test_scripts.test_filter_abund_2 ... ok
> > tests.test_scripts.test_filter_abund_3_fq_retained ... ok
> > tests.test_scripts.test_filter_abund_1_singlefile ... ok
> > tests.test_scripts.test_filter_abund_4_retain_low_abund ... ok
> > tests.test_scripts.test_filter_abund_5_trim_high_abund ... ok
> > tests.test_scripts.test_filter_abund_6_trim_high_abund_Z ... ok
> > tests.test_scripts.test_filter_stoptags ... ok
> > tests.test_scripts.test_normalize_by_median ... ok
> > tests.test_scripts.test_normalize_by_median_2 ... ok
> > tests.test_scripts.test_normalize_by_median_paired ... ok
> > tests.test_scripts.test_normalize_by_median_impaired ... ok
> > tests.test_scripts.test_normalize_by_median_force ... ok
> > tests.test_scripts.test_normalize_by_median_dumpfrequency ... ok
> > tests.test_scripts.test_normalize_by_median_empty ... ok
> > tests.test_scripts.test_count_median ... ok
> > tests.test_scripts.test_load_graph ... ok
> > tests.test_scripts.test_load_graph_no_tags ... ok
> > tests.test_scripts.test_load_graph_fail ... ok
> > tests.test_scripts.test_partition_graph_1 ... ok
> > tests.test_scripts.test_partition_graph_nojoin_k21 ... ok
> > tests.test_scripts.test_partition_graph_nojoin_stoptags ... ok
> > tests.test_scripts.test_partition_graph_big_traverse ... ok
> > tests.test_scripts.test_partition_graph_no_big_traverse ... ok
> > tests.test_scripts.test_annotate_partitions ... ok
> > tests.test_scripts.test_annotate_partitions_2 ... ok
> > tests.test_scripts.test_extract_partitions ... ok
> > tests.test_scripts.test_abundance_dist ... ok
> > tests.test_scripts.test_abundance_dist_nobigcount ... ok
> > tests.test_scripts.test_abundance_dist_single ... ok
> > tests.test_scripts.test_abundance_dist_single_nobigcount ... ok
> > tests.test_scripts.test_do_partition ... ok
> > tests.test_scripts.test_do_partition_2 ... ok
> > tests.test_scripts.test_interleave_reads_1_fq ... ok
> > tests.test_scripts.test_interleave_reads_2_fa ... ok
> > tests.test_scripts.test_extract_paired_reads_1_fa ... ok
> > tests.test_scripts.test_extract_paired_reads_2_fq ... ok
> > tests.test_scripts.test_split_paired_reads_1_fa ... ok
> > tests.test_scripts.test_split_paired_reads_2_fq ... ok
> > tests.test_split.test_2_split ... ok
> > tests.test_split.test_n_split ... ok
> > tests.test_split.test_n3_split ... ok
> > tests.test_subset_graph.Test_RandomData.test_3_merge_013 ... ok
> > tests.test_subset_graph.Test_RandomData.test_3_merge_023 ... ok
> > tests.test_subset_graph.Test_RandomData.test_5_merge_046 ... ok
> > tests.test_subset_graph.Test_RandomData.test_random_20_a_succ ... ok
> > tests.test_subset_graph.Test_RandomData.test_random_20_a_succ_II ... ok
> > tests.test_subset_graph.Test_RandomData.test_random_20_a_succ_III ... ok
> > tests.test_subset_graph.Test_RandomData.test_random_20_a_succ_IV ... ok
> > tests.test_subset_graph.Test_RandomData.test_random_20_a_succ_IV_save
> ... ok
> > tests.test_subset_graph.Test_SaveLoadPmap.test_save_load_merge ... ok
> > tests.test_subset_graph.Test_SaveLoadPmap.test_save_load_merge_2 ... ok
> > tests.test_subset_graph.Test_SaveLoadPmap.test_save_merge_from_disk ...
> ok
> > tests.test_subset_graph.Test_SaveLoadPmap.test_save_merge_from_disk_2
> ... ok
> > tests.test_subset_graph.test_output_partitions ... ok
> > tests.test_subset_graph.test_tiny_real_partitions ... ok
> > tests.test_subset_graph.test_small_real_partitions ... ok
> > tests.test_threaded_sequence_processor.test_basic ... ok
> > tests.test_threaded_sequence_processor.test_basic_fastq_like ... ok
> > tests.test_threaded_sequence_processor.test_odd ... ok
> > tests.test_threaded_sequence_processor.test_basic_2thread ... ok
> > tests.test_threaded_sequence_processor.test_paired_2thread ... ok
> > tests.test_threaded_sequence_processor.test_paired_2thread_more_seq ...
> ok
> >
> > ----------------------------------------------------------------------
> > Ran 233 tests in 20.632s
> >
> > OK
> >
> >
> >
> > On Tue, Aug 27, 2013 at 5:29 PM, C. Titus Brown <ctb at msu.edu> wrote:
> >
> > > Hmm, make sure you've deleted old versions of Khmer. What does 'make
> test'
> > > report in the top Khmer directory?
> > >
> > > ---
> > > C. Titus Brown, ctb at msu.edu
> > >
> > > On Aug 27, 2013, at 17:27, Rajat Shuvro Roy <rajatroy at cs.rutgers.edu>
> > > wrote:
> > >
> > > Thanks so much. I downloaded and compiled the latest version. make test
> > > resulted in 'ok' for everything. However, when I tried to run it, I
> get the
> > > following message:
> > >
> > > python load-into-counting.py -k 31 -x 5e10 out.kh 1Mreads.fa
> > > Traceback (most recent call last):
> > >   File "load-into-counting.py", line 13, in <module>
> > >     from khmer.counting_args import build_construct_args,
> report_on_config
> > > ImportError: cannot import name report_on_config
> > >
> > >
> > >
> > > On Tue, Aug 27, 2013 at 4:41 PM, C. Titus Brown <ctb at msu.edu> wrote:
> > >
> > >> Hi Rajat,
> > >>
> > >> sorry for long delay in response!
> > >>
> > >> On Thu, Jul 18, 2013 at 03:32:39PM -0400, Rajat Shuvro Roy wrote:
> > >> > Hello Prof Brown,
> > >> > I was attempting to produce a k-mer frequency distribution using
> khmer
> > >> and
> > >> > followed the instructions in (
> > >> > http://khmer.readthedocs.org/en/latest/scripts.html) . I have a Zia
> > >> mays
> > >> > library (SRR404240, 95.8Gbp ) and I executed the following command.
> > >> >
> > >> > python load-into-counting.py -k 31 -x 5e10 out.kh SRR404240.fasta
> > >> >
> > >> > I believe, this counts k-mer frequencies and the script
> > >> abundance-dist.py
> > >> > produces the distribution.
> > >> >
> > >> > We stopped it after it had ran for 2464 mins (41hrs) using 187GB
> space.
> > >> I
> > >> > tried with smaller values for -x but failed to complete the
> computation
> > >> in
> > >> > less than 3 days. Could you please let us know if this is expected
> and
> > >> we
> > >> > should allow more time. And is there a more efficient way of using
> > >> Khmer?
> > >>
> > >> Your e-mail actually triggered some doc changes and updates ;).
> > >>
> > >> Briefly, khmer can count k-mers in either constant-memory mode or in
> > >> accurate-large-counts mode.  In the former, counts above 255 will
> > >> stop being counted, but the memory specified with the -N and -x
> parameters
> > >> will be the total amount used; in the latter mode (which is the
> default),
> > >> counts above 255 will be kept and memory use will expand indefinitely.
> > >>
> > >> You can use these modes easily in the latest khmer, the bleeding-edge
> > >> branch; you can get that like so:
> > >>
> > >>         git clone https://github.com/ged-lab/khmer.git -b
> bleeding-edge
> > >>
> > >> Then use 'load-into-counting.py -b' to build the tables, and
> > >> 'abundance-dist'
> > >> to generate the output.
> > >>
> > >> I'd suggest running it on a small test data set (data/25k.fq.gz, in
> the
> > >> khmer repo) just to make sure it all works for you, but it should -
> we use
> > >> this regularly.
> > >>
> > >> Please let me know if you have any questions, and again, apologies for
> > >> the delay!
> > >>
> > >> cheers,
> > >> --titus
> > >> --
> > >> C. Titus Brown, ctb at msu.edu
> > >>
> > >
> > >
>
> --
> C. Titus Brown, ctb at msu.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.idyll.org/pipermail/khmer/attachments/20130828/7648c8aa/attachment.htm>


More information about the khmer mailing list