[khmer] Partitioning: are resulting lumps that different from each other?

Mon Jan 13 06:21:30 PST 2014

Thanks for the confirmations!

To back my intuition, I repeated the rough blastn searches on
groups0000, 0001 and 0002 produced from your Iowa corn dataset.

What I did was:
1. blastn the 1 mil reads of each dataset against nt
2. for each read, pick the top hit with e-value of less than 1e-5
3. and do a bit of BASH-fu to record what species the read comes from,
and show the top 20.

Results!

--- iowa-corn-50m.group0000_vs_nt.blastn.tsv ---
    3173 Rhodanobacter sp.
    2122 Rhodopseudomonas palustris
    1971 Bradyrhizobium japonicum
    1878 Streptomyces sp.
    1822 Bradyrhizobium sp.
    1202 Ramlibacter tataouinensis
    1024 Variovorax paradoxus
     890 Streptomyces fulvissimus
     801 Uncultured bacterium
     774 Intrasporangium calvum
     687 Kribbella flavida
     616 Agromonas oligotrophica
     568 Actinoplanes sp.
     539 Conexibacter woesei
     518 Streptomyces griseus
     508 Nocardioides sp.
     502 Streptosporangium roseum
     466 Micromonospora sp.
     432 Verrucosispora maris
     390 Clavibacter michiganensis

--- iowa-corn-50m.group0001_vs_nt.blastn.tsv ---
    5072 Rhodanobacter sp.
    3322 Streptomyces sp.
    2287 Bradyrhizobium japonicum
    2164 Rhodopseudomonas palustris
    2129 Bradyrhizobium sp.
    1658 Ramlibacter tataouinensis
    1486 Streptomyces fulvissimus
    1325 Intrasporangium calvum
    1271 Variovorax paradoxus
     889 Uncultured bacterium
     841 Streptomyces griseus
     728 Agromonas oligotrophica
     627 Nocardioides sp.
     626 Actinoplanes sp.
     555 Streptosporangium roseum
     552 Kribbella flavida
     546 Conexibacter woesei
     517 Micromonospora sp.
     474 Verrucosispora maris
     434 Nitrobacter hamburgensis

--- iowa-corn-50m.group0002_vs_nt.blastn.tsv ---
    7099 Rhodanobacter sp.
    4219 Streptomyces sp.
    2559 Bradyrhizobium japonicum
    2399 Bradyrhizobium sp.
    2336 Rhodopseudomonas palustris
    1931 Intrasporangium calvum
    1852 Ramlibacter tataouinensis
    1803 Streptomyces fulvissimus
    1350 Variovorax paradoxus
    1170 Streptomyces griseus
     849 Agromonas oligotrophica
     848 Uncultured bacterium
     668 Conexibacter woesei
     646 Actinoplanes sp.
     615 Streptosporangium roseum
     585 Nocardioides sp.
     559 Micromonospora sp.
     491 Clavibacter michiganensis
     473 Kribbella flavida
     454 Rubrivivax gelatinosus

... come to think of it, I should've repeated the same analysis on the
biggest lump (group0007). However, from what I've observed from my own
datasets, the biggest lump should be different from the smaller lumps.
It's just that I feel that the small lumps are actually not very
different from each other, and they're split mainly because of the -X
setting. Hope this (very rough) analysis of mine illustrates what I'm
trying to say!

Yours
Yi Jin

On 13/01/2014 04:46, C. Titus Brown wrote:
> On Sun, Jan 12, 2014 at 11:26:50PM +0000, YiJin Liew wrote:
>> Thanks for the reply! Could I ask a few follow-up questions regarding
>> the format of the .dist file then, as I can't seem to find a full
>> description of how the file is structured?
>>
>> take for example
>>
>> --- iowa-corn-50m.dist ---
>> 1 19750012 19750012 19750012
>> 2 2905935 22655947 25561882
>> 3 745747 23401694 27799123
>> 4 324017 23725711 29095191
>> 5 167228 23892939 29931331
>> <snip>
>> 2312 1 24356713 37268397
>> 2359 1 24356714 37270756
>> 2714 1 24356715 37273470
>> 3008 1 24356716 37276478
>> 3296530 1 24356717 40573008   <-- is this the most interesting group?
>>
>> I can sort of guess what the numbers mean, but let me double-check: does
>> this indicate that there's 19.8 million clusters that are "singlets";
>> followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
>> cumulative figures for clusters and sequences respectively?
>
> Exactly!
>
>> If you don't mind, could you elaborate briefly on how groups are created
>> based on the dist file? Judging from the line counts, I suspect that the
>> script fills the first group with singlets till --max-size is hit, if
>> not, continue filling with doublets, then move on to the next group once
>> --max-size is crossed?
>
> Yep.
>
>> On my data, I've tried blastn-ing the groups0000 and 0001 produced from
>> the partitioning process, but from the results I'd wager that they're
>> roughly the same - which was what prompted me to seek advice on how the
>> script functioned.
>
> Roughly the same... no, shouldn't be.  Those are probably spurious
> BLAST matches of some sort.  If partitioning worked (and at least from
> the examples above you got a lot of partitions) then those reads are
> from different components of the overall de Bruijn graph.
>
> cheers,
> --titus
>

________________________________

This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.