[khmer] Partitioning: are resulting lumps that different from each other?
YiJin Liew
YiJin.Liew at KAUST.EDU.SA
Mon Jan 13 06:21:30 PST 2014
Thanks for the confirmations!
To back my intuition, I repeated the rough blastn searches on
groups0000, 0001 and 0002 produced from your Iowa corn dataset.
What I did was:
1. blastn the 1 mil reads of each dataset against nt
2. for each read, pick the top hit with e-value of less than 1e-5
3. and do a bit of BASH-fu to record what species the read comes from,
and show the top 20.
Results!
--- iowa-corn-50m.group0000_vs_nt.blastn.tsv ---
3173 Rhodanobacter sp.
2122 Rhodopseudomonas palustris
1971 Bradyrhizobium japonicum
1878 Streptomyces sp.
1822 Bradyrhizobium sp.
1202 Ramlibacter tataouinensis
1024 Variovorax paradoxus
890 Streptomyces fulvissimus
801 Uncultured bacterium
774 Intrasporangium calvum
687 Kribbella flavida
616 Agromonas oligotrophica
568 Actinoplanes sp.
539 Conexibacter woesei
518 Streptomyces griseus
508 Nocardioides sp.
502 Streptosporangium roseum
466 Micromonospora sp.
432 Verrucosispora maris
390 Clavibacter michiganensis
--- iowa-corn-50m.group0001_vs_nt.blastn.tsv ---
5072 Rhodanobacter sp.
3322 Streptomyces sp.
2287 Bradyrhizobium japonicum
2164 Rhodopseudomonas palustris
2129 Bradyrhizobium sp.
1658 Ramlibacter tataouinensis
1486 Streptomyces fulvissimus
1325 Intrasporangium calvum
1271 Variovorax paradoxus
889 Uncultured bacterium
841 Streptomyces griseus
728 Agromonas oligotrophica
627 Nocardioides sp.
626 Actinoplanes sp.
555 Streptosporangium roseum
552 Kribbella flavida
546 Conexibacter woesei
517 Micromonospora sp.
474 Verrucosispora maris
434 Nitrobacter hamburgensis
--- iowa-corn-50m.group0002_vs_nt.blastn.tsv ---
7099 Rhodanobacter sp.
4219 Streptomyces sp.
2559 Bradyrhizobium japonicum
2399 Bradyrhizobium sp.
2336 Rhodopseudomonas palustris
1931 Intrasporangium calvum
1852 Ramlibacter tataouinensis
1803 Streptomyces fulvissimus
1350 Variovorax paradoxus
1170 Streptomyces griseus
849 Agromonas oligotrophica
848 Uncultured bacterium
668 Conexibacter woesei
646 Actinoplanes sp.
615 Streptosporangium roseum
585 Nocardioides sp.
559 Micromonospora sp.
491 Clavibacter michiganensis
473 Kribbella flavida
454 Rubrivivax gelatinosus
... come to think of it, I should've repeated the same analysis on the
biggest lump (group0007). However, from what I've observed from my own
datasets, the biggest lump should be different from the smaller lumps.
It's just that I feel that the small lumps are actually not very
different from each other, and they're split mainly because of the -X
setting. Hope this (very rough) analysis of mine illustrates what I'm
trying to say!
Yours
Yi Jin
On 13/01/2014 04:46, C. Titus Brown wrote:
> On Sun, Jan 12, 2014 at 11:26:50PM +0000, YiJin Liew wrote:
>> Thanks for the reply! Could I ask a few follow-up questions regarding
>> the format of the .dist file then, as I can't seem to find a full
>> description of how the file is structured?
>>
>> take for example
>>
>> --- iowa-corn-50m.dist ---
>> 1 19750012 19750012 19750012
>> 2 2905935 22655947 25561882
>> 3 745747 23401694 27799123
>> 4 324017 23725711 29095191
>> 5 167228 23892939 29931331
>> <snip>
>> 2312 1 24356713 37268397
>> 2359 1 24356714 37270756
>> 2714 1 24356715 37273470
>> 3008 1 24356716 37276478
>> 3296530 1 24356717 40573008 <-- is this the most interesting group?
>>
>> I can sort of guess what the numbers mean, but let me double-check: does
>> this indicate that there's 19.8 million clusters that are "singlets";
>> followed by 2.9 million "doublets" etc.? Also, are columns 3 and 4
>> cumulative figures for clusters and sequences respectively?
>
> Exactly!
>
>> If you don't mind, could you elaborate briefly on how groups are created
>> based on the dist file? Judging from the line counts, I suspect that the
>> script fills the first group with singlets till --max-size is hit, if
>> not, continue filling with doublets, then move on to the next group once
>> --max-size is crossed?
>
> Yep.
>
>> On my data, I've tried blastn-ing the groups0000 and 0001 produced from
>> the partitioning process, but from the results I'd wager that they're
>> roughly the same - which was what prompted me to seek advice on how the
>> script functioned.
>
> Roughly the same... no, shouldn't be. Those are probably spurious
> BLAST matches of some sort. If partitioning worked (and at least from
> the examples above you got a lot of partitions) then those reads are
> from different components of the overall de Bruijn graph.
>
> cheers,
> --titus
>
________________________________
This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
More information about the khmer
mailing list