[bip] testing for clustered-ness on non-random background

Wed Feb 18 09:18:30 PST 2009

Brent Pedersen wrote:
> On Tue, Feb 17, 2009 at 11:50 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>   
>> hi, this isn't a python question per se, but it seems like it might be
>> a good place to ask.
>> so i'd like to take a class of genes on a chromosome and see if they
>> are "clustered".
>> is there a good way to do this given that the genes are _already_
>> clustered/non-randomly distributed
>> along the chromosome due to the centromere, local duplications, etc?
>> i've thought of:
>> + encoding a chromosome as binary with 1 if it's a gene of interest
>> and 0 for any other gene
>> and then taking a moving average and finding peaks that fall outside
>> of 95% limits generated
>> by monte-carlo. this has the problem (or perhaps benefit) that it
>> doesn't account for base pair
>> position, just relative gene position.
>>
>> + using geospatial measures like moran's I or geary's C--though those
>> are generally 2 dimensional,
>> i think they could be modified to handle distribution along the 1d
>> chromsome. then i could take something
>> like the global geary's C for the genome and comparing to the geary's
>> C for the genes in question.
>>
>> any literature on this?
>> thanks for any pointers.
>> -brent
>>
>>     
>
> hi, thanks for all the ideas.
> as i said, i'd like to keep it simple--using only spatial measures for
> this test, that's why i mentioned the autocorrelation-based
> stats above. my plan is to make a distance matrix for all genes on a
> chromosome as bruce suggested.
> then, if there are 10 genes of interest in my gene family, i can
> randomly sample 10 genes and calculate the sum of their
> distances--and repeat that 100 times and sort the sums. then if the
> sum of distances of the 10 genes of interest is less than
> the 5th in my sorted sums, the genes of interest are more "clustered"
> than random gene families. this is what i want to show.
> i think this will be pretty flexible:
> + works for any number of genes, even a local block of unrelated
> genes, see if it is
> clustered or has tighter spacing.
> + can replace sum-of-distances with sum-of-inverse distances.
> + can sample other gene families for the basis, instead of randomly
> generated gene sets.
> + can use the sum of only the N closest neighbors for each gene,
> rather than all genes.
>
> if any of that's not sane, let me know.
>
> -brent
>
>   
If you have ten genes such that five are clustered on one end and five 
are clustered on the other end, then the sum of distances will be 
greater than if you have ten equally spaced genes.  I guess that is why 
you mention taking only the N closest neighbors of each group.

Here is another suggestion if you don't want to do the HMM.  You could 
chop the genome into sections (of some size) and count the number of 
genes of interest that fall into each section.  If the sizes are big 
enough and genes of interest occur 'at random' then I think that these 
counts should be distributed according to a poisson distribution (where 
the variance should be equal to the mean).  If the variance is much less 
than the mean, then the genes are more spatially 'spread out' than you 
would expect, and if the variance is much greater than the mean, then 
the genes are more spatially 'clustered' than you would expect.  So with 
Titus's suggestion in mind, you could compare the sample variance/mean 
ratio for your gene set vs. those of other comparable sets or of 
randomly chosen sets.  I've seen this kind of thing done before 
(comparing variance/mean ratios) but I'm not sure how good an idea it is.

Alex