[bip] testing for clustered-ness on non-random background

Wed Feb 18 09:42:59 PST 2009

Brent Pedersen wrote:
> On Tue, Feb 17, 2009 at 11:50 AM, Brent Pedersen <bpederse at gmail.com> wrote:
>   
>> hi, this isn't a python question per se, but it seems like it might be
>> a good place to ask.
>> so i'd like to take a class of genes on a chromosome and see if they
>> are "clustered".
>> is there a good way to do this given that the genes are _already_
>> clustered/non-randomly distributed
>> along the chromosome due to the centromere, local duplications, etc?
>> i've thought of:
>> + encoding a chromosome as binary with 1 if it's a gene of interest
>> and 0 for any other gene
>> and then taking a moving average and finding peaks that fall outside
>> of 95% limits generated
>> by monte-carlo. this has the problem (or perhaps benefit) that it
>> doesn't account for base pair
>> position, just relative gene position.
>>
>> + using geospatial measures like moran's I or geary's C--though those
>> are generally 2 dimensional,
>> i think they could be modified to handle distribution along the 1d
>> chromsome. then i could take something
>> like the global geary's C for the genome and comparing to the geary's
>> C for the genes in question.
>>
>> any literature on this?
>> thanks for any pointers.
>> -brent
>>
>>     
>
> hi, thanks for all the ideas.
> as i said, i'd like to keep it simple--using only spatial measures for
> this test, that's why i mentioned the autocorrelation-based
> stats above. my plan is to make a distance matrix for all genes on a
> chromosome as bruce suggested.
> then, if there are 10 genes of interest in my gene family, i can
> randomly sample 10 genes and calculate the sum of their
> distances--and repeat that 100 times and sort the sums. then if the
> sum of distances of the 10 genes of interest is less than
> the 5th in my sorted sums, the genes of interest are more "clustered"
> than random gene families. this is what i want to show.
> i think this will be pretty flexible:
> + works for any number of genes, even a local block of unrelated
> genes, see if it is
> clustered or has tighter spacing.
> + can replace sum-of-distances with sum-of-inverse distances.
> + can sample other gene families for the basis, instead of randomly
> generated gene sets.
> + can use the sum of only the N closest neighbors for each gene,
> rather than all genes.
>
> if any of that's not sane, let me know.
>
> -brent
>   
Here is an extension of Bruce's idea:
"""
Or perhaps the null hypothesis says that distance between adjacent genes 
follows a Poisson distribution so if these genes don't follow that then 
it is non-random.
"""

I think he meant exponential distribution here instead of Poisson distribution.  Anyway, you could model the inter-gene distances within a group as coming from one exponential distribution, and the inter-group distances as coming from another exponential distribution.  Therefore you could define a nested model where the simpler model has one degree of freedom (the expected distance between genes) and the alternate model has three degrees of freedom: an expected distance between genes in the same cluster, an expected distance between clusters, and an expected number of genes per cluster.  You could compare likelihood ratios for your gene set to likelihood ratios for other gene sets or to random gene sets or to the asymptotic chi-squared distribution.

Alex