[bip] testing for clustered-ness on non-random background

Wed Feb 18 08:18:54 PST 2009

On Tue, Feb 17, 2009 at 11:50 AM, Brent Pedersen <bpederse at gmail.com> wrote:
> hi, this isn't a python question per se, but it seems like it might be
> a good place to ask.
> so i'd like to take a class of genes on a chromosome and see if they
> are "clustered".
> is there a good way to do this given that the genes are _already_
> clustered/non-randomly distributed
> along the chromosome due to the centromere, local duplications, etc?
> i've thought of:
> + encoding a chromosome as binary with 1 if it's a gene of interest
> and 0 for any other gene
> and then taking a moving average and finding peaks that fall outside
> of 95% limits generated
> by monte-carlo. this has the problem (or perhaps benefit) that it
> doesn't account for base pair
> position, just relative gene position.
>
> + using geospatial measures like moran's I or geary's C--though those
> are generally 2 dimensional,
> i think they could be modified to handle distribution along the 1d
> chromsome. then i could take something
> like the global geary's C for the genome and comparing to the geary's
> C for the genes in question.
>
> any literature on this?
> thanks for any pointers.
> -brent
>

hi, thanks for all the ideas.
as i said, i'd like to keep it simple--using only spatial measures for
this test, that's why i mentioned the autocorrelation-based
stats above. my plan is to make a distance matrix for all genes on a
chromosome as bruce suggested.
then, if there are 10 genes of interest in my gene family, i can
randomly sample 10 genes and calculate the sum of their
distances--and repeat that 100 times and sort the sums. then if the
sum of distances of the 10 genes of interest is less than
the 5th in my sorted sums, the genes of interest are more "clustered"
than random gene families. this is what i want to show.
i think this will be pretty flexible:
+ works for any number of genes, even a local block of unrelated
genes, see if it is
clustered or has tighter spacing.
+ can replace sum-of-distances with sum-of-inverse distances.
+ can sample other gene families for the basis, instead of randomly
generated gene sets.
+ can use the sum of only the N closest neighbors for each gene,
rather than all genes.

if any of that's not sane, let me know.

-brent