[bip] testing for clustered-ness on non-random background

Tue Feb 17 14:28:23 PST 2009

Brent Pedersen wrote:
> hi, this isn't a python question per se, but it seems like it might be
> a good place to ask.
> so i'd like to take a class of genes on a chromosome and see if they
> are "clustered".
> is there a good way to do this given that the genes are _already_
> clustered/non-randomly distributed
> along the chromosome due to the centromere, local duplications, etc?
>   
What do you mean by 'non-random' and, therefore, 'random'?
For that matter, what do you mean by 'distributed' and how do you 
measure it?

Do you assume that the members of this gene family are say uniformly 
distributed across the chromosome? If so, then a corresponding null 
hypothesis says the distance between adjacent genes should be the same.  
Or perhaps the null hypothesis says that distance between adjacent genes 
follows a Poisson distribution so if these genes don't follow that then 
it is non-random.  Obviously assumption of one distribution tends to 
preclude another distribution - if all genes are randomly distributed 
via a Poisson distribution then the genes can not be distributed in a 
uniform manner.

Or do you think these genes are closer together than other genes? If so, 
perhaps the null hypothesis is that the distance between members of the 
gene family are not any different from the difference between genes 
present on the same chromosome. Or you may use some 'M' or 'W' shaped 
distributions to address cases of less or more genes near centromere and 
telomeres. (I would be tempted to treat the chromosome arms as two 
different entities because of the duplication process involved to have 
genes from the same family on different arms. )

> i've thought of:
> + encoding a chromosome as binary with 1 if it's a gene of interest
> and 0 for any other gene
> and then taking a moving average and finding peaks that fall outside
> of 95% limits generated
> by monte-carlo. this has the problem (or perhaps benefit) that it
> doesn't account for base pair
> position, just relative gene position.
>   
If you have a sequenced chromosome of sufficient quality then you have 
the distance between genes in base pairs. Therefore you can create a 
distance matrix that can be used for clustering. Then you can perhaps 
examine the composition of the clusters or bicluster with some other 
factor of interest (like homology). Instead of distance between genes, 
distance to centromere or telomeres could be more relevant. 
Alternatively a multipoint feasible mapping function is probably more 
appropriate because as recombination becomes more likely as the distance 
increases.

> + using geospatial measures like moran's I or geary's C--though those
> are generally 2 dimensional,
> i think they could be modified to handle distribution along the 1d
> chromsome. then i could take something
> like the global geary's C for the genome and comparing to the geary's
> C for the genes in question.
>
> any literature on this?
>   
Obviously things like double strands, introns, repeats and non-coding 
region will mess up this nice simple view. So I would suggest looking 
at  gene duplication theory and the distribution of repetitive DNA. 
Really you have to incorporate some evolutionary model of gene 
duplication such as gene conversion into this.

Bruce