[bip] testing for clustered-ness on non-random background
Bruce Southey
bsouthey at gmail.com
Tue Feb 17 14:28:23 PST 2009
Brent Pedersen wrote:
> hi, this isn't a python question per se, but it seems like it might be
> a good place to ask.
> so i'd like to take a class of genes on a chromosome and see if they
> are "clustered".
> is there a good way to do this given that the genes are _already_
> clustered/non-randomly distributed
> along the chromosome due to the centromere, local duplications, etc?
>
What do you mean by 'non-random' and, therefore, 'random'?
For that matter, what do you mean by 'distributed' and how do you
measure it?
Do you assume that the members of this gene family are say uniformly
distributed across the chromosome? If so, then a corresponding null
hypothesis says the distance between adjacent genes should be the same.
Or perhaps the null hypothesis says that distance between adjacent genes
follows a Poisson distribution so if these genes don't follow that then
it is non-random. Obviously assumption of one distribution tends to
preclude another distribution - if all genes are randomly distributed
via a Poisson distribution then the genes can not be distributed in a
uniform manner.
Or do you think these genes are closer together than other genes? If so,
perhaps the null hypothesis is that the distance between members of the
gene family are not any different from the difference between genes
present on the same chromosome. Or you may use some 'M' or 'W' shaped
distributions to address cases of less or more genes near centromere and
telomeres. (I would be tempted to treat the chromosome arms as two
different entities because of the duplication process involved to have
genes from the same family on different arms. )
> i've thought of:
> + encoding a chromosome as binary with 1 if it's a gene of interest
> and 0 for any other gene
> and then taking a moving average and finding peaks that fall outside
> of 95% limits generated
> by monte-carlo. this has the problem (or perhaps benefit) that it
> doesn't account for base pair
> position, just relative gene position.
>
If you have a sequenced chromosome of sufficient quality then you have
the distance between genes in base pairs. Therefore you can create a
distance matrix that can be used for clustering. Then you can perhaps
examine the composition of the clusters or bicluster with some other
factor of interest (like homology). Instead of distance between genes,
distance to centromere or telomeres could be more relevant.
Alternatively a multipoint feasible mapping function is probably more
appropriate because as recombination becomes more likely as the distance
increases.
> + using geospatial measures like moran's I or geary's C--though those
> are generally 2 dimensional,
> i think they could be modified to handle distribution along the 1d
> chromsome. then i could take something
> like the global geary's C for the genome and comparing to the geary's
> C for the genes in question.
>
> any literature on this?
>
Obviously things like double strands, introns, repeats and non-coding
region will mess up this nice simple view. So I would suggest looking
at gene duplication theory and the distribution of repetitive DNA.
Really you have to incorporate some evolutionary model of gene
duplication such as gene conversion into this.
Bruce
More information about the biology-in-python
mailing list