[bip] testing for clustered-ness on non-random background

Tue Feb 17 14:58:20 PST 2009

oops. forwading to list.

---------- Forwarded message ----------
From: Brent Pedersen <bpederse at gmail.com>
Date: Tue, Feb 17, 2009 at 2:57 PM
Subject: Re: [bip] testing for clustered-ness on non-random background
To: Bruce Southey <bsouthey at gmail.com>

On Tue, Feb 17, 2009 at 2:28 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> Brent Pedersen wrote:
>>
>> hi, this isn't a python question per se, but it seems like it might be
>> a good place to ask.
>> so i'd like to take a class of genes on a chromosome and see if they
>> are "clustered".
>> is there a good way to do this given that the genes are _already_
>> clustered/non-randomly distributed
>> along the chromosome due to the centromere, local duplications, etc?
>>
>
> What do you mean by 'non-random' and, therefore, 'random'?
> For that matter, what do you mean by 'distributed' and how do you measure
> it?
>

well, that's pretty much what i am asking how to ask. i guess i could
phrase it like the proverbial coin toss where knowing you just flipped
heads tells you nothing about future flips. so: does knowing the
location of a given member of a gene family tell me /anything/ about
the probable locations of the other members of the gene family? and
from there, take into account the non-randomness of the genome.

> Do you assume that the members of this gene family are say uniformly
> distributed across the chromosome? If so, then a corresponding null
> hypothesis says the distance between adjacent genes should be the same.  Or
> perhaps the null hypothesis says that distance between adjacent genes
> follows a Poisson distribution so if these genes don't follow that then it
> is non-random.  Obviously assumption of one distribution tends to preclude
> another distribution - if all genes are randomly distributed via a Poisson
> distribution then the genes can not be distributed in a uniform manner.
>
> Or do you think these genes are closer together than other genes? If so,
> perhaps the null hypothesis is that the distance between members of the gene
> family are not any different from the difference between genes present on
> the same chromosome. Or you may use some 'M' or 'W' shaped distributions to
> address cases of less or more genes near centromere and telomeres. (I would
> be tempted to treat the chromosome arms as two different entities because of
> the duplication process involved to have genes from the same family on
> different arms. )
>
>> i've thought of:
>> + encoding a chromosome as binary with 1 if it's a gene of interest
>> and 0 for any other gene
>> and then taking a moving average and finding peaks that fall outside
>> of 95% limits generated
>> by monte-carlo. this has the problem (or perhaps benefit) that it
>> doesn't account for base pair
>> position, just relative gene position.
>>
>
> If you have a sequenced chromosome of sufficient quality then you have the
> distance between genes in base pairs. Therefore you can create a distance
> matrix that can be used for clustering. Then you can perhaps examine the
> composition of the clusters or bicluster with some other factor of interest
> (like homology). Instead of distance between genes, distance to centromere
> or telomeres could be more relevant. Alternatively a multipoint feasible
> mapping function is probably more appropriate because as recombination
> becomes more likely as the distance increases.

ah, this distance matrix makes sense. i'll have to read up on the
multipoint mapping stuff.

>
>> + using geospatial measures like moran's I or geary's C--though those
>> are generally 2 dimensional,
>> i think they could be modified to handle distribution along the 1d
>> chromsome. then i could take something
>> like the global geary's C for the genome and comparing to the geary's
>> C for the genes in question.
>>
>> any literature on this?
>>
>
> Obviously things like double strands, introns, repeats and non-coding region
> will mess up this nice simple view. So I would suggest looking at  gene
> duplication theory and the distribution of repetitive DNA. Really you have
> to incorporate some evolutionary model of gene duplication such as gene
> conversion into this.
>

thanks for the thoughtful ideas. i'm going to try to keep things
simple to start.
someone also suggested using an HMM which i'll look into.

> Bruce
>
>