[bip] testing for clustered-ness on non-random background

Tue Feb 17 15:21:40 PST 2009

I don't, but maybe someone else reading the list has.  I think I 
remember that the author of Epydoc has a library for HMMs in the context 
of natural language research.

http://www.nltk.org/

This is how I would try it.  Actually I would probably try to implement 
it myself but nltk is probably better.

Brent Pedersen wrote:
> thanks, i dont know much about HMM's, but i'll look into this. do you
> have a python HMM lib you prefer?
>
> On Tue, Feb 17, 2009 at 12:41 PM, alex <argriffi at ncsu.edu> wrote:
>   
>> After you do the zero one encoding, you could consider a hidden markov model
>> with a hidden 'enriched' state and a hidden 'non-enriched' state (each with
>> a different probability of emitting a 1), and two more degrees of freedom
>> defining the probability of being in the hidden enriched state and how fast
>> you switch between enriched and non-enriched states.  So four degrees of
>> freedom.  Then you can use various known HMM algorithms to estimate
>> quantities of interest.  You could do a likelihood ratio test between nested
>> models.  The smaller model would have only one degree of freedom (the
>> probability of emitting a 1) and the larger model would have the four
>> degrees of freedom explained above.  So if L is the ratio of maximum
>> likelihoods under these models then -2*log(L) should be chi-squared
>> distributed with 4-1=3 degrees of freedom when the extra degrees of freedom
>> are not particularly helpful.  If the observed ratio is extreme relative to
>> this distribution then it means that some combination of the "clustering"
>> degrees of freedom were probably useful, so you can say that there is
>> clustering.  You can use posterior decoding to find the clusters.
>>
>> Alex
>>
>>
>> Brent Pedersen wrote:
>>     
>>> hi, this isn't a python question per se, but it seems like it might be
>>> a good place to ask.
>>> so i'd like to take a class of genes on a chromosome and see if they
>>> are "clustered".
>>> is there a good way to do this given that the genes are _already_
>>> clustered/non-randomly distributed
>>> along the chromosome due to the centromere, local duplications, etc?
>>> i've thought of:
>>> + encoding a chromosome as binary with 1 if it's a gene of interest
>>> and 0 for any other gene
>>> and then taking a moving average and finding peaks that fall outside
>>> of 95% limits generated
>>> by monte-carlo. this has the problem (or perhaps benefit) that it
>>> doesn't account for base pair
>>> position, just relative gene position.
>>>
>>> + using geospatial measures like moran's I or geary's C--though those
>>> are generally 2 dimensional,
>>> i think they could be modified to handle distribution along the 1d
>>> chromsome. then i could take something
>>> like the global geary's C for the genome and comparing to the geary's
>>> C for the genes in question.
>>>
>>> any literature on this?
>>> thanks for any pointers.
>>> -brent
>>>
>>> _______________________________________________
>>> biology-in-python mailing list - bip at lists.idyll.org.
>>>
>>> See http://bio.scipy.org/ for our Wiki.
>>>       
>>