[TIP] testing and hash values

Mon Sep 30 03:19:31 PDT 2013

On 9/30/13 1:09 AM, Chris Jerdonek wrote:
> On Sun, Sep 29, 2013 at 6:14 PM, Ned Batchelder <ned at nedbatchelder.com> wrote:
>> On 9/29/13 4:19 PM, Chris Jerdonek wrote:
>>> On Sun, Sep 29, 2013 at 12:20 PM, Ned Batchelder <ned at nedbatchelder.com>
>>> wrote:
>>>> On 9/29/13 12:56 PM, Chris Jerdonek wrote:
>>>>> I have a question about the behavior of hashing prior to Python 3.3
>>>>> (when hash randomization was turned on by default [1]).
>>>>>
>>>>> I know that in earlier versions Python never made any guarantees about
>>>>> hash values and their effect on dictionary key ordering, etc [2].  But
>>>>> for testing purposes, in practice, to what extent does hashing behave
>>>>> the same across systems and Python versions prior to Python 3.3?  For
>>>>> example, the note at [2] says that "it typically varies between 32-bit
>>>>> and 64-bit builds."
>>>>>
>>>>> I'm asking because I'm curious about the extent to which tests that
>>>>> unknowingly depend on hash values are reproducible across systems and
>>>>> versions.
>>>>
>>>> Tests like that are not reproducible across systems and versions. They
>>>> may
>>>> not be reproducible as the product code changes.  Two equal dicts may not
>>>> iterate in the same order, even within a single process:
>>> What explains the following then?  For quite a while, the unit tests
>>> for a project I maintained always passed for all versions, but only
>>> when I added 3.3 (when hash randomization was enabled) did I get
>>> intermittent failures on such a test.  Do some such tests tend to
>>> behave the same *in practice* -- even to a limited extent?  Otherwise,
>>> I would have expected to see test failures in the earlier versions, at
>>> least sometime.
>>
>> Many dictionaries will behave the same across versions and systems. In the
>> example I gave below, if the keys were integers instead of strings, the two
>> dicts would iterate the same.  It all comes down to the hash values of your
>> actual keys.
>>
>> When I said "tests like that are not reproducible," I didn't mean that they
>> would actually behave differently.  I meant that you couldn't count on them
>> always behaving the same.
>>
>> Your test dictionaries happened to fall into a reproducible scenario.  The
>> fact that they always behaved the same doesn't change the fact: you were
>> relying on undefined behavior (the iteration sequence of dict keys).  Python
>> 3.3 shook things up enough for it to actually change the outcome of your
>> program.
> Hmm, there's a subtle distinction here.  There's a difference between
> the questions of whether equal dictionaries iterate to the same order
> and whether the same code yields the same values on different systems
> and versions.
>
> For example, I ran your sample code on Python 2.7, 3.2, 3.3 (on 3.3
> setting PYTHONHASHSEED=0), and PyPy 1.9, and all yielded the same
> representations of d1 and d2.  So if a test depended on that ordering
> of d2, it seems like it would be reproducible across versions at least
> for those versions.
>
> To rephrase my question, I'm asking to what extent code can be
> expected or guaranteed to behave the same if the hash seed is the same
> (as it was by default prior to 3.3).  Can you provide an example of
> code that behaves differently on some systems or versions?

Using my same code example, but printing the keys more compactly, I get 
this:

    $ python2.4 dictorder.py
    d1: 1,0,3,2,5,4,7,6,9,8
    d2: 9,2,1,5,0,3,4,6,7,8
    $ python2.5 dictorder.py
    d1: 1,0,3,2,5,4,7,6,9,8
    d2: 9,1,5,2,0,3,4,6,7,8
    $ jython dictorder.py
    d1: 8,1,7,5,0,2,6,3,9,4
    d2: 0,8,5,1,7,6,4,2,9,3

I don't know what changed between 2.4 and 2.5, I just know that I wasn't 
allowed to complain about it.  Every version since 2.5 has agreed with 
2.5.  Jython of course is a completely different beast. Dict iteration 
order isn't guaranteed.

--Ned.
>
> I'm asking because if a test is flaky because of its dependence on
> hash values, it would be useful to know that knowing the hash seed is
> enough to reliably reproduce the test failure (e.g. if some user or CI
> server encountered a sporadic failure).  Otherwise, reproducing the
> failure would be harder.
>
> --Chris
>
>
>
>> --Ned.
>>
>>
>>> --Chris
>>>
>>>
>>>>>>> d1 = dict.fromkeys(str(i) for i in range(10))
>>>>>>> d2 = dict.fromkeys(str(i) for i in range(1000000))
>>>>>>> for i in range(10, 1000000):
>>>> ...   del d2[str(i)]
>>>> ...
>>>>>>> d1 == d2
>>>> True
>>>>>>> d1
>>>> {'1': None, '0': None, '3': None, '2': None, '5': None, '4': None, '7':
>>>> None, '6': None, '9': None, '8': None}
>>>>>>> d2
>>>> {'9': None, '1': None, '5': None, '2': None, '0': None, '3': None, '4':
>>>> None, '6': None, '7': None, '8': None}
>>>>
>>>> Be careful out there...
>>>>
>>>> --Ned.
>>>>> --Chris
>>>>>
>>>>> [1]
>>>>> http://docs.python.org/3/whatsnew/3.3.html#builtin-functions-and-types
>>>>> [2] http://docs.python.org/2/using/cmdline.html#cmdoption-R
>>>>>
>>>>> _______________________________________________
>>>>> testing-in-python mailing list
>>>>> testing-in-python at lists.idyll.org
>>>>> http://lists.idyll.org/listinfo/testing-in-python
>>>>