[bip] Reproducible research

Leighton Pritchard lpritc at scri.ac.uk
Mon Mar 9 08:03:37 PDT 2009


Howdo,

This is a stimulating discussion.

On 09/03/2009 13:39, "C. Titus Brown" <ctb at msu.edu> wrote:

> -> On 07/03/2009 04:39, "C. Titus Brown" <ctb at msu.edu> wrote:

> And yet... presumably you agree science Has Been Done, in bioinformatics
> and elsewhere?  Despite the generally rather abysmal quality of Methods
> sections and the lack of open source software using version-locked
> databases?  Methinks you have a contradiction ;)

I certainly do agree that Science Has Been Done in many areas, including
bioinformatics, and I hope that I was involved in some of it ;).  The bits
that count as Science were all reproducible ;)

I'm not, incidentally, arguing for any particular unique implementation of
'reproducibility', whether it involves version-locked databases or not.  I
happen to think that the precise criteria for what constitutes
reproducibility are case-dependent (e.g. The Metabolomics Standard
Initiative and MIAME have similar aims for reproducibility, but do not
record identical data), and that plenty of poor science is also reproducible
- even if only reproducibly poor ;)  So I don't think I've contradicted
myself anywhere, yet... though it's probably inevitable at some point ;)
 
> While I have respect for the argument that some form of reproducibility
> is important, I think our discussion on this list is taking it a bit
> far.  

In terms of exploring what degree of reproducibility we might consider
satisfactory, we need to discuss cases that go beyond a desirable limit in
order to establish where boundaries should be drawn.  Suggesting that
reproducibility has no role in the scientific process goes too far in the
opposite direction.

> I personally don't care too much about the exact version of nr you
> are using, unless it's somehow critically important for the analysis
> (which then suggests to me that you're doing the wrong analysis ;)

For a BLAST search, the database is a source of raw data, in that it
completely defines the set of comparison sequences against which your query
will be compared.  How is that not critically important to the analysis?  It
may well be the case that obtained results are robust to changes in the
version of nr, but they're never so robust that a sequence which is not in
nr can be found by querying that database.

All that really needs to be provided (for NCBI databases) is the database
name, and the date/range of dates it was accessed/downloaded.  That is not a
particularly onerous task for a methods section, and you'd think that it
would be included more often...
 
> In a previous e-mail, I said that I'd like to have access to your
> source, so I can run, modify, and grok your code.

That's mostly fair, if you're interested in verifying that the claimed
algorithms are working as described, though it runs into practical problems.

For example, the chain of trust that leads through from assembled executable
code to high-level source may be broken at a number of potentially
inaccessible (from the point of view of source inspection) stages, including
in the CPU, at the compiler/interpreter and any proprietary libraries
(operating system or otherwise) that may be linked.  The question there is
again one of extremes: do you need to demand that *all* the source, the
compiler *and* the compiled executable are available for reproducibility, or
can elements be treated as modular black boxes?  Clearly, the latter is true
- so long as you have access to the same, or equivalent, closed black boxes.

If you're not testing the implementation of the method, but only its ability
to reproduce results, you don't need the source code any more than you need
the exact processor it was running on, but the executable has to be
available.  Again, there is not one single criterion for reproducibility -
it's case dependent.

> I'd also like to have
> access to the important parts of your raw data, so I can run it through
> my own tools.  

Hence the critical importance of the database against which you query...

> I think these are important to doing Science because they
> speak directly to the question of whether or not your research is
> reproducible.  Being able to reproduce every jot and tittle of your
> publication, however, is not so important to me; I can only think of
> five or six papers over the last 10 years where I would have even wanted
> to try to reproduce their results.

Your choice of those five or six papers will be different to mine, which
will again be different to most other people's.  That adds up over the
community, and helps keep Science mostly progressing.

Whatever the reasons for wanting or needing to reproduce the work, a
(consistent) inability of others to reproduce the work would render it at
least in need of repair, and at worst pseudoscience.

> I think it's a bit of a distraction.  It's easy to get sidetracked by
> questions of whether or not a particular analysis is reproducible.
> [...]
> someone (in the original group or
> elsewhere; eventually, someone else, if it's interesting enough) will
> follow up on the research, and we will eventually find out if the
> results truly were reproducible.

Yes exactly, someone will follow up to find out if the results were
reproducible, because if they aren't reproducible the results aren't much
use, however interesting they, or the analysis, might be.

For example, N-rays (http://en.wikipedia.org/wiki/N_rays) were interesting,
but ultimately non-reproducible.  We don't return their phone-calls, these
days ;)
 
> Anecdote: I attended a talk the other day on next-gen sequencing, and it
> will soon be cheaper for the sequencing center to rerun a particular
> Illumina GA run than it is to store the resulting image data (~1 tb?)
> for 6 months.  So primary data is simply getting tossed.  [...]

The assumption you're already making here is that the technique *is*
reproducible, and that between-run variation in the data is acceptable.
That's been demonstrated elsewhere (the Science Has Been Done, as you say ;)
).

As it happens, I throw away intermediate data like that all the time.  I
keep sequenced bacterial genomes, and the data from their comparisons (and
instructions on how to repeat the analysis to get the same results), but
throw away the hundreds of Gb of output comparison files I get: they're not
worth storing when the method is repeatable.  I get away with it, but only
because the method *is* repeatable.

> [...] Arguing that we should
> be individually responsible for retaining every bit of data that anyone
> might find relevant -- as I think some have on this list -- is,
> ultimately, silly.  Or at least distracting. ;)

I agree, but I'm not claiming that that is the case.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________



More information about the biology-in-python mailing list