ask a scatterbrain: sharing code and data

This from an anonymous correspondent:

Making data and code available for other researchers is an important tool for promoting replication, discovering errors, and advancing research agendas. However, how to handle the ownership situation? In this case, a junior scholar who worked for a long time on a tricky piece of code, using publicly available data, was asked to provide a copy of the code. A while later, an original piece of research appeared using the code, with appropriate citation to the original work. However, if the two researchers had been colleagues surely the programmer would have been a co-author on the work. What is the best way to handle this in general? Is there some way to communicate reasonable expectations at the point the code is shared, or should one reserve the right to review and join the work later, or what?

Author: andrewperrin

University of North Carolina, Chapel Hill

15 thoughts on “ask a scatterbrain: sharing code and data”

  1. At least as you describe it, your correspondent has a weak case for feeling wronged. The correspondent wrote the code for the original project, which resulted in a publication. The code was part of the contribution of that work. The second person did not use that contribution to “scoop” a related project the correspondent was working, and the second person gave your correspondent credit for the code.

    If your correspondent is someone who uses R, or Firefox, or Stata add-on files, they might consider how they have themselves benefited from open-source contributions of others.

    So, unless they are an elite coder and had done something very clever, my advice for them would be to suck it up and be happy that their work has contributed also to other work. If they did do something truly ingenious, they might consider whether there is a way they can make their coding skills more directly into a credited contribution in its own right.

    (Note also: lots of projects involve coding by professional programmers or graduate students that have no expectations of even being credited as authors on the original work. Occasionally these folks will make a cameo appearance on stage if problems are revealed in the work and the credited author needs someone to blame.)


  2. I am all for sharing in this way. But I am worried about the disincentive to share that this might create, especially for junior people, I wonder if there’s some way to specify terms up front?


  3. There are a number of boilerplate contracts out there that co-authors/collaborators can adapt to their purposes. I’m working remotely, so the one I’ve used (which I remember was from NSF although I now can’t find it using that in a search) is inaccessible. But a search yields plenty of information on said contracts.

    And review pieces. E.g., this one which also has related articles in listed on the upper right corner:

    I strongly suggest that sociologists consider using these contracts even when (especially when?) collaborating with friends, close colleagues, etc. It is too often the case that verbal agreements break down when participants’ work lives change, or simply as a result of an undetected difference of opinion or interpretation.

    Unfortunately, I find our ASA ethics statement is woefully unhelpful w/r/t issues like the above.


    1. Even in the absence of a formal contract, a conversation at the very outset of a project is key. Often it doesn’t happen because it is uncomfortable. But I can guarantee it will be more uncomfortable later!


  4. I sympathize with not wanting to be scooped on one’s own work. The requirement that data and code be public aids replication but can let someone who is faster and less picky publish ahead of the person who did the data work.

    Still, these data become public property. For example, many people use a measure of citizen ideology constructed by political scientists Berry et al. Calculation of it is very complicated. They post the data on their web site, and zillions of social scientists use the scale in their publications. Berry et al are not co-authors on publications that use their scale, and I don’t think anyone thinks they ought to be. There’s another political scientist who regularly produces the most accurate data on partisan control of state legislatures. Ditto.

    As Jeremy said, the norm in sociology is not to include programmers and research assistants as coauthors unless they participated in the conceptual development and writing of the paper. But there is variability around this central tendency.


  5. I suppose it depends on the code in question. If it is a couple of days of piecing together some complicated variables, I am not so sure I’d be ready to share it until I got my return on the investment. It seems all too likely that as OW stated someone faster (or, less picky) might get there first. Particularly if the categorization is novel; the idea is mine and I want to use it first.

    That said, when a paper is more or less centered around these new variables, I think something like Stata code should be made available as an electronic supplement to the article (when published) and hosted by the journal. People will cite the original article in that case anyway. It is sort of ridiculous to be forced to reverse engineer someone’s variables (sometimes the exact means of creation aren’t even clear in the text), and that sort of thing only fosters the no-replications environment in sociology.


  6. I am unclear from Andrew’s original post whether the junior scholar published the variables in a paper and the senior authors cited that paper or whether the senior authors used the code and acknowledged the junior scholar’s efforts on the code.

    If the senior authors cited the original publication of the junior scholar, I agree with those above provided that the senior authors published the paper in sociology. Rules and conventions regarding authorship and contributions vary dramatically by field. In my experience in public health, sharing that code would constitute enough of a substantial contribution to be attributed authorship (see the article that Jenn posted for citations of conventions in medical/public health journals).

    If, however, the senior authors only acknowledged unpublished work by the junior scholar then I think that the case is far less clear than other comments suggest. The idea that code becomes community property as soon as it is created is very problematic. This might seem ridiculous, but I have seen senior colleagues (especially advisors) suggest to another senior (or more senior) colleague the work of junior colleagues that is not yet published. The advisors, rightfully believed their effort would benefit the junior colleagues through the exposure. But, it creates a situation in which the junior colleague, who will likely be slower getting publications out, shares the code (or data) before they publish the work.

    Given the comments above, it appears that the former case is more likely, but it is important to remember that in our effort to share code and be more open we must also consider power relationships.


    1. Good point. The examples I cited are political scientists whose publications with the methodological details for their scales are both published and widely cited. Their citation counts have to go through the roof. AND their data are posted on a web site accessible to all.

      I agree that “borrowing” the code from a junior scholar whose work is not yet published is a different kettle of fish. Junior scholar perhaps ought not to have shared, but that is a different matter.

      The thing is, if you got federal funding to do your work, you are required to make it public very quickly.


  7. Yeah, I had inferred from the opening sentence about replication that we were talking about a situation where the person had already published a paper using the code and was contacted by someone else on that basis. If it’s something where they haven’t published anything from it, perhaps different story.


  8. Not all papers published are supported by federal grant funds, so does code automatically become public property simply because a paper is published and readers want it?


    1. I don’t think even the Feds require that code be shared, only data. The reason that ethics require data and code sharing is not that it becomes “public property” per se but because science depends on replicability, and you have to be able to tell people what you did in a way that can be checked. If the data are public, they can be obtained from the original public source, and it is possible to meet the ethical replication norm with a clear description of what you did without the actual code. But the first thing a new author would try to do is to replicate your tables, and they can legitimately ask for more details if they can’t do it.

      If the source of the data is owned by someone else (e.g. newspaper stories), you can’t share it without violating copyright restrictions. So that gets dicier.


  9. Could the author have published the article without use of the other person’s code? If not, then perhaps one ought to consider providing a co-author credit. If use of someone else’s custom data is sufficient to require co-authorship, why isn’t use of their custom code?

    The other problem I see is the practice in which senior authors use the results from programs written by graduate students, without whom the senior author would be unable to write the article. Although some senior scholars can write code, too, there are also some who cannot. These authors are largely dependent upon graduate students and research assistants and their ability in statistical computer programming. Should appropriating computer programs and their findings be justified as part of the deal of being a graduate or research assistant? Or do these programs – some of which take weeks, if not months to construct and debug – merit some degree of formal recognition?

    I always thought it was best to give credit where credit is due. Here, that means some formal acknowledgement of who did what. Whether that translates into a co-authorship, a “with”, or a simple acknowledgement in the footnotes is best worked out in the early stages of the project.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s