Identifying Supporters of Political Causes
These issues aren’t new. In 2008, voters in California went to the polls to vote on Proposition 8, a measure that amended the state constitution to make same-sex marriage illegal. After the measure was approved by voters, a fierce campaign over the future of the law took shape between supporters of the measure and supporters of same-sex marriage.
One group of Proposition 8 opponents created a website called Eightmaps.com as part of the fight. The site consisted of information made public through state campaign finance disclosure laws and overlaid that information onto a map of the state. Anyone who visited the site could find the names, locations, amount donated, and employers of people who donated money to support Proposition 8. After the site was launched, many donors to Proposition 8 began experiencing threats, vandalism, intimidation, and property destruction. The full report from GovLab is on The Global Impact of Open Data website (odimpact.org/case-united-states-eightmaps.html). The site is no longer on the web.
Toxic Data
The concept of gathering disparate, publicly available datasets and combining them to produce information far outside the scope of what was initially intended when the data was made public has become common enough to acquire a name—toxic data. In an article for Forbes written by Dan Woods (April 30, 2018; forbes.com/sites/danwoods/2018/04/30/toxic-data-a-new-challenge-for-data-governance-and-security), one security analyst outlines a hypothetical scenario in which publicly available data on travel schedules, aircraft models, and crew staffing could be combined in a way to make someone more vulnerable to a terrorist attack.
Value Neutrality of Data
As these examples illustrate, data itself is value-neutral: It can be put to use for positive purposes, or it can be used in more nefarious ways. Most people with a stake in the open data movement realize this, which is why there is so much conversation around governance, data ethics, privacy, and security.
Two characteristics of these examples, however, reveal that discussions need to rise to a new level. First, as the volume of open data increases, the opportunities for combining disparate datasets in unexpected ways to reveal unintended information increase at an astounding rate. A second and related factor is that, just as open data advocates have always argued, open data can contribute to a faster pace of innovation. A combination of human ingenuity, computing power, and a desire to gain competitive advantage will result in data being used in ways never imagined when it was initially created.
Taking a Nuanced Approach to Open Data
While we can’t undo events that have already happened, knowing about these incidents can raise awareness of some of the pitfalls of open data and can help us all take a more thoughtful, balanced approach moving forward.
Many of the researchers we work with have reported procrastinating in publishing their data unless data must be openly accessible as a condition of getting a journal manuscript accepted for publication. Although researchers often stall in depositing their data into an open repository, once a funding agency asks for it, there’s a flurry of activity to get a dataset published, and sometimes steps get rushed.
Instead of waiting until the last minute to publish, encourage researchers to spend some time at the end of a project to think about how this dataset might be combined with others. Have them consider the impact of what might happen if multiple variables were to be combined to form a new variable. How could a dataset be combined with others? What kinds of patterns might be visible?
Many of these examples involve human subjects. Within the United States, research involving human subjects and meeting specific criteria is covered by the Federal Policy for the Protection of Human Subjects or the Common Rule, which is part of the Code of Federal Regulations, 45 CFR 46, revised in July 2018 (hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html; hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html). Researchers are expected to have an Institutional Review Board (IRB) approve methodologies used in a project before research begins. If a researcher has concerns about how the release of a dataset might impact people, that researcher should raise these concerns with the relevant IRB.
Data doesn’t have to be an all-or-nothing proposition— data doesn’t have to be “closed” or “open.” Researchers can share some variables of a complex dataset without sharing all of them. At a minimum, posting metadata about a project without posting the full dataset. This option should be a last resort when research was funded (in part or in whole) by an organization with an open data policy.
It is important to note that many of the funders with open data policies do allow for exceptions, but these exceptions should be thoughtfully submitted with an indication of why releasing the data could harm subjects. Not all exceptions will be granted, but, whenever possible, they should be negotiated before a grant agreement has been signed.
Making data openly accessible requires more thinking and preparation than is often realized. Researchers creating data have an ethical responsibility to consider the implications of their work. Open data should involve more preparation than simply going through the mechanics of uploading a file to an open repository.