BUILDING DIGITAL LIBRARIES
Data Discovery and Data Curation Going Hand in Hand
by Terence K. Huwe
The emergence of Big Data research practices, which is revolutionizing how people parse datasets large and small, can actually strengthen the impact of library discovery skills. |
In just a few short years, data curation has been widely embraced by the
profession and is recognized by many as an emerging core competency. The
reasons are many, but the power of the web as a platform for mashing up diverse
data sources is certainly a key factor. New government regulations require
researchers to share data compiled in grant-funded research, which also
provides a powerful incentive for taking a fresh look at how data can be
preserved. In 2011, the Association of Research Libraries published an
excellent summation of the potential of data curation for the library
profession titled “New Roles for New Times: Digital Curation for Preservation” (arl.org/bm~doc/nrnt_digital_curation17mar11.pdf). This report was prescient in arguing that the volume of data
and the need to preserve it is opening new opportunities for librarians to take
center stage as collaborators.
Exciting times to be sure, but with all the new energy surrounding data curation
of web-sourced and crowdsourced information, it is important to remember that
new discovery techniques can also uncover fresh value in conventional data
resources, particularly those that are generated by public mandate. For my
part, I believe that there are significant “sleeper cells” of useful data—much of it gathered by public institutions—and these data can add value when they are added to born-digital, linked
datasets.
Many public information databases are compiled with a single need in mind:
regulating construction permits, monitoring the growth of electrical grids, and
so on. These data are often in digital formats, and they can be added to
web-based or cloud-based resources and used in ways that may not have been
foreseen by the agencies that compile the data. The trick is to not only
recognize what the primary goal for collecting is but also to discover what
value the data might have in different contexts. With that in mind, I will
offer two examples of how data resources can empower new ideas in the broadest
sense, and I will also share an old-fashioned data acquisition story “from the trenches.” The story shows how local data gathered by a public agency made the crucial
difference in a research project—and suggests how it might gain value as part of larger-scale data analysis.
Big Data, Big Results
One of the best aspects of working with linked data is the ability to combine
diverse sources of information and then extrapolate more nuanced meaning from
the improved dataset. This trend is accelerating, and it currently focuses on “new” and exciting areas such as crowdsourced data generation and online consumer
behavior tracking. Rightly so: President Barack Obama’s re-election campaign used data-driven strategies alongside its political and
rhetorical vision, to considerable advantage. The 2012 U.S. elections proved
beyond a doubt that smart data, carefully deployed, was worth more than the
hundreds of millions of dollars that were hurled at the general electorate. The
overall electoral cycle demonstrated that Big Data is recognized by politicians
and entrepreneurs, as well as academics.
In the academic sphere, Big Data have created all-new approaches to research. The New York Times published an interesting update on how humanists can now analyze thousands of
online novels (The New York Times, Jan. 27, 2013, p. B3). The article describes how Matthew L. Jockers at the
University of Nebraska–Lincoln conducted word- and phrase-level textual analysis of digital books to
study long-term language patterns. The much larger sample revealed not only how
authors use words but also how they inspire other authors over the years. One
surprise finding was that relatively small number of authors have had an
outsized impact on other writers, with Jane Austen and Sir Walter Scott at the
forefront. This analytical approach is groundbreaking, insofar as it goes
beyond the limitations imposed by much smaller samples of literature. The data
application enables researchers to place authors in a larger historical context
in ways that were not possible before.
Data-driven political campaigns and large-scale literature analyses demonstrate
the blue-sky nature of Big Data—and the attendant opportunities to curate the data that is being produced. Yet
even as the new frontier expands at a rapid rate, it is still possible to find
value in existing data sources. In my opinion, Big Data applications and data
curation will reach their fullest potential when all sources, both old and new,
are re-examined with the new tools.
New Value From Not-So-New Data
Not all data worth curating are born on the web. Agencies that oversee
construction variances, hospitals, nursing homes, public works, and public
health all gather data, but in many cases, their charge is to gather data for a
single, specific purpose. The expected “data deliverable” might be tabular information for policymakers and urban planners, flowing from
the stream of new construction permits or other relatively mundane activities.
It is easy to assume that such data may be well-targeted but do not have
transferable value. The following example of wage research proves the opposite.
During the 2012 election season, one of our researchers was monitoring “living wage” campaigns across the country and was very interested to see how they would
fare. In the political discourse surrounding this issue, many voices argue that
increasing the minimum wage is bad for business, raising costs and placing a
burden on small firms in particular. Others argue that increasing low wages in
nominal increments—75 cents, for example—has a negligible effect on the economy, and yet they help household incomes
significantly. Our researcher wanted to assess the actual performance and
policy ramifications of living wages to shed light on the debate and needed
help.
He needed to gather employee data on every fast-food restaurant in a specific
metropolitan region. Easily accessible sources indicated that there were more
than 3,500 establishments in all. Yet within that category, movie theaters, gas
station convenience stores, and other purveyors of food-on-the-go needed to be
winnowed out. None of the obvious data sources could provide such a pinpointed
sample.
One of our library staff members contacted the county agency that monitors food
safety in restaurants and eventually got through to its information technology
department. She learned that the agency had detailed data on every
establishment, including the exact number of employees at each location. This
was the data our researcher needed to analyze low-wage market dynamics and
write a policy brief—just 3 weeks before the election.
The agency monitors restaurants for compliance with public health regulations.
But—and this is a big but—that is literally all it is concerned about. It gathers detailed data, but the
data are only of interest when it finds a safety infraction and must fine the
offending restaurant. In our case, we had no interest in restaurant health and
safety, but we very much wanted to know employee counts at every restaurant
location. This sample would be useful as a basis for testing how living wage
policies played out “on the ground.” The agency had exactly what we wanted, and we asked if it would be willing to
share the dataset with us.
The IT manager agreed, with the proviso that no information about regulatory
compliance would be sent to us—just the whole list of restaurants and their employee counts. Once this was
agreed upon, it took a few days to receive a data file that had all of what we
wanted.
These data provide a comprehensive resource for labor economists, and they will
retain their value over the long term. Moreover, good relations with the
regulatory agency have established a foundation for receiving data updates
periodically. The dataset will also have added value if it is mashed together
with other resources, such as state- and national-level employee data, or
coupled with web- and cloud-based news and information about restaurants in the
region.
Curate—But Counsel Too
This reference story drives home the fact that even while we are moving
full-speed into an era when crowdsourced, web-crawled, and tagged data are
creating wholly new avenues for research, value still remains in ongoing
data-acquisition programs. Many public agencies produce data, and more often
than not, they are well-managed and have a service mentality. When locally
gathered data of this nature are obtained and merged with other larger sources,
the specificity of the local data enriches the “big picture” that Big Data can reveal.
The emergence of Big Data research practices, which is revolutionizing how
people parse datasets large and small, can actually strengthen the impact of
library discovery skills. As a result, information professionals stand to
benefit not through digital curation and getting involved in Big Data analysis
but also through the ongoing practice of reference and resource discovery.
Because of this, I believe that it is important to promote our research and
discovery acumen in the same manner that we are currently promoting the library
as the “solution lab” for data curation. As admirable as that effort is, curation alone is, in my
opinion, just half of the needed strategy. The crucial balance may be found by
remembering that the skills inherent in reference work—discovery, pattern recognition, and analysis—offer a powerful means to convey our value proposition not only as data curators
but also as information counselors with advanced data-acquisition skills.
|