CASE STUDIES The data life cycle applied to our own data

Increased demand for data-driven decision making is driving the need for librarians to be facile with the data life cycle. This case study follows the migration of reference desk statistics from handwritten to digital format. This shift presented two opportunities: first, the availability of a nonsensitive data set to improve the librarians’ understanding of data-management and statistical analysis skills, and second, the use of analytics to directly inform staffing decisions and departmental strategic goals. By working through each step of the data life cycle, library faculty explored data gathering, storage, sharing, and analysis questions.


INTRODUCTION
In 2003, the National Institutes of Health mandated that researchers provide data-sharing plans for grant applications requesting more than $500,000 [1].This mandate, combined with the requirements of the 2011 National Science Foundation's (NSF's) Data Management Plan and other emerging restrictions for funding [2], has contributed to a greater interest in data sets, as well as their sharing and reuse in the health sciences.Librarians could benefit significantly from being directly exposed to the stages that data move through, from creation until deletion or destruction, a process commonly called the data life cycle.Understanding the data life cycle is essential both in their own work and as they collaborate with researchers.Firsthand experience assists not only in developing solutions for the potential barriers in gathering and analyzing data, but also in curating data sets and discovering relevant external data sets.Librarians should also recognize that they themselves generate many valuable data sets as part of their everyday workflow.This case study provides a review of a process developed around a commonly available data set: reference desk statistics.This process can be easily employed to provide librarians with a selfdirected opportunity to enhance their data-management, data-curation, and data-analysis skills.

CONTEXT
The Library of the Health Sciences (LHS-C) at the University of Illinois at Chicago (UIC) is one of the largest health sciences library systems in the United States.
LHS-C has been subject to many of the same circumstances affecting libraries throughout the nation.In fall 2011, resource redistribution had reduced support staff availability in the information services department by 50%.To compensate for this loss, the information services faculty was charged by the department head to identify potential solutions.This led to the goals of modernizing desk statistics collection and reconsidering the information services staffing model.One important change was that all future data collection would be performed electronically.With retroactive digitization of records, both historical research and trend analysis are possible.
At the same time, the information services faculty began receiving increased requests for research datamanagement assistance.Various tutorials and continuing education had been pursued and reviewed [3][4][5].At the time, the available material focused primarily on creating a data-management plan and did not provide librarians with the opportunity to explore their own data as a self-educational tool.It became clear that a working knowledge of the data life cycle from start to finish was essential, and the recently transitioned desk metrics provided a nonsensitive data set to work through the data life cycle model.

LITERATURE REVIEW
Reference desk data are often the subject of library research.A 2007 study conducted at LHS-C examined both quantitative and qualitative data collected between 1990 and 2005 [6].This was followed in 2010 by Barrett's examination of reference desk statistics from 1990-2009 at the Crawford Library of the Health Sciences-Rockford, a regional campus of UIC [7].In both studies, data indicated that reference desk interactions seemed to be on the decline.Other research includes a 2009 paper from McMaster University Library incorporating evidencebased practice into an operational review of the library.Analysis of data collected through a form capturing reference desk statistics in conjunction with a sheet for observational tracking resulted in minimizing dedicated librarian staffing of the reference desk, emphasizing dropin consultations for complex questions, and improving support staff training [8].More recently, a 2011 article by Carter and Ambrosi described one methodology for tracking reference desk statistics via tools available from Google; however, the authors did not cover the topic of managing the data set after collection [9].
Recent library research has highlighted the importance of data management.A 2013 systematic review of the emerging roles in health sciences librarianship from 1990-2012 explicitly identified data management librarians [10].A 2012 article on translational researchers' perceptions of data maintenance presented the library as having a role in areas such as repository management, training in searching databases, and metadata description and discovery [11].Further, Carlson's 2013 paper identified barriers and opportunities for librarian education in this area, particularly suggesting that levels of engagement with data remained stagnant despite workshop attendance.Barriers to the respondents' Supplemental Figure 1 is available with the online version of this journal.
engagement with data curation included organizational support, staffing, and time [12].Finally, Marshall et al. provided a case study of the data-management process for librarians.While the article provided excellent guidelines, it neither fully explored each data life cycle stage nor discussed library-generated data sets [13].

METHODS AND MATERIALS
There are several possible templates for describing the data life cycle.Popular examples include those from the Digital Curation Center, the NSF-funded geology project DataONE, and the California Digital Library [14][15][16].The life cycle model described by DataONE was chosen as an initial template because it most closely aligned with the intended goals.
The following life cycle stages were included in the final project template: identifying, digitizing, cleaning, describing, storing and preserving, sharing, and analyzing.Questions for each stage were identified (Table 1).Prior to the project, the authors created a sample datamanagement plan using the Institute of Museum and Library Services template in the DMPTool, which is the data-management planning software available from the University of California Curation Center [17].

Identifying
Desk statistics between July 2006 and August 2011 were archived in print form.Data tracking was transitioned to a digital format using Google Forms starting in September 2011.Due to the amount of data being captured and limitations of Google Tools, data were removed from the live repository on an annual basis at the close of the fiscal year and archived independently.

Digitizing
To perform more robust analysis, the print archive required conversion into a digital format.To facilitate this, the authors defined coding procedures (e.g., time stamp should be recorded for the hour window in which the question was asked), and the conversion project was assigned to student staff and temporary support employees.Initially, these procedures were recorded informally on the first author's blog [18] but were then strictly documented as the project continued.Digitization began in May 2012, and the complete print archive was digitized by March 2014.Digitization required far more time and effort than initially anticipated and needed to be distributed over a total of nine different employees.Also, project management was inconsistent until August 2013, when a centralized ad hoc tool was developed by the first author.

Cleaning
Over the course of the project, it became clear that record consistency correlated inversely with the number of people performing data entry.After reviewing professional literature on librarians performing data standardization [19,20], the authors identified OpenRefine [21] as an open source tool that would allow efficient data set normalization.Single academic year subunits of the data set were uploaded to OpenRefine as digitization was completed and were standardized based on predetermined rules (e.g., time stamp format of HH:MM:SS).Techniques such as clustering and faceting were used to group similar fields and summarize information in a column or to filter related data aggregates.The standardization processes were documented to enable consistent repetition as new data became available.

Describing
The project notes document was digitally shared among collaborators, allowing reliable coordination of timelines, instructions to the student employees performing the digitization, and project management  Case studies: Goben and Raszewski reviews.Records were maintained that contained each column header (time stamp, patron type, question type, notes) and definitions of standardized answers.Also included in a separate text file were known gaps, author contact information, and software packages used for digitization, cleaning, and initial analysis in order to synchronize appropriate metadata with the finalized data set.(Figure 1, online only).
A further text file, supplemental to the data set, provided additional information about the project, including the specific JSON code used to process the data in OpenRefine.Bibliographic and library ontologies were consulted, but most of those were targeted at library collections rather than library events.The Library, Information Science & Technology Abstracts [22] thesaurus was used to determine appropriate subject headings and key terms to associate with the data set to improve future discovery in an institutional repository.

Storing and preserving
Storage and data access were fundamental issues.Options had to be evaluated for the short term, while the data set was being digitized and gathered, and for the long term, after study analysis had been completed and the data set was ready to be shared.To control access and localize initial data storage during digitization, the university installation of Box [23], an online file management service, was chosen for short-term storage.Using Box allowed convenient collaboration between the authors as well as manageable, limited, and easily revocable user access for employees performing data entry.
Many options and challenges for data sharing and long-term data storage were considered.These included using a subject repository versus a general data repository, cost, potential to embargo the data, and license requirements.The UIC repository, INDIGO [24], was selected as the long-term storage solution for this data set, with a supplemental version stored with the published manuscript on PMC.INDIGO, built on DSpace software, had the advantages of being locally maintained, readily available due to the University Library Open Access Mandate [25], and flexible in terms of file-level publishing.However, it also has limitations: files must be locally downloaded for access, and there is no mechanism for update management.The final data set was released in INDIGO in June 2014 [26].
In the absence of a grant requiring a specific length of time for data set preservation, the initial duration was identified as five years to meet professional journal standards [27].Beyond the five-year point, the authors intend to preserve the data for as long as it continues to have utility.

Sharing
In the interest of understanding data-sharing nuances and because desk statistics are not considered sensitive, the authors wished to share the data set under terms as broad as possible for reuse as an educational tool for other librarians looking to work through the data life cycle.This raised questions regarding ownership and rights, as the data were initially gathered as part of the regular work by publicly funded employees at a state institution.
In determining potential rights-holders of the data set, the following possibilities had to be considered: the authors, the University Library (as an independent college), the UIC institution, the entire University of Illinois system, and potentially, the state of Illinois itself.While a few institutional guidelines surrounding copyright assignment are available from the University of Illinois Board of Trustees [28], ultimate licensing policy remained unclear.Due to both the generative and transformational work done by the authors to compile the data electronically, standardize it for analysis, and then format it for potential reuse, it was decided that, in the absence of clear policy and with the permission of the dean of the University Library, the data set could be released through the institutional repository.
Further, in order to facilitate as much potential reuse of the data set as possible and with no perceived risk or income loss to the authors or the university by commercial use, the authors opted for a Creative Commons Attribution 4.0 International License.

Analyzing
Data set constraints and limitations had to be identified before analysis could occur.1. Prior to developing and implementing an electronic data-capture solution, question type and questioner could not be resolved to a one-to-one relationship.Historically, the information services metrics had tracked patrons individually and independently of each separate question.For example, if a student asked a directional and an in-depth reference question, two separate questions would be marked, but patron details would only be captured once.2. Without requiring explicit patron type identification, the default entry was ''UIC Student,'' potentially introducing significant skew toward that category.3. Similarly, when a patron was recorded without a correlating question type, the default entry was ''Ready Reference,'' introducing skew to that category as well.4.There were numerous gaps in the data.5. Human error is believed to have introduced omissions and inaccuracies in actual patron counts.6. Indirect staffing challenges and changes, including reference desk closures, were not tracked.7. Acknowledging these issues, the authors considered tools for exploring the data that included SPSS [29], STATA [30], and R [31].Considerations included cost, size and data set complexity, and learning curve.Analysis was finally undertaken with the most easily available tool, Microsoft Excel 2007, in conjunction with the charting functions and general accessibility of Google Spreadsheets.

OUTCOMES
The data set from this case study was used to directly argue for hiring additional student employees.Patron frequency and question type were analyzed, identifying a clear trend of many directional questions with few in-depth reference questions.From this analysis, the information services department petitioned for increased student employee hours beyond the single student employee position that had existed, with the rationale that the repurposed time would enable the faculty to better meet demand for consultations and classes.This review led to an increase in the student employee budget and the addition of more student employee positions to the department.

DISCUSSION
This data set enabled the authors to achieve three goals: redistributing faculty workload, obtaining practical experience with data through each life cycle stage, and informing future data collection practices.
On a more fundamental level, this exercise changed the authors' general approach to working with data.In addition to the data tools specifically mentioned, it provided the opportunity to explore Event Ontology [32], GitHub [33], DataDryad [34], figshare [35], and OpenDepot [36].The data set was evaluated against its limitations to determine if there was value in preservation for further analysis.Having completed this data life cycle review, the ongoing data capture of desk metrics was converted from paper to a standardized Google form, which remedied many limitations.Finally, the authors gained familiarity with data collection and standardization challenges as well as facility with commonly used tools and techniques.
This case study focused on creation of a datamanagement plan, the metadata standardization challenges, and exploration of data storage and sharing options.Future studies may incorporate other information services statistics, including reference desk metrics, and consider the complexities for librarians as they navigate multiple independent data sets.Discussion and clear policies will be critical to minimize all the potential issues that may occur as data sets are shared more broadly in the future.
As legislation and funding requirements emerge surrounding data management, sharing, and reuse, academic libraries are being called on to provide guidance to their institutions and training to both researchers and students.Librarians with a working knowledge of data skills are a resource not only for patrons with questions about their own data, but for their libraries as a whole.By understanding the data generated by the library itself, librarians have the opportunity to apply evidence-based librarianship and demonstrate library efficacy to their institutional administrations.This case study demonstrates that librarians can start with nonsensitive data (e.g., circulation, electronic resources, or facilities usage statistics; publication data; bibliometric analyses) and consider the same questions engaged in this case study, detailed in Table 1, as a means to gain valuable experience.

&
Are the data in digital format?& If no, what would it take to digitize the data?& Are the data in a stable digital format that can be preserved?Cleaning & How many people have touched or will touch the data?& What rules have been created to ensure consistent data standardization?& What tools am I using to standardize the data?Describing & Is there a README.txtfile outlining the project?& Is there a standard ontology applicable to this data set?& What information would others need to use the data?Storing and preserving & What access is needed to work with the data now?& Who needs access now? & What are the best storage options for the future?& What is the intended duration of preservation?Sharing & Are there any privacy concerns about these data?& Who is the owner of this data set?& What institutional policies apply to these data?& How can sharing rights be maximized?Analyzing & What analysis tools are available?& What are the limitations of the data set?

Table 1
Life cycle stages and identified questions &What is the current audience for these data?&What potential future audiences exist for these data?& Is this an isolated data set or could it be combined with other sets?Digitizing