Data Management
Data management describes the actions taken during the different stages of the data lifecycle which define how data are collected, stored, secured, and disseminated. Data management best practices are defined by discipline, PI, or project.
A fundamental understanding of data management helps when writing a Data Management Plan (DMP), in addition to ensuring data accessibility and integrity. To learn more about writing a DMP, including templates and what federal agencies require, visit the Writing a Data Management Plan Webpage.
Good Research Data Management (GRDM) is essential to rigorous and reproducible scientific research. Lapses in RDM can lead to questionable data and cause lasting damage to a researcher’s reputation and their ability to receive federal funds, as well an erosion of the public’s confidence in the scientific community.
Benefits of GRDM | Data management is a process that includes collecting, validating, storing, protecting, sharing, and processing data to enable accessibility and reliability of the data for its users. GRDM is important for several reasons:
- Maintains your data integrity – Properly documenting and managing your data increases the reproducibility of your work and as a result, increases the validity of your results.
- Improves your research impact – maintaining accessible and reliable data allows you to readily share your raw datasets and can improve your research impact by increasing the “relevance” of your research. Re-using and re-purposing data can lead to “unanticipated” new discoveries and can provide the raw material for researchers with little funding to work on.
- Saves you time – Planning ahead and confronting obstacles early on prevents potential headaches down the road and will save you both time and money.
- Guarantees long-term data longevity – Properly preserving your data in a data repository makes it accessible and discoverable for years to come; it safeguards your “research contribution” for the research community.
- Allows you to meet funding/grant requirements – Most funding agencies, including the NIH, require that you properly manage, document, and share your data (starting Jan 2023).
- Allows you to satisfy requirements for journal publications – Today, many journals require that published articles be accompanied by the underlying raw research data.
References
- Briney KA, Coates H, Goben A (2020) Foundational Practices of Research Data Management. Research Ideas and Outcomes 6: e56508.
- Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
- Some Simple Guidelines for Effective Data Management. Borer, Elizabeth T., Eric W. Seabloom, Matthew B. Jones, and Mark Schildhauer. Bull. Ecol. Soc. Am. 90(2)205-214. 2009.
Data management begins with asking the right questions as to how data will be collected, stored, shared and organized. Below is a list of questions that can help you get started:
- What types of data are collected?
- How much data (file size) will be collected?
- How quickly will data accumulate?
- What are the likely file formats?
- How unique are the data and how often will backups be performed?
- Will the data be collected from a third-party source?
- What data tools are available?
- Are the data part of a collaboration that needs to be shared regularly and frequently?
- Who needs access to the data?
- How long will the data need to be kept?
- What are the data retention policies of the funder, journal, Columbia University?
- Who owns the data?
- What is classification of the data and what security measures need to be put in place?
- Will the data be shared to a public database?
- What sort of problems have been encountered previously with managing data?
- What kind of DMP does the funder require?
Data Management Resources
Tutorials and Guidelines
- The ReaDI Program has created several tutorials (below) and identified guidelines to aid in the management of data during the collection phase of research. The ReaDI Program is available for data management consulting and presentations (Columbia researchers only).
- Tutorial
- Best Practices for Data Management when Using Instrumentation
- Description
- Tips and best practices for collecting, saving and processing data collected from instruments
- Tutorial
- Good Laboratory Notebook Practices
- Description
- Tips and best practices for maintaining a laboratory notebook
- Tutorial
- Guidelines on the Organization of Samples in a Laboratory
- Description
- Tips on managing, identifying and preserving samples (non-clinical)
Columbia University Data Management Consulting Services
- Statistical Analysis Center Data Management Services are available to anybody at Columbia. They are able to help with all aspects of data management, including administrative systems. Their services include:
- Case report form design
- Database design
- Database hosting
- Custom user interface design (web, desktop, telephone, etc.)
- Data system design (data for analysis, logistical data, personnel data, financial data, etc.)
- Report design
- Database querying and data set generation
- RedCap host and development
- Research Data Services, part of the Columbia University Libraries, is available to help with many aspects of the research data lifecycle, including research data management, finding data, recommendations for cleaning and understanding data, mapping and visualizing your data.
- Irving Institute for Clinical and Translational Research offers free one-hour consultation to discuss data management requirements, help design a data management plan with associated budget requirements or provide guidelines for moving data into a properly formatted, secure environment.
- Resource
- Data Management for Researchers by Kristin Briney
- Description
- A comprehensive guide to everything scientists need to know about data management, this book is essential for researchers who need to learn how to organize, document and take care of their own data. Text adapted from Amazon.
- Resource
- Responsible Conduct of Research - Data Management Module from Office of Research Integrity
- Description
- The Office of Research Integrity (ORI) oversees and directs Public Health Service (PHS) research integrity activities on behalf of the Secretary of Health and Human Services with the exception of the regulatory research integrity activities of the Food and Drug Administration.
These modules have been compiled from multiple institutions.
- Resource
- Folder Hierarchy Best Practices for Digital Asset Management
- Description
- When a folder hierarchy is shared between multiple people or departments (such as a shared file server), things often get messy because everyone thinks about organizing and finding files in different ways. This article takes an in-depth look at why folder hierarchies are important and provides best practices for folder organization. (Text from article)
- Resource
- Research Data Management: A Primer by NISO
- Description
- This primer covers the basics of research data management, with the goal of helping researchers and those that support them become better data stewards. (adapted from text)
- Resource
- The FAIR Guiding Principles for scientific data management and stewardship
- Description
- Comment that appeared in Nature's Scientific Data on FAIR Principles which put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
- Resource
- DataONE Best Practices database
- Description
- The DataONE Best Practices database provides individuals with recommendations on how to effectively work with their data through all stages of the data lifecycle.
- Resource
- Guidelines for Research Data Integrity (GRDI) from Miller and Spiegel, Nature 2025
- Description
- Guidelines to promote standardization, improve reproducibility, and underscore the critical role of data integrity in strengthening scientific research.
- File Naming Convention (University of Illinois)
- Guide to writing README.txt for metadata (Cornell University)
-
Kushida CA, Nichols DA, Jadrnicek R, Miller R, Walsh JK, Griffin K. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care. 2012 Jul;50 Suppl(Suppl):S82-101.
-
Meurers T, Bild R, Do KM, Prasser F. A scalable software solution for anonymizing high-dimensional biomedical data. Gigascience. 2021 Oct 4;10(10):giab068.
-
El Emam K, Arbuckle L, Koru G, Eze B, Gaudette L, Neri E, Rose S, Howard J, Gluck J. De-identification methods for open health data: the case of the Heritage Health Prize claims dataset. J Med Internet Res. 2012 Feb 27;14(1):e33.
-
The database of Genotypes and Phenotypes (dbGaP) - What do I need to know about protecting study participants' privacy, HIPAA, and subject de-identification for dbGaP data submissions?
-
El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform Assoc. 2008 Sep-Oct;15(5):627-37.
-
5 steps for removing identifiers from datasets (John Hopkins Sheridan Libraries)
- Tool
- REDCap
- Description
- A secure web application for building and managing online surveys and databases.
Columbia data management consulting for REDCap (Columbia researchers only)
- Tool
- LabArchives
- Description
- A cloud-based Electronic Research Notebook which replaces traditional paper notebooks in professional research labs and higher education laboratory courses. LabArchives is available for free to any Columbia researcher with a valid UNI.
- Tool
- Globus
- Description
- Subscribers can move, share, publish and discover data via a single interface. Globus is available at no cost to Columbia researchers.
- Tool
- openrefine.org
- Description
- A tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
- Tool
- Software Tools Catalog (from DataONE.org)
- Description
- The Software Tools database is the product of two NSF-funded Informatics Education Planning Workshops hosted by DataONE. The database provides a brief description of a wide range of tools that are recommended for use by scientists and students, as well as additional information and links to further resources. Users can access tools within the database by selecting keywords (under advanced search) or using free search.
- Tool
- DataCite
- Description
- DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs.