The Dataset That Disappeared
A colleague emailed last month asking for the dataset behind a 2019 paper. The corresponding author had moved institutions, the original lab server had been wiped during a migration, and the only copy lived on a graduate student's external hard drive that no one could find. The paper still gets cited. The data does not exist.
This happens constantly. Research data management is the unglamorous work that prevents it - the planning, organizing, documenting, and depositing that turns a one-time analysis into a citable, reusable artifact. Most PhD students learn about RDM the wrong way: through a panicked email from their funder six months into a project, or through a journal that requires a deposited dataset before it will publish the paper.
This guide covers what you actually need to know. The FAIR data principles, how to write a data management plan that does not waste your time, which repositories to use for what, and the metadata work that decides whether your data gets reused or buried.
What Research Data Management Actually Means
Research data management (RDM) is the set of practices that keep your data findable, usable, and preserved across the life of a project and beyond. It covers everything from how you name files during data collection to where the final dataset lives ten years after publication.
A good RDM workflow answers four questions:
- Where does the data live during the project?
- Who is allowed to see it, edit it, and delete it?
- How will it survive the project, the postdoc, the lab move, and the institutional server migration?
- How will someone else find and reuse it five years from now?
Most researchers handle the first two questions adequately. The last two are where things fall apart, and they are exactly what funders and journals now require you to address explicitly.
The FAIR Data Principles, Decoded
The FAIR principles were published in Scientific Data in 2016 and have since become the default framework for research data. FAIR stands for Findable, Accessible, Interoperable, and Reusable. The principles apply to data and to metadata, and they are designed to work for both humans and machines.
Findable
Your data needs a globally unique, persistent identifier. In practice this means a DOI from a recognized repository. Sticking the dataset on Dropbox or your lab website does not count, because those URLs break. A DOI keeps resolving even after the lab moves, the website migrates, or you change institutions.
Findable also means the metadata is rich enough that someone searching can locate the dataset without already knowing it exists. Title, abstract, keywords, authors, methods, and variable definitions are the minimum.
Accessible
Once someone finds your data, they need to know how to retrieve it and under what conditions. Accessibility does not mean the data must be open. Some data, like patient records, must be access-controlled. The FAIR principle requires that the access conditions are clearly stated and that the metadata remains public even when the data itself is restricted.
Interoperable
Interoperability is about formats and vocabularies. CSV beats Excel for tabular data. Open formats beat proprietary ones. Standardized vocabularies (controlled terms from MeSH, OBO Foundry ontologies, or domain-specific schemas) beat ad-hoc labels. The test: can a researcher in another lab combine your data with theirs without manual cleanup?
Reusable
Reusability is the hardest principle and the one most papers fail. It requires a clear license (CC-BY for data is a common choice), detailed provenance, methodological documentation, and version history. A dataset that someone else cannot interpret without emailing you is not reusable.
How to Write a Data Management Plan That Funders Actually Accept
Most major funders now require a data management plan (DMP) at the proposal stage. NIH has the Data Management and Sharing Policy. Horizon Europe makes a DMP a "living document" that updates throughout the project. UKRI, NSF, and Wellcome Trust have their own variants.
A good DMP covers six sections:
- Data description. What types of data the project will generate (raw, processed, derived), estimated volume, formats, and rate of generation.
- Documentation and metadata. Which metadata standard you will use. Common choices include Dublin Core for general data, DDI for social science, and DataCite for repository deposit.
- Storage and security during the project. Where the active data lives, who can access it, how it is backed up, and how personal or sensitive data is protected.
- Sharing and access. Where the final dataset will be deposited, when, under what license, and who is allowed to use it. Sensitive data needs a specific access-control mechanism (controlled access via dbGaP, EGA, or institutional review).
- Preservation. Which repository will host the data long-term. Most funders require a minimum retention of 10 years, often longer for clinical or human-subjects data.
- Roles and responsibilities. Who in the team is the data steward, who handles deposit at project end, and who handles requests after the original PhD student leaves.
The single biggest mistake in DMPs is vagueness. "Data will be stored securely on institutional servers" is not a plan. "Raw sequencing reads will be stored on the [University] HPC /data/lab partition with daily snapshots, mirrored nightly to S3 for disaster recovery, with final processed reads deposited in the European Nucleotide Archive at project end under accession to be assigned" is a plan.
DMPTool (run by the California Digital Library) has free templates aligned to most major funders. Use them rather than starting from a blank page.
Where to Deposit Your Research Data: Zenodo, Figshare, Dryad, OSF
Once a paper is accepted, you typically have weeks, not months, to deposit your data. Pick the repository before you start writing, not after.
Zenodo is the default for most researchers without a domain-specific repository. It is free, run by CERN with EU funding, gives every dataset a DOI, allows up to 50 GB per dataset (more on request), and has no charge for the depositor. Use it for general data, code, supplementary materials, and any dataset associated with a paper.
Figshare scores highest for FAIR compliance in independent assessments. Free for individual researchers up to 20 GB total, with institutional Figshare instances at many universities providing more space. Strong support for non-traditional outputs (figures, posters, presentations) alongside datasets.
Dryad is curated, which means a human reviews each submission for completeness and metadata quality. That curation costs $120 per submission up to 20 GB, plus $50 per additional 10 GB. Many top biology and medicine journals waive or pay this fee for their authors. The curation produces noticeably more reusable datasets than self-deposit elsewhere.
OSF (Open Science Framework) is a project-level platform, not just a repository. Use it when you want to host the entire project (preregistration, materials, data, analysis code) in one linked structure rather than depositing only the final dataset.
If your field has a specialized repository, use it. GenBank for sequences, PDB for protein structures, ICPSR for social science survey data, GEO for gene expression. Specialized repositories enforce field-specific metadata and dramatically increase discoverability for researchers in your area.
The Metadata Work That Makes Data Reusable
A dataset without metadata is a file with numbers in it. Metadata is what turns numbers into knowledge that another researcher can use without writing to you for clarification.
At minimum, every deposited dataset needs:
- Variable-level documentation. A data dictionary that lists every column or field, its type, units, allowed values, and meaning.
- Methodology. Enough description of how the data was collected for someone else to evaluate fitness for their question. This usually links back to the methods section of the associated paper.
- Version history. What changed between versions and why.
- License. Most generalist repositories default to CC0 or CC-BY for data. Pick one and stick with it.
- Provenance. What software and version was used to process the data, what processing steps were applied, and which raw inputs produced the deposited output.
Writing this metadata is tedious. The best time to write it is during data collection, not at project end. The worst time is six months after submission, when everyone has forgotten what the columns mean. Build a README.md or data_dictionary.csv into your project from day one and update it as variables change.
Common Research Data Management Mistakes
A few mistakes show up repeatedly in audits of deposited datasets:
- Submitting raw spreadsheets without a data dictionary. Reviewers cannot interpret column headers like
var_3a_corrand the dataset gets cited less. - Using proprietary file formats as the only deposit. SPSS, SAS, and STATA files exclude users without licenses. Always include CSV or another open format alongside.
- Mixing identifying information into the deposited dataset. Re-identification risk is the most common reason a deposited dataset gets pulled. Run any human-subjects data through a privacy review before deposit.
- Picking a repository the funder does not recognize. Some funders require specific repositories. Check the policy before you deposit.
- Treating the DMP as a one-time form to fill out. A DMP that does not get updated when the project changes is worse than no DMP, because it gives the appearance of planning without the substance.
A Practical RDM Checklist for Every Project
This is the workflow we recommend for new PhD students setting up a research workspace for the first time:
- Before data collection, write a one-page DMP using your funder's template.
- Set up a structured directory:
/raw/,/processed/,/scripts/,/docs/,/manuscript/. - Add a
README.mdto the project root describing the project, the data, and the responsible person. - During data collection, maintain a data dictionary. Update it whenever a variable changes.
- Choose your target repository before paper submission, not after.
- Run a metadata completeness check: can someone outside the lab interpret every column?
- Strip identifying information and run a privacy review for human-subjects data.
- Deposit, get the DOI, cite the dataset in the paper's data availability statement.
- Update the DMP at project close with what actually happened, including any deviations from the original plan.
Where This Fits in Your Broader Research Workflow
Research data management connects to almost every other part of the publication workflow. Your literature review cites datasets, your manuscript needs a data availability statement, and your reproducibility claims rely on the deposited data being interpretable. Treating RDM as a separate compliance task is what makes it feel like overhead. Treating it as part of the same workflow as your writing and analysis is what makes it tractable.
The single highest-leverage habit for PhD students: write the README and data dictionary on the same day you collect the data. Everything else in this guide is downstream of that one practice.