Collect & Manage

Build quality into your data from the start

Data Collection Data Management Ethics & Privacy Quality Control

Checkpoint: Data quality validation meeting

Data Collection

Your data acquisition procedures must be documented in sufficient details to allow replication by another researcher (see LMU good practices guidelines). Establishing reproducible data collection processes strengthens the quality and consistency of both your data and that of your team.

State-of-the-art practices for reproducible data acquisition include:

creating standard operating procedures
recording metadata as data collection is taking place
build in automatization through programming

Metadata are data about data; they provide context to your data. Metadata such as equipment settings, environmental conditions, software versions, and calibration records should be recorded contemporaneously, not reconstructed afterward. Electronic lab notebooks, instrument logs, and automated logging all help to document your metadata.

Your protocol should specify materials with identifying details (lot numbers, versions, sources), equipment settings, step-by-step instructions with timing, and expected outcomes at each stage. What counts as “materials” varies by field: reagent concentrations in wet lab work, scanner parameters in neuroimaging, sampling coordinates in field ecology. But the principle is the same: enough detail that someone else could replicate your procedure exactly.

Track deviations in real time. Write your protocol before you start, follow it precisely, and record any deviations as they happen. When you need to adapt, note it immediately. These deviations often explain unexpected results and guide protocol improvements. Electronic lab notebooks (ELNs) make this easier by creating version-controlled, timestamped records automatically, providing an audit trail that paper cannot match.
Publish your protocols. A detailed, tested protocol is a contribution to your field. Publishing establishes priority, enables citation, and makes your methods reusable. Platforms like protocols.io provide version control and DOI assignment.

protocols.io

Share, discover, cite, and improve research protocols.

Document both the instrument and the administration procedure completely. Use validated instruments when possible, pilot test before deployment, and archive the exact version participants see in an open source format.

When data comes from instruments, sensors, or APIs, scripting the acquisition creates a reproducible record of exactly what was collected and how. Programming languages like R or Python work well for straightforward pipelines. For more complex multi-step workflows which are common in e.g. in bioinformatics and neuroimaging, workflow managers like Snakemake or Nextflow ensure steps run in the correct order and can resume after failures.

Structure data correctly from the start. Variables in columns, observations in rows. This makes your data immediately interoperable with analysis tools rather than requiring cleanup later. Scripts can also automate organization, file renaming, and conversion to open formats. See Data Management for guidelines.
Keep records of what ran and when. Include error handling so failures are recorded rather than silently corrupting data. When something fails months later, you need to know what happened. Always test acquisition scripts on sample data before production runs. A bug in your collection pipeline can invalidate an entire dataset.
Version control your code and data. This makes your methods reproducible and shareable. See Version Control tab in Data Management for details.

Introduction to R

Programming fundamentals for research data processing

Data Management

In Plan & Design you created a Research Data Management Plan. It’s now time to put this plan into practice, refining it as you learn what actually works for your project.

Keep your raw data as read-only files. The unmodified output of your instruments, surveys, or observations should never be modified directly. You should provide enough documentation to recreate your processed datasets and results from your raw data (see Reproducible Data Processing & Analyses).
Organize, document, and store your research files so they remain usable, and become FAIR upon sharing. Beyond raw data, you will generate processed data, code, documentation, and metadata. How you organize, describe, and store these determines whether your work remains usable and reproducible. The FAIR principles guide these decisions: making outputs Findable, Accessible, Interoperable, and Reusable.

What follows are general practices. Your domain has specific conventions for file formats, folder structure, and metadata. RDMkit provides detailed guidance organized by research area.

This section covers storage for data you are actively collecting. Long-term archiving for sharing is covered in 4. Preserve & Share].

Use institutional storage. Your institution provides storage with automated backups, access controls, and GDPR compliance. The specific options vary by department. Contact the Research Data Management team of the University Library to find what is available to you. When choosing, consider how much data you will generate, who needs access, and whether your data includes personal information requiring stricter controls.
Follow the 3-2-1 backup rule. Keep three copies on two media types with one off-site. Designate one location as the master copy, the authoritative version everything else syncs from. Working with multiple “equal” copies creates version conflicts. Remember that syncing is not backup: if you delete a file from a synced folder, the deletion propagates everywhere. True backups preserve previous versions independently.
Control access from the start. Grant access only to those who need it. Use institutional sharing tools, not email attachments or personal cloud links. For collaborations, agree at the start who can read, who can edit, and who manages permissions. When team members leave, remove their access promptly.
Test your backups. A backup you cannot restore is not a backup. Test restoration at least once. Archive inactive data periodically and review access lists when team composition changes.

Avoid for Research Data

Personal laptops as primary storage, external drives as only copy, consumer cloud services (Dropbox, Google Drive) for sensitive data, and USB drives except for temporary transport.

Your folder structure and file naming conventions determine whether you and others can navigate your project months or years later. Establish these conventions at the start of your project and document them. When collaborating, ensure everyone follows the same system.

Separate raw from processed data. Raw data should not be modified: once collected, these files should never be touched. All cleaning, data processing, transformations, and analyses happen on copies or through scripts in a separate folder. This preserves your ability to verify results or reprocess from the original source.
Develop a file naming convention. Good file names identify contents at a glance and sort correctly. Balance specificity with readability: too many elements make names unwieldy, too few make them ambiguous. Order elements from general to specific.
- Use underscores or hyphens to separate elements, never spaces or special characters (? ! & * % # @)
- Use ISO 8601 dates (YYYYMMDD) so files sort chronologically
- Include version numbers with leading zeros (v01, v02) so v10 sorts after v09
- Use meaningful abbreviations and document what they mean

A pattern like YYYYMMDD_project_condition_type_v01.ext places files in chronological order while preserving context. For example, 20240315_sleep-study_control_survey_v02.csv immediately tells you when it was created, which project it belongs to, the experimental condition, data type, and revision. Document your convention in a README file stored next to your data files so collaborators can parse filenames without asking.

Follow domain standards where they exist. Many fields have established organizational conventions that tools and collaborators expect. Using these means your data works immediately with existing analysis pipelines and reviewers recognize the structure. Search RDMkit for standards in your domain.

Research Project Template

Use this template to start projects with a consistent organization that separates raw data, processed data, code, and outputs.

Data Organization

Folder structure and naming conventions

RDMkit

Domain-specific data management standards

File format choices affect who can work with your data now and whether it remains readable in the future. Open formats have publicly documented specifications that anyone can implement, so many programs can read them and they remain accessible even if the original software disappears. Proprietary formats lock you into specific tools, complicate collaboration, and risk becoming unreadable if the company stops supporting them.

Keep raw data in its original format. Whatever your instrument or source produces, preserve that original as your ground truth. Even if it is proprietary, you need it for verification and potential reprocessing.
Work in open formats. For analysis, convert to open formats like CSV, JSON, or plain text. This makes your workflow reproducible, enables collaboration across different tools, and ensures your data can be shared. If conversion loses important information (metadata, precision, structure), document what is lost and keep both versions.
Be careful with spreadsheets. Excel is convenient for data entry but causes real problems. It silently converts data: gene names like MARCH1 become dates, leading zeros in IDs disappear, and long numbers lose precision. Formatting (colors, merged cells) breaks machine-readability since scripts cannot see it. If you use spreadsheets for entry, keep them simple (one header row, one observation per row, no merged cells) and export to CSV immediately. Save CSVs with UTF-8 encoding to avoid character corruption when sharing across systems. For more guidance on spreadsheet best practices, see The Turing Way and UC Davis DataLab.
Check domain recommendations. Your field likely has established conventions balancing openness with practical needs like performance or metadata preservation. Consult the RDMKit to find conventions for your field.

Format issues often surface during quality control. The Quality Control panel below covers validation checks that can catch encoding problems, unexpected conversions, and structural inconsistencies early.

RDMkit

Find recommended file formats and conventions for your research field.

Without documentation, a dataset is just a collection of files. Six months from now, you will not remember what each column means, why certain values are missing, or how files relate to each other. Documentation makes your data usable by your future self, your collaborators, and anyone who might reuse it.

Create a README.md or README.txt file early and update it as you go. Your README is the entry point to your project. Start it when you begin, not when preparing to publish. A good README answers the essential questions: who created the data, what it contains, when and where it was collected, why it was generated, how it was produced, and whether it can be reused. These answers let someone unfamiliar with your project understand and work with your data.
Create a data dictionary defining every variable. A data dictionary (or codebook) makes your dataset self-explanatory. For each variable, document what it measures, its data type, valid values, units of measurement, and how missing data is coded. Use appropriate missing codes to distinguish why data is absent (declined to answer, not applicable, technical failure) since this distinction matters for analysis.

Research Project Template

Project template with README and data dictionaries included

Data Documentation & Validation

Create effective READMEs, data dictionaries, and validation checks for your research data.

Data Documentation

Principles of README files and data dictionaries.

Standards are community agreements on how to organize and describe research data. Using them means others in your field immediately understand your data. Three types of standards matter here:

Organizational standards specify how to structure files and folders. Some fields have well-established conventions, like BIDS for neuroimaging data. When such standards exist, use them. Your data will work immediately with existing tools, and collaborators will recognize the structure without explanation. If no standard exists for your domain, create a consistent structure and document it in your README.
Reporting guidelines specify what methodological details to document for different study types. The EQUATOR Network maintains a searchable database of guidelines for clinical trials, observational studies, animal research, and many other study types. Following these ensures you capture everything others need to understand or replicate your work.
Metadata standards define what descriptive information to record and how to structure it. Scientific metadata describes how your data was produced: equipment specifications, acquisition parameters, protocols followed. This is distinct from discovery metadata (titles, keywords, descriptions) which you will prepare when sharing in 4. Preserve & Sharing. Your field has conventions for which parameters matter. FAIRsharing catalogs metadata standards by discipline.

Think of your data as a first-class research output. Comprehensive metadata transforms a project artifact into a reusable resource. Someone reanalyzing your data years later needs to understand exactly how it was produced.

FAIRsharing

Search by discipline to find metadata standards, reporting guidelines, and data policies for your field.

RDMkit

Find metadata standards and requirements specific to your research field.

Version control tracks changes to files over time. You can see what changed, when, and why. You can revert to previous versions. Collaborators can work without overwriting each other.

Git usually suffices for data files. Text-based formats (CSV, JSON, plain text) and smaller binary files work well in standard Git repositories. You get a complete history of changes and can share easily via GitHub or GitLab.
Use specialized tools for large or frequently changing binary files. Standard Git stores each version in full, so repositories become unwieldy with large datasets. Git LFS (Large File Storage) stores large files separately while keeping them tracked. Git-annex manages files across multiple storage locations. DataLad builds on git-annex and works with standard Git workflows.

Important

Difference between Git, GitHub and GitLab

Git is a version control system that tracks changes in text files (e.g. CSV, plain text, R, Python). The Git software and your git repositories should be, respectively, installed and located in your local environment (i.e. on your computer, not on a drive, see Git tutorial).
GitHub is the most popular, but proprietary and US-based cloud-based platform for software development with Git, providing collaboration features like pull requests and issues. You should not have any sensitive data on GitHub even in a private repository.
LRZ GitLab is a cloud-based hosting platform that works exactly the same as GitHub but is free and open source and is installed on the LRZ servers for LMU Munich and can therefore be considered secure when the repository is private.

While your LRZ GitLab account is associated with your LMU Munich affiliation, your GitHub account can be associated with your private email, be included in your CV, and be used for public sharing of your data and code (see 3. Analyze & Collaborate).

In a version controlled workflow, you back up your local git repositories on either GitHub or LRZ GitLab through a secure SSH connection (see GitHub tutorial) and share access to your repositories with your collaborators through the cloud-based platform GitHub or LRZ GitLab.

Introduction to Git

Learn Git basics integrated with RStudio (2h)

DataLad

Version control for large datasets

Ethics & Privacy

Research involving human participants requires ethics approval and data protection compliance. These are not bureaucratic hurdles but safeguards for the people contributing to your research. Address them before collecting any data.

At LMU Munich, submit ethics applications to the Ethics Committee and allow 4-8 weeks for review. For data protection guidance, contact the LMU Data Protection Officer or the Research Data Management team of the University Library.

Participants have the right to understand what they are agreeing to. Your consent form should explain the research purpose in plain language, describe what data you will collect and how you will protect it, specify who will have access and for how long, and make clear that participation is voluntary.

Use tiered consent when you plan to share data. Some participants may consent to their data being used for your study but not shared publicly. Others may be comfortable with broader sharing. Giving options respects autonomy while maximizing what you can eventually share.
Store consent forms separately from data. The consent form links a name to participation. Keeping it with your data undermines any pseudonymization you apply.

Anonymization protects privacy and determines what you can share.

Remove direct identifiers during collection. Names, addresses, ID numbers, photographs, email addresses. Replace these with codes.
Assess indirect identifiers carefully. A combination of age, location, profession, and a rare condition might identify someone even without their name. Timestamps reveal patterns. Free-text responses often contain identifying details participants did not intend to share.

The distinction between pseudonymization and anonymization matters for GDPR compliance.

Pseudonymization replaces identifiers with codes while retaining a key that links back to individuals. Pseudonymized data is still personal data under GDPR because re-identification is possible.
Anonymization removes all possibility of re-identification. Only truly anonymized data falls outside GDPR scope. Achieving this is harder than it appears, especially with rich datasets.

Research at LMU Munich must comply with EU data protection regulations. The core principles: have a lawful basis for processing personal data (usually consent or legitimate research interest), use data only for stated purposes, collect only what you need, delete data when you no longer need it, and protect it against unauthorized access.

In practice: document your lawful basis, include data protection language in consent forms, use institutional storage rather than personal cloud services, restrict access to those who need it, and plan when and how you will delete data.

Quality Control

Quality control catches problems before they propagate into your analysis. The practices here ensure your data is trustworthy and your exclusions are defensible.

Define criteria before looking at your data. This prevents unconscious bias in what you keep and exclude, and demonstrates that your decisions are principled rather than convenient (see Study Design & Analysis Plan).

Validation checks whether your data meets specifications. Run checks during collection to catch problems immediately, after collection for systematic review, and after any processing to verify transformations worked correctly.

Automate what you can. Check that data types are correct, values fall within expected ranges, required fields are populated, and formats are consistent. These checks should run automatically and flag problems for review.
Manual review catches what automation misses. Sample your data and verify it against the source. Inspect outliers to determine whether they are errors or genuine extreme values. Look for suspicious patterns: survey responses that alternate predictably, reaction times that are impossibly fast.

Data Documentation & Validation

Learn to create validation rules and automated checks for your research data.

Data cleaning handles errors, inconsistencies, and missing values. The cardinal rule: never modify your raw data. All cleaning happens on copies.

Correct unambiguous errors. Clear typos, obvious data entry mistakes. For ambiguous cases, flag them for review rather than making assumptions. Document your reasoning for every judgment call.
Handle missing data consistently. Decide on a coding scheme (NA, -999, blank) and apply it uniformly. When you know why data is missing, record that information. It may matter for analysis.
Investigate outliers before acting. An extreme value might be an error, or it might be genuine. Understand the cause before deciding whether to remove, transform, or retain it.
Write cleaning as a script. A script documents exactly what you did and lets you reproduce it. Keep a decision log for choices that cannot be automated.

Exclusion criteria specify which data points will be removed from analysis and why. Define these before you see your results (see Study Design & Analysis Plan).

Common exclusion criteria: technical failures (equipment malfunction, incomplete recording), protocol violations (wrong procedure followed, participant did not comply), quality thresholds (too much missing data, failed attention checks), and participant criteria (did not meet stated inclusion criteria).
Document everything. Record criteria before analysis begins. Report how many data points were excluded for each criterion. Plan sensitivity analyses comparing results with and without exclusions to show your findings are robust.

Data Management Checklist