Collect & Manage
Organize Your Research Workflow
Data collection and management aren’t just administrative tasks—they’re the foundation of reproducible research. This phase covers practices for capturing, organizing, documenting, and maintaining research outputs throughout your project.
What you’ll find here: Strategies for FAIR data management, tools for version control and documentation, and approaches to organizing research materials effectively.
- FAIR Data Management (2h) - Managing data following FAIR principles
- Data Documentation & Validation in R (2h) - Create codebooks, validate data, and ensure quality
- Data Sharing (1h) - Lecture on sharing research data
- Report Detailed Methods & Protocols (30 min) - ReproducibiliTeach lecture
- Write Reusable Protocols (30 min) - ReproducibiliTeach lecture
- RDMkit - Discipline-specific data management guidance
Core Data Management Activities
Effective data management involves multiple interconnected practices working together.
Organization & Storage
Common practices:
- Consistent file naming conventions
- Logical directory structures
- Regular backups (3-2-1 rule)
- Secure storage solutions
Documentation
Researchers often maintain:
- Electronic lab notebooks
- Data dictionaries and codebooks
- README files for datasets
- Metadata standards
Version Control
Essential for tracking:
- Code and script changes
- Document revisions
- Collaborative workflows
- Analysis reproducibility
Quality & Security
Maintaining integrity through:
- Quality control procedures
- Data validation checks
- Access controls
- Anonymization protocols
Data Management Approaches
Different aspects of data management require different tools and strategies.
FAIR Data Management (2h) - Learn how to organize, document, and manage research data following FAIR principles.
Applying FAIR principles ensures your data is useful beyond your immediate project.
The FAIR Framework:
Findable
- Persistent identifiers (DOIs)
- Rich metadata descriptions
- Searchable repositories
Accessible
- Standard retrieval protocols
- Clear access procedures
- Metadata remains accessible
Interoperable
- Standard data formats
- Controlled vocabularies
- Common terminologies
Reusable
- Rich documentation
- Clear usage licenses
- Provenance information
Naming Conventions
Consistent file naming makes data findable and understandable.
Best practices:
- Use descriptive, meaningful names
- Include dates (YYYY-MM-DD format)
- Avoid spaces (use underscores or hyphens)
- Use version numbers (v01, v02)
- Keep names short but informative
Example:
2025-01-15_experiment-A_participant-001_v02.csv
Directory Structure & Backups
Organization patterns:
- By data type (raw/, processed/, analyzed/)
- By date or session (2025-01/, 2025-02/)
- By subject or condition (control/, treatment/)
3-2-1 Backup Rule:
- 3 copies of your data (original + 2 backups)
- 2 different media types (local drive + cloud)
- 1 copy stored off-site (cloud or separate location)
Never rely on a single device for your only copy of research data.
README Files
README files are essential for understanding your data and code.
What to include:
- Project title and description
- Authors and date of collection
- File organization explanation
- Data collection methods
- Variable definitions
- Known issues or limitations
Write README files for your future self—you’ll return months or years later.
Data Dictionaries
Data dictionaries define all variables in your dataset.
Essential elements:
- Variable name (as it appears in data)
- Full variable label/description
- Data type (numeric, string, date)
- Allowed values or ranges
- Units of measurement
- Missing data codes
Example:
Variable: age_years
Description: Participant age at data collection
Type: Numeric
Range: 18-65
Units: Years
Missing: -99
Data Documentation & Validation in R (2h) - Create codebooks automatically, validate data against expectations, and ensure data quality using R packages like codebook and pointblank.
Version control systems track changes to files over time, enabling collaboration and reproducibility.
Introduction to Git and GitHub (3h) - Learn version control fundamentals for research projects.
Born Open Data
Data that is automatically uploaded to a repository (e.g., GitHub) including timestamps and automatically generated logs is called born open.
Advantages:
- Full openness and transparency from the start
- Built-in data management through version control
- Simplified data sharing at publication
- Complete audit trail of changes
Resources:
- Rouder (2016) - The what, why, and how of born-open data
- Rouder, Haaf & Snyder (2018) - Minimizing Mistakes In Psychological Science
Born-open workflows work best with non-sensitive data. For human participant data, consider pseudonymization before automatic uploads.
Why use version control:
- Track all changes to files
- Revert to previous versions
- Collaborate without conflicts
- Maintain parallel versions
- Document why changes were made
- Backup all project history
- Share code reliably
- Enable reproducibility
Common workflows:
- Code and scripts: Track all analysis code
- Documentation: Version control for manuscripts, protocols
- Small data files: Track metadata, data dictionaries
- Configuration files: Manage software parameters
Large data files (>100MB) should not be stored directly in Git. Use Git LFS or data repositories instead.
Quality Control
Quality control identifies errors and issues before analysis.
Common QC checks:
- Completeness: Check for missing data
- Range checks: Verify values within expected ranges
- Consistency: Check logical relationships
- Duplicates: Identify duplicate records
- Format: Ensure consistent data types
Document QC decisions:
- Procedures and criteria
- Cases flagged or excluded
- Reasons for exclusions
- Date and reviewer
Data Security & Anonymization
Access Control:
- Use permission levels (read-only, read-write, admin)
- Limit access to sensitive data
- Remove access when collaborators leave
- Use institutional authentication systems
Anonymization for human participant data:
Remove personally identifiable information (PII):
- Names, addresses, contact information
- Dates of birth, ages over 89
- Geographic identifiers smaller than state
- Social security numbers, medical records
- Facial features in photographs
Anonymization approaches:
- De-identification: Remove direct identifiers
- Pseudonymization: Replace with codes
- Aggregation: Report group-level data
- Perturbation: Add noise to prevent re-identification
Test anonymization on sample data first. Have a colleague review for remaining identifiers.
Tools & Resources
Documentation & Lab Notebooks
Tools for documenting research processes and maintaining lab records:
eLabFTW
Open-source electronic lab notebook
Read more
Free, self-hosted ELN with timestamping, templates, and database features. Ideal for documenting experiments and protocols.
Quarto
Reproducible documents combining code and narrative
Tutorial availableRead more
Create data collection protocols, README files, and documentation. Supports R, Python, Julia. Renders to multiple formats including HTML, PDF, and Word.
Jupyter Notebooks
Interactive computational notebooks
Read more
Combine code, output, and narrative text. Supports Python, R, Julia. Useful for documenting data processing workflows.
Version Control Systems
Platforms for tracking changes and collaborating on code and documents:
LRZ GitLab
Institutional Git hosting for LMU
Supported at LMURead more
Private repositories hosted by LRZ. Use LMU credentials. Suitable for active research projects.
GitHub
Popular Git hosting with collaboration features
Tutorial availableRead more
Free public repositories, GitHub Actions for automation, extensive integrations.
OSF
Research project management platform
Read more
Combines version control, storage, and collaboration. Integrates with GitHub, Dropbox, and other services.
Data Storage & Backup
Services for secure storage, synchronization, and backup of research data:
LRZ Sync+Share
LMU cloud storage service
Supported at LMURead more
50GB+ storage per user. GDPR-compliant. Desktop and mobile sync clients available.
LRZ DSS
Long-term archival storage
Supported at LMURead more
Tape-based backup for large datasets. Contact LRZ for access and quotas.
re3data
Registry of data repositories
Read more
Search to find repositories suited to your data type and discipline.
Anonymization Tools
Tools to help protect participant privacy while enabling data sharing:
Amnesia
Web-based anonymization application
Read more
Developed by OpenAIRE. Provides k-anonymity and other privacy-preserving transformations through a browser interface.
ARX
Comprehensive data anonymization tool
Read more
Open-source software supporting k-anonymity, l-diversity, t-closeness, and differential privacy. Desktop application with GUI.
sdcMicro
R package for statistical disclosure control
Read more
Implements various anonymization methods. Useful for microdata protection and risk assessment.
Maintaining Privacy with Open Data - Workshop by Ruben Arslan on anonymization strategies for behavioral research data.
LMU University Library Data Management Services provides guidance on organizing and storing research data.
- Contact: rdm@ub.uni-muenchen.de
- Services: Storage recommendations, metadata guidance, repository selection, DMP support
Data Management Checklist
Throughout Data Collection:
Before Moving to Analysis: