I need to write a data management and sharing plan. Guide me: 1) Describe what data will be generated (type
Okay, this is a crucial document for any research project. I'll guide you through each section of a comprehensive Data Management and Sharing Plan (DMSP), providing prompts and best practices. Remember to tailor this to your specific project, data, and institutional/funder requirements.
Data Management and Sharing Plan (DMSP) Template
Project Title: [Your Project Title]
Principal Investigator(s): [Your Name(s) and Affiliation(s)]
Funding Agency (if applicable): [e.g., NIH, NSF, ERC, Internal Grant]
Date: [Current Date]
Version: 1.0
Introduction
This Data Management and Sharing Plan (DMSP) outlines how data generated during the [Project Title] project will be managed, documented, stored, preserved, and shared. This plan adheres to [Your Institution]'s policies and [Funder]'s guidelines (e.g., NIH Data Management and Sharing Policy). The goal is to ensure that research data are Findable, Accessible, Interoperable, and Reusable (FAIR) to the maximum extent possible, while also addressing ethical, legal, and privacy considerations.
1. Data Description: What Data Will Be Generated?
- Type of Data:
- What specific types of data will be generated or collected? (e.g., quantitative survey data, qualitative interview transcripts, experimental results, imaging data, genomic sequences, sensor data, computational models, code, text corpora, audio/video recordings).
- Example: "Quantitative data from a 50-item survey, qualitative data from 20 semi-structured interviews, and demographic data."
- Format of Data (Raw and Processed):
- What are the initial file formats for raw data? (e.g., .csv, .xlsx, .txt, .wav, .jpg, .dcm, .fastq, .mat, .rdata, .json).
- What formats will processed or analyzed data be in? (e.g., .csv, .spss, .rda, .tiff, .netcdf, .pdf, .html).
- Aim for open, non-proprietary formats where possible for long-term preservation and sharing.
- Volume/Size of Data:
- Estimate the total volume of data (e.g., 50 GB, 2 TB, ~10,000 files).
- Consider both raw and processed data.
- Example: "Approximately 5 GB of raw survey data, 2 GB of anonymized interview transcripts, and 1 GB of analyzed statistical output and figures."
- Metadata Content:
- What information will be necessary to understand and use the data? (e.g., variable names, descriptions, units, codebooks, data dictionaries, methodological details, instrument specifications, study protocols, README files).
- Example: "Survey data will be accompanied by a data dictionary defining all variables, their types, and permissible values. Interview data will include context notes and a legend for anonymization."
- Relationship to Research Questions:
- Briefly explain how the generated data will directly address the project's research questions or objectives.
2. Data Collection: How Will the Data Be Collected?
- Methods of Collection:
- Describe the primary methods (e.g., online surveys, in-person interviews, laboratory experiments, field observations, sensor deployment, web scraping, existing public datasets).
- Tools/Instruments Used:
- Specify software, hardware, or instruments (e.g., Qualtrics, REDCap, custom Python script, eye-tracker, specific PCR machine, spectrophotometer, GIS software).
- Protocols and Procedures:
- Will standardized protocols be used? How will consistency be ensured across data collectors or time points?
- Example: "All survey data will be collected via a secure Qualtrics platform. Interviewers will follow a standardized script and receive specific training on transcription guidelines."
- Data Entry/Capture:
- How will data be entered or captured? (e.g., direct electronic capture, manual data entry, automated scripts).
- If manual entry, describe safeguards for accuracy (e.g., double-entry, validation checks).
- Anonymization/De-identification during Collection (if applicable):
- If sensitive data is collected, how will personally identifiable information (PII) be separated or removed at the point of collection or during initial processing?
- Example: "Participant names and contact information will be collected separately from survey responses and stored in an encrypted file accessible only by the PI. Interview recordings will be transcribed by a secure service under NDA, and PII will be removed during transcription."
3. Data Organization: How Will the Data Be Organized?
- File Naming Conventions:
- Describe a consistent, descriptive, and machine-readable file naming convention.
- Example: "All files will follow the format:
YYYYMMDDProjectIDDataTypeDescriptionVersion.ext
(e.g.,20231026PROJ001SurveyRawv01.csv
)." - Folder Structure:
- Outline a logical, hierarchical folder structure (e.g.,
RawData/
,ProcessedData/
,Analysis_Scripts/
,Documentation/
,Outputs/
). - Example:
ProjectName/
├── Raw_Data/
│ ├── Surveys/
│ ├── Interviews/
│ └── Sensors/
├── Processed_Data/
│ ├── Anonymized_Interviews/
│ └── Cleaned_Survey_Data/
├── Analysis_Scripts/
│ ├── R_Scripts/
│ └── Python_Scripts/
├── Documentation/
│ ├── Codebooks/
│ ├── Protocols/
│ └── README.md
└── Outputs/
├── Figures/
└── Tables/
- Version Control:
- How will different versions of files be managed, especially during analysis? (e.g., appending version numbers to filenames, using Git for code/scripts, institutional cloud storage with version history).
- Example: "Key data files and analysis scripts will utilize version control (e.g.,
v01
,v02
). For collaborative coding, Git and GitHub will be used."
4. Data Storage: How Will the Data Be Stored (Active Phase)?
- Primary Storage Location(s):
- Where will the active data be stored during the project? (e.g., institutional network drives, secure cloud storage like Box/OneDrive/Google Drive for Education, dedicated research servers, local encrypted drives).
- Specify the capacity of the chosen storage.
- Example: "All active data will be stored on [University Name]'s secure network drive (accessible via VPN) with 5 TB allocated storage. Sensitive data will reside in a restricted-access folder."
- Access Control:
- Who will have access to the data? How will access permissions be managed? (e.g., role-based access, specific user groups).
- Example: "Access to the project's network drive folder is restricted to the PI and designated research assistants. Sensitive data subfolders have even tighter restrictions, requiring specific authorization from the PI."
- Encryption (if applicable):
- Will data be encrypted at rest or in transit? (Often standard on institutional services).
- Responsibility for Management:
- Who is responsible for overseeing active data storage?
5. Data Backup: How Will the Data Be Backed Up?
- Backup Strategy:
- Describe the backup frequency and mechanism. (e.g., automated daily backups, manual weekly backups).
- Consider the 3-2-1 rule: 3 copies of data, on 2 different media, with 1 copy offsite.
- Example: "Data on the institutional network drive is automatically backed up daily by [University Name] IT services to an offsite location. Additionally, key raw data and cleaned datasets will be manually backed up weekly to an encrypted external hard drive stored securely in the PI's office (physical offsite copy)."
- Backup Locations:
- Where will backup copies be stored? (Should be separate from primary storage).
- Recovery Plan:
- How would data be recovered in case of data loss? (e.g., contacting IT support, restoring from external drive).
- Frequency of Testing Backups:
- How often will the integrity of backups be verified?
6. Data Security: How Will the Data Be Secured?
- Access Control and Authentication:
- Detail specific measures for preventing unauthorized access (e.g., multi-factor authentication, strong password policies, least privilege access).
- Encryption:
- Specify if and how data will be encrypted (e.g., full disk encryption on local machines, encrypted cloud storage, encrypted transfers).
- Physical Security:
- For any data stored on local devices or physical media, describe physical security measures (e.g., locked offices, secure cabinets).
- Network Security:
- How will data be protected during transfer? (e.g., secure network protocols like SFTP, VPN for remote access).
- Personnel Training:
- How will team members be trained on data security protocols?
- Incident Response Plan:
- What steps will be taken in the event of a security breach or data loss? (e.g., immediately inform IT, isolate affected systems, assess impact, inform affected individuals if PII is involved).
7. Data Quality Control: How Will the Data Be Quality Controlled?
- During Collection:
- How will data accuracy be ensured at the point of collection? (e.g., instrument calibration, observer training, standardized protocols, double-blind data entry, real-time validation checks in survey software).
- Example: "Survey platforms will incorporate skip logic and range checks. Interviewers will receive training and conduct pilot interviews for consistency. Lab instruments will be calibrated daily."
- During Processing/Analysis:
- Describe procedures for cleaning and validating data (e.g., script-based checks for outliers/missing values, cross-referencing, statistical checks, manual review).
- Example: "R scripts will be developed to identify and flag outliers, missing values, and inconsistencies. These scripts will be documented, and all data cleaning decisions will be logged."
- Documentation of QC:
- How will quality control procedures and decisions be documented? (e.g., QC logs, data cleaning scripts, audit trails).
- Responsibility:
- Who is responsible for conducting and overseeing data quality checks?
8. Data Documentation: How Will the Data Be Documented?
- Content of Documentation/Metadata:
- List the types of documentation that will be created:
- Project-level: README file (overview, contact info, licenses, acknowledgments).
- Dataset-level: Data dictionary/codebook (variable names, definitions, units, valid ranges, missing value codes), study protocol, methodology description.
- File-level: Description of each file, date created, version.
- Process-level: Scripts for data cleaning/analysis, workflow diagrams.
- Metadata Standards (if applicable):
- Will any specific metadata standards be followed? (e.g., Dublin Core, ISO 19115, DDI (Data Documentation Initiative), EML (Ecological Metadata Language)).
- If unsure, state you will follow best practices recommended by the chosen repository.
- Timing of Documentation:
- When will metadata be created and updated? (e.g., ongoing throughout the project, finalized at project end).
- Storage of Documentation:
- Where will metadata be stored? (e.g., alongside the data files, in a dedicated documentation folder, within a data repository).
9. Data Preservation: How Will the Data Be Preserved (Long-term)?
- Selection of Data for Preservation:
- What specific datasets (raw, processed, analyzed outputs, code, models, documentation) will be selected for long-term preservation? (Consider what's essential for reproducibility and reuse).
- Example: "All raw data, cleaned/anonymized datasets, analysis scripts, codebooks, and a final README file will be preserved."
- Chosen Repository/Archive:
- Identify the specific repository where the data will be deposited for long-term preservation. (e.g., institutional repository like [University Name] Dataverse, disciplinary repository like GenBank/PDB/ICPSR, generalist repository like Zenodo/Dryad/Figshare).
- Rationale for choice (e.g., funder requirement, disciplinary standard, institutional support).
- Format Transformation for Preservation:
- Will data need to be converted to open, non-proprietary formats for preservation? (e.g., .docx to .pdf/a, .spss to .csv, proprietary image formats to .tiff).
- Persistent Identifiers (PIDs):
- Will the repository assign persistent identifiers (e.g., DOIs, ARKs) to the datasets? (This is standard for reputable repositories).
- Retention Period:
- How long will the data be preserved? (Often dictated by funder requirements, e.g., 5-10 years post-publication).
10. Data Sharing: How Will the Data Be Shared?
- Timing of Data Sharing:
- When will the data be made available? (e.g., upon publication of primary findings, at the end of the project, after a specific embargo period).
- Sharing Mechanism:
- How will the data be shared? (e.g., via the chosen public data repository, upon request for restricted data).
- Level of Data to be Shared:
- Will raw data, processed data, aggregated data, or a combination be shared?
- Example: "Anonymized, cleaned survey data and qualitative interview summaries will be shared. Raw interview recordings and any direct PII will not be shared publicly."
- Access Conditions:
- What are the conditions for accessing the data? (e.g., fully open access, restricted access requiring a data use agreement, embargoed for a specific period).
- Example: "The anonymized and de-identified dataset will be publicly available under a CC BY 4.0 license. Access to more granular data or data with higher risk of re-identification may be granted via a formal Data Use Agreement (DUA)."
- Anonymization/De-identification for Sharing:
- How will data be anonymized or de-identified to protect privacy while maximizing utility? (e.g., removing direct identifiers, aggregating data, masking specific variables, k-anonymity).
- Example: "All direct identifiers will be removed. Quasi-identifiers (e.g., age, geographic location) will be generalized or aggregated to reduce re-identification risk. A formal risk assessment will be conducted prior to sharing."
- Data Use Limitations/Licenses:
- What license will be applied to the shared data? (e.g., Creative Commons Zero (CC0) Public Domain Dedication, CC BY 4.0, specific institutional license).
- Example: "Data will be released under a Creative Commons Attribution 4.0 International License (CC BY 4.0), allowing others to use, share, and adapt the data, provided appropriate credit is given."
- Attribution/Citation:
- How should others cite the data? (Usually provided by the repository with the DOI).
11. Legal, Ethical, and Intellectual Property Considerations
- Human Subjects Protections (if applicable):
- IRB/Ethics Approval: Confirmation of Institutional Review Board (IRB) or equivalent ethics committee approval.
- Informed Consent: Describe how informed consent will address data sharing. (e.g., "Participants will be informed that anonymized data may be shared in a public repository for future research purposes, and their consent will be obtained for this.")
- Confidentiality: How will participant confidentiality be maintained? (e.g., anonymization, pseudonymization, secure storage).
- Animal Subjects Protections (if applicable):
- Confirmation of IACUC or equivalent approval.
- Confidentiality/Privacy Laws:
- Address compliance with relevant regulations (e.g., HIPAA for health data, GDPR for EU personal data, FERPA for educational records).
- Intellectual Property (IP) and Copyright:
- Who owns the data? (Typically the institution, but varies).
- How will copyright be managed for datasets, code, and documentation?
- Example: "Data generated through this project will be subject to [University Name]'s IP policy. The data itself will be licensed under CC BY 4.0 for reuse."
- Data Use Agreements (DUA)/Material Transfer Agreements (MTA):
- If acquiring data from external sources, describe any DUAs or MTAs that govern its use and sharing.
- If providing restricted data, describe the process for DUAs.
- Other Legal/Regulatory Requirements:
- Are there any other specific legal or regulatory requirements for your data type or region? (e.g., export controls, indigenous data sovereignty).
12. Roles and Responsibilities
- Principal Investigator (PI): Overall oversight, final decision-making, ensuring compliance with funder/institutional policies, ensuring adequate resources.
- Data Manager/Research Coordinator: [Name/Role] – Responsible for day-to-day data management, organization, quality control, metadata creation, and preparation for deposit.
- Research Assistants/Students: [Name/Role] – Responsible for data collection, initial processing, and adherence to protocols.
- IT Support: [Department/Contact] – Support for secure storage, network access, backup solutions.
- Institutional Repository Staff: [Department/Contact] – Assistance with data deposit, metadata review, long-term preservation.
- Training: How will team members be trained on data management and sharing best practices?
13. Resources and Budget
- Personnel Time:
- Estimate the effort (e.g., FTE percentage or hours) dedicated to data management, quality control, documentation, and preparation for sharing.
- Example: "0.1 FTE for a dedicated Data Manager for the duration of the project. PI will dedicate 5% effort to oversight."
- Hardware/Software:
- Any specific hardware (e.g., encrypted drives, dedicated servers) or software licenses needed for data management (e.g., statistical software, transcription software, metadata tools).
- Storage Costs:
- Costs associated with active storage (if not covered by institutional overhead).
- Costs for long-term preservation in a repository (some repositories may charge fees based on volume).
- Training Costs:
- Budget for any specific data management training for staff.
- Publication/Repository Fees:
- Any Article Processing Charges (APCs) for data journals or fees for specific data repositories.
14. Timeline for Data Management Activities
- Pre-Award/Project Start:
- Finalize DMSP, obtain IRB/ethics approval, set up storage, establish folder structure and naming conventions.
- During Data Collection (Ongoing):
- Regular data collection, data entry, initial QC, metadata capture, regular backups.
- During Data Analysis:
- Data cleaning, further QC, version control for analysis scripts, update documentation.
- Project End/Publication:
- Final data cleaning and preparation, anonymization/de-identification, final metadata creation, selection of data for preservation, data deposit to repository.
- Post-Project:
- Ongoing monitoring of shared data (e.g., responding to queries), adherence to retention policies.
Conclusion
This Data Management and Sharing Plan ensures that the data generated by the [Project Title] project will be managed responsibly throughout its lifecycle, from collection to long-term preservation and sharing. By adhering to these guidelines, we aim to enhance the integrity, reproducibility, and impact of our research, contributing to the broader scientific community. This DMSP will be reviewed and updated as needed during the project's duration.
Remember to be as specific as possible for each point. Good luck!
Certainly! Writing a data management and sharing plan is a crucial step in ensuring that your research data is well-organized, accessible, and reusable. Here’s a step-by-step guide to help you describe the type of data that will be generated:
Step 1: Identify the Types of Data
- Primary Data:
- Quantitative Data: Numerical data collected through surveys, experiments, or observations. Examples include measurements, counts, and statistical data.
- Qualitative Data: Non-numerical data collected through interviews, focus groups, or observations. Examples include text, audio recordings, and images.
- Secondary Data:
- Data collected from existing sources such as databases, literature, or archives.
- Metadata:
- Information about the data, such as descriptions of the data, methods used for collection, and any relevant context.
- Derived Data:
- Data generated through analysis or processing of primary or secondary data. Examples include statistical models, visualizations, and processed images.
Step 2: Describe the Data in Detail
For each type of data, provide a detailed description:
- Quantitative Data:
- Example: "The project will generate quantitative data from a series of experiments measuring the impact of different fertilizers on crop yield. Data will include measurements of soil nutrients, plant growth rates, and yield quantities."
- Qualitative Data:
- Example: "Qualitative data will be collected through semi-structured interviews with farmers to understand their perceptions and practices related to fertilizer use. Interviews will be audio-recorded and transcribed."
- Secondary Data:
- Example: "Secondary data will be sourced from existing agricultural databases and published literature to provide context and comparative analysis for the primary data collected."
- Metadata:
- Example: "Metadata will include descriptions of the experimental design, data collection methods, and any relevant contextual information. This will ensure that the data is understandable and reusable by other researchers."
- Derived Data:
- Example: "Derived data will include statistical models and visualizations generated from the analysis of primary and secondary data. These will help in interpreting the results and identifying trends and patterns."
Step 3: Organize the Description
Organize your description in a structured format. Here’s an example:
Data Management and Sharing Plan
1. Description of Data to be Generated
Primary Data:
- Quantitative Data: The project will generate quantitative data from a series of experiments measuring the impact of different fertilizers on crop yield. Data will include measurements of soil nutrients, plant growth rates, and yield quantities.
- Qualitative Data: Qualitative data will be collected through semi-structured interviews with farmers to understand their perceptions and practices related to fertilizer use. Interviews will be audio-recorded and transcribed.
Secondary Data:
- Secondary data will be sourced from existing agricultural databases and published literature to provide context and comparative analysis for the primary data collected.
Metadata:
- Metadata will include descriptions of the experimental design, data collection methods, and any relevant contextual information. This will ensure that the data is understandable and reusable by other researchers.
Derived Data:
- Derived data will include statistical models and visualizations generated from the analysis of primary and secondary data. These will help in interpreting the results and identifying trends and patterns.
Step 4: Review and Refine
Review your description to ensure it is clear, comprehensive, and aligned with your research objectives. Make any necessary refinements to improve clarity and completeness.
By following these steps, you will have a well-structured and detailed description of the types of data that will be generated in your research project. This will serve as a solid foundation for the rest of your data management and sharing plan.
💡 Try this comparison yourself:Compare AI models side-by-side on SNEOS
Analysis
This comparison demonstrates the different approaches each AI model takes when responding to the same prompt. Here are the key differences observed:
Response Characteristics
Gemini: Provides a direct response with 321 sentences.
Mistral: Provides a direct response with 43 sentences.
Key Takeaways
- Each model brings unique strengths to this type of query
- Response styles vary significantly between models
- Consider your specific use case when choosing between these models
Try This Comparison Yourself
Want to test these models with your own prompts? Visit SNEOS.com to compare AI responses side-by-side in real-time.
This comparison was generated using the SNEOS AI Comparison ToolPublished: October 15, 2025 | Models: Gemini, Mistral