25  Reproducibility and QC

26 Reproducibility Plan

  • Use renv to lock package versions; avoid installing packages during render. Initialize once, snapshot after dependency changes, and commit renv.lock.
  • Move data derivation into scripted steps (e.g., targets/drake or a simple 01_ingest.R + 02_clean.R pipeline) that read the raw Excel files in _first_phase/_second_phase and write the _temp_*.RDS artifacts deterministically.
  • Set seeds for any resampling/bootstrap; record session info at the end of each render (keep summary.qmd as is).
  • Document data provenance: source file name, timestamp, and row counts before/after filters; write these to a small metadata table that is saved alongside each _temp_*.RDS.

27 Data Provenance Checklist

  • Track how _temp_agreement_decision.RDS, _temp_duration.RDS, _temp_all_data_duration_no_outlier.RDS, and _temp_subjective.RDS are created. Store the script path, input file hashes, and filter criteria in an attribute:
  • Avoid manual editing of intermediate files; prefer regenerating from raw data.
  • Keep a short RUNME.md describing the order of scripts to rebuild all RDS/Excel outputs.

28 Quick Missingness and Outlier Screens

Note for Pathologists: Quality Control: Summary of missing data in key diagnostic fields.

Core-level missingness (key diagnostic fields).
Variable Missing
Slide_Label 0
Dx_Paige 0
Dx_Report 0
Dx_Research 0

Note for Pathologists: Quality Control: Sanity check for diagnosis duration, showing outlier counts (diagnoses taking > 300 seconds).

Duration sanity check (threshold currently 300 seconds).
n over_300s max_seconds
6248 0 297.455

29 Rendering Hygiene

  • Keep _common.R limited to loading packages and shared options; perform installations only when setting up the environment (e.g., renv::restore() before render).
  • Prefer chunk-level eval: false for exploratory code instead of commenting out blocks; retain current analyses untouched.
  • When adding new figures/tables, place assets in dedicated folders (e.g., img_qc) to avoid clashing with existing outputs.