Harder, Better, Faster:

Case Studies in Reproducible Workflows


Kathryn Huff

NYU Reproducibility Symposium

May 03, 2016

Physics, University of Chicago Nuclear Engineering and Engineering Physics, University of Wisconsin - Madison Nuclear Engineering, University of California, Berkeley Nuclear, Plasma, and Radiological Engineering, University of Illinois, Urbana-Champaign
BIDS Logo
Scipy Software Carpentry THW book pyne cyclus pyrk

  • Case Study Book Concept
  • Case Study Contributions
  • Lessons Learned
  • Next Steps!
repro mission
justin kitzes flow
justin kitzes flow
khuff flow
basic flow

Reproducibility and Open Science Conference

May 21&22, 2015

  • Three days
  • Invitation Only
  • Case Studies, Education, Self-assessment
  • https://github.com/BIDS/repro-conf
jgukelberger flow
akrause flow
- Preface (Stark) - Introduction (Kitzes) - Assessing the Reproducibility of a Research Project (Rokem, Marwick, Staneva) - The Basic Reproducible Workflow Template (Kitzes, Turek) - Introducing the Case Studies (Imamoglu, Turek) - PART 1: High-Level Case Studies - PART 2: Low-Level Case Studies - Lessons Learned (Huff et al.) - Supporting Reproducible Science (Ram, Marwick) - Glossary of Terms and Techniques (Rokem, Chirigati)

Editors

justin kitzes fatma imamoglu daniel turek

Justin Kitzes, Fatma Imamoglu, Daniel Turek

Supplementary Chapter Authors


BIDS uw escience nyu center for data science
  • Philip Stark
  • Justin Kitzes
  • Daniel Turek
  • Fatma Imamoglu
  • Kathryn Huff
  • Karthik Ram
  • Ariel Rokem
  • Ben Marwick
  • Valentina Staneva

Fernando Chirigati

Case Study Chapter Contributors!

  • Mary K. Askren
  • Anthony Arendt
  • Lorena A. Barba
  • Pablo Barberá
  • Kyle Barbary
  • Carl Boettiger
  • You-Wei Cheah
  • Garret Christensen
  • Devarshi Ghoshal
  • Chris Gorgolewski
  • Jan Gukelberger
  • Chris Holdgraf
  • Konrad Hinsen
  • David Holland
  • Chris Hartgerink
  • Kathryn Huff
  • Fatma Imamoglu
  • Justin Kitzes
  • Natalie Koh
  • Andy Krause
  • Randy LeVeque
  • Tara Madhyastha
  • José Manuel Magallanes
  • Ben Marwick
  • Olivier Mesnard
  • K. Jarrod Millman
  • K. A. S. Mislan
  • Kellie Ottoboni
  • Gilberto Pastorello
  • Russell Poldrack
  • Karthik Ram
  • Ariel Rokem
  • Rachel Slaybaugh
  • Valentina Staneva
  • Philip Stark
  • Daniel Turek
  • Daniela Ushizima
  • Zhao Zhang
tools
## Lessons Learned - Pain Points - Recommmendations from the Authors - A Little Data - Needs

Pain Points

  • People and Skills
  • Dependencies, Build Systems, and Packaging
  • Hardware Access
  • Testing
  • Publishing
  • Data Versioning
  • Time and Incentives
  • Data restrictions
  • ## Incentives - verifiability - collaboration - efficiency - extensibility - "focus on science" - "forced planning" - "safety for evolution"
    ## Recommendations - version control your code - open your data - automate everywhere possible - document your processes - test everything - use free and open tools
    ## Recommendations: Continued - avoid excessive dependencies - when dependencies can't be avoided, package their installation - host code on a collaborative platform (e.g. GitHub) - get a Digital Object Identifier for your data and code - avoid spreadsheets, plain text data is preferred ("timeless," even) - explicitly set pseudorandom number generator seeds - workflow and provenance frameworks may be too clunky for most scientists
    ## Recommendations: Outliers > ... in our estimation, if someone > was to try to reproduce our research it would probably be more > natural for them to write their own scripts as this has the > additional benefit that they might not fall into any error > we may have accidentally introduced in our scripts.
    ## Recommendations: Outliers > Scientific funding and the number of scientists available to do the work is finite. Therefore not every scientific result can, or should be reproduced.
    tools
    languagess
    testing

    Emergent Needs

    • Better education of scientists in more reproducibility-robust tools.
    • Widely used tools should be more reproducible so that the common denominator tool does not undermine reproducibility.
    • Improved configuration and build systems for portably packaging software, data, and analysis workflows.
    • Reproducibility at scale for high performance computing.
    • Standardized hardware configurations and experimental procedures for limited-availability experimental apparatuses.
    • Better understanding of why researchers don't respond to the delayed incentives of unit testing as a practice.
    • Greater adoption of unit testing irrespective of programming language.
    • Broader community adoption around publication formats that allow parallel editing (i.e. any plain text markup language that can be version
    • Greater scientific adoption of new industry-led tools and platforms for data storage, versioning, and management.
    • Increased community recognition of the benefits of reproducibility.
    • Incentive systems where reproducibility is not self-incentivizing.
    • Standards around scrubbed and representational data so that analysis can be investigated separate from restricted data sets.
    • Community adoption for file format standards within some domains.
    • Domain standards which translate well outside of their own scientific communities.

    Social Science Volume

    Collecting Case Studies Spring/Summer 2016

    • Same format: 1,500-2,000 words plus one diagram
    • Bad Hessian blog : http://www.badhessian.org
    • GitHub Repo : http://github.com/BIDS/ss-repro-case-public
    • Email Garret Christensen (garret@berkeley.edu) or Cyrus Dioun (dioun@berkeley.edu)

    Acknowledgements

    • Justin Kitzes
    • Fatma Imamoglu
    • Daniel Turek
    • Chapter Authors
    • Case Study Authors
    • Reproducibility Working Group

    BIDS Logo

    THE END

    Katy Huff

    katyhuff.github.io/2016-05-03-nyu
    Creative Commons License
    Harder, Better, Faster: Case Studies in Reproducible Workflows by Kathryn Huff is licensed under a Creative Commons Attribution 4.0 International License.
    Based on a work at http://katyhuff.github.io/2016-05-03-nyu.
    > 'connectome workbench', 'stata', 'zotero', ' ', 'travisci', 'vistrails', > 'osf', 'testtools', 'nipy', 'coverage/coveralls', 'ferret', 'cmake', > 'flickr api', 'amazon s3', 'nose', 'readthedocs', 'pypi', 'jira', > 'jenkins', 'ec2 s3', 'sweave', 'shell', ' jupyter', 'sql', 'dataverse', > 'rnw', 'spark', ' paraview', 'data science toolkit', 'overleaf', > 'virtualenv', 'crossref', 'spyder', 'markdown', 'dropbox', > 'scikit-image', 'awk', 'netcdf', 'petsc', 'figshare', 'sharelatex', > 'pandoc', 'ibamr', 'dcvs', 'twitter api', 'mendeley', 'word', 'd3', > 'beautiful soup', 'sed', 'devtools', 'activepapers', 'private git repo', > 'cython', 'outreg2', 'rsync', ' zenodo', 'vagrant', 'c' >