Doing Our Best

Approaches in Scientific Computing

Kathryn (Katy) Huff

IACS, Stony Brook, NY, November 8, 2018

Chain reaction
Science can do much more than we imagine.
“Heavier-than-air flying machines are impossible” - Lord Kelvin, 1895
“the first principle is that you must not fool yourself, and you are the easiest person to fool.” - R. Feynman, 1974
“I am thinking about something much more important than bombs. I am thinking about computers.” - John von Neumann, 1946.


  • builds and organizes knowledge
  • tests explanations about the universe
  • systematically,
  • objectively,
  • transparently,
  • and reproducibly.

Otherwise it's not science.

Science relies on

  • peer review,
  • skepticism,
  • transparency,
  • attribution,
  • accountability,
  • collaboration,
  • and impact.

Since 6th century BCE, science has been perfecting these tenents.

Open source software is now superior at all of them.



  • improve efficiency,
  • reduce human error,
  • automate the mundane,
  • simplify the complex,
  • and accelerate research.

But scientists aren't trained to use them effectively.

Getting Started

“ Organized Skepticism. Scientists are critical: All ideas must be tested and are subject to rigorous structured community scrutiny.” - R.K. Merton, 1942

Data Storage

  • Good: pencil and paper
  • Better: spreadsheet
  • Best: standardized file format, database management system

Formats: Evaluated Nuclear Data File (ENDF), Evaluated Nuclear Structure Data File (ENSDF), Hierarchical Data Format (HDF), etc.

Management: C/Python/Fortran APIs, SQL, MySQL, MongoDB, etc.

Backing Up Files

  • Good: hope
  • Better: nightly emails
  • Best: remote version control

Version Control Systems: cvs, svn, hg, git

Managing Changes

  • Good: naming convention
  • Better: clever naming convention
  • Best: local version control

Version Control.

Getting It Done

“ It takes just as much time to write a good paper as it takes to write a bad one. ” - Polterovich, 2014


  • Good: pencil and calculator
  • Better: spreadsheets, matlab, mathematica
  • Best: scripting, open source libraries, modern programming language

Hint: Python, scipy, numpy, numba, pandas, scikit-learn, scikit-image, etc.

Multiple File Cleanup

  • Good: manually edit every file
  • Better: search and replace in each file
  • Best: scripted batch editing

Hint: try a tutorial on BASH, CSH, Python, or Perl.

Excecuting Workflows

  • Good: retype a series of commands
  • Better: bash script
  • Best: build system

Build System Tools: make, autoconf, automake, cmake, etc.

Data Structures

  • Good: 100 string variables holding doubles
  • Better: lists of lists of doubles
  • Best: appropriate powerful data structures

Hint: In FORTAN, learn about arrays. In C++, learn about maps, vectors, deques, queues, etc. In python, the power lies in dictionaries and numpy arrays.

API Design

  • Good: single block of procedural code
  • Better: separate functions
  • Best: small, testable functions, grouped into classes, DRY

DRY: Dont Repeat Yourself. Code replication is bug proliferation.

Variable Naming

  • Good: d1, d2, d3
  • Better: x, y, z
  • Best: p.x, p.y, p.z, p=Point(x,y,z)

File I/O

  • Good: none, hardcoded variables
  • Better: plain text input file, line-by-line homemade string parsing
  • Best: file parsing library

Tools: python argparse, xml rng, etc.

Getting It Right

“ The scientific method’s central motivation is the ubiquity of error—the awareness that mistakes and self-delusion can creep in absolutely anywhere and that the scientist’s effort is primarily expended in recognizing and rooting out error. ” - Donoho, 2009.

Error Detection

  • Good: show results to experts
  • Better: integration testing
  • Best: unit test suite, continuous integration

Error Diagnostics

  • Good: re-re-read the code
  • Better: print statements
  • Best: use a linter, a debugger, and a profiler

Tools: cpplint, pyflakes, gdb, lldb, pdb, idb, valgrind, kernprof, kcachegrind

Error Correction

  • Good: fix code
  • Better: fix, add an exception
  • Best: fix, add an exception, add a test

Getting It Together

“ Two of the biggest challenges scientists and other programmers face when working with code and data are keeping track of changes (and being able to revert them if things go wrong), and collaborating on a program or dataset. ” - Wilson, et al. 2014.

Merging Collaborative Work

  • Good: single master copy, waiting
  • Better: emails and patches
  • Best: remote version control

Peer Review For Code

  • Good: separation of concerns
  • Better: shared repository
  • Best: peer-reviewed pull requests
“ just-in-time review of small code changes is more likely to succeed than large-scale end-of-work reviews. ” - Petre, Wilson 2014


  • Good: weekly research meetings, year-long tasks
  • Better: daily conversations, month-long goals
  • Best: pair programming, issue tracking

Software Handovers

  • Good: zip file, theory paper
  • Better: comments in code, example input file
  • Best: automated documentation, test suite

Books: Clean Code, Working Effectively with Legacy Code

Tools: sphinx, doxygen, gooletest, unitttest, nosetests

Getting It Out There

“ If a piece of scientific software is released in the forest, does it change the field? ”


  • Good: custom formatting, clickable GUI
  • Better: plot format templates (excel, mathematica)
  • Best: scripted plotting, matplotlib, gnuplot, etc.


  • Good: stone tablet, microsoft word
  • Better: word with track changes, open office
  • Best: plain text markup with version control and a makefile

Tools: LaTeX, markdown, restructured text

Distribution Control

  • Good: "email to request access"
  • Better: license file
  • Best: license file, citation file, DOI, forkable repository


Community Adoption

  • Good: none, internal use only
  • Better: online repository, developer email online
  • Best: issue tracker, user/developer listhost(s), online documentation

Unique Issue in Nuclear Engineering

Export control is serious.

Export Control is a big deal in nuclear

Write programs for people, not computers.

    • A program should not require its readers to hold more than a handful of facts in memory at once.
    • Make names consistent, distinctive, and meaningful.
    • Make code style and formatting consistent.

Let the computer do the work.

  • Make the computer repeat tasks.
  • Save recent commands in a file for re-use.
  • Use a build tool to automate workflows.

Make incremental changes.

  • Work in small steps with frequent feedback and course correction.

Use a version control system.

  • Put everything that has been created manually in version control.

Don't repeat yourself (or others).

  • Every piece of data must have a single authoritative representation in the system.
  • Modularize code rather than copying and pasting.
  • Re-use code instead of rewriting it.

Plan for mistakes.

  • Add assertions to programs to check their operation.
  • Use an off-the-shelf unit testing library.
  • Turn bugs into test cases.
  • Use a symbolic debugger.

Optimize software only after it works correctly.

  • Use a profiler to identify bottlenecks.
  • Write code in the highest-level language possible.

Document design and purpose, not mechanics.

  • Document interfaces and reasons, not implementations.
  • Refactor code in preference to explaining how it works.
  • Embed the documentation for a piece of software in that software.


  • Use pre-merge code reviews.
  • Use pair programming when bringing someone new up to speed and when tackling particularly tricky problems.
  • Use an issue tracking tool.

''Reading brings us unknown friends'' - Honore de Balzac

  • BIDS
  • Justin Kitzes
  • Fatma Imamoglu
  • Daniel Turek
  • Ben Marwick
  • Chapter Authors
  • Case Study Authors
  • Reproducibility Working Group

repro mission

Reproducibility and Open Science Conference

May 21&22, 2015

  • Three days
  • Invitation Only
  • Case Studies, Education, Self-assessment
jgukelberger flow
akrause flow
  • Incentives
  • Pain Points
  • Recommmendations from the Authors
  • A Little Data
  • Needs


  • verifiability
  • collaboration
  • efficiency
  • extensibility
  • "focus on science"
  • "forced planning"
  • "safety for evolution"

Pain Points

  • People and Skills
  • Dependencies, Build Systems, and Packaging
  • Hardware Access
  • Testing
  • Publishing
  • Data Versioning
  • Time and Incentives
  • Data restrictions


  • version control your code
  • open your data
  • automate everywhere possible
  • document your processes
  • test everything
  • use free and open tools

Recommendations: Continued

  • avoid excessive dependencies
  • or at least package their installation
  • host code on a collaborative platform (e.g. GitHub)
  • get DOIs for data and code
  • plain text data is preferred, timeless
  • explicitly set seeds
  • workflow frameworks can be overkill

Emergent Needs

  • Common demoninator tools should support reproducibility
  • Improved configuration and build systems
  • Reproducibility at scale for HPC
  • Standardized hardware configurations limited-availability experimental apparatuses.
  • Better understanding of incentives for unit testing.
  • Greater adoption of unit testing irrespective of programming language.
  • Broader community adoption around publication formats that allow parallel editing
  • Broader adoption of data storage, versioning, and management tools.
  • Increased community recognition of the benefits of reproducibility.
  • Incentive systems where reproducibility is not self-incentivizing.
  • Standards around scrubbed and representational data
  • Community adoption for file format standards within some domains.
  • Domain standards which translate well outside of their own scientific communities.

The Journal of Open Source Software (JOSS) is a developer friendly journal for research software packages.

What exactly do you mean by 'journal'?

The Journal of Open Source Software (JOSS) is an academic journal (ISSN 2475-9066) with a formal peer review process that is designed to improve the quality of the software submitted. Upon acceptance into JOSS, a CrossRef DOI is minted and we list your paper on the JOSS website.

(More: JOSS Editorial Board. "About JOSS" 2018.)

Don't we have enough journals already?

Perhaps, and in a perfect world we'd rather papers about software weren't necessary but we recognize that for most researchers, papers and not software are the currency of academic research and that citations are required for a good career.

(More: JOSS Editorial Board. "About JOSS" 2018.)

You said developer friendly, what do you mean?

We have a simple submission workflow and extensive documentation to help you prepare your submission. If your software is already well documented then paper preparation should take no more than an hour.

(More: JOSS Editorial Board. "About JOSS" 2018.)

DOI: 10.7717/peerjcs.147/fig-1

Image DOI: 10.7717/peerjcs.147/fig-2

JOSS papers accepted by month, as of today.


A lot of these thoughts came from my personal experience. However, much of it was annealed from conversations with colleagues throughout the scientific and computing communities (too many of you to name).


Ok, I'm convinced. So how can one learn this stuff?

Online Resources


Good Books, etc.

  • Clean Code - Robert C. Martin
  • Working Effectively with Legacy Code - Martin Fowler
  • Effective Computation in Physics - Huff, Scopatz


Katy Huff
