How do I make my computational workflow reproducible, and what tools could support me in doing that?

Question

lukascbossert · Answer 1 · 2024-03-06T10:31:10+0000

There are many ways creating a reproducible computational workflow. In my eyes it needs to contain the documentation of the code which is used to get the results:

Documentation must be regarded as an integral part of the process of design and coding. A good programming language will encourage and assist the programmer to write clear, self-documenting code, and even perhaps to develop and display a pleasant style of writing. -- Charles Antony Richard Hoare

I focus on a concept called "literate programming" where you have your code along with your description in the same document. You can go on and publish this document since it contains everything that is needed for someone running your analysis.

This concept itself is not bound to any tool or language and allows a big flexibility: Widespread is the JupyterNotebook especially when you are coding with Python or R (at least my common languages together with JupyterNotebooks). I would also suggest taking a look at org-babel which is a module of org-mode (available through Emacs) and highly extendable and flexible to be used with any programming language.There is also an introductory video how to use org-babel.

Of course there are also other concepts that focus on reproducible results like containing all binaries, software and code that are used. But this is not my expertise.

mschwarzmeier · Answer 2 · 2024-03-07T09:43:24+0000

As lukascbossert already pointed out, one aspect of reproducibility is the understanding of what you have done.

First, let me state:
When trying to reproduce something, you should aim to fully automatize your workflow.

Second:
Stick to the KISS principle.

Some questions arise:

What is your "workflow"?
Of what kind are the results, you want to reproduce?
- Do the software versions play a meaningful role?
- Does hardware play a role?
Do you have special requirements in terms of needed programs, packages, CPU-/GPU-power or enormous amounts of data?
What other tools are already in place? Is your workflow already/can be scripted with some stable tool like bash-shell (KISS)?
What is the desired level and ease of reproducibility?
Bitwise reproducibilty is nearly(?) impossible.
How much time&effort are you willing to invest?
Are you aiming to reproduce results in future (e.g. 10 years)?
Do you use third-party/closed source software? Do you need licenses (could you need them in future?)? Will you always be able to go back to a certain software version?

One way to approach your problem (based on some assumptions on the answers to the previous questions) would be:

Automate your workflow.
Build a container (e.g. docker/apptainer) with your workflow in it.
Test it does work with GitLab/GitHub CI or on another machine.
Archive and link to each other:
1. your (documented) code,
2. results & intermediary results
3. your image file (apptainer) or upload to dockerhub
4. (related scientific paper)

Possibly you could also take a look into dedicated workflow tools like snakemake, but I am no expert in that. My group does not use is, because it is yet-another-tool and not needed in our work ->KISS.

How do I make my computational workflow reproducible, and what tools could support me in doing that?

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Most popular tags