On Provenance Reports in Scientific Workflows
One of the more important tasks for a scientific workflow is to keep track of so called “provenance information” about its data outputs - information about how each data file was created. This is important so other researchers can easily replicate the study (re-run it with the same software and tools). It should also help for anyone wanting to reproduce it (re-run the same study design, possibly with other software and tools).
In workflows in cheminformatics and especially bioinformatics these days, such provenance information typically contains a list of exact shell commands executed to create all the intermediate output files in and possible extra metadata such as parameters in a more structured form, execution times, and the like. Optimally we also would like to include the exact versions of the software used.
How do other workflow tools do this
The above seems like a pretty straight-forward receipe for how to create audit- or provenance reports for workflow runs. Still, in our experience, there is to not that much of consensus about how to do it in the most used tools. I’ve been looking hard for prior art here, and from what I understand from googling and some experimentation (please correct any misconceptions here!):
- I understand Galaxy stores quite a lot of metadata information about workflow runs internally, which are viewable via jobs in the UI, but not so sure about any specific reporting facility.
- I know bpipe has some type of HTML report ,
- Unsure about the situation in NextFlow (don’t remember seeing an explicit reporting function), and Snakemake .
Pardon my hand-waving unsurity here, but again, my point is just that there doesn’t seem to be much consensus about it yet. Also, in the cases where a report is provided, it seems to be more likely to be as a formatted text or HTML document, that a structured data document, or an executable one.
Now, it should be said that the workflow code itself is a very accurate and executable documentation of all its outputs, if the outputs can be linked to a specific version of the code (git commit). Still, linking workflow outputs to specific git versions seems to come with its own problems, such as finding the right repository, making sure that all dependencies are also equally well versioned and available, etc etc, so not sure that is the ultimate solution.
We can do better
Thus, I think there is room for improvement. Below I go in to some detail about how we do this in SciPipe, hoping to spur some brainstorming on the topic.
How SciPipe does it
SciPipe is designed quite heavily around being able to customize completely up to your liking how to name both intermediate and final outputs from the workflow, with helper methods , or providing your own anonymous function to generate the path. This makes it manageable to browse through intermediate output file for manual sanity checking for example.
Thus, wouldn’t it be great if we could have the provenance information handy together with the workflow output files? That is exactly what SciPipe does. For every output in the workflow it creates a corresponding file named like the file but with the extention “.audit.json”. This makes it really easy to find the exact commands used to produce an output, should one find anything that looks funny., etc. This has already been very helpful for debugging subtle errors in our workflows.
The SciPipe audit log format
Now to the format of the file.
Some workflow tools provide a linear representation of the list of commands that were run in a workflow run. It turns out though that for every output in a workflow, one actually has a tree structure of upstream commands that were run, and for each of the inputs of that command, further upstream commands that were run to produce them, with their own inputs, and corresponding upstream commands, and … until you get to your raw data aquisition scripts.
Let’s illustrate the data structure with a simple HTML list, for the file workflow_output.tsv:
- Command:
my-nice-command -i input_data.tsv,another_input.tsv > workflow_output.tsv
- Inputs:
input_data.tsv
:- Command:
curl -O http://some-longurl.com/datasets/input_data.tsv
- Command:
another_input.tsv
:- Command:
sed 's/typo/correct-word/' rawdata.tsv > another_input.tsv
- Inputs:
rawdata.tsv
:- Command:
curl -O ftp://some-archive.org/rawdata.tsv
- Command:
- Inputs:
- Command:
- Inputs:
If we look at “another_input.tsv” here, we see how there can actually be an indefinitely large tree of upstream commands that produced a certain input, etc etc. In other words, far from a linear list of commands. This is exactly what SciPipe stores for every file. The concrete data serialization format used is JSON, and to give you a sense for how this looks in practice, this is a real-world audit file from one of our workflows (only slightly cleaned of tabs and newlines):
{
"Command": "java -jar ../../bin/cpsign-0.6.2.jar train --license ../../bin/cpsign.lic --cptype 1 --modelfile dat/slc6a4/slc6a4.tsv.precomp --labels A, N --impl liblinear --nr-models 10 --cost 1 --model-out dat/final_models/slc6a4_liblin_c1_nrmdl10.mdl.tmp --model-name "SLC6A4 target profile" # Efficiency: 0.183",
"Params": {
"cost": "1",
"efficiency": "0.183",
"gene": "SLC6A4",
"nrmdl": "10"
},
"ExecTimeMS": 26828,
"Upstream": {
"dat/slc6a4/slc6a4.tsv.precomp": {
"Command": "java -jar ../../bin/cpsign-0.6.2.jar precompute --license ../../bin/cpsign.lic --cptype 1 --trainfile dat/slc6a4/slc6a4.tsv --labels A, N --model-out dat/slc6a4/slc6a4.tsv.precomp.tmp --model-name "SLC6A4 target profile"",
"Params": {},
"ExecTimeMS": 22637,
"Upstream": {
"dat/slc6a4/slc6a4.tsv": {
"Command": "awk -F"" '$9 == "SLC6A4" { print $12""$4 }' ../../raw/pubchem.chembl.dataset4publication_inchi_smiles.tsv \u003e dat/slc6a4/slc6a4.tsv.tmp",
"Params": {},
"ExecTimeMS": 2437035,
"Upstream": {
"../../raw/pubchem.chembl.dataset4publication_inchi_smiles.tsv": {
"Command": "xzcat dat/pubchem.chembl.dataset4publication_inchi_smiles.tsv.xz \u003e dat/pubchem.chembl.dataset4publication_inchi_smiles.tsv.tmp",
"Params": {},
"ExecTimeMS": 0,
"Upstream": null
}
}
}
}
}
}
}
Now, based on this structured information, we actually have all the information to easily generate:
- A bash script to re-run everything from scratch
- A HTML/PDF/Markdown report
- Anything else you can come up with?
Another nice aspect of storing this info as structured data together with the main workflow output, is that it can easily be used as inputs in the workflow itself. This has already been a life-saver for us in some tricky workflow constructs.
It also highlights a fact that it took us a few to realize:
application logs != provenance logs
Application logs tend to contain much more low-level technical information, and also be so unstructured and unreliably stored, that it is hardly useful as input for the workflow itself.
Known limitations
As this is work in progress, there are some things we haven’t got to yet, such as software versioning. We still need to find a robust and universal method to store that.
What are your thoughts? Our thinking right now is that the optimal will always be to include installation or building of the software itself from sources in the workflow specification itself, but in the meanwhile, perhaps just providing a way in the workflow to specify how to run a command to get the version string from it would suffice?
Something like:
myProcess := workflow.NewProc("my_process", "awk '{ print }' ..." ...)
myProcess.SetVersionCommand("awk --version")
?
Guess we’ll have to try and see. Comments welcome!
- Update: See some discussion in this reddit thread
- Edit, Oct 19, 13:50 CET: Updated wording to make reading more fluent