Living Systems_

Galaxy

Wanted: Dynamic workflow scheduling

Photo credits: Matthew Smith / Unsplash In our work on automating machine learning computations in cheminformatics with scientific workflow tools , we have came to realize something; Dynamic scheduling in scientific workflow tools is very important and sometimes badly needed. What I mean is that new tasks should be able to be scheduled during the execution of a workflow, not just in its scheduling phase. What is striking is that far from all workflow tools allow this.

Workflow tool makers: Allow defining data flow, not just task dependencies

Upsurge in workflow tools There seem to be a little upsurge in light-weight - often python-based - workflow tools for data pipelines in the last couple of years: Spotify’s Luigi , OpenStack’s Mistral , Pinterest’s Pinball , and recently AirBnb’s Airflow , to name a few. These are all interesting tools, and it is an interesting trend for us at pharmbio , who try to see how we can use workflow tools to automate bio- and cheminformatics tasks on compute clusters.

(E)BNF parser for parts of the Galaxy ToolConfig syntax with ANTLR

As blogged earlier, I’m currently into parsing the syntax of some definitions for the parameters and stuff of command line tools. As said in the linked blog post, I was pondering whether to use the Galaxy Toolconfig format or the DocBook CmdSynopsis format . It turned out though Well, that cmdsynopsis lacks the option to specify a list of valid choices, for a parameter, as is possible in the Galaxy ToolConfig format (see here ), and thus can be used to generate drop-down lists in wizards etc.

Partial Galaxy ToolConfig to DocBook CmdSynopsis conversion with XSLT RegEx

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1"> <description>converts SAM format to BAM format</description> <requirements> <requirement type="package">samtools</requirement> </requirements> <command interpreter="python"> sam_to_bam.py --input1=$source.input1 --dbkey=${input1.metadata.dbkey} #if $source.index_source == "history": --ref_file=$source.ref_file #else --ref_file="None" #end if --output1=$output1 --index_dir=${GALAXY_DATA_INDEX_DIR} </command> <inputs> <conditional name="source"> <param name="index_source" type="select" label="Choose the source for the reference list"> <option value="cached">Locally cached</option> <option value="history">History</option> </param> <when value="cached"> <param name="input1" type="data" format="sam" label="SAM File to Convert"> <validator type="unspecified_build" /> <validator type="dataset_metadata_in_file" filename="sam_fa_indices.loc" metadata_name="dbkey" metadata_column="1" message="Sequences are not currently available for the specified build.