Living Systems_

Posts

Rewrite of Scicommander in Go with much improved algorithm

When I presented a poster about SciCommander at the Swedish bioinformatics workshop last year, I got a lot of awesome feedback from some great people including Fredrik Boulund, Johannes Alneberg and others, of which I unfortunately lost the names (please shout out if you read this!). (For those new to SciCommander, it is my attempt at creating a tool that can track complete provenance reports also for ad-hoc shell commands, not just those included in a pipeline.

A few notes from the Applied Hologenomics Conference 2024

I’m just back from the Applied Hologenomics Conference in 2024 (See also #AHC2024 on Twitter ) in Copenhagen and thought to reflect a little on the conference and highlight the bits that particularly stuck with me. The first thing I want to say is that a paradigm shift is happening here. I think what is happening here is a step away from the reductionist view of the past that goes beyond the systems biology approach that has been establishing itself during the last 10-20 years.

We need recipes for common bioinformatics tasks

Ad-hoc tasks in bioinformatics can contain such an immense number of operations and tasks that need to be performed to achieve a certain goal. Often these are all individually regarded as rather “standard” or “routine”. Despite this, it is quite hard to find an authoritative set of “recipes” for how to do such tasks. Thus I was starting to think that there needs to be a collection of bioinformatics “recipes”. A sort of “cookbook” for common bioinformatics tasks.

Why didn't Go get a breakthrough in bioinformatics (yet)?

As we are - according to some expert opinions - living in the Century of Biology, I found it interesting to reflect on Go’s usage within the field. Go has some great features that make it really well suited for biology, such as: A relatively simple language that can be learned in a short time even for people without a CS background. This is super important aspect for biologists. Fantastic support for cross-compilation into all major computer architectures and operating systems, as static, self-sufficient executables making it extremely simple to deploy tools, something that can’t be said about the currently most popular bio language, Python.

SciPipe used at NASA Glenn Research Center

I was happy to see the publication finally going online , of work done at NASA Glenn Research Center , where SciPipe has been used to process and track provenance of the analyses, “Modeling the impact of thoracic pressure on intracranial pressure”. I’ve known the work existed for a couple of years, after getting some extraordinarily useful contributions from Drayton fixing some bugs I’m not sure I’d ever find otherwise, but cool to now also see it published!

Debugging inside Jinja templates using pdb/ipdb

I’m working on a static reporting tool using the Jinja2 templating engine for Python. I was trying to figure out a way to enter into the Jinja templating code with the pdb/ipdb commandline debugger. I tried creating an .ipdbrc file in my local directory with the line: path/to/template.html:<lineno> … but that didn’t work. What worked was to figure out the line that says : return self.environment.concat(self.root_render_func(ctx)) … inside the the jinja codebase, and put a breakpoint on that (which for me was on line 1299, but might vary depending on version):

SciCommander - track provenance of any shell command

I haven’t written much about a new tool I’ve been working on in some extra time: SciCommander . I just presented a poster about it at the Swedish Bioinformatics Workshop 2023 , so perhaps let me first present you the poster instead of re-iterating what it is (click to view large version): New version not requiring running the scicmd command I got a lot of great feedback from numerous people at the conference, most of who pointed out that it would be great if one could start scicommander as a kind of subshell, inside which one can run commands as usual, instead of running them via the scicmd -c command.

Troubleshooting Nextflow pipelines

We have been evaluating Nextflow before in my work at pharmb.io , but that was before DSL2 and the support for re-usable modules (which was one reason we needed to develop our own tools to support our challenges, as explained in the paper ). Thus, there’s definitely some stuff to get into. Based on my years in bioinformatics and data science, I’ve seen that the number one skill that you need to develop is to be able to effectively troubleshoot things, because things will invariably fail in all kinds of ways.

Random notes from installing Debian 11 with separate accounts for work and private

See especially the end for info about how to set up a nice integration between the work and private accounts, such that one can e.g. occasionally start the mail client or web browser from the private account from the work one etc. Caveats when installing Debian 11 Make sure that an EFI partition is created (when I manually modified the partition table I accidentally deleted it, and had to reinstall to get it created properly again).

Installing Qubes OS

I just switched to Qubes OS as operating system on my main work laptop (a Dell Latitude). Or in fact, one of the reasons was to be able to combine work and private hobby coding projects, that’s increasinbly been happening on the same machine. Anyways, these are my experiences and notes, as a way to document caveats and quirks in case I need to do this again, while possibly also being of use for others.

Composability in functional and flow-based programming

An area where I’m not so happy with some things I’ve seen in FP, is composability. In my view, a well designed system or langauge should make functions (or other smallest unit of computation) more easily composable, not less. What strikes me as one of the biggest elephants in the room regarding FP, is that typical functions compose fantastically as long as you are working with a single input argument, and a single output for each function application, but as soon as you start taking multiple input arguments and returned outputs though, you tend to end up with very messy trees of function application.

Crystal: Go-like concurrency with easier syntax

I have been playing around a lot with concurrency in Go over the years, resulting in libraries such as SciPipe , FlowBase and rdf2smw . My main motivation for looking into Go has been the possibility to use it as a more performant, scaleable and type-safe alternative to Python for data heavy scripting tasks in bioinformatics and other fields I’ve been dabbling in. Especially as it makes it so easy to write concurrent and parallel code in it.

Viewing Go test coverage in the browser with one command

Go has some really nice tools for running tests and analyzing code. One of these functionalities is that you can generate coverage information when running tests, that can later be viewed in a browser using the go tool cover command. It turns out though, since doing it requires executing multiple commands after each other, it might be hard to remember the exact commands. To this end, I created a bash alias that does everything in one command, gocov.

Creating a static copy of a Drupal, Wordpress or other CMS website

wget -P . -mpck --html-extension -e robots=off --wait 0.5 <URL> To understand the flags, you can check man wget of course, but some explanations follow here: -P - Tell where to store the site -m - Create a mirror -p - Download all the required files (.css, .js) needed to properly render the page -c - Continue getting partially downloaded files -k - Convert links to enable local viewing –html-extension - Add the .

Basic PUB/SUB connection with ZeroMQ in Python

ZeroMQ is a great way to quickly and simply send messages between multiple programs running on the same or different computers. It is very simple and robust since it doesn’t need any central server. Instead it talks directly between the programs through sockets, TCP-connections or similar. ZeroMQ has client libraries for basically all commonly used programming languages, but when testing out that a connection works between e.g. two different machines, it might be good to keep things simple and test just the connection, as simply as possible.

Table-driven tests in C#

Folks in the Go community have championed so called table-driven tests (see e.g. this post by Dave Cheney and the Go wiki ) as a way to quickly and easily writing up a bunch of complete test cases with inputs and corresponding expected outputs, and looping over them to execute the function being tested. In short, the idea is to suggest a maximally short and convenient syntax to do this.

SciPipe paper published in GigaScience

We just wanted to share that the paper on our Go-based workflow library, SciPipe, was just published in GigaScience: Abstract Background The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.

Structured Go-routines or framework-less Flow-Based Programming in Go

I was so happy the other day to find someone else who found the great benefits of a little pattern for how to structure pipeline-heavy programs in Go, which I described in a few posts before. I have been surprised to not find more people using this kind of pattern, which has been so extremely helpful to us, so I thought to take this opportunity to re-iterate it again, in the hopes that more people might get aware of it.

Setting up a reasonable and light-weight Linux-like (non-WSL) terminal environment on Windows

I was looking for was a no-fuss, lightweight, robust and as simple as possible solution to running my normal Bash-based workflow inside the main Windows filesystem, interacting with the Windows world. Turns out there are some solutions. Read on for more info on that. Windows Subsystem for Linux too heavy First, I must mention the impressive work by Microsoft on the Windows Subsystem for Linux (aka. WSL) , which more or less lets you run an almost full-blown distribution of popular Linux distros like Ubuntu and Fedora.

Linked Data Science - For improved understandability of computer-aided research

This is an excerpt from the “future outlook” section of my thesis titled “Reproducible Data Analysis in Drug Discovery with Scientific Workflows and the Semantic Web” (click for the open access full text), which aims to provide various putative ways towards improved reproducibility, understandability and verifiability of computer-aided research. Historically, something of a divide has developed between the metadata rich datasets and approaches in the world of Semantic Web/Ontologies/Linked Data, versus in the Big Data field in particular, which has been at least initially mostly focused on large unstructured datasets.

Preprint on SciPipe - Go-based scientific workflow library

A pre-print for our Go-based workflow libarary SciPipe , is out, with the title SciPipe - A workflow library for agile development of complex and dynamic bioinformatics pipelines , co-authored by me and colleagues at pharmb.io : Martin Dahlö , Jonathan Alvarsson and Ola Spjuth . Access it here . It has been more than three years since the first commit on the SciPipe Git repository in March, 2015, and development has been going in various degrees of intensity during these years, often besides other duties at pharmb.

Make your commandline tool workflow friendly

Update (May 2019): A paper incorporating the below considerations is published: Björn A Grüning, Samuel Lampa, Marc Vaudel, Daniel Blankenberg, “Software engineering for scientific big data analysis ” GigaScience, Volume 8, Issue 5, May 2019, giz054, https://doi.org/10.1093/gigascience/giz054 There are a number of pitfalls that can make a commandline program really hard to integrate into a workflow (or “pipeline”) framework. The reason is that many workflow tools use output file paths to keep track of the state of the tasks producing these files.

To make computational lab note-taking happen, make the journal into a todo-list (a "Todournal")

Good lab note-taking is hard Good note-taking is in my opinion as important for computational research as for wet lab research. For computational research it is much easier though to forget doing it, since you might not have a physical notebook lying on your desk staring at you, but rather might need to open a specific software or file, to write the notes. I think this is one reason why lab note taking seems to happen a lot less among computational scientists than among wet lab ditto.

Semantic Web ❤ Data Science? My talk at Linked Data Sweden 2018

During the last months, I have had the pleasure work together with Matthias Palmér (MetaSolutions AB ) and Fernanda Dórea (National Veterinary Institute ), to prepare for and organize this year’s version of the annual Linked Data Sweden event , which this year was held in Uppsala hosted by the SciLifeLab Data Centre . Thanks to engaged speakers and attendees, it turned into an interesting day with great discussions, new contacts, and a lot of new impressions and insights.

Parsing DrugBank XML (or any large XML file) in streaming mode in Go

I had a problem in which I thought I needed to parse the full DrugBank dataset, which comes as a (670MB) XML file (For open access papers describing DrugBank, see: [1], [2], [3] and [4]). It turned out what I needed was available as CSV files under “Structure External Links ”. There is probably still some other uses of this approach though, as the XML version of DrugBank seems to contain a lot more information in a single format.

Equation-centric dataflow programming in Go

Mathematical notation and dataflow programming Even though computations done on computers are very often based on some type of math, it is striking that the notation used in math to express equations and relations is not always very readily converted into programming code. Outside of purely symbolic programming languages like sage math or the (proprietary) Wolfram language , there seem to always be quite a divide between the mathematical notation and the numerical implementation.

What is a scientific (batch) workflow?

Workflows and DAGs - Confusion about the concepts Jörgen Brandt tweeted a comment that got me thinking again on something I’ve pondered a lot lately: “A workflow is a DAG.” is really a weak definition. That’s like saying “A love letter is a sequence of characters.” representation ≠ meaning – @joergenbr Jörgen makes a good point. A Directed Acyclic Graph (DAG) does not by any means capture the full semantic content included in a computational workflow.

Go is growing in bioinformatics workflow tools

TL;DR: We wrote a post on gopherdata.io, about the growing ecosystem of Go-based workflow tools in bioinformatics. Go read it here It is interesting to note how Google’s Go programming language seems to increase in popularity in bioinformatics. Just to give a sample of some of the Go based bioinformatics tools I’ve stumbled upon, there is since a few years back, the biogo library , providing common functionality for bioinformatics tasks.

The frustrating state of note taking tools

One year left to the dissertation (we hope) and now turning from mostly software development into more of data analysis and needing to read up quite a pile of books and papers on my actual topic, pharmaceutical bioinformatics. With this background, I’m feel forced to ponder ways to improving my note taking workflow. I’m already quite happy with the way of taking notes I’ve settled on, using a lot of drawings and often iterating over the same notes multiple times to ask questions, fill in details, and figure out connections.

Learning how to learn

I’m reading A mind for numbers , by Barbara Oakley. Firstly, it is a very interesting book, but the main lesson I’ve already learned from this book seems so paramount that I have to write it down, so I don’t forget it (some meta-connotations in that statement ;) ). I found the book through Barbara’s coursera course “Learning how to Learn ”, and to me it seems learning in general is the topic of the book too, more than numbers specifically - but I still have to read it through, so stay tuned.

On Provenance Reports in Scientific Workflows

One of the more important tasks for a scientific workflow is to keep track of so called “provenance information” about its data outputs - information about how each data file was created. This is important so other researchers can easily replicate the study (re-run it with the same software and tools). It should also help for anyone wanting to reproduce it (re-run the same study design, possibly with other software and tools).

(Almost) ranging over multiple Go channels simultaneously

Thus, optimally, one would want to use Go’s handy range keyword for looping over multiple channels, since range takes care of closing the for-loop at the right time (when the inbound channel is closed). So something like this (N.B: non-working code!): for a, b, c := range chA, chB, chC { doSomething(a, b, c) } Unfortunately this is not possible, and probably for good reason (how would it know whether to close the loop when the first, or all of the channels are closed?

First production run with SciPipe - A Go-based scientific workflow tool

Today marked the day when we ran the very first production workflow with SciPipe , the Go -based scientific workflow tool we’ve been working on over the last couple of years. Yay! :) This is how it looked (no fancy GUI or such yet, sorry): The first result we got in this very very first job was a list of counts of ligands (chemical compounds) in the ExcapeDB dataset (download here ) interacting with the 44 protein/gene targets identified by Bowes et al as a good baseline set for identifying hazardous side-effects effects in the body (that is, any chemical compounds binding these proteins, will never become an approved drug).

Compiling RDFHDT C++ tools on UPPMAX (RHEL/CentOS 7)

At pharmb.io we are researching how to use semantic technologies to push the boundaries for what can be done with intelligent data processing, often of large datasets (see e.g. our paper on linking RDF to cheminformatics and proteomics , and our work on the RDFIO software suite ). Thus, for us, RDFHDT opens new possibilites. As we are heavy users of the UPPMAX HPC center for our computations, and so, we need to have the HDT tools available there.

New paper on RDFIO for interoperable biomedical data management in Semantic MediaWiki

As my collaborator and M.Sc. supervisor Egon Willighagen already blogged , we just released a paper titled: “RDFIO: extending Semantic MediaWiki for interoperable biomedical data management ”, with uses cases from Egon and Pekka Kohonen , coding help from Ali King and project supervision from Denny Vrandečić , Roland Grafström and Ola Spjuth . See the picture below (from the paper) for an overview of all the newly developed functionality (drawn in black), as related to the previously existing functionality (drawn in grey):

Notes on launching kubernetes jobs from the Go API

This post is also published on medium My current work at pharmb.io entails adding kubernetes support to my light-weight Go-based scientific workflow engine, scipipe (kubernetes, or k8s for short, is Google’s open source project for orchestrating container based compute clusters), which should take scipipe from a simple “run it on your laptop” workflow system with HPC support still in the work, to something that can power scientific workflows on any set of networked computers that can run kubernetes, which is quite a few (AWS, GCE, Azure, your Raspberry Phi cluster etc etc).

SMWCon Fall 2016 - My talk on large RDF imports

I was invited to give a talk at Semantic MediaWiki (SMW) conference in Frankfurt last week, on our work on enabling import of RDF datasets into SMW . I have presented at SMWCon before as well (2011: blog , slides video , 2013: slides ), so it was nice to re-connect with some old friends, and to get up to date about how SMW is developing, as well as share about our own contributions.

Tutorial: Luigi for Scientific Workflows

This is a Luigi tutorial I held at the e-Infrastructures for Massively parallel sequencing workshop (Video archive ) at SciLifeLab Uppsala in January 2015, moved here for future reference. What is Luigi? Luigi is a batch workflow system written in Python and developed by Erik Bernhardson and others at Spotify , where it is used to compute machine-learning powered music recommendation lists, top lists etc. Luigi is one of not-too-many batch workflow systems that supports running both normal command line jobs and Hadoop jobs in the same (in this tutorial, we will focus only on the command line part).

Combining the best of Go, D and Rust?

I’ve been following the development of D , Go and Rust (and also FreePascal for some use cases ) for some years (been into some benchmarking for bioinfo tasks ), and now we finally have three (four, with fpc) stable statically compiled languages with some momentum behind them, meaning they all are past 1.0. While I have went with Go for current projects , I still have a hard time “totally falling in love” with any single of these languages.

Time-boxing and a unified trello board = productivity

Figure: Sketchy screenshot of how my current board looks. Notice especially the “Now” stack, marked in yellow, where you are only allowed to put one single card. I used to have a very hard time getting an overview of my current work, and prioritizing and concentrating on any single task for too long. I always felt there might be something else that might be more important than what I were currently doing.

The unexpected convenience of JSON on the commandline

I was working with a migration from drupal to processwire CMS:es, where I wanted to be able to pipe data, including the body field with HTML formatting and all, through multiple processing steps in a flexible manner. I’d start with an extraction SQL query, through a few components to replace and massage the data, and finally over to an import command using processwire’s wireshell tool . So, basically I needed a flexible format for structured data that could be sent as one “data object” per line, to work nicely with linux commandline tools like grep, sed and awk.

The matrix transformation as a model for declarative atomic data flow operations

After just reading on Hacker News about Google’s newly released TensorFlow library , for deep learning based on tensors and data flow, I realized I wrote in a draft post back in 2013 that: “What if one could have a fully declarative “matrix language” in which all data transformations ever needed could be declaratively defined in a way that is very easy to comprehend?” … so, I thought this is a good time to post this draft, to see whether it spurs any further ideas.

Wanted: Dynamic workflow scheduling

Photo credits: Matthew Smith / Unsplash In our work on automating machine learning computations in cheminformatics with scientific workflow tools , we have came to realize something; Dynamic scheduling in scientific workflow tools is very important and sometimes badly needed. What I mean is that new tasks should be able to be scheduled during the execution of a workflow, not just in its scheduling phase. What is striking is that far from all workflow tools allow this.

How to be productive in vim in 30 minutes

I had heard a lot of people say vim is very hard to learn, and got the impression that it will take a great investment to switch to using it. While I have came to understand that they are right in that there is a lot of things to invest in to get really great at using vim, that will really pay back, I have also found out one thing that I see almost no-one mentioning:

How to compile vim for use with pyenv and vim-pyenv

This manifested itself in a bunch of error message from the python module in vim, ending with: AttributeError: 'module' object has no attribute 'vars' I first thought it was an error in vim-pyenv and reported it (see that issue for more in-depth details). In summary it turns out that older versions of VIM indeed lack some attributes in its python module, so I figured I had to compile my own version, below are just my notes about how to do this, for future reference:

How I would like to write Go programs

Some time ago I got a post published on GopherAcademy , outlining in detail how I think a flow-based programming inspired syntax can strongly help to create clearer, easier-to-maintain, and more declarative Go programs. These ideas have since became clearer, and we (Ola Spjuth ’s research group at pharmbio ) have successfully used them to make the workflow syntax for Luigi (Spotify’s great workflow engine by Erik Bernhardsson & co) workflows easier, as implemented in the SciLuigi helper library .

Terminator as a middle-way between floating and tiling window managers

I have tried hard to improve my linux desktop productivity by learning to do as much as possible using keyboard shortcuts, aliases for terminal commands etc etc (I even produced an online course on linux commandline productivity ). In this spirit, I naturally tried out a so called tiling window manager (aka tiling wm). In short, a tiling wm organizes all open windows on the screen (or on the current desktop) into a “tiled” grid of frames.

FBP inspired data flow syntax: The missing piece for the success of functional programming?

Often when I suggest people have a look at Flow-based Programming (FBP) or Data Flow for one reason or another, people are often put off by the strong connection between these concepts and graphical programming. That is, the idea that programs will be easier to understand if expressed and developed in a visual notation. This is unfortunate, since I think this is in no way the core benefit of FBP or Data Flow, although it is a nice side-effect for those who prefer it.

A few thoughts on organizing computational (biology) projects

I read this excellent article with practical recommendations on how to organize a computational project, in terms of directory structure. Directory structure matters The importance of a good directory structure seems to often be overlooked in teaching about computational biology, but can be the difference between a successful project, and one where every change or re-run of some part of a workflow, will require days of manual fiddling to get hand on the right data, in the right format, in the right place, with the right version of the workflow, with the right parameters, and then succeed to run it without errors.

Flow-based programming and Erlang style message passing - A Biology-inspired idea of how they fit together

I think Erlang/Elixir fits great as control plane or service-to-service messaging layer for distributing services built with flow-based programming Just back from a one day visit to Erlang User Conference . I find the Erlang virtual machine fascinating. And with the new Elixir language built on top of it to fix some of the pain points with Erlang the language, the eco-system has got even more interesting. What I find exciting about Erlang/Elixir and its virtual machine, is its ability to utilize multiple CPU:s on computers, and doing this across multiple computers, in what is commonly referred to as “distributed computing”.

A cheatsheet for the iRODS rule language

iRODS , the “integrated rule oriented data system” is a super cool system for managing datasets consisting of files, from smallish ones, to really large ones counted in petabytes, and possibly spanning multiple continents. There’s a lot to be said about iRODS (up for another blog post) but the one most interesting feature, in my opinion, is the rule language, which allows to define custom rules and policies for how data should be handled, totally automatically, depending on a lot of factors.

Workflow tool makers: Allow defining data flow, not just task dependencies

Upsurge in workflow tools There seem to be a little upsurge in light-weight - often python-based - workflow tools for data pipelines in the last couple of years: Spotify’s Luigi , OpenStack’s Mistral , Pinterest’s Pinball , and recently AirBnb’s Airflow , to name a few. These are all interesting tools, and it is an interesting trend for us at pharmbio , who try to see how we can use workflow tools to automate bio- and cheminformatics tasks on compute clusters.

Patterns for composable concurrent pipelines in Go

I realize I didn’t have a link to my blog on Gopher Academy , on patterns for compoasable concurrent pipelines in Go(lang), so here it goes: blog.gopheracademy.com/composable-pipelines-pattern

The role of simplicity in testing and automation

Disclaimer: Don’t take this too seriously … this is “thinking-in-progress” :) It just struck me the other minute, how simplicity is the key theme behind two very important areas in software development, that I’ve been dabbling with quite a bit recently: Testing, and automation. Have you thought about how testing, in its essence, is: Wrapping complex code, which you can’t mentally comprehend completely, in simple code, that you can mentally comprehend, at least one test at a time.

The problem with make for scientific workflows

The workflow problem solved once and for all in 1979? As soon as the topic of scientific workflows is brought up, there are always a few make fans fervently insisting that the problem of workflows is solved once and for all with GNU make , written first in the 70’s :) Personally I haven’t been so sure. On the one hand, I know the tool solves a lot of problems for many people.

Dynamic Navigation for Higher Performance

Improving performance in Delphi Bold MDA applications by replacing navigation code with derived links in the model This post on Model Driven Architecture in Delphi and Bold , by Rolf Lampa , has been previously published on howtodothings.com . Modeling class structures takes some thinking, and when done the thinking and the drawing and after that starting up using the model, then you’ll spend awful lots of code traversing links in order to retrieve trivial info in a given object structure.

NGS Bioinformatics Course Day 3: New Luigi helper tool, "real-world" NGS pipelines

It turned out I didn’t have the time and strength to blog every day at the NGS Bioinformatics Intro course, so here comes a wrap up with some random notes and tidbits from the last days, including any concluding remarks! These days we started working on a more realistic NGS pipeline, on analysing re-sequencing samples (slides , tutorial ). First some outcome from this tutorial What do I mean with “outcome”?

Random links from the Hadoop NGS Workshop

Some random links from the Hadoop for Next-Gen Sequencing workshop held at KTH in Kista, Stockholm in February 2015. UPDATE: Slides and Videos now available ! Spark notebook Scala notebook ADAM By Big Data Genomics Tweet by Frank Nothaft on common workflow def Part of Global Alliance for … Another link is ga4gh.org Tachyon in-memory file system Cuneiform Does support multiple outputs etc Black-box vs. White-box Workflow dependency graph can be dynamically built up while you’re running Can specity tasks in any scripting languages, or in cuneiform itself Hi-Way Worklow engine for Hadoop​ Can run exported Galaxy workflows

Links: Our experiences using Spotify's Luigi for Bioinformatics Workflows

Fig 1: A screenshot of Luigi’s web UI, of a real-world (although rather simple) workflow implemented in Luigi: Update May 5, 2016: Most of the below material is more or less outdated. Our latest work has resulted in the SciLuigi helper library , which we have used in production and will be focus of further developments. In the Bioclipse / Pharmaceutical Bioinformatics group at Dept of Pharm. Biosciences att UU, we are quite heavy users of Spotify’s Luigi workflow library , to automate workflows, mainly doing Machine Learning heavy lifting.

NGS Bioinformatics Intro Course Day 2

Today was the second day of the introductory course in NGS bioinformatics that I’m taking as part of my PhD studies. For me it started with a substantial oversleep, probably due to a combination of an annoying cold and the ~2 hour commute from south Stockholm to Uppsala and BMC . Thus I missed some really interesting material (and tutorial ) on file types in NGS analysis, but will make sure to go through that in my free time during the week.

NGS Bioinformatics Intro Course Day 1

Just finished day 1 of the introductory course on Bioinformatics for Next generation sequencing data at Scilifelab Uppsala. Attaching a photo from one of the hands-on tutorial sessions, with the tutorial leaders, standing to the right. Today’s content was mostly introductions to the linux commandline in general, and the UPPMAX HPC environment in particular, an area I’m already very familiar with, after two years as a sysadmin at UPPMAX. Thus, today I mostly got to help out the other students a bit.

Taking a one week introductory course in Bioinformatics for NGS data

Right now I’m sitting on the train and trying to get my head around some of the pre-course materials .

RDFIO VM

The old Virtual Machine still available The old virtual machine from June 25, 2014, based on Ubuntu 14.04, and RDFIO 2.x can be found here

The smallest pipeable go program

Edit: My original suggested way further below in the post is no way the “smallest pipeable” program, instead see this example (Credits: Axel Wagner ): package main import ( "io" "os" ) func main() { io.Copy(os.Stdout, os.Stdin) } … or (credits: Roger Peppe ): package main import ( "bufio" "fmt" "os" ) func main() { for scan := bufio.NewScanner(os.Stdin); scan.Scan(); { fmt.Printf("%s\n", scan.Text()) } } Ah, I just realized that the “smallest pipeable” Go (lang) program is rather small, if using my little library of minimalistic streaming components .

Profiling and creating call graphs for Go programs

In trying to get my head around the code of the very interesting GoFlow library, (for flow-based programming in Go), and the accompanying flow-based bioinformatics library I started hacking on, I needed to get some kind of visualization (like a call graph) … something like this: (And in the end, that is what I got … read on … ) :) I then found out about the go tool pprof command, for which the Go team published a blog post on here .

(E)BNF parser for parts of the Galaxy ToolConfig syntax with ANTLR

As blogged earlier, I’m currently into parsing the syntax of some definitions for the parameters and stuff of command line tools. As said in the linked blog post, I was pondering whether to use the Galaxy Toolconfig format or the DocBook CmdSynopsis format . It turned out though Well, that cmdsynopsis lacks the option to specify a list of valid choices, for a parameter, as is possible in the Galaxy ToolConfig format (see here ), and thus can be used to generate drop-down lists in wizards etc.

Partial Galaxy ToolConfig to DocBook CmdSynopsis conversion with XSLT RegEx

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1"> <description>converts SAM format to BAM format</description> <requirements> <requirement type="package">samtools</requirement> </requirements> <command interpreter="python"> sam_to_bam.py --input1=$source.input1 --dbkey=${input1.metadata.dbkey} #if $source.index_source == "history": --ref_file=$source.ref_file #else --ref_file="None" #end if --output1=$output1 --index_dir=${GALAXY_DATA_INDEX_DIR} </command> <inputs> <conditional name="source"> <param name="index_source" type="select" label="Choose the source for the reference list"> <option value="cached">Locally cached</option> <option value="history">History</option> </param> <when value="cached"> <param name="input1" type="data" format="sam" label="SAM File to Convert"> <validator type="unspecified_build" /> <validator type="dataset_metadata_in_file" filename="sam_fa_indices.loc" metadata_name="dbkey" metadata_column="1" message="Sequences are not currently available for the specified build.

Answering questions without answers - by wrapping simulations in semantics

There are lots of things that can’t be answered by a computer from data alone. Maybe the majority of what we humans perceive as knowledge is inferred from a combination of data (simple fact statements about reality) and rules that tell how facts can be combined together to allow making implicit knowledge (knowledge that is not persisted as facts anywhere, but has to be inferred from other facts and rules) become explicit.

A Hello World program in SWI-Prolog

Then you can load the program from inside prolog after you’ve started it. So, let’s start the prolog interactive GUI: prolog Then, in the Prolog GUI, load the file test.pl like so: ?- [test]. Now, if you had some prolog clauses in the test.pl file, you will be able to extract that information by querying. A very simple test program that you could create is: /* Some facts about parent relationships */ parent(sam,mark).