Blogging about Nextflow, computational workflows, containers and cloud computing


Goodbye zero, Hello Apache!

  • Paolo Di Tommaso
  • 24 October 2018

Today marks an important milestone in the Nextflow project. We are thrilled to announce three important changes to better meet users’ needs and ground the project on a solid foundation upon which to build a vibrant ecosystem of tools and data analysis applications for genomic research and beyond.

Apache license

Nextflow was originally licensed as GPLv3 open source software more than five years ago. GPL is designed to promote the adoption and spread of open source software and culture. On the other hand it has also some controversial side-effects, such as the one on derivate works and legal implications which make the use of GPL released software a headache in many organisations. We have previously discussed these concerns in this blog post and, after community feedback, have opted to change the project license to Apache 2.0.

This is a popular permissive free software license written by .. (click here to read more)

Nextflow meets Dockstore

  • Paolo Di Tommaso
  • 18 September 2018

This post is co-authored with Denis Yuen, lead of the Dockstore project at the Ontario Institute for Cancer Research

One key feature of Nextflow is the ability to automatically pull and execute a workflow application directly from a sharing platform such as GitHub. We realised this was critical to allow users to properly track code changes and releases and, above all, to enable the seamless sharing of workflow projects.

Nextflow never wanted to implement its own centralised workflow registry because we thought that in order for a registry to be viable and therefore useful, it should be technology agnostic and it should be driven by a consensus among the wider user community.

This is exactly what the Dockstore project is designed for and for this reason we are thrilled to announce that Dockstore has just released the support for Nextflow workflows in its latest release! .. (click here to read more)

Clarification about the Nextflow license

  • Paolo Di Tommaso
  • 20 July 2018

Over past week there was some discussion on social media regarding the Nextflow license and its impact on users' workflow applications.

What's the problem with GPL?

Nextflow has been released under the GPLv3 license since its early days over 5 years ago. GPL is a very popular open source licence used by many projects (like, .. (click here to read more)

Conda support has landed!

  • Paolo Di Tommaso
  • 05 June 2018

Nextflow aims to ease the development of large scale, reproducible workflows allowing
developers to focus on the main application logic and to rely on best community tools and best practices.

For this reason we are very excited to announce that the latest Nextflow version (0.30.0) finally provides built-in support for Conda.

Conda is a popular package manager that simplifies the installation of software packages and the configuration of complex software environments. Above all, it provides access to large tool and software package collections maintained by domain specific communities such as Bioconda and BioBuild.

The native integration with Nextflow allows researchers to develop workflow applications in a rapid and easy repeatable manner, reusing community tools, whilst taking advantage of the configuration flexibility, portability and scalability provided by Nextflow.

How it works

Nextflow automatically creates and activates the Conda environment(s) given the dependencies specified by each process.

Dependencies are specified by using the click here to read more)

Nextflow turns five! Happy birthday!

  • Evan Floden
  • 03 April 2018

Nextflow is growing up. The past week marked five years since the first commit of the project on GitHub. Like a parent reflecting on their child attending school for the first time, we know reaching this point hasn’t been an entirely solo journey, despite Paolo's best efforts!

A lot has happened recently and we thought it was time to highlight some of the recent evolutions. We also take the opportunity to extend the warmest of thanks to all those who have contributed to the development of Nextflow as well as the fantastic community of users who consistently provide ideas, feedback and the occasional late night banter on the Gitter channel.

Here are a few neat developments churning out of the birthday cake mix.

nf-core

nf-core is a community effort to provide a home for high quality, production-ready, curated analysis pipelines built using Nextflow. The project has been initiated and is being .. (click here to read more)

Running CAW with Singularity and Nextflow

  • Maxime Garcia
  • 16 November 2017

This is a guest post authored by Maxime Garcia from the Science for Life Laboratory in Sweden. Max describes how they deploy complex cancer data analysis pipelines using Nextflow and Singularity. We are very happy to share their experience across the Nextflow community.

The CAW pipeline

Cancer Analysis Workflow logo

Cancer Analysis Workflow (CAW for short) is a Nextflow based analysis pipeline developed for the analysis of tumour: normal pairs. It is developed in collaboration with two infrastructures within Science for Life Laboratory: National Genomics Infrastructure (NGI), in The Stockholm Genomics Applications Development Facility to be precise and National Bioinformatics Infrastructure Sweden (NBIS).

CAW is based on GATK Best Practices for the preprocessing of FastQ files, then uses various variant calling tools to look for somatic SNVs and small indels (MuTect1, MuTect2, Strelka, Freebayes), (GATK HaplotyeCaller), for structural variants(click here to read more)

Scaling with AWS Batch

  • Paolo Di Tommaso
  • 08 November 2017

The latest Nextflow release (0.26.0) includes built-in support for AWS Batch, a managed computing service that allows the execution of containerised workloads over the Amazon EC2 Container Service (ECS).

This feature allows the seamless deployment of Nextflow pipelines in the cloud by offloading the process executions as managed Batch jobs. The service takes care to spin up the required computing instances on-demand, scaling up and down the number and composition of the instances to best accommodate the actual workload resource needs at any point in time.

AWS Batch shares with Nextflow the same vision regarding workflow containerisation i.e. each compute task is executed in its own Docker container. This dramatically simplifies the workflow deployment through the download of a few container images. This common design background made the support for AWS Batch a natural extension for Nextflow.

Batch in a nutshell

Batch is organised in Compute Environments, Job queues, Job definitions and .. (click here to read more)

Nexflow Hackathon 2017

  • Evan Floden
  • 30 September 2017

Last week saw the inaugural Nextflow meeting organised at the Centre for Genomic Regulation (CRG) in Barcelona. The event combined talks, demos, a tutorial/workshop for beginners as well as two hackathon sessions for more advanced users.

Nearly 50 participants attended over the two days which included an entertaining tapas course during the first evening!

One of the main objectives of the event was to bring together Nextflow users to work together on common interest projects. There were several proposals for the hackathon sessions and in the end five diverse ideas were chosen for communal development ranging from new pipelines through to the addition of new features in Nextflow.

The proposals and outcomes of each the projects, which can be found in the issues section of this GitHub repository, have been summarised below.

Nextflow HTML tracing reports

The HTML tracing project aims to generate a rendered version of the Nextflow trace file to .. (click here to read more)

Nextflow and the Common Workflow Language

  • Kevin Sayers
  • 20 July 2017

The Common Workflow Language (CWL) is a specification for defining workflows in a declarative manner. It has been implemented to varying degrees by different software packages. Nextflow and CWL share a common goal of enabling portable reproducible workflows.

We are currently investigating the automatic conversion of CWL workflows into Nextflow scripts to increase the portability of workflows. This work is being developed as the cwl2nxf project, currently in early prototype stage.

Our first phase of the project was to determine mappings of CWL to Nextflow and familiarize ourselves with how the current implementation of the converter supports a number of CWL specific features.

Mapping CWL to Nextflow

Inputs in the CWL workflow file are initially parsed as channels or other Nextflow input types. Each step specified in the workflow is then parsed independently. At the time of writing subworkflows are not supported, each step must be a CWL CommandLineTool .. (click here to read more)

Nextflow workshop is coming!

  • Paolo Di Tommaso
  • 26 April 2017

We are excited to announce the first Nextflow workshop that will take place at the Barcelona Biomedical Research Park building (PRBB) on 14-15th September 2017.

This event is open to everybody who is interested in the problem of computational workflow reproducibility. Leading experts and users will discuss the current .. (click here to read more)

Nextflow published in Nature Biotechnology

  • Paolo Di Tommaso
  • 12 April 2017

We are excited to announce the publication of our work Nextflow enables reproducible computational workflows in Nature Biotechnology.

The article provides a description of the fundamental components and principles of Nextflow. We illustrate how the unique combination of containers, pipeline sharing and portable deployment provides tangible advantages to researchers wishing to generate reproducible computational workflows.

Reproducibility is a major challenge in today's scientific environment. We show how three bioinformatics data analyses produce different results when executed on different execution platforms and how Nextflow, along with software containers, can be used to control numerical stability, enabling consistent and replicable results across different computing platforms. As complex omics analyses enter the clinical setting, ensuring that results remain stable brings on extra importance.

Since its first release three years ago, the Nextflow user base has grown in an organic fashion. From the beginning it has been our own demands in a workflow tool and .. (click here to read more)

More fun with containers in HPC

  • Paolo Di Tommaso
  • 20 December 2016

Nextflow was one of the first workflow framework to provide built-in support for Docker containers. A couple of years ago we also started to experiment with the deployment of containerised bioinformatic pipelines at CRG, using Docker technology (see here and here).

We found that by isolating and packaging the complete computational workflow environment with the use of Docker images, radically simplifies the burden of maintaining complex dependency graphs of real workload data analysis pipelines.

Even more importantly, the use of containers enables replicable results with minimal effort for the system configuration. The entire computational environment can be archived in a self-contained executable format, allowing the replication of the associated analysis at any point in time.

This ability is the main reason that drove the rapid adoption of Docker in the bioinformatic community and its support in many projects, like for example Galaxy, CWL, Bioboxes, Dockstore and .. (click here to read more)

Enabling elastic computing with Nextflow

  • Paolo Di Tommaso
  • 19 October 2016

Learn how to deploy an elastic computing cluster in the AWS cloud with Nextflow

In the previous post I introduced the new cloud native support for AWS provided by Nextflow.

It allows the creation of a computing cluster in the cloud in a no-brainer way, enabling the deployment of complex computational pipelines in a few commands.

This solution is characterised by using a lean application stack which does not require any third party component installed in the EC2 instances other than a Java VM and the Docker engine (the latter it's only required in order to deploy pipeline binary dependencies).

Nextflow cloud deployment

Each EC2 instance runs a script, at bootstrap time, that mounts the EFS storage and downloads and launches the Nextflow cluster daemon. This daemon is self-configuring, it automatically discovers the other running instances and .. (click here to read more)

Deploy your computational pipelines in the cloud at the snap-of-a-finger

  • Paolo Di Tommaso
  • 01 September 2016

Learn how to deploy and run a computational pipeline in the Amazon AWS cloud with ease thanks to Nextflow and Docker containers

Nextflow is a framework that simplifies the writing of parallel and distributed computational pipelines in a portable and reproducible manner across different computing platforms, from a laptop to a cluster of computers.

Indeed, the original idea, when this project started three years ago, was to implement a tool that would allow researchers in our lab to smoothly migrate their data analysis applications in the cloud when needed - without having to change or adapt their code.

However to date Nextflow has been used mostly to deploy computational workflows within on-premise computing clusters or HPC data-centers, because these infrastructures are easier to use and provide, on average, cheaper cost and better performance when compared to a cloud environment.

A major obstacle to .. (click here to read more)

Docker for dunces & Nextflow for nunces

  • Evan Floden
  • 10 June 2016

Below is a step-by-step guide for creating Docker images for use with Nextflow pipelines. This post was inspired by recent experiences and written with the hope that it may encourage others to join in the virtualization revolution.

Modern science is built on collaboration. Recently I became involved with one such venture between several groups across Europe. The aim was to annotate long non-coding RNA (lncRNA) in farm animals and I agreed to help with the annotation based on RNA-Seq data. The basic procedure relies on mapping short read data from many different tissues to a genome, generating transcripts and then determining if they are likely to be lncRNA or protein coding genes.

During several successful 'hackathon' meetings the best approach was decided and implemented in a joint effort. I undertook the task of wrapping the procedure up into a Nextflow pipeline with a view to replicating the results across our .. (click here to read more)

Workflows & publishing: best practice for reproducibility

  • Evan Floden
  • 13 April 2016

Publication time acts as a snapshot for scientific work. Whether a project is ongoing or not, work which was performed months ago must be described, new software documented, data collated and figures generated.

The monumental increase in data and pipeline complexity has led to this task being performed to many differing standards, or lack of thereof. We all agree it is not good enough to simply note down the software version number. But what practical measures can be taken?

The recent publication describing Kallisto (Bray et al. 2016) provides an excellent high profile example of the growing efforts to ensure reproducible science in computational biology. The authors provide a GitHub repository that “contains all the analysis to reproduce the results in the kallisto paper”.

They should be applauded and indeed - in the Twittersphere - they were. The corresponding author Lior Pachter stated that the publication could be .. (click here to read more)

Error recovery and automatic resource management with Nextflow

  • Paolo Di Tommaso
  • 11 February 2016

Recently a new feature has been added to Nextflow that allows failing jobs to be rescheduled, automatically increasing the amount of computational resources requested.

The problem

Nextflow provides a mechanism that allows tasks to be automatically re-executed when a command terminates with an error exit status. This is useful to handle errors caused by temporary or even permanent failures (i.e. network hiccups, broken disks, etc.) that may happen in a cloud based environment.

However in an HPC cluster these events are very rare. In this scenario error conditions are more likely to be caused by a peak in computing resources, allocated by a job exceeding the original resource requested. This leads to the batch scheduler killing the job which in turn stops the overall pipeline execution.

In this context automatically re-executing the failed task is useless because it would simply replicate the same error condition. A common solution consists of increasing .. (click here to read more)

Developing a bioinformatics pipeline across multiple environments

  • Evan Floden
  • 04 February 2016

As a new bioinformatics student with little formal computer science training, there are few things that scare me more than PhD committee meetings and having to run my code in a completely different operating environment.

Recently my work landed me in the middle of the phylogenetic tree jungle and the computational requirements of my project far outgrew the resources that were available on our institute’s Univa Grid Engine based cluster. Luckily for me, an opportunity arose to participate in a joint program at the MareNostrum HPC at the Barcelona Supercomputing Centre (BSC).

As one of the top 100 supercomputers in the world, the MareNostrum III dwarfs our cluster and consists of nearly 50'000 processors. However it soon became apparent that with great power comes great responsibility and in the case of the BSC, great restrictions. These include no internet access, restrictive wall times for jobs, longer queues, fewer .. (click here to read more)

MPI-like distributed execution with Nextflow

  • Paolo Di Tommaso
  • 13 November 2015

The main goal of Nextflow is to make workflows portable across different computing platforms taking advantage of the parallelisation features provided by the underlying system without having to reimplement your application code.

From the beginning Nextflow has included executors designed to target the most popular resource managers and batch schedulers commonly used in HPC data centers, such as Univa Grid Engine, Platform LSF, SLURM, PBS and Torque.

When using one of these executors Nextflow submits the computational workflow tasks as independent job requests to the underlying platform scheduler, specifying for each of them the computing resources needed to carry out its job.

This approach works well for workflows that are composed of long running tasks, which is the case of most common genomic pipelines.

However this approach does not scale well for workloads made up of a large number of short-lived tasks (e.g. a few seconds .. (click here to read more)

The impact of Docker containers on the performance of genomic pipelines

  • Paolo Di Tommaso
  • 15 June 2015

In a recent publication we assessed the impact of Docker containers technology on the performance of bioinformatic tools and data analysis workflows.

We benchmarked three different data analyses: a RNA sequence pipeline for gene expression, a consensus assembly and variant calling pipeline, and finally a pipeline for the detection and mapping of long non-coding RNAs.

We found that Docker containers have only a minor impact on the performance of common genomic data analysis, which is negligible when the executed tasks are demanding in terms of computational time.

This publication is available as PeerJ preprint at this link.

Innovation In Science - The story behind Nextflow

  • Maria Chatzou
  • 09 June 2015

Innovation can be viewed as the application of solutions that meet new requirements or existing market needs. Academia has traditionally been the driving force of innovation. Scientific ideas have shaped the world, but only a few of them were brought to market by the inventing scientists themselves, resulting in both time and financial loses.

Lately there have been several attempts to boost scientific innovation and translation, with most notable in Europe being the Horizon 2020 funding program. The problem with these types of funding is that they are not designed for PhDs and Postdocs, but rather aim to promote the collaboration of senior scientists in different institutions. This neglects two very important facts, first and foremost that most of the Nobel prizes were given for discoveries made when scientists were in their 20's / 30's (not in their 50's / 60's). Secondly, innovation really happens when a few individuals (not .. (click here to read more)

Introducing Nextflow REPL Console

  • Paolo Di Tommaso
  • 14 April 2015

The latest version of Nextflow introduces a new console graphical interface.

The Nextflow console is a REPL (read-eval-print loop) environment that allows one to quickly test part of a script or pieces of Nextflow code in an interactive manner.

It is a handy tool that allows one to evaluate fragments of Nextflow/Groovy code or fast prototype a complete pipeline script.

Getting started

The console application is included in the latest version of Nextflow (0.13.1 or higher).

You can try this feature out, having Nextflow installed on your computer, by entering the following command in your shell terminal: nextflow console.

When you execute it for the first time, Nextflow will spend a few seconds downloading the required runtime dependencies. When complete the console window will appear as shown in the picture below.

Nextflow console

It contains a text editor (the top white box) that .. (click here to read more)

Using Docker for scientific data analysis in an HPC cluster

  • Paolo Di Tommaso
  • 06 November 2014

Scientific data analysis pipelines are rarely composed by a single piece of software. In a real world scenario, computational pipelines are made up of multiple stages, each of which can execute many different scripts, system commands and external tools deployed in a hosting computing environment, usually an HPC cluster.

As I work as a research engineer in a bioinformatics lab I experience on a daily basis the difficulties related on keeping such a piece of software consistent.

Computing enviroments can change frequently in order to test new pieces of software or maybe because system libraries need to be updated. For this reason replicating the results of a data analysis over time can be a challenging task.

Docker has emerged recently as a new type of virtualisation technology that allows one to create a self-contained runtime environment. There are plenty of examples showing the benefits of using it to run .. (click here to read more)

Reproducibility in Science - Nextflow meets Docker

  • Maria Chatzou
  • 09 September 2014

The scientific world nowadays operates on the basis of published articles. These are used to report novel discoveries to the rest of the scientific community.

But have you ever wondered what a scientific article is? It is a:

  1. defeasible argument for claims, supported by
  2. exhibited, reproducible data and methods, and
  3. explicit references to other work in that domain;
  4. described using domain-agreed technical terminology,
  5. which exists within a complex ecosystem of technologies, people and activities.

Hence the very essence of Science relies on the ability of scientists to reproduce and build upon each other’s published results.

So how much can we rely on published data? In a recent report in Nature, researchers at the Amgen corporation found that only 11% of the academic research in the literature was reproducible by their groups [1].

While many factors are likely at play here, perhaps the most basic requirement for reproducibility holds that .. (click here to read more)

Share Nextflow pipelines with GitHub

  • Paolo Di Tommaso
  • 07 August 2014

The GitHub code repository and collaboration platform is widely used between researchers to publish their work and to collaborate on projects source code.

Even more interestingly a few months ago GitHub announced improved support for researchers making it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

With a DOI for your GitHub repository archive your code becomes formally citable in scientific publications.

Why use GitHub with Nextflow?

The latest Nextflow release (0.9.0) seamlessly integrates with GitHub. This feature allows you to manage your code in a more consistent manner, or use other people's Nextflow pipelines, published through GitHub, in a quick and transparent manner.

How it works

The idea is very simple, when you launch a script execution with Nextflow, it will look for a file with the pipeline name you've specified. If that file does not exist, it will look for a public repository with .. (click here to read more)