Hi, my name is Ben, and I’m a software engineer at Seqera Labs. I joined Seqera in November 2021 after finishing my Ph.D. at Clemson University. I work on a number of things at Seqera, but my primary role is that of a Nextflow core contributor.
I have run Nextflow just about everywhere, from my laptop to my university cluster to the cloud and Kubernetes. I have written Nextlfow pipelines for bioinformatics and machine learning, and I even wrote a pipeline to run other Nextflow pipelines for my dissertation research. While I tried to avoid contributing code to Nextflow as a student (I had enough work already), now I get to work on it full-time!
Which brings me to the topic of this post: Nextflow and Kubernetes.
One of my first contributions was a “best practices guide” for running Nextflow on Kubernetes. The guide has helped many people, but for me .. (click here to read more)
In 2023, the world of Nextflow is more exciting than ever! With new resources constantly being released, there is no better time to dive into this powerful tool. From a new Software Carpentries’ course to an in-depth write-up by 23andMe to new tutorials on Wave and Fusion, the options for learning Nextflow are endless.
We've compiled a list of the best resources in 2023 to make your journey to Nextflow mastery as seamless as possible. And remember, Nextflow is a community-driven project. If you have suggestions or want to contribute to this list, head to the GitHub page and make a pull request.
Before learning Nextflow, you should be comfortable with the Linux command line and be familiar with some basic scripting languages such as Perl or Python. The beauty of Nextflow is that task logic can be written in your language of choice. You will .. (click here to read more)
We have talked about Google Cloud Batch before. Not only that, we were proud to announce Nextflow support to Google Cloud Batch right after it was publicly released, back in July 2022. How amazing is that? But we didn't stop there! The Nextflow official documentation also provides a lot of useful information on how to use Google Cloud Batch as the compute environment for your Nextflow pipelines. Having said that, feedback from the community is valuable, and we agreed that in addition to the documentation, teaching by example, and in a more informal language, can help many of our users. So, here is a tutorial on how to use the Batch service of the Google Cloud Platform with Nextflow 🥳
Welcome to our RNAseq tutorial using Nextflow and Google Cloud Batch! RNAseq is a powerful technique for studying gene expression and is .. (click here to read more)
The ability to resume an analysis (i.e. caching) is one of the core strengths of Nextflow. When developing pipelines, this allows us to avoid re-running unchanged processes by simply appending
-resume to the
nextflow run command. Sometimes, tasks may be repeated for reasons that are unclear. In these cases it can help to look into the caching mechanism, to understand why a specific process was re-run.
We have previously written about Nextflow's resume functionality as well as some troubleshooting strategies to gain more insights on the caching behavior.
In this post, we will take a more hands-on approach and highlight some strategies which we can use to understand what is causing a particular process (or processes) to re-run, instead of using the cache from previous runs of the pipeline. To demonstrate the process, we will introduce a minor change into one of the process definitions in the the click here to read more)
After a three-year COVID-related hiatus from in-person events, Nextflow developers and users found their way to Barcelona this October for the 2022 Nextflow Summit. Held at Barcelona’s iconic Agbar tower, this was easily the most successful Nextflow community event yet!
The week-long event kicked off with 50 people participating in a hackathon organized by nf-core beginning on October 10th. The hackathon tackled several cutting-edge projects with developer teams focused on various aspects of nf-core including documentation, subworkflows, pipelines, DSL2 conversions, modules, and infrastructure. The Nextflow Summit began mid-week attracting nearly 600 people, including 165 attending in person and another 433 remotely. The YouTube live streams have now collected over two and half thousand views. Just prior to the summit, three virtual Nextflow training events were also run with separate sessions for the Americas, .. (click here to read more)
Containers have become an essential part of well-structured data analysis pipelines. They encapsulate applications and dependencies in portable, self-contained packages that can be easily distributed. Containers are also key to enabling predictable and reproducible results.
Nextflow was one of the first workflow technologies to fully embrace containers for data analysis pipelines. Community curated container collections such as BioContainers also helped speed container adoption.
However, the increasing complexity of data analysis pipelines and the need to deploy them across different clouds and platforms pose new challenges. Today, workflows may comprise dozens of distinct container images. Pipeline developers must manage and maintain these containers and ensure that their functionality precisely aligns with the requirements of every pipeline task.
Also, multi-cloud deployments and the increased use of private container registries further increase complexity for developers. Building and maintaining containers, pushing them to multiple registries, and dealing with platform-specific authentication schemes are tedious, time .. (click here to read more)
Nextflow is a powerful workflow manager that supports multiple container technologies, cloud providers and HPC job schedulers. It shouldn't be a surprise that wide ranging functionality leads to a complex interface, but comes with the drawback of many subcommands and options to remember. For a first-time user (and sometimes even for some long-time users) it can be difficult to remember everything. This is not a new problem for the command-line; even very common applications such as grep and tar are famous for having a bewildering array of options.
Many tools have sprung up to make the command-line more user friendly, such as tldr pages and rich-click. Fig is one such tool that adds powerful autocomplete functionality to your terminal. Fig gives you graphical popups with color-coded contexts more dynamic than shaded text for recent commands or long .. (click here to read more)
Word cloud of scientific interest keywords, averaged across all applications.
Our recent The State of the Workflow 2022: Community Survey Results showed that Nextflow and nf-core have a strong global community with a high level of engagement in several countries. As the community continues to grow, we aim to prioritize inclusivity for everyone through active outreach to groups with low representation.
Thanks to funding from our Chan Zuckerberg Initiative Diversity and Inclusion grant we established an international Nextflow and nf-core mentoring program with the aim of empowering those from underrepresented groups. With the first round of the mentorship now complete, we look back at the success of the program so far.
From almost 200 applications, five pairs of mentors and mentees were selected for the first round of the program. Over the following .. (click here to read more)
A key feature of Nextflow is the ability to abstract the implementation of data analysis pipelines so they can be deployed in a portable manner across execution platforms.
As of today, Nextflow supports a rich variety of HPC schedulers and all major cloud providers. Our goal is to support new services as they emerge to enable Nextflow users to take advantage of the latest technology and deploy pipelines on the compute environments that best fit their requirements.
For this reason, we are delighted to announce that Nextflow now supports Google Cloud Batch, a new fully managed batch service just announced for beta availability by Google Cloud.
Google Cloud Batch is a comprehensive cloud service suitable for multiple use cases, including HPC, AI/ML, and data processing. While it is similar to the Google Cloud Life Sciences API, used by many Nextflow users today, Google Cloud .. (click here to read more)
As recently announced, we are super excited to host a new Nextflow community event late this year! The Nextflow Summit will take place October 12-14, 2022 at the iconic Torre Glòries in Barcelona, with an associated nf-core hackathon beforehand.
Today we’re excited to open the call for abstracts! We’re looking for talks and posters about anything and everything happening in the Nextflow world. Specifically, we’re aiming to shape the program into four key areas:
Speaking at the summit will primarily be in-person, but we welcome posters from remote attendees. Posters will be submitted digitally and available online during and after the event. Talks will be streamed live and be available after the event... (click here to read more)
Software development is a constantly evolving process that requires continuous adaptation to keep pace with new technologies, user needs, and trends. Likewise, changes are needed in order to introduce new capabilities and guarantee a sustainable development process.
Nextflow is no exception. This post will summarise the major changes in the evolution of the framework over the next 12 to 18 months.
Nextflow runs on top of Java (or, more precisely, the Java virtual machine). So far, Java 8 has been the minimal version required to run Nextflow. However, this version was released 8 years ago and is going to reach its end-of-life status at the end of this month. For this reason, as of version 22.01.x-edge and the upcoming stable release 22.04.0, Nextflow will require Java version 11 or later for its execution. This also allows the introduction of new capabilities provided by the modern Java runtime.
Tip: .. (click here to read more)
The Nextflow community channel on Gitter has grown substantially over the last few years and today has more than 1,300 members.
I still remember when a former colleague proposed the idea of opening a Nextflow channel on Gitter. At the time, I didn't know anything about Gitter, and my initial response was : "would that not be a waste of time?".
Fortunately, I took him up on his suggestion and the Gitter channel quickly became an important resource for all Nextflow developers and a key factor to its success.
As the Nextflow community continues to grow, we realize that we have reached the limit of .. (click here to read more)
A lot has happened since we last wrote about how best to learn Nextflow, over a year ago. Several new resources have been released including a new Nextflow Software Carpentries course and an excellent write-up by 23andMe.
We have collated some links below from a diverse collection of resources to help you on your journey to learn Nextflow. Nextflow is a community-driven project - if you have any suggestions, please make a pull request to this page on GitHub.
Without further ado, here is the definitive guide for learning Nextflow in 2022. These resources will support anyone in the journey from total beginner to Nextflow expert.
Before you start writing Nextflow pipelines, we recommend that you are comfortable with using the command-line and understand the basic concepts of scripting languages such as Python or Perl. Nextflow is widely used for bioinformatics applications, and scientific data analysis. The examples and .. (click here to read more)
Git has become the de-facto standard for source-code version control system and has seen increasing adoption across the spectrum of software development.
Nextflow provides builtin support for Git and most popular Git hosting platforms such as GitHub, GitLab and Bitbucket between the others, which streamline managing versions and track changes in your pipeline projects and facilitate the collaboration across different users.
In order to access public repositories Nextflow does not require any special configuration, just use the http URL of the pipeline project you want to run in the run command, for example:
nextflow run https://github.com/nextflow-io/hello
However to allow Nextflow to access private repositories you will need to specify the repository credentials, and the server hostname in the case of self-managed Git server installations.
This is done through a file name
scm placed in the
$HOME/.nextflow/ directory, containing the credentials and other details for accessing a .. (click here to read more)
For Windows users, getting access to a Linux-based Nextflow development and runtime environment used to be hard. Users would need to run virtual machines, access separate physical servers or cloud instances, or install packages such as Cygwin or Wubi. Fortunately, there is now an easier way to deploy a complete Nextflow development environment on Windows.
The Windows Subsystem for Linux (WSL) allows users to build, manage and execute Nextflow pipelines on a Windows 10 laptop or desktop without needing a separate Linux machine or cloud VM. Users can build and test Nextflow pipelines and containerized workflows locally, on an HPC cluster, or their preferred cloud service, including AWS Batch and Azure Batch.
This document provides a step-by-step guide to setting up a Nextflow development environment on Windows 10.
The steps described in this guide are as follows:
The recent tweet introducing the Nextflow support for SQL databases raised a lot of positive reaction. In this post, I want to describe more in detail how this extension works.
Nextflow was designed with the idea to streamline the deployment of complex data pipelines in a scalable, portable and reproducible manner across different computing platforms. To make this all possible, it was decided the resulting pipeline and the runtime should be self-contained i.e. to not depend on separate services such as database servers.
This makes the resulting pipelines easier to configure, deploy, and allows for testing them using CI services, which is a critical best practice for delivering high-quality and stable software.
Another important consequence is that Nextflow pipelines do not retain the pipeline state on separate storage. Said in a different way, the idea was - and still is - to promote stateless pipeline execution in which the .. (click here to read more)
In May we blogged about Five Nextflow Tips for HPC Users and now we continue the series with five additional tips for deploying Nextflow with on HPC batch schedulers.
To allow the pipeline tasks to share data with each other, Nextflow requires a shared file system path as a working directory. When using this model, a common recommendation is to use the node's local scratch storage as the job working directory to avoid unnecessary use of the network shared file system and achieve better performance.
Nextflow implements this best-practice which can be enabled by adding the following setting in your
process.scratch = true
When using this option, Nextflow: * Creates a unique directory in the computing node's local
/tmp or the path assigned by your cluster via the
TMPDIR environment variable. * Creates a symlink for each input file required by the job execution. .. (click here to read more)
Nextflow is a powerful tool for developing scientific workflows for use on HPC systems. It provides a simple solution to deploy parallelized workloads at scale using an elegant reactive/functional programming model in a portable manner.
It supports the most popular workload managers such as Grid Engine, Slurm, LSF and PBS, among other out-of-the-box executors, and comes with sensible defaults for each. However, each HPC system is a complex machine with its own characteristics and constraints. For this reason you should always consult your system administrator before running a new piece of software or a compute intensive pipeline that spawns a large number of jobs.
In this series of posts, we will be sharing the top tips we have learned along the way that should help you get results faster while keeping in the good books of your sys admins.
Nextflow, by default, spawns parallel task executions in .. (click here to read more)
This blog follows up the Learning Nextflow in 2020 blog post.
This guide is designed to walk you through a basic development setup for writing Nextflow pipelines.
Nextflow runs on any Linux compatible system and MacOS with Java installed. Windows users can rely on the Windows Subsystem for Linux. Installing Nextflow is straightforward. You just need to download the
nextflow executable. In your terminal type the following commands:
$ curl get.nextflow.io | bash $ sudo mv nextflow /usr/local/bin
The first line uses the curl command to download the nextflow executable, and the second line moves the executable to your PATH. Note
/usr/local/bin is the default for MacOS, you might want to choose
/usr/bin depending on your PATH definition and operating system.
Nextflow pipelines can be written in any plain text editor. I'm personally a bit of a Vim fan, however, the advent of .. (click here to read more)
When the Nextflow project was created, one of the main drivers was to enable reproducible data pipelines that could be deployed across a wide range of execution platforms with minimal effort as well as to empower users to scale their data analysis while facilitating the migration to the cloud.
Throughout the years, the computing services provided by cloud vendors have evolved in a spectacular manner. Eight years ago, the model was focused on launching virtual machines in the cloud, then came containers and then the idea of serverless computing which changed everything again. However, the power of the Nextflow abstraction consists of hiding the complexity of the underlying platform. Through the concept of executors, emerging technologies and new platforms can be easily adapted with no changes required to user pipelines.
With this in mind, we could not be more excited to announce that over the past months we have been working with .. (click here to read more)
With the year nearly over, we thought it was about time to pull together the best-of-the-best guide for learning Nextflow in 2020. These resources will support anyone in the journey from total noob to Nextflow expert so this holiday season, give yourself or someone you know the gift of learning Nextflow!
We recommend that learners are comfortable with using the command line and the basic concepts of a scripting language such as Python or Perl before they start writing pipelines. Nextflow is widely used for bioinformatics applications, and the examples in these guides often focus on applications in these topics. However, Nextflow is now adopted in a number of data-intensive domains such as radio astronomy, satellite imaging and machine learning. No domain expertise is expected.
We estimate that the speediest of learners can complete the material in around 12 hours. It all depends on your background and how .. (click here to read more)
The latest Nextflow version 2020.10.0 is the first stable release running on Groovy 3.
The first benefit of this change is that now Nextflow can be compiled and run on any modern Java virtual machine, from Java 8, all the way up to the latest Java 15!
Along with this, the new Groovy runtime brings a whole lot of syntax enhancements that can be useful in the everyday life of .. (click here to read more)
For most developers, the command line is synonymous with agility. While tools such as Nextflow Tower are opening up the ecosystem to a whole new set of users, the Nextflow CLI remains a bedrock for pipeline development. The CLI in Nextflow has been the core interface since the beginning; however, its full functionality was never extensively documented. Today we are excited to release the first iteration of the CLI documentation available on the Nextflow website.
And given Halloween is just around the corner, in this blog post we'll take a look at 5 CLI tricks and examples which will make your life easier in designing, executing and debugging data pipelines. We are also giving away 5 limited-edition Nextflow hoodies and sticker packs so you can code in style this Halloween season!
Nextflow facilitates easy collaboration and re-use of existing pipelines in .. (click here to read more)
We are thrilled to announce the stable release of Nextflow DSL 2 as part of the latest 20.07.1 version!
Nextflow DSL 2 represents a major evolution of the Nextflow language and makes it possible to scale and modularise your data analysis pipeline while continuing to use the Dataflow programming paradigm that characterises the Nextflow processing model.
We spent more than one year collecting user feedback and making sure that DSL 2 would naturally fit the programming experience Nextflow developers are used to.
Backward compatibility is a paramount value, for this reason the changes introduced in the syntax have been minimal and above all, guarantee the support of all existing applications. DSL 2 will be an opt-in feature for at least the next 12 to 18 months. After this transitory period, we plan to make it the default Nextflow execution mode.
As of today, to use DSL .. (click here to read more)
Continuing our series on understanding Nextflow resume, we wanted to delve deeper to show how you can report which tasks contribute to a given workflow output.
When provided with a run name or session ID, the log command can return useful information about a pipeline execution. This can be composed to track the provenance of a workflow result.
When supplying a run name or session ID, the log command .. (click here to read more)
This two-part blog aims to help users understand Nextflow’s powerful caching mechanism. Part one describes how it works whilst part two will focus on execution provenance and troubleshooting. You can read part one here.
If your workflow execution is not resumed as expected, there exists several strategies to debug the problem.
Make sure that there has been no change in your input files. Don’t forget the unique task hash is computed by taking into account the complete file path, the last modified timestamp and the file size. If any of these change, the workflow will be re-executed, even if the input content is the same.
A process should never alter input files. When this happens, the future execution of tasks will be invalidated for the same reason explained in the previous point.
Some shared file system, such as NFS, may .. (click here to read more)
This two-part blog aims to help users understand Nextflow’s powerful caching mechanism. Part one describes how it works whilst part two will focus on execution provenance and troubleshooting. You can read part two here
Task execution caching and checkpointing is an essential feature of any modern workflow manager and Nextflow provides an automated caching mechanism with every workflow execution. When using the
-resume flag, successfully completed tasks are skipped and the previously cached results are used in downstream tasks. But understanding the specifics of how it works and debugging situations when the behaviour is not as expected is a common source of frustration.
The mechanism works by assigning a unique ID to each task. This unique ID is used to create a separate execution directory, called the working directory, where the tasks are executed and the results stored. A task’s unique ID is generated as a 128-bit hash number obtained from .. (click here to read more)
The ability to create components, libraries or module files has been among the most requested feature ever over the years.
For this reason, today we are very happy to announce that a preview implementation of the modules feature has been merged on master branch of the project and included in the 19.05.0-edge release.
The implementation of this feature has opened the possibility for many fantastic improvements to Nextflow and its syntax. We are extremely excited as it results in a radical new way of writing Nextflow applications! So much so, that we are referring to these changes as DSL 2.
Since this is still a preview technology and, above all, to not break any existing applications, to enable the new syntax you will need to add the following line at the beginning of your workflow script:
A module file simply consists of one .. (click here to read more)
We are excited to announce the new Nextflow 19.04.0 stable release!
This version includes numerous bug fixes, enhancement and new features.
In this release, we are making the new interactive rich output using ANSI escape characters as the default logging option. This produces a much more readable and easy to follow log of the running workflow execution.
The ANSI log is implicitly disabled when the nextflow is launched in the background i.e. when using the
-bg option. It can also be explicitly disabled using the
-ansi-log false option or setting the
NXF_ANSI_LOG=false variable in your launching environment.
The support for NCBI SRA archive was introduced in the previous edge release. Given the very positive reaction, we are graduating this feature into the stable release for general availability.
This version includes also a new Git repository provider for the Gitea self-hosted source .. (click here to read more)
It's time for the monthly Nextflow release for March, edge version 19.03. This is another great release with some cool new features, bug fixes and improvements.
This sees the introduction of the long-awaited sequence read archive (SRA) channel factory. The SRA is a key public repository for sequencing data and run in coordination between The National Center for Biotechnology Information (NCBI), The European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ).
This feature originates all the way back in 2015 and was worked on during a 2018 Nextflow hackathon. It was brought to fore again thanks to the release of Phil Ewels' excellent SRA Explorer. The SRA channel factory allows users to pull read data in FASTQ format directly from SRA by referencing a study, accession ID or even a keyword. It works in a similar way to
fromFilePairs, returning a sample ID .. (click here to read more)
Google Cloud and WuXi NextCODE are dedicated to advancing the state of the art in biomedical informatics, especially through open source, which allows developers to collaborate broadly and deeply.
WuXi NextCODE is itself a user of Nextflow, and Google Cloud has many customers that use Nextflow. Together, we’ve collaborated to deliver Google Cloud Platform (GCP) support for Nextflow using the Google Pipelines API. Pipelines API is a managed computing service that allows the execution of containerized workloads .. (click here to read more)
Today marks an important milestone in the Nextflow project. We are thrilled to announce three important changes to better meet users’ needs and ground the project on a solid foundation upon which to build a vibrant ecosystem of tools and data analysis applications for genomic research and beyond.
Nextflow was originally licensed as GPLv3 open source software more than five years ago. GPL is designed to promote the adoption and spread of open source software and culture. On the other hand it has also some controversial side-effects, such as the one on derivative works and legal implications which make the use of GPL released software a headache in many organisations. We have previously discussed these concerns in this blog post and, after community feedback, have opted to change the project license to Apache 2.0.
This is a popular permissive free software license written by .. (click here to read more)
One key feature of Nextflow is the ability to automatically pull and execute a workflow application directly from a sharing platform such as GitHub. We realised this was critical to allow users to properly track code changes and releases and, above all, to enable the seamless sharing of workflow projects.
Nextflow never wanted to implement its own centralised workflow registry because we thought that in order for a registry to be viable and therefore useful, it should be technology agnostic and it should be driven by a consensus among the wider user community.
This is exactly what the Dockstore project is designed for and for this reason we are thrilled to announce that Dockstore has just released the support for Nextflow workflows in its latest release! .. (click here to read more)
Over past week there was some discussion on social media regarding the Nextflow license and its impact on users' workflow applications.
… don’t use Nextflow, yo. https://t.co/Paip5W1wgG— Konrad Rudolph 👨🔬💻 (@klmr) July 10, 2018
This is certainly disappointing. An argument in favor of writing workflows in @commonwl, which is independent of the execution engine. https://t.co/mIbdLQQxmf— John Didion (@jdidion) July 10, 2018
GPL is generally considered toxic to companies due to fear of the viral nature of the license.— Jeff Gentry (@geoffjentry) July 10, 2018
Nextflow has been released under the GPLv3 license since its early days over 5 years ago. GPL is a very popular open source licence used by many projects (like, .. (click here to read more)
Nextflow aims to ease the development of large scale, reproducible workflows allowing
developers to focus on the main application logic and to rely on best community tools and best practices.
For this reason we are very excited to announce that the latest Nextflow version (
0.30.0) finally provides built-in support for Conda.
Conda is a popular package manager that simplifies the installation of software packages and the configuration of complex software environments. Above all, it provides access to large tool and software package collections maintained by domain specific communities such as Bioconda and BioBuild.
The native integration with Nextflow allows researchers to develop workflow applications in a rapid and easy repeatable manner, reusing community tools, whilst taking advantage of the configuration flexibility, portability and scalability provided by Nextflow.
Nextflow automatically creates and activates the Conda environment(s) given the dependencies specified by each process.
Dependencies are specified by using the click here to read more)
Nextflow is growing up. The past week marked five years since the first commit of the project on GitHub. Like a parent reflecting on their child attending school for the first time, we know reaching this point hasn’t been an entirely solo journey, despite Paolo's best efforts!
A lot has happened recently and we thought it was time to highlight some of the recent evolutions. We also take the opportunity to extend the warmest of thanks to all those who have contributed to the development of Nextflow as well as the fantastic community of users who consistently provide ideas, feedback and the occasional late night banter on the Gitter channel.
Here are a few neat developments churning out of the birthday cake mix.
nf-core is a community effort to provide a home for high quality, production-ready, curated analysis pipelines built using Nextflow. The project has been initiated and is being .. (click here to read more)
This is a guest post authored by Maxime Garcia from the Science for Life Laboratory in Sweden. Max describes how they deploy complex cancer data analysis pipelines using Nextflow and Singularity. We are very happy to share their experience across the Nextflow community.
Cancer Analysis Workflow (CAW for short) is a Nextflow based analysis pipeline developed for the analysis of tumour: normal pairs. It is developed in collaboration with two infrastructures within Science for Life Laboratory: National Genomics Infrastructure (NGI), in The Stockholm Genomics Applications Development Facility to be precise and National Bioinformatics Infrastructure Sweden (NBIS).
CAW is based on GATK Best Practices for the preprocessing of FastQ files, then uses various variant calling tools to look for somatic SNVs and small indels (MuTect1, MuTect2, Strelka, Freebayes), (GATK HaplotyeCaller), for structural variants(click here to read more)
The latest Nextflow release (0.26.0) includes built-in support for AWS Batch, a managed computing service that allows the execution of containerised workloads over the Amazon EC2 Container Service (ECS).
This feature allows the seamless deployment of Nextflow pipelines in the cloud by offloading the process executions as managed Batch jobs. The service takes care to spin up the required computing instances on-demand, scaling up and down the number and composition of the instances to best accommodate the actual workload resource needs at any point in time.
AWS Batch shares with Nextflow the same vision regarding workflow containerisation i.e. each compute task is executed in its own Docker container. This dramatically simplifies the workflow deployment through the download of a few container images. This common design background made the support for AWS Batch a natural extension for Nextflow.
Batch is organised in Compute Environments, Job queues, Job definitions and .. (click here to read more)
Last week saw the inaugural Nextflow meeting organised at the Centre for Genomic Regulation (CRG) in Barcelona. The event combined talks, demos, a tutorial/workshop for beginners as well as two hackathon sessions for more advanced users.
Nearly 50 participants attended over the two days which included an entertaining tapas course during the first evening!
One of the main objectives of the event was to bring together Nextflow users to work together on common interest projects. There were several proposals for the hackathon sessions and in the end five diverse ideas were chosen for communal development ranging from new pipelines through to the addition of new features in Nextflow.
The proposals and outcomes of each the projects, which can be found in the issues section of this GitHub repository, have been summarised below.
The HTML tracing project aims to generate a rendered version of the Nextflow trace file to .. (click here to read more)
The Common Workflow Language (CWL) is a specification for defining workflows in a declarative manner. It has been implemented to varying degrees by different software packages. Nextflow and CWL share a common goal of enabling portable reproducible workflows.
We are currently investigating the automatic conversion of CWL workflows into Nextflow scripts to increase the portability of workflows. This work is being developed as the cwl2nxf project, currently in early prototype stage.
Our first phase of the project was to determine mappings of CWL to Nextflow and familiarize ourselves with how the current implementation of the converter supports a number of CWL specific features.
Inputs in the CWL workflow file are initially parsed as channels or other Nextflow input types. Each step specified in the workflow is then parsed independently. At the time of writing subworkflows are not supported, each step must be a CWL
CommandLineTool .. (click here to read more)
We are excited to announce the first Nextflow workshop that will take place at the Barcelona Biomedical Research Park building (PRBB) on 14-15th September 2017.
This event is open to everybody who is interested in the problem of computational workflow reproducibility. Leading experts and users will discuss the current .. (click here to read more)
We are excited to announce the publication of our work Nextflow enables reproducible computational workflows in Nature Biotechnology.
The article provides a description of the fundamental components and principles of Nextflow. We illustrate how the unique combination of containers, pipeline sharing and portable deployment provides tangible advantages to researchers wishing to generate reproducible computational workflows.
Reproducibility is a major challenge in today's scientific environment. We show how three bioinformatics data analyses produce different results when executed on different execution platforms and how Nextflow, along with software containers, can be used to control numerical stability, enabling consistent and replicable results across different computing platforms. As complex omics analyses enter the clinical setting, ensuring that results remain stable brings on extra importance.
Since its first release three years ago, the Nextflow user base has grown in an organic fashion. From the beginning it has been our own demands in a workflow tool and .. (click here to read more)
Nextflow was one of the first workflow framework to provide built-in support for Docker containers. A couple of years ago we also started to experiment with the deployment of containerised bioinformatic pipelines at CRG, using Docker technology (see here and here).
We found that by isolating and packaging the complete computational workflow environment with the use of Docker images, radically simplifies the burden of maintaining complex dependency graphs of real workload data analysis pipelines.
Even more importantly, the use of containers enables replicable results with minimal effort for the system configuration. The entire computational environment can be archived in a self-contained executable format, allowing the replication of the associated analysis at any point in time.
This ability is the main reason that drove the rapid adoption of Docker in the bioinformatic community and its support in many projects, like for example Galaxy, CWL, Bioboxes, Dockstore and .. (click here to read more)
Learn how to deploy an elastic computing cluster in the AWS cloud with Nextflow
In the previous post I introduced the new cloud native support for AWS provided by Nextflow.
It allows the creation of a computing cluster in the cloud in a no-brainer way, enabling the deployment of complex computational pipelines in a few commands.
This solution is characterised by using a lean application stack which does not require any third party component installed in the EC2 instances other than a Java VM and the Docker engine (the latter it's only required in order to deploy pipeline binary dependencies).
Each EC2 instance runs a script, at bootstrap time, that mounts the EFS storage and downloads and launches the Nextflow cluster daemon. This daemon is self-configuring, it automatically discovers the other running instances and .. (click here to read more)
Learn how to deploy and run a computational pipeline in the Amazon AWS cloud with ease thanks to Nextflow and Docker containers
Nextflow is a framework that simplifies the writing of parallel and distributed computational pipelines in a portable and reproducible manner across different computing platforms, from a laptop to a cluster of computers.
Indeed, the original idea, when this project started three years ago, was to implement a tool that would allow researchers in our lab to smoothly migrate their data analysis applications in the cloud when needed - without having to change or adapt their code.
However to date Nextflow has been used mostly to deploy computational workflows within on-premise computing clusters or HPC data-centers, because these infrastructures are easier to use and provide, on average, cheaper cost and better performance when compared to a cloud environment.
A major obstacle to .. (click here to read more)
Below is a step-by-step guide for creating Docker images for use with Nextflow pipelines. This post was inspired by recent experiences and written with the hope that it may encourage others to join in the virtualization revolution.
Modern science is built on collaboration. Recently I became involved with one such venture between several groups across Europe. The aim was to annotate long non-coding RNA (lncRNA) in farm animals and I agreed to help with the annotation based on RNA-Seq data. The basic procedure relies on mapping short read data from many different tissues to a genome, generating transcripts and then determining if they are likely to be lncRNA or protein coding genes.
During several successful 'hackathon' meetings the best approach was decided and implemented in a joint effort. I undertook the task of wrapping the procedure up into a Nextflow pipeline with a view to replicating the results across our .. (click here to read more)
Publication time acts as a snapshot for scientific work. Whether a project is ongoing or not, work which was performed months ago must be described, new software documented, data collated and figures generated.
The monumental increase in data and pipeline complexity has led to this task being performed to many differing standards, or lack of thereof. We all agree it is not good enough to simply note down the software version number. But what practical measures can be taken?
The recent publication describing Kallisto (Bray et al. 2016) provides an excellent high profile example of the growing efforts to ensure reproducible science in computational biology. The authors provide a GitHub repository that “contains all the analysis to reproduce the results in the kallisto paper”.
They should be applauded and indeed - in the Twittersphere - they were. The corresponding author Lior Pachter stated that the publication could be .. (click here to read more)
Recently a new feature has been added to Nextflow that allows failing jobs to be rescheduled, automatically increasing the amount of computational resources requested.
Nextflow provides a mechanism that allows tasks to be automatically re-executed when a command terminates with an error exit status. This is useful to handle errors caused by temporary or even permanent failures (i.e. network hiccups, broken disks, etc.) that may happen in a cloud based environment.
However in an HPC cluster these events are very rare. In this scenario error conditions are more likely to be caused by a peak in computing resources, allocated by a job exceeding the original resource requested. This leads to the batch scheduler killing the job which in turn stops the overall pipeline execution.
In this context automatically re-executing the failed task is useless because it would simply replicate the same error condition. A common solution consists of increasing .. (click here to read more)
As a new bioinformatics student with little formal computer science training, there are few things that scare me more than PhD committee meetings and having to run my code in a completely different operating environment.
Recently my work landed me in the middle of the phylogenetic tree jungle and the computational requirements of my project far outgrew the resources that were available on our institute’s Univa Grid Engine based cluster. Luckily for me, an opportunity arose to participate in a joint program at the MareNostrum HPC at the Barcelona Supercomputing Centre (BSC).
As one of the top 100 supercomputers in the world, the MareNostrum III dwarfs our cluster and consists of nearly 50'000 processors. However it soon became apparent that with great power comes great responsibility and in the case of the BSC, great restrictions. These include no internet access, restrictive wall times for jobs, longer queues, fewer .. (click here to read more)
The main goal of Nextflow is to make workflows portable across different computing platforms taking advantage of the parallelisation features provided by the underlying system without having to reimplement your application code.
From the beginning Nextflow has included executors designed to target the most popular resource managers and batch schedulers commonly used in HPC data centers, such as Univa Grid Engine, Platform LSF, SLURM, PBS and Torque.
When using one of these executors Nextflow submits the computational workflow tasks as independent job requests to the underlying platform scheduler, specifying for each of them the computing resources needed to carry out its job.
This approach works well for workflows that are composed of long running tasks, which is the case of most common genomic pipelines.
However this approach does not scale well for workloads made up of a large number of short-lived tasks (e.g. a few seconds .. (click here to read more)
In a recent publication we assessed the impact of Docker containers technology on the performance of bioinformatic tools and data analysis workflows.
We benchmarked three different data analyses: a RNA sequence pipeline for gene expression, a consensus assembly and variant calling pipeline, and finally a pipeline for the detection and mapping of long non-coding RNAs.
We found that Docker containers have only a minor impact on the performance of common genomic data analysis, which is negligible when the executed tasks are demanding in terms of computational time.
This publication is available as PeerJ preprint at this link.
Innovation can be viewed as the application of solutions that meet new requirements or existing market needs. Academia has traditionally been the driving force of innovation. Scientific ideas have shaped the world, but only a few of them were brought to market by the inventing scientists themselves, resulting in both time and financial loses.
Lately there have been several attempts to boost scientific innovation and translation, with most notable in Europe being the Horizon 2020 funding program. The problem with these types of funding is that they are not designed for PhDs and Postdocs, but rather aim to promote the collaboration of senior scientists in different institutions. This neglects two very important facts, first and foremost that most of the Nobel prizes were given for discoveries made when scientists were in their 20's / 30's (not in their 50's / 60's). Secondly, innovation really happens when a few individuals (not .. (click here to read more)
The latest version of Nextflow introduces a new console graphical interface.
The Nextflow console is a REPL (read-eval-print loop) environment that allows one to quickly test part of a script or pieces of Nextflow code in an interactive manner.
It is a handy tool that allows one to evaluate fragments of Nextflow/Groovy code or fast prototype a complete pipeline script.
The console application is included in the latest version of Nextflow (0.13.1 or higher).
You can try this feature out, having Nextflow installed on your computer, by entering the following command in your shell terminal:
When you execute it for the first time, Nextflow will spend a few seconds downloading the required runtime dependencies. When complete the console window will appear as shown in the picture below.
It contains a text editor (the top white box) that .. (click here to read more)
Scientific data analysis pipelines are rarely composed by a single piece of software. In a real world scenario, computational pipelines are made up of multiple stages, each of which can execute many different scripts, system commands and external tools deployed in a hosting computing environment, usually an HPC cluster.
As I work as a research engineer in a bioinformatics lab I experience on a daily basis the difficulties related on keeping such a piece of software consistent.
Computing environments can change frequently in order to test new pieces of software or maybe because system libraries need to be updated. For this reason replicating the results of a data analysis over time can be a challenging task.
Docker has emerged recently as a new type of virtualisation technology that allows one to create a self-contained runtime environment. There are plenty of examples showing the benefits of using it to run .. (click here to read more)
The scientific world nowadays operates on the basis of published articles. These are used to report novel discoveries to the rest of the scientific community.
But have you ever wondered what a scientific article is? It is a:
Hence the very essence of Science relies on the ability of scientists to reproduce and build upon each other’s published results.
So how much can we rely on published data? In a recent report in Nature, researchers at the Amgen corporation found that only 11% of the academic research in the literature was reproducible by their groups .
While many factors are likely at play here, perhaps the most basic requirement for reproducibility holds that .. (click here to read more)
The GitHub code repository and collaboration platform is widely used between researchers to publish their work and to collaborate on projects source code.
Even more interestingly a few months ago GitHub announced improved support for researchers making it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.
With a DOI for your GitHub repository archive your code becomes formally citable in scientific publications.
The latest Nextflow release (0.9.0) seamlessly integrates with GitHub. This feature allows you to manage your code in a more consistent manner, or use other people's Nextflow pipelines, published through GitHub, in a quick and transparent manner.
The idea is very simple, when you launch a script execution with Nextflow, it will look for a file with the pipeline name you've specified. If that file does not exist, it will look for a public repository with .. (click here to read more)