In 2023, the world of Nextflow is more exciting than ever! With new resources constantly being released, there is no better time to dive into this powerful tool. From a new Software Carpentries’ course to an in-depth write-up by 23andMe to new tutorials on Wave and Fusion, the options for learning Nextflow are endless.
We've compiled a list of the best resources in 2023 to make your journey to Nextflow mastery as seamless as possible. And remember, Nextflow is a community-driven project. If you have suggestions or want to contribute to this list, head to the GitHub page and make a pull request.
Before learning Nextflow, you should be comfortable with the Linux command line and be familiar with some basic scripting languages such as Perl or Python. The beauty of Nextflow is that task logic can be written in your language of choice. You will just need to learn Nextflow’s domain-specific language (DSL) to control overall flow.
Nextflow is widely used in bioinformatics, so many tutorials focus on life sciences. However, Nextflow can be used for almost any data-intensive workflow, including image analysis, ML model training, astronomy, and geoscience applications.
So, let's get started! These resources will guide you from beginner to expert and make you unstoppable in the field of scientific workflows.
There are hundreds of workflow managers to choose from. In fact, Meir Wahnon and several of his colleagues have gone to the trouble of compiling an awesome-workflow-engines list. The workflows community initiative is another excellent source of information about workflow engines.
Some of the best publicly available tutorials are listed below:
This informative post by Anuved Verma and Anja Bog begins with the basic concepts of Nextflow. It builds towards how Nextflow is used at 23andMe. It includes a detailed explanation of how 23andMe runs their genotype imputation pipelines in the cloud, processing over 1 million individual samples daily in an environment with 10,000+ CPUs in a single compute environment.
This hands-on tutorial from Seqera Labs will guide you through implementing a proof-of-concept RNA-seq pipeline. The goal is to become familiar with basic concepts, including defining parameters, using channels to pass data, and writing processes to perform tasks. It includes all scripts, input data, and resources and is perfect for getting a taste of Nextflow. One thing to consider is that some of the examples pre-date the availability of DSL2 – a set of syntax extensions to Nextflow that allows for defining module libraries and simplifying the writing of complex data analysis pipelines. The tutorial is still a great place to start, but you’ll want to learn about DSL 2 as well.
Here you’ll dive deeper into Nextflow’s most prominent features and learn how to apply them. The full workshop includes an excellent section on containers, how to build them, and how to use them with Nextflow. The written materials come with examples and hands-on exercises. Optionally, you can also follow with a series of videos from a live training workshop. Better yet, these examples are updated to reflect the new DSL 2 syntax and concepts such as modules, so you can learn to take advantage of the latest powerful Nextflow features.
The workshop includes topics on:
The Nextflow Software Carpentry workshop (still being developed) explains the use of Nextflow and nf-core as development tools for building and sharing reproducible data science workflows. The intended audience is those with little programming experience. The course provides a foundation to write and run Nextflow and nf-core workflows comfortably. Adapted from the Seqera training material above, the workshop has been updated by Software Carpentries instructors within the nf-core community to fit The Carpentries training style. The Carpentries emphasize feedback to improve teaching materials, so we would like to hear back from you about what you thought was well-explained and what needs improvement. Pull requests to the course material are very welcome. The workshop can be opened on Gitpod where you can try the exercises in an online computing environment at your own pace, while referencing the course material in another window alongside the tutorials.
The workshop can be opened on Gitpod where you can try the exercises in an online computing environment at your own pace, while referencing the course material in another window alongside the tutorials.
You can find the course in The Carpentries incubator.
This on-demand webinar features Phil Ewels from SciLifeLab, nf-core (and now also Seqera Labs), Brendan Boufler from Amazon Web Services, and Evan Floden from Seqera Labs. The wide-ranging discussion covers the significance of scientific workflows, examples of Nextflow in production settings, and how Nextflow can be integrated with other processes.
This advanced documentation discusses recurring patterns in Nextflow and solutions to many common implementation requirements. Code examples are available with notes to follow along and a GitHub repository.
Nextflow Patterns & GitHub repository.
A set of tutorials covering the basics of using and creating nf-core pipelines developed by the team at nf-core. These tutorials provide an overview of the nf-core framework, including:
nf-core usage tutorials and nf-core developer tutorials.
A collection of awesome Nextflow pipelines compiled by various contributors to the open-source Nextflow project.
Awesome Nextflow and GitHub
Wave and the Fusion file system are new Nextflow capabilities introduced in November 2022. Wave is a container provisioning and augmentation service fully integrated with the Nextflow ecosystem. Instead of viewing containers as separate artifacts that need to be integrated into a pipeline, Wave allows developers to manage containers as part of the pipeline itself.
Tightly coupled with Wave is the new Fusion 2.0 file system. Fusion implements a virtual distributed file system and presents a thin client allowing data hosted in AWS S3 buckets (and other object stores in the future) to be accessed via the standard POSIX filesystem interface expected by most applications.
Wave can help simplify development, improve reliability, and make pipelines easier to maintain. It can even improve pipeline performance. The optional Fusion 2.0 file system offers further advantages delivering performance on par with FSx for Lustre while enabling organizations to reduce their cloud computing bill and improve pipeline efficiency throughput. See the blog article released in February 2023 explaining the Fusion file system and providing benchmarks comparing Fusion to other data handling approaches in the cloud.
Wave showcase on GitHub
While not strictly a guide about Nextflow, this article provides an overview of scientific containers and provides a tutorial involved in creating your own container and integrating it into a Nextflow pipeline. It also provides some useful tips on troubleshooting containers and publishing them to registries.
Building Containers for Scientific Workflows
When building Nextflow pipelines, a best practice is to supply a nextflow_schema.json file describing pipeline input parameters. The benefit of adding this file to your code repo, is that if the pipeline is launched using Nextflow, the schema enables an easy-to-use web interface that users through the process of parameter selection. While it is possible to craft this file by hand, the nf-core community provides a handy schema build tool. This step-by-step guide explains how to adapt your pipeline for use with Nextflow Tower by using the schema build tool to automatically generate the nextflow_schema.json file.
Best Practices for Deploying Pipelines with Nextflow Tower
In addition to the learning resources above, several step-by-step integration guides explain how to run Nextflow pipelines on your cloud platform of choice. Some of these tutorials extend to the use of Nextflow Tower. Organizations can use the Tower Cloud Free edition to launch pipelines quickly in the cloud. Organizations can optionally use Tower Cloud Professional or run self-hosted or on-premises Tower Enterprise environments as requirements grow. This year, we added Google Cloud Batch to the cloud services supported by Nextflow.
This three-part series of articles provides a step-by-step guide explaining how to use Nextflow with AWS Batch. The first of three articles covers AWS Batch concepts, the Nextflow execution model, and explains how the integration works under the covers. The second article in the series provides a step by step guide explaining how to set up the AWS batch environment and how to run and troubleshoot open-source Nextflow pipelines. The third article builds on what you've learned, explaining how to integrate workflows with Nextflow Tower and share the AWS Batch environment with other users by "publishing" your workflows to the cloud.
Nextflow and AWS Batch — Inside the Integration (part 1 of 3, part 2 of 3, part 3 of 3)
Similar to the tutorial above, this set of articles does a deep dive into the Nextflow Azure Batch integration. Part 1 covers Azure Batch and essential concepts, provides an overview of the integration and explains how to set up Azure Batch and Storage accounts. It also covers deploying a machine instance in the Azure cloud and configuring it to run Nextflow pipelines against the Azure Batch service.
Part 2 builds on what you learned in part 1 and shows how to use Azure Batch from within Nextflow Tower Cloud. It provides a walkthrough of how to make the environment set up in part 1 accessible to users through Tower's intuitive web interface.
Nextflow and Azure Batch — Inside the Integration (part 1 of 2, part 2 of 2)
This excellent article by Marcel Ribeiro-Dantas provides a step-by-step tutorial on using Nextflow with Google’s new Google Cloud Batch service. Google Cloud Batch is expected to replace the Google Life Sciences integration over time. The article explains how to deploy the Google Cloud Batch and Storage environments in GCP using the gcloud CLI. It then goes on to explain how to configure Nextflow to launch pipelines into the newly created Google Cloud Batch environment.
Get started with Nextflow on Google Cloud Batch
While not commonly used for HPC workloads, Kubernetes has clear momentum. In this educational article, Ben Sherman provides an overview of how the Nextflow / Kubernetes integration has been simplified by avoiding the requirement for Persistent Volumes (PVs) and Persistent Volume Claims (PVCs). This detailed guide provides step-by-step instructions for using Amazon EKS as a compute environment complete with how to configure IAM Roles for Kubernetes Services Accounts (IRSA), now an Amazon EKS best practice.
Nextflow and K8s Rebooted: Running Nextflow on Amazon EKS
The following resources will help you dig deeper into Nextflow and other related projects like the nf-core community who maintain curated pipelines and a very active Slack channel. There are plenty of Nextflow tutorials and videos online, and the following list is no way exhaustive. Please let us know if we are missing anything.
The reference for the Nextflow language and runtime. These docs should be your first point of reference while developing Nextflow pipelines. The newest features are documented in edge documentation pages released every month with the latest stable releases every three months.
Latest stable & edge documentation.
An index of documentation, deployment guides, training materials and resources for all things Nextflow and Tower.
nf-core is a growing community of Nextflow users and developers. You can find curated sets of biomedical analysis pipelines written in Nextflow and built by domain experts. Each pipeline is stringently reviewed and has been implemented according to best practice guidelines. Be sure to sign up to the Slack channel.
nf-core website and nf-core Slack
Nextflow Tower is a platform to easily monitor, launch and scale Nextflow pipelines on cloud providers and on-premise infrastructure. The documentation provides details on setting up compute environments, monitoring pipelines and launching using either the web graphic interface, CLI or API.
Nextflow Tower and user documentation.
Part of the Genomics Workflows on AWS, Amazon provides a quickstart for deploying a genomics analysis environment on Amazon Web Services (AWS) cloud, using Nextflow to create and orchestrate analysis workflows and AWS Batch to run the workflow processes. While this article is packed with good information, the procedure outlined in the more recent Nextflow and AWS Batch – Inside the integration series, may be an easier place to start. Some of the steps that previously needed to be performed manually have been updated in the latest integration.
Nextflow on Azure requires at minimum two Azure services, Azure Batch and Azure Storage. Follow the guide below developed by the team at Microsoft to set up both services on Azure, and to get your storage and batch account names and keys.
Azure Blog and GitHub repository.
A step-by-step guide to launching Nextflow Pipelines in Google Cloud. Note that this integration process is specific to Google Life Sciences – an offering that pre-dates Google Cloud Batch. If you want to use the newer integration approach, you can also check out the Nextflow blog article Get started with Nextflow on Google Cloud Batch.
[Nextflow on Google Cloud](https://cloud.google.com/life-sciences/docs/tutorials/nextflow]
This Nextflow Tutorial - Variant Calling Edition has been adapted from the Nextflow Software Carpentry training material and Data Carpentry: Wrangling Genomics Lesson. Learners will have the chance to learn Nextflow and nf-core basics, to convert a variant-calling bash-script into a Nextflow workflow and to modularize the pipeline using DSL2 modules and sub-workflows.
The workshop can be opened on Gitpod where you can try the exercises in an online computing environment at your own pace, with the course material in another window alongside.
You can find the course in Nextflow Tutorial - Variant Calling Edition.