A key feature of Nextflow is the ability to abstract the implementation of data analysis pipelines so they can be deployed in a portable manner across execution platforms.
As of today, Nextflow supports a rich variety of HPC schedulers and all major cloud providers. Our goal is to support new services as they emerge to enable Nextflow users to take advantage of the latest technology and deploy pipelines on the compute environments that best fit their requirements.
For this reason, we are delighted to announce that Nextflow now supports Google Cloud Batch, a new fully managed batch service just announced for beta availability by Google Cloud.
Google Cloud Batch is a comprehensive cloud service suitable for multiple use cases, including HPC, AI/ML, and data processing. While it is similar to the Google Cloud Life Sciences API, used by many Nextflow users today, Google Cloud Batch offers a broader set of capabilities. As with Google Cloud Life Sciences, Google Cloud Batch automatically provisions resources, manages capacity, and allows batch workloads to run at scale. It offers several advantages, including:
A nice feature of Google Cloud Batch API, that fits nicely with Nextflow, is its built-in support for data ingestion from Google Cloud Storage buckets. A batch job can mount a storage bucket and make it directly accessible to a container running a Nextflow task. This feature makes data ingestion and sharing resulting data sets more efficient and reliable than other solutions.
Support for the Google Cloud Batch requires the latest release of Nextflow from the edge channel (version
22.07.1-edge or later). If you don't already have it, you can install this release using these commands:
export NXF_EDGE=1 curl get.nextflow.io | bash ./nextflow -self-update
Make sure your Google account is allowed to access the Google Cloud Batch service by checking the API & Service dashboard.
Credentials for accessing the service are picked up by Nextflow from your environment using the usual Google Application Default Credentials mechanism. That is, either via the
GOOGLE_APPLICATION_CREDENTIALS environment variable, or by using the following command to set up the environment:
gcloud auth application-default login
After authenticating yourself to Google Cloud, create a
nextflow.config file and specify
google-batch as the Nextflow executor. You will also need to specify the Google Cloud project where execution will occur and the Google Cloud Storage working directory for pipeline execution.
cat <<EOT > nextflow.config process.executor = 'google-batch' workDir = 'gs://YOUR-GOOGLE-BUCKET/scratch' google.project = 'YOUR GOOGLE PROJECT ID' EOT
In the above snippet replace
<YOUR_GOOGLE_BUCKET> with a Google Storage bucket of your choice where to store the pipeline output data and
<YOUR_GOOGLE_PROJECT_ID> with your Google project Id where the computation will be deployed.
With this information, you are ready to start. You can verify that the integration is working by running the Nextflow “hello” pipeline as shown below:
nextflow run https://github.com/nextflow-io/hello
Google Cloud Life Sciences users can easily migrate their pipelines to Google Cloud Batch by making just a few edits to their pipeline configuration settings. Simply replace the
google-lifesciences executor with
For each setting having the prefix
google.lifeScience., there is a corresponding
google.batch. setting. Simply update these configuration settings to reflect the new service.
The usual process directives such as: cpus, memory, time, machineType are natively supported by Google Cloud Batch, and should not be modified.
Find out more details in the Nextflow documentation.
The Google Cloud Batch executor for Nextflow is offered as an open source contribution to the Nextflow project. The integration was developed by Google in collaboration with Seqera Labs. This is a validation of Google Cloud’s ongoing commitment to open source software (OSS) and a testament to the health and vibrancy of the Nextflow project. We wish to thank the entire Google Cloud Batch team, and Shamel Jacobs in particular, for their support of this effort.
Support for Google Cloud Batch further expands the wide range of computing platforms supported by Nextflow. It empowers Nextflow users to easily access cost-effective resources, and take full advantage of the rich capabilities of the Google Cloud. Above all, it enables researchers to easily scale and collaborate, improving their productivity, and resulting in better research outcomes.