Pipeline sharing

Nextflow seamlessly integrates with BitBucket [1], GitHub, and GitLab hosted code repositories and sharing platforms. This feature allows you to manage your project code in a more consistent manner or use other people's Nextflow pipelines, published through BitBucket/GitHub/GitLab, in a quick and transparent way.

How it works

When you launch a script execution with Nextflow, it will look for a file with the pipeline name you've specified. If that file does not exist, it will look for a public repository with the same name on GitHub (unless otherwise specified). If it is found, the repository is automatically downloaded to your computer and executed. This repository is stored in the Nextflow home directory, that is by default the $HOME/.nextflow path, and thus will be reused for any further executions.

Running a pipeline

To launch the execution of a pipeline project, hosted in a remote code repository, you simply need to specify its qualified name or the repository URL after the run command. The qualified name is formed by two parts: the owner name and the repository name separated by a / character.

In other words if a Nextflow project is hosted, for example, in a GitHub repository at the address http://github.com/foo/bar, it can be executed by entering the following command in your shell terminal:

nextflow run foo/bar

or using the project URL:

nextflow run http://github.com/foo/bar

Note

In the first case, if your project is hosted on a service other than GitHub, you will need to specify this hosting service in the command line by using the -hub option. For example -hub bitbucket or -hub gitlab. In the second case, i.e. when using the project URL as name, the -hub option is not needed.

You can try this feature out by simply entering the following command in your shell terminal:

nextflow run nextflow-io/hello

It will download a trivial Hello example from the repository published at the following address http://github.com/nextflow-io/hello and execute it in your computer.

If the owner part in the pipeline name is omitted, Nextflow will look for a pipeline between the ones you have already executed having a name that matches the name specified. If none is found it will try to download it using the organisation name defined by the environment variable NXF_ORG (which by default is nextflow-io).

Tip

To access a private repository, specify the access credentials by using the -user command line option, then the program will ask you to enter the password interactively. Private repositories access credentials can also be defined in the SCM configuration file.

Handling revisions

Any Git branch, tag or commit ID defined in a project repository, can be used to specify the revision that you want to execute when launching a pipeline by adding the -r option to the run command line. So for example you could enter:

nextflow run nextflow-io/hello -r mybranch

or

nextflow run nextflow-io/hello -r v1.1

It will execute two different project revisions corresponding to the Git tag/branch having that names.

Commands to manage projects

The following commands allows you to perform some basic operations that can be used to manage your projects.

Note

Nextflow is not meant to replace functionalities provided by the Git tool. You may still need it to create new repositories or commit changes, etc.

Listing available projects

The list command allows you to list all the projects you have downloaded in your computer. For example:

nextflow list

This prints a list similar to the following one:

cbcrg/ampa-nf
cbcrg/piper-nf
nextflow-io/hello
nextflow-io/examples

Showing project information

By using the info command you can show information from a downloaded project. For example:

project name: nextflow-io/hello
repository  : http://github.com/nextflow-io/hello
local path  : $HOME/.nextflow/assets/nextflow-io/hello
main script : main.nf
revisions   :
* master (default)
  mybranch
  v1.1 [t]
  v1.2 [t]

Starting from the top it shows: 1) the project name; 2) the Git repository URL; 3) the local folder where the project has been downloaded; 4) the script that is executed when launched; 5) the list of available revisions i.e. branches and tags. Tags are marked with a [t] on the right, the current checked-out revision is marked with a * on the left.

Pulling or updating a project

The pull command allows you to download a project from a GitHub repository or to update it if that repository has already been downloaded. For example:

nextflow pull nextflow-io/examples

Altenatively, you can use the repository URL as the name of the project to pull:

nextflow pull https://github.com/nextflow-io/examples

Downloaded pipeline projects are stored in the folder $HOME/.nextflow/assets in your computer.

Viewing the project code

The view command allows you to quickly show the content of the pipeline script you have downloaded. For example:

nextflow view nextflow-io/hello

By adding the -l option to the example above it will list the content of the repository.

Cloning a project into a folder

The clone command allows you to copy a Nextflow pipeline project to a directory of your choice. For example:

nextflow clone nextflow-io/hello target-dir

If the destination directory is omitted the specified project is cloned to a directory with the same name as the pipeline base name (e.g. hello) in the current folder.

The clone command can be used to inspect or modify the source code of a pipeline project. You can eventually commit and push back your changes by using the usual Git/GitHub workflow.

Deleting a downloaded project

Downloaded pipelines can be deleted by using the drop command, as shown below:

nextflow drop nextflow-io/hello

SCM configuration file

The file $HOME/.nextflow/scm allows you to centralise the security credentials required to access private project repositories on Bitbucket, GitHub and GitLab source code management (SCM) platforms or to manage the configuration properties of private server installations (of the same platforms).

The configuration properties for each SCM platform are defined inside the providers section, properties for the same provider are grouped together with a common name and delimited with curly brackets as in this example:

providers {
    <provider-name> {
        property = value
        :
    }
}

In the above template replace <provider-name> with one of the "default" servers (i.e. bitbucket, github or gitlab) or a custom identifier representing a private SCM server installation.

The following configuration properties are supported for each provider configuration:

Name Description
user User name required to access private repositories on the SCM server.
password User password required to access private repositories on the SCM server.
token Private API access token (used only when the specified platform is gitlab).
* platform SCM platform name, either: github, gitlab or bitbucket.
* server SCM server name including the protocol prefix e.g. https://github.com.
* endpoint SCM API endpoint URL e.g. https://api.github.com (default: the same value specified for server).

The attributes marked with a * are only required when defining the configuration of a private SCM server.

BitBucket credentials

Create a bitbucket entry in the SCM configuration file specifying your user name and password, as shown below:

providers {

    bitbucket {
        user = 'me'
        password = 'my-secret'
    }

}

GitHub credentials

Create a github entry in the SCM configuration file specifying your user name and password as shown below:

providers {

    github {
        user = 'me'
        password = 'my-secret'
    }

}

Tip

You can use use a Personal API token in place of your GitHub password.

GitLab credentials

Create a gitlab entry in the SCM configuration file specifying the user name, password and your API access token that can be found in your GitLab account page (sign in required). For example:

providers {

    gitlab {
        user = 'me'
        password = 'my-secret'
        token = 'YgpR8m7viH_ZYnC8YSe8'
    }

}

Private server configuration

Nextflow is able to access repositories hosted on private BitBucket, GitHub and GitLab server installations.

In order to use a private SCM installation you will need to set the server name and access credentials in your SCM configuration file .

If, for example, the host name of your private GitLab server is gitlab.acme.org, you will need to have in the $HOME/.nextflow/scm file a configuration like the following:

providers {

    mygit {
        server = 'http://gitlab.acme.org'
        platform = 'gitlab'
        user = 'your-user'
        password = 'your-password'
        token = 'your-api-token'
    }

}

Then you will be able to run/pull a project with Nextflow using the following command line:

$ nextflow run foo/bar -hub mygit

Or, in alternative, using the Git clone URL:

$ nextflow run http://gitlab.acme.org/foo/bar.git

Warning

When accessing a private SCM installation over https and that server uses a custom SSL certificate you may need to import such certificate into your local Java keystore. Read more here.

Publishing your pipeline

In order to publish your Nextflow pipeline to GitHub (or any other supported platform) and allow other people to use it, you only need to create a GitHub repository containing all your project script and data files. If you don't know how to do it, follow this simple tutorial that explains how create a GitHub repository.

Nextflow only requires that the main script in your pipeline project is called main.nf. A different name can be used by specifying the manifest.mainScript attribute in the nextflow.config file that must be included in your project. For example:

manifest.mainScript = 'my_very_long_script_name.nf'

To learn more about this and other project metadata information, that can be defined in the Nextflow configuration file, read the Manifest section on the Nextflow configuration page.

Once you have uploaded your pipeline project to GitHub other people can execute it simply using the project name or the repository URL.

For if your GitHub account name is foo and you have uploaded a project into a repository named bar the repository URL will be http://github.com/foo/bar and people will able to download and run it by using either the command:

nextflow run foo/bar

or

nextflow run http://github.com/foo/bar

See the Running a pipeline section for more details on how to run Nextflow projects.

Manage dependencies

Computational pipelines are rarely composed by a single script. In real world applications they depend on dozens of other components. These can be other scripts, databases, or applications compiled for a platform native binary format.

External dependencies are the most common source of problems when sharing a piece of software, because the users need to have an identical set of tools and the same configuration to be able to use it. In many cases this has proven to be a painful and error prone process, that can severely limit the ability to reproduce computational results on a system other than the one on which it was originally developed.

Nextflow tackles this problem by integrating GitHub, BitBucket and GitLab sharing platforms and Docker containers technology.

The use of a code management system is important to keep together all the dependencies of your pipeline project and allows you to track the changes of the source code in a consistent manner.

Moreover to guarantee that a pipeline is reproducible it should be self-contained i.e. it should have ideally no dependencies on the hosting environment. By using Nextflow you can achieve this goal following these methods:

Third party scripts

Any third party script that does not need to be compiled (BASH, Python, Perl, etc) can be included in the pipeline project repository, so that they are distributed with it.

Grant the execute permission to these files and copy them into a folder named bin/ in the root directory of your project repository. Nextflow will automatically add this folder to the PATH environment variable, and the scripts will automatically be accessible in your pipeline without the need to specify an absolute path to invoke them.

System environment

Any environment variable that may be required by the tools in your pipeline can be defined in the nextflow.config file by using the env scope and including it in the root directory of your project. For example:

env {
  DELTA = 'foo'
  GAMMA = 'bar'
}

See the Configuration page to learn more about the Nextflow configuration file.

Resource manager

When using Nextflow you don't need to write the code to parallelize your pipeline for a specific grid engine/resource manager because the parallelization is defined implicitly and managed by the Nextflow runtime. The target execution environment is parametrized and defined in the configuration file, thus your code is free from this kind of dependency.

Bootstrap data

Whenever your pipeline requires some files or dataset to carry out any initialization step, you can include this data in the pipeline repository itself and distribute them together.

To reference this data in your pipeline script in a portable manner (i.e. without the need to use a static absolute path) use the implicit variable baseDir which locates the base directory of your pipeline project.

For example, you can create a folder named dataset/ in your repository root directory and copy there the required data file(s) you may need, then you can access this data in your script by writing:

sequences = file("$baseDir/dataset/sequences.fa")
sequences.splitFasta {
     println it
 }

User inputs

Nextflow scripts can be easily parametrised to allow users to provide their own input data. Simply declare on the top of your script all the parameters it may require as shown below:

params.my_input = 'default input file'
params.my_output = 'default output path'
params.my_flag = false
..

The actual parameter values can be provided when launching the script execution on the command line by prefixed the parameter name with a double minus character i.e. --, for example:

nextflow run <your pipeline> --my_input /path/to/input/file --my_output /other/path --my_flag true

Binary applications

Docker allows you to ship any binary dependencies that you may have in your pipeline to a portable image that is downloaded on-demand and can be executed on any platform where a Docker engine is installed.

In order to use it with Nextflow, create a Docker image containing the tools needed by your pipeline and make it available in the Docker registry.

Then declare in the nextflow.config file, that you will include in your project, the name of the Docker image you have created. For example:

process.container = 'my-docker-image'
docker.enabled = true

In this way when you launch the pipeline execution, the Docker image will be automatically downloaded and used to run your tasks.

Read the Docker containers page to lean more on how to use Docker containers with Nextflow.

This mix of technologies makes it possible to write self-contained and truly reproducible pipelines which require zero configuration and can be reproduced in any system having a Java VM and a Docker engine installed.

[1]BitBucket provides two types of version control system: Git and Mercurial. Nextflow supports only Git based repositories.