More fun with containers in HPC

  • Paolo Di Tommaso
  • 20 December 2016

Nextflow was one of the first workflow framework to provide built-in support for Docker containers. A couple of years ago we also started to experiment with the deployment of containerised bioinformatic pipelines at CRG, using Docker technology (see here and here).

We found that by isolating and packaging the complete computational workflow environment with the use of Docker images, radically simplifies the burden of maintaining complex dependency graphs of real workload data analysis pipelines.

Even more importantly, the use of containers enables replicable results with minimal effort for the system configuration. The entire computational environment can be archived in a self-contained executable format, allowing the replication of the associated analysis at any point in time.

This ability is the main reason that drove the rapid adoption of Docker in the bioinformatic community and its support in many projects, like for example Galaxy, CWL, Bioboxes, Dockstore and many others.

However, while the popularity of Docker spread between the developers, its adaption in research computing infrastructures continues to remain very low and it’s very unlikely that this trend will change in the future.

The reason for this resides in the Docker architecture, which requires a daemon running with root permissions on each node of a computing cluster. Such a requirement raises many security concerns, thus good practices would prevent its use in shared HPC cluster or supercomputer environments.

Introducing Singularity

Alternative implementations, such as Singularity, have fortunately been promoted by the interested in containers technology.

Singularity is a containers engine developed at the Berkeley Lab and designed for the needs of scientific workloads. The main differences with Docker are: containers are file based, no root escalation is allowed nor root permission is needed to run a container (although a privileged user is needed to create a container image), and there is no separate running daemon.

These, along with other features, such as support for autofs mounts, makes Singularity a container engine better suited to the requirements of HPC clusters and supercomputers.

Moreover, although Singularity uses a container image format different to that of Docker, they provide a conversion tool that allows Docker images to be converted to the Singularity format.

Singularity in the wild

We integrated Singularity support in Nextflow framework and tested it in the CRG computing cluster and the BSC MareNostrum supercomputer.

The absence of a separate running daemon or image gateway made the installation straightforward when compared to Docker or other solutions.

To evaluate the performance of Singularity we carried out the same benchmarks we performed for Docker and compared the results of the two engines.

The benchmarks consisted in the execution of three Nextflow based genomic pipelines:

  1. Rna-toy: a simple pipeline for RNA-Seq data analysis.
  2. Nmdp-Flow: an assembly-based variant calling pipeline.
  3. Piper-NF: a pipeline for the detection and mapping of long non-coding RNAs.

In order to repeat the analyses, we converted the container images we used to perform the Docker benchmarks to Singularity image files by using the docker2singularity tool (this is not required anymore, see the update below).

The only change needed to run these pipelines with Singularity was to replace the Docker specific settings with the following ones in the configuration file:

singularity.enabled = true
process.container = '<the image file path>'

Each pipeline was executed 10 times, alternately by using Docker and Singularity as container engine. The results are shown in the following table (time in minutes):

Pipeline Tasks Mean task time Mean execution time Execution time std dev Ratio
    Singularity Docker Singularity Docker Singularity Docker  
RNA-Seq 9 73.7 73.6 663.6 662.3 2.0 3.1 0.998
Variant call 48 22.1 22.4 1061.2 1074.4 43.1 38.5 1.012
Piper-NF 98 1.2 1.3 120.0 124.5 6.9 2.8 1.038

The benchmark results show that there isn’t any significative difference in the execution times of containerised workflows between Docker and Singularity. In two cases Singularity was slightly faster and a third one it was almost identical although a little slower than Docker.

Conclusion

In our evaluation Singularity proved to be an easy to install, stable and performant container engine.

The only minor drawback, we found when compared to Docker, was the need to define the host path mount points statically when the Singularity images were created. In fact, even if Singularity supports user mount points to be defined dynamically when the container is launched, this feature requires the overlay file system which was not supported by the kernel available in our system.

Docker surely will remain the de facto standard engine and image format for containers due to its popularity and impressive growth.

However, in our opinion, Singularity is the tool of choice for the execution of containerised workloads in the context of HPC, thanks to its focus on system security and its simpler architectural design.

The transparent support provided by Nextflow for both Docker and Singularity technology guarantees the ability to deploy your workflows in a range of different platforms (cloud, cluster, supercomputer, etc). Nextflow transparently manages the deployment of the containerised workload according to the runtime available in the target system.

Credits

Thanks to Gabriel Gonzalez (CRG), Luis Exposito (CRG) and Carlos Tripiana Montes (BSC) for the support installing Singularity.

Update Singularity, since version 2.3.x, is able to pull and run Docker images from the Docker Hub. This greatly simplifies the interoperability with existing Docker containers. You only need to prefix the image name with the docker:// pseudo-protocol to download it as a Singularity image, for example:

singularity pull --size 1200 docker://nextflow/rnatoy

aws pipelines nextflow genomic docker singularity