nucleotid.es

Why use containers for scientific software?

2015-01-07T00:00:00-08:00

Nucleotid.es benchmarking data has been available for seven months beginning with a single table of results for one organism to the current ~1,900 replicated benchmarks across multiple organisms. There have been discussions on the applicability of using containers for this kind of approach. In particular one question is how does this lend to reproducibility in science? For example Titus Brown wrote a blog post describing a post-apocalyptic world of binary containers and a discussion started on twitter by Shaun Jackman led to many replies.

Reproducibility

If we specifically discuss the Docker implementation when we talk about containers, and we almost certainly are, then I disagree with the description of these as ‘binary blobs’ that cannot be understood. You can run the docker export command to get a .tar of the container’s file system. A Docker container is not compiled in the way a C or Java program is, instead it is a series of transparent file system layers. The act of containerising scientific software does not obscure how it works or make it inaccessible.

I think that containers make for more reproducible science. A Dockerfile allows for the opportunity to explicitly show the steps required to compile and organise the code. This is better than simply providing the source code alone. I can illustrate this with two example Dockerfiles for genome assembly containers:

I hope that we can agree that neither of these are trivial installs. The advantage of using a container is that it saves everyone else from having to do this. More importantly it saves them from having to learn how to do this. There is a case for encouraging non-computational biologists to learn to code but not for forcing them to debug g++ errors.

My favourite way to describe this is as “deduplicaton of agony”. We can take the pain of compiling and installing, often buggy and undocumented, bioinformatics code which we force on our users and move that into a container. Instead of making everyone else do this work we can just ask the person who knows best to do it: the developer.

Standardisation

A second argument is that containers are ‘black boxes’ and cannot be used with other tools. For instance if I can give you a container with a working version of Spades or ABySS. This has solved the problem of getting the software to run but now you have to use it to produce results. This is what nucleotid.es aims to solve.

I have taken some of the most popular genome assemblers and containerised them. Importantly these have all been standardised behind the same interface so they can all be used in exactly the same way. This means that if you are using assembler X and then new data suggests that assembler Y is better, you can immediately switch between the two containers because they are run on the command line identically. You can use all these containers interchangeably in your own custom pipelines with minimal development effort.

Nucleotid.es provides the data to allow you to make the decisions about which assembler to use. I have taken Illumina reads from bacterial organisms of different sizes and %GC content and assembled them using the assembler containers. This provides concrete information how you might expect each assembler to perform on variety of data. Furthermore because the assembler was benchmarked as a container, the results are guaranteed to be the same for you as they were for me when I ran the analysis. This would not be the case if there was not a standardised interface because I couldn’t share the container with you and expect you to reproduce my results. This is why standardisation is equally as important as containerisation.

Summary

At the JGI we produce thousands of assemblies and terabases of sequence data each year. The days where we manually improved genomes drafts have long passed. This may not be the case for smaller research labs, however as sequencing becomes cheaper and generated in ever larger volumes it soon will be. Nucleotid.es aims to allow us to make data-driven decisions about what kind of software to use that we can do assembly in-the-large.

Using containers allows us to reliably understand what kind of results we might expect from an assembler and when someone inevitably produces a better assembler we can identify it immediately and insert it into our pipelines, and allows you to do the same.

November 2014 Update

2014-11-17T00:00:00-08:00

This update to nucleotid.es includes additional assemblers and a new method of summarising performance across benchmarks. There are also minor site changes and updates to the benchmark metrics.

Maximum likelihood estimates of performance

I have made a large change to how the benchmarks are summarised. Instead of using a voting method, the results are now summarised using linear modelling. Each benchmark metric is modelled as metric ~ assembler + genome using a generalised linear model. This model estimates the maximum-likelihood coefficients for how much each genome effects the evaluation metric and how well the assembler performs.

Last month I outlined how each set of reads for each genome was subsampled to generate five replicates and each assembler was evaluated against all replicates. There 16 genomes leading to 80 data for each assembler and ~1900 data to use for linear modelling. I used the glm() function in R to model four assembly metrics: NG50, percent unassembled, incorrect per 100KBp, and number of local misassemblies. The results are shown in the updated nucleotid.es summary page.

Each column shows the coefficients for a different model. For example the NG50 column is the coefficients of the assembler term in the model: NG50 ~ assembler + genome. As the NG50 metrics are log-normally distributed the model was specified using log-link e.g. NG50 ~ e^(assembler + genome). This is why the coefficients are small, as they are exponentially additive terms rather than linearly additive terms.

As an example of how these summaries can be applied we can consider the effect of using ABySS with a kmer size of either 32 or 96. The NG50 coefficient for ABySS k-96 is 0.26 while the coefficient for ABySS k-32 is -1.02. Therefore the difference between the two is 1.28. Taking the natural exponent of this (e^1.28) this shows that using k-96 over k-32 with ABySS should, on average, give you a 3.6 times larger NG50. We can check the results of this using the first three read sets as an example. Each row shows the NG50 for k-96 vs k-32.

Read set 0001: 460000 / 78000 = 5.89
Read set 0002: 97000 / 51000 = 1.90
Read set 0003: 171000 / 70000 = 2.44

This is my initial attempt at summarising the assemblers in this way and so I welcome suggestions on how this may be improved or possible deficiencies in the method. The aim of this is to provide an aggregate summary of how each assembler is performing rather than solely listing many tables of results.

Additional assemblers

New assemblers also have been evaluated in the benchmarks. The assemblers added this month are SGA, sparse assembler, minia and megahit. I added megahit, even though it is a metagenome assembler, as it can still be useful to compare isolate assemblies. The results of evaluating these assemblers are now available on benchmarks and the updated summary page.

Minor changes to benchmark metrics and site

The incorrect bases measure has been changed. This measure now only includes mismatching bases and indels. Previously this measure also included Ns however this would penalise assemblers which scaffolded contigs together. I believe that removing Ns from the incorrect bases measure provides a better metric.

The CPU seconds per assembled base was incorrect by a factor of 1e6. The benchmarks now list this measure correctly which is now CPU seconds per assembled 1KBp.

There is also now an atom feed for updates. Users of Firefox may have seen errors at the top of the benchmark tables - this should be fixed.

September 2014 Update

2014-09-02T00:00:00-07:00

Approximately a month ago nucleotid.es was a single page showing a handful of benchmark tables. Since then I have been able to add more features and the website has changed greatly. I will aim regularly write announcements to summarise these changes as nucleotid.es continues to change and improve in the future.

Assembler command bundles

A problem I encountered early on was how to manage running the same assembler in different ways. An example is the spades assembler which has the --single-cell and --careful flags, both of which should be evaluated for their effect on assembly quality. My initial approach was to create a new Docker image for each way of running an assembler. This resulted in Docker images like nucleotides/spades-3-single-cell-careful where command line flags were listed in the name.

This approach was ungainly and I assumed that more complex ways of running an assembler would generate longer and longer names. Furthermore if a new Docker image had to be created for each combination of command line flags then this would result in an overabundance and confusion of Docker images.

Instead, I created Docker images with “command bundles.” These command bundles allow the same Docker container to be run in multiple different ways. Using the spades example from above, the spades container can be called on the command line as follows docker run nucleotides/spades default ... or docker run nucleotides/spades single-cell .... The first argument to each container should be the command bundle specifying how it should be run. I believe this simplifies the problem of benchmarking assemblers with multiple different command line options. You can see these command bundles in the second column of each table on the assembler benchmarks page.

Voting on the best assemblers

I have added more reference genome read sets for benchmarking. There are now 16 references, each with a corresponding table on the benchmarks page. A greater number of benchmarks provides more information on how the assemblers perform. If you browse these benchmarks there are visible trends as to which assemblers perform well. Viewing a large number of tables is however not an ideal way to compare assemblers.

I have tried to solve this visualisation problem by treating the benchmarks as an election. Each reference genome can be thought as ‘voting’ for the Docker image which assembles their corresponding reads the best. The best assembler is then the one that is ‘elected’ by all the reference genomes. I use the Schulze method for tallying the votes. At present there are the results of two elections on the assembler results page. The first is for the Docker image that produces the best NG50 and the second is for the assembler that produces the least number of incorrect bases.

List of assemblers

I have listed all the assembler Docker images on the assemblers page. This page shows each assembler Docker image and indicates whether an assembler has a homepage and a source code repository. You can see that if an assembler doesn’t have a source code repository, such as github or bitbucket, then there is a small red cross. The aim of this is to encourage developers to provide resources related to their assembler for the bioinformaticians. In future I would like to add additional checks like providing a mailing list, an issue tracker and documentation.

Improved website appearance

I have spent some time improving the website appearance. I have limited ability when developing HTML and CSS and this shows when viewing the website on a mobile device. I have however tried to improve the front page of nucleotid.es to clarify the main goals of the project. I created some simple logos in Inkscape which are also visible on the front page. These logos are based on cogs or gears inside boxes, representing genome assemblers inside Docker containers.

No third-party assemblers have been submitted

At present the only Docker images on nucleotid.es are those I have written myself. I would encourage any interested developers to write a Docker image for their own or other assemblers. Creating a working image often takes a some time and so this project will progress slowly if I am writing all the Docker images myself. If you would be interested in developing an assembler image I would be happy to help by providing support through the nucleotid.es mailing list. The more assemblers that are included the benchmarking, the more accurate a reflection of the state of genome assembly this project provides.

October 2014 Update

2014-10-08T00:00:00-07:00

This is the second update on recent improvements to nucleotid.es. These include additional assemblers and updates to existing assemblers. There are additional metrics added to provide more detail on the performance of each assembler. The generated data are also now more accurate by using five replicates for each genome.

More assemblers

This project needs more assembler images. I created six assembler images however there are many more assemblers that could be included. If you are interested in creating an assembler Docker image please contact me through this mailing list or through my personal email. Additional assembler images included here this will extremely helpful and the more assemblers benchmarked the better the picture of genome assembly this project provides.

This month was exciting for me because two assembler images were created by others. Aaron Darling at the University of Technology Sydney created an image of A5-miseq. Eugene Goltsman at the Joint Genome Institute made an image of Meraculous. These assemblers have both been benchmarked and you can view how these assemblers perform in the benchmarks page. These new results are particularly interesting as A5-miseq performs very well.

Shaun Jackman provided feedback on the ABySS image. These comments came as a pull request and on a commit and are useful for improving the performance of the assembler image. The ABySS image now has an ‘adaptive’ command bundle which uses kmergenie to search for the optimal kmer to use for assembly.

The existence of nucleotid.es is provide accurate benchmarks of genome assemblers where the images can be immediately used by anyone. Therefore if you are interested in any of these assemblers, then install Docker and you can start using the images immediately. There are simple instructions provided that you can use to get started.

More metrics

I have added additional metrics to each benchmark. Each benchmark now includes both local misassemblies and larger misassemblies. These are useful for providing detail on larger scale inaccuracies in addition to the already include granular incorrect bases metric. All of these assembly metrics on the benchmark page are generated using QUAST by comparing the produced scaffolds with the reference genome.

The second set of metrics I have added relate to Linux control groups. These cgroups are used by the Docker daemon to organise the container processes and include information about memory and CPU usage. I collect these metrics for each container by periodically querying the cgroup for the running container. These metrics are included in the benchmarks page and can be used to compare the computational requirements for running each assembler.

I further computed an additional metric: CPU seconds per assembled base. This is the total number of CPU seconds used by the container divided by the total length of the assembly. This metric provides a perspective on the computational efficiency of each assembler, where a smaller number indicates a computationally more efficient assembler.

More replicates

Previously each assembler was benchmarked on a single FASTQ file from a reference genome. This allowed the possibility that a benchmark could be over fitted to the sampling of the reads. I have updated the benchmarks so that each calculated metric is the result of running the assembler on five different subsampling of reads. This should therefore provide a more accurate view of each how the assembler performs and I hope provide more confidence in the results.