<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>nucleotid.es</title>
  <id>http://nucleotid.es</id>
  <link href="http://nucleotid.es"/>
  <link href="http://nucleotid.es/atom.xml" rel="self" type="application/atom+xml"/>
  <author>
    <name>Michael Barton</name>
    <email>mail@michaelbarton.me.uk</email>
    <uri>http://www.michaelbarton.me.uk</uri>
  </author>
  <updated>2015-01-07T00:00:00-08:00</updated>
  <entry>
    <title>Why use containers for scientific software?</title>
    <id>tag:nucleotid.es,2015-01-07:post-4</id>
    <link rel="alternate" href="http://nucleotid.es/blog/why-use-containers/"/>
    <published>2015-01-07T00:00:00-08:00</published>
    <updated>2015-01-07T00:00:00-08:00</updated>
    <author>
      <name>Michael Barton</name>
      <email>mail@michaelbarton.me.uk</email>
      <uri>http://www.michaelbarton.me.uk</uri>
    </author>
    <content type="html">
&lt;p&gt;Nucleotid.es benchmarking data has been available for seven months beginning
with a single table of results for one organism to the current ~1,900
replicated benchmarks across multiple organisms. There have been discussions on
the applicability of using containers for this kind of approach. In particular
one question is how does this lend to reproducibility in science? For example
Titus Brown wrote a blog post describing &lt;a href="http://ivory.idyll.org/blog/2014-containers.html"&gt;a post-apocalyptic world of binary
containers&lt;/a&gt; and a &lt;a href="https://twitter.com/sjackman/status/537723151057039362"&gt;discussion started on twitter by Shaun
Jackman&lt;/a&gt; led to many replies.&lt;/p&gt;

&lt;h3 id="reproducibility"&gt;Reproducibility&lt;/h3&gt;

&lt;p&gt;If we specifically discuss the Docker implementation when we talk about
containers, and we almost certainly are, then I disagree with the description
of these as ‘binary blobs’ that cannot be understood. You can run the &lt;code&gt;docker
export&lt;/code&gt; command to get a .tar of the container’s file system. A Docker
container is not compiled in the way a C or Java program is, instead it is a
series of transparent file system layers. The act of containerising scientific
software does not obscure how it works or make it inaccessible.&lt;/p&gt;

&lt;p&gt;I think that containers make for more reproducible science. A Dockerfile allows
for the opportunity to explicitly show the steps required to compile and
organise the code. This is better than simply providing the source code alone.
I can illustrate this with two example Dockerfiles for genome assembly
containers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href="https://github.com/nucleotides/docker-velvet/blob/master/Dockerfile"&gt;velvet + kmergenie&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="https://github.com/nucleotides/docker-idba/blob/master/Dockerfile"&gt;idba&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope that we can agree that neither of these are trivial installs. The
advantage of using a container is that it saves everyone else from having to do
this. More importantly it saves them from having to &lt;strong&gt;learn&lt;/strong&gt; how to do this.
There is a case for encouraging non-computational biologists to learn to code
but not for forcing them to debug g++ errors.&lt;/p&gt;

&lt;p&gt;My favourite way to describe this is as “deduplicaton of agony”. We can take the
pain of compiling and installing, often buggy and undocumented, bioinformatics
code which we force on our users and move that into a container. Instead of
making everyone else do this work we can just ask the person who knows best to
do it: the developer.&lt;/p&gt;

&lt;h3 id="standardisation"&gt;Standardisation&lt;/h3&gt;

&lt;p&gt;A second argument is that containers are ‘black boxes’ and cannot be used with
other tools. For instance if I can give you a container with a working version
of Spades or ABySS. This has solved the problem of getting the software to run
but now you have to use it to produce results. This is what nucleotid.es aims
to solve.&lt;/p&gt;

&lt;p&gt;I have taken some of the most popular genome assemblers and containerised them.
Importantly these have all been standardised behind the same interface so they
can all be used in exactly the same way. This means that if you are using
assembler X and then new data suggests that assembler Y is better, you can
immediately switch between the two containers because they are run on the
command line identically. You can use all these containers interchangeably in
your own custom pipelines with minimal development effort.&lt;/p&gt;

&lt;p&gt;Nucleotid.es provides the data to allow you to make the decisions about which
assembler to use. I have taken Illumina reads from bacterial organisms of
different sizes and %GC content and assembled them using the assembler
containers. This provides concrete information how you might expect each
assembler to perform on variety of data. Furthermore because the assembler was
benchmarked as a container, the results are guaranteed to be the same for you
as they were for me when I ran the analysis. This would not be the case if
there was not a standardised interface because I couldn’t share the container
with you and expect you to reproduce my results. This is why standardisation is
equally as important as containerisation.&lt;/p&gt;

&lt;h3 id="summary"&gt;Summary&lt;/h3&gt;

&lt;p&gt;At the JGI we produce thousands of assemblies and terabases of sequence data
each year. The days where we manually improved genomes drafts have long passed.
This may not be the case for smaller research labs, however as sequencing
becomes cheaper and generated in ever larger volumes it soon will be.
Nucleotid.es aims to allow us to make data-driven decisions about what kind of
software to use that we can do assembly in-the-large.&lt;/p&gt;

&lt;p&gt;Using containers allows us to reliably understand what kind of results we might
expect from an assembler and when someone inevitably produces a better
assembler we can identify it immediately and insert it into our pipelines, and
allows you to do the same.&lt;/p&gt;
</content>
  </entry>
  <entry>
    <title>November 2014 Update</title>
    <id>tag:nucleotid.es,2014-11-17:post-3</id>
    <link rel="alternate" href="http://nucleotid.es/blog/2014-11/"/>
    <published>2014-11-17T00:00:00-08:00</published>
    <updated>2014-11-17T00:00:00-08:00</updated>
    <author>
      <name>Michael Barton</name>
      <email>mail@michaelbarton.me.uk</email>
      <uri>http://www.michaelbarton.me.uk</uri>
    </author>
    <content type="html">
&lt;p&gt;This update to &lt;a href="http://nucleotid.es"&gt;nucleotid.es&lt;/a&gt; includes additional
assemblers and a new method of summarising performance across benchmarks. There
are also minor site changes and updates to the benchmark metrics.&lt;/p&gt;

&lt;h3 id="maximum-likelihood-estimates-of-performance"&gt;Maximum likelihood estimates of performance&lt;/h3&gt;

&lt;p&gt;I have made a large change to how the benchmarks are summarised. Instead of
using a voting method, the results are now summarised using linear modelling.
Each benchmark metric is modelled as &lt;code&gt;metric ~ assembler + genome&lt;/code&gt; using a
generalised linear model. This model estimates the maximum-likelihood
coefficients for how much each genome effects the evaluation metric and how
well the assembler performs.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://nucleotid.es/blog/2014-10/"&gt;Last month I outlined&lt;/a&gt; how each set of reads for each genome was
subsampled to generate five replicates and each assembler was evaluated against
all replicates. There 16 genomes leading to 80 data for each assembler and
~1900 data to use for linear modelling. I used the &lt;code&gt;glm()&lt;/code&gt; function in R to
model four assembly metrics: NG50, percent unassembled, incorrect per 100KBp,
and number of local misassemblies. The results are shown in the updated
&lt;a href="http://nucleotid.es/results/"&gt;nucleotid.es summary page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each column shows the coefficients for a different model. For example the NG50
column is the coefficients of the &lt;code&gt;assembler&lt;/code&gt; term in the model: &lt;code&gt;NG50 ~
assembler + genome&lt;/code&gt;. As the NG50 metrics are log-normally distributed the model
was specified using log-link e.g. &lt;code&gt;NG50 ~ e^(assembler + genome)&lt;/code&gt;. This is why
the coefficients are small, as they are exponentially additive terms rather
than linearly additive terms.&lt;/p&gt;

&lt;p&gt;As an example of how these summaries can be applied we can consider the effect
of using ABySS with a kmer size of either 32 or 96. The NG50 coefficient for
ABySS k-96 is 0.26 while the coefficient for ABySS k-32 is -1.02. Therefore the
difference between the two is 1.28. Taking the natural exponent of this
(&lt;code&gt;e^1.28&lt;/code&gt;) this shows that using k-96 over k-32 with ABySS should, on average,
give you a 3.6 times larger NG50. We can check the results of this using the
first three read sets as an example. Each row shows the NG50 for k-96 vs k-32.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Read set 0001: 460000 / 78000 = 5.89&lt;/li&gt;
  &lt;li&gt;Read set 0002: 97000 / 51000  = 1.90&lt;/li&gt;
  &lt;li&gt;Read set 0003: 171000 / 70000 = 2.44&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is my initial attempt at summarising the assemblers in this way and so I
welcome suggestions on how this may be improved or possible deficiencies in the
method. The aim of this is to provide an aggregate summary of how each
assembler is performing rather than solely listing many tables of results.&lt;/p&gt;

&lt;h3 id="additional-assemblers"&gt;Additional assemblers&lt;/h3&gt;

&lt;p&gt;New assemblers also have been evaluated in the benchmarks. The assemblers added
this month are SGA, sparse assembler, minia and megahit. I added megahit, even
though it is a metagenome assembler, as it can still be useful to compare
isolate assemblies. The results of evaluating these assemblers are now
available on &lt;a href="http://nucleotid.es/benchmarks/"&gt;benchmarks&lt;/a&gt; and the updated &lt;a href="http://nucleotid.es/results/"&gt;summary page&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="minor-changes-to-benchmark-metrics-and-site"&gt;Minor changes to benchmark metrics and site&lt;/h3&gt;

&lt;p&gt;The incorrect bases measure has been changed. This measure now only includes
mismatching bases and indels. Previously this measure also included Ns however
this would penalise assemblers which scaffolded contigs together. I believe
that removing Ns from the incorrect bases measure provides a better metric.&lt;/p&gt;

&lt;p&gt;The CPU seconds per assembled base was incorrect by a factor of 1e6. The
benchmarks now list this measure correctly which is now CPU seconds per
assembled 1KBp.&lt;/p&gt;

&lt;p&gt;There is also now an &lt;a href="http://nucleotid.es/atom.xml"&gt;atom feed&lt;/a&gt; for updates. Users of Firefox may have
seen errors at the top of the benchmark tables - this should be fixed.&lt;/p&gt;

</content>
  </entry>
  <entry>
    <title>September 2014 Update</title>
    <id>tag:nucleotid.es,2014-09-02:post-1</id>
    <link rel="alternate" href="http://nucleotid.es/blog/2014-09/"/>
    <published>2014-09-02T00:00:00-07:00</published>
    <updated>2014-09-02T00:00:00-07:00</updated>
    <author>
      <name>Michael Barton</name>
      <email>mail@michaelbarton.me.uk</email>
      <uri>http://www.michaelbarton.me.uk</uri>
    </author>
    <content type="html">
&lt;p&gt;Approximately a month ago nucleotid.es was a single page showing a handful of
benchmark tables. Since then I have been able to add more features and the
website has changed greatly. I will aim regularly write announcements to
summarise these changes as nucleotid.es continues to change and improve in the
future.&lt;/p&gt;

&lt;h3 id="assembler-command-bundles"&gt;Assembler command bundles&lt;/h3&gt;

&lt;p&gt;A problem I encountered early on was how to manage running the same assembler
in different ways. An example is the spades assembler which has the
&lt;code&gt;--single-cell&lt;/code&gt; and &lt;code&gt;--careful&lt;/code&gt; flags, both of which should be evaluated for
their effect on assembly quality. My initial approach was to create a new
Docker image for each way of running an assembler. This resulted in Docker
images like nucleotides/spades-3-single-cell-careful where command line flags
were listed in the name.&lt;/p&gt;

&lt;p&gt;This approach was ungainly and I assumed that more complex ways of running an
assembler would generate longer and longer names. Furthermore if a new Docker
image had to be created for each combination of command line flags then this
would result in an overabundance and confusion of Docker images.&lt;/p&gt;

&lt;p&gt;Instead, I created Docker images with “command bundles.” These command bundles
allow the same Docker container to be run in multiple different ways. Using the
spades example from above, the spades container can be called on the command
line as follows &lt;code&gt;docker run nucleotides/spades default ...&lt;/code&gt; or &lt;code&gt;docker run
nucleotides/spades single-cell ...&lt;/code&gt;. The first argument to each container
should be the command bundle specifying how it should be run. I believe this
simplifies the problem of benchmarking assemblers with multiple different
command line options. You can see these command bundles in the second column of
each table on the &lt;a href="http://nucleotid.es/benchmarks/"&gt;assembler benchmarks page&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="voting-on-the-best-assemblers"&gt;Voting on the best assemblers&lt;/h3&gt;

&lt;p&gt;I have added more reference genome read sets for benchmarking. There are now 16
references, each with a corresponding table on the benchmarks page. A greater
number of benchmarks provides more information on how the assemblers perform.
If you browse these benchmarks there are visible trends as to which assemblers
perform well. Viewing a large number of tables is however not an ideal way to
compare assemblers.&lt;/p&gt;

&lt;p&gt;I have tried to solve this visualisation problem by treating the benchmarks as
an election. Each reference genome can be thought as ‘voting’ for the Docker
image which assembles their corresponding reads the best. The best assembler is
then the one that is ‘elected’ by all the reference genomes. I use the Schulze
method for tallying the votes. At present there are the results of two
elections on the &lt;a href="http://nucleotid.es/results/"&gt;assembler results&lt;/a&gt; page. The first is for the Docker image
that produces the best NG50 and the second is for the assembler that produces
the least number of incorrect bases.&lt;/p&gt;

&lt;h3 id="list-of-assemblers"&gt;List of assemblers&lt;/h3&gt;

&lt;p&gt;I have listed all the assembler Docker images on the &lt;a href="http://nucleotid.es/assemblers/"&gt;assemblers page&lt;/a&gt;. This
page shows each assembler Docker image and indicates whether an assembler has a
homepage and a source code repository. You can see that if an assembler doesn’t
have a source code repository, such as github or bitbucket, then there is a
small red cross. The aim of this is to encourage developers to provide
resources related to their assembler for the bioinformaticians. In future I
would like to add additional checks like providing a mailing list, an issue
tracker and documentation.&lt;/p&gt;

&lt;h3 id="improved-website-appearance"&gt;Improved website appearance&lt;/h3&gt;

&lt;p&gt;I have spent some time improving the website appearance. I have limited ability
when developing HTML and CSS and this shows when viewing the website on a
mobile device. I have however tried to improve the front page of nucleotid.es
to clarify the main goals of the project. I created some simple logos in
Inkscape which are also visible on the front page. These logos are based on
cogs or gears inside boxes, representing genome assemblers inside Docker
containers.&lt;/p&gt;

&lt;h3 id="no-third-party-assemblers-have-been-submitted"&gt;No third-party assemblers have been submitted&lt;/h3&gt;

&lt;p&gt;At present the only Docker images on nucleotid.es are those I have written
myself. I would encourage any interested developers to write a Docker image for
their own or other assemblers. Creating a working image often takes a some time
and so this project will progress slowly if I am writing all the Docker images
myself. If you would be interested in developing an assembler image I would be
happy to help by providing support through the &lt;a href="http://nucleotid.es/mailing-list/"&gt;nucleotid.es mailing list&lt;/a&gt;.
The more assemblers that are included the benchmarking, the more accurate a
reflection of the state of genome assembly this project provides.&lt;/p&gt;

</content>
  </entry>
  <entry>
    <title>October 2014 Update</title>
    <id>tag:nucleotid.es,2014-10-08:post-2</id>
    <link rel="alternate" href="http://nucleotid.es/blog/2014-10/"/>
    <published>2014-10-08T00:00:00-07:00</published>
    <updated>2014-10-08T00:00:00-07:00</updated>
    <author>
      <name>Michael Barton</name>
      <email>mail@michaelbarton.me.uk</email>
      <uri>http://www.michaelbarton.me.uk</uri>
    </author>
    <content type="html">
&lt;p&gt;This is the second update on recent improvements to nucleotid.es. These include
additional assemblers and updates to existing assemblers. There are additional
metrics added to provide more detail on the  performance of each assembler. The
generated data are also now more accurate by using five replicates for each
genome.&lt;/p&gt;

&lt;h3 id="more-assemblers"&gt;More assemblers&lt;/h3&gt;

&lt;p&gt;This project needs more assembler images. I created six assembler images
however there are many more assemblers that could be included. If you are
interested in creating an assembler Docker image please contact me through this
mailing list or through my personal email. Additional assembler images included
here this will extremely helpful and the more assemblers benchmarked the better
the picture of genome assembly this project provides.&lt;/p&gt;

&lt;p&gt;This month was exciting for me because two assembler images were created by
others. Aaron Darling at the University of Technology Sydney created an image
of A5-miseq. Eugene Goltsman at the Joint Genome Institute made an image of
Meraculous. These assemblers have both been benchmarked and you can view how
these assemblers perform in the benchmarks page. These new results are
particularly interesting as A5-miseq performs very well.&lt;/p&gt;

&lt;p&gt;Shaun Jackman provided feedback on the ABySS image. These comments came as a
&lt;a href="https://github.com/nucleotides/docker-abyss/pull/2"&gt;pull request&lt;/a&gt; and on a &lt;a href="https://github.com/nucleotides/docker-abyss/commit/8d841532bae4ba69bf65c82aedde9e5f449d41ea"&gt;commit&lt;/a&gt; and are useful for improving the
performance of the assembler image. The ABySS image now has an ‘adaptive’
command bundle which uses &lt;a href="http://kmergenie.bx.psu.edu/"&gt;kmergenie&lt;/a&gt; to search for the optimal kmer to use
for assembly.&lt;/p&gt;

&lt;p&gt;The existence of nucleotid.es is provide accurate benchmarks of genome
assemblers where the images can be immediately used by anyone. Therefore if you
are interested in any of these assemblers, then install Docker and you can
start using the images immediately. There are &lt;a href="http://nucleotid.es/using-images/"&gt;simple instructions&lt;/a&gt; provided
that you can use to get started.&lt;/p&gt;

&lt;h3 id="more-metrics"&gt;More metrics&lt;/h3&gt;

&lt;p&gt;I have added additional metrics to each benchmark. Each benchmark now includes
both local misassemblies and larger misassemblies. These are useful for
providing detail on larger scale inaccuracies in addition to the already
include granular incorrect bases metric. All of these assembly metrics on the
benchmark page are generated using &lt;a href="http://bioinf.spbau.ru/quast"&gt;QUAST&lt;/a&gt; by comparing the produced
scaffolds with the reference genome.&lt;/p&gt;

&lt;p&gt;The second set of metrics I have added relate to &lt;a href="https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt"&gt;Linux control groups&lt;/a&gt;.
These cgroups are used by the Docker daemon to organise the container processes
and include information about memory and CPU usage. I collect these metrics for
each container by periodically querying the cgroup for the running container.
These metrics are included in the benchmarks page and can be used to compare
the computational requirements for running each assembler.&lt;/p&gt;

&lt;p&gt;I further computed an additional metric: CPU seconds per assembled base. This
is the total number of CPU seconds used by the container divided by the total
length of the assembly. This metric provides a perspective on the computational
efficiency of each assembler, where a smaller number indicates a
computationally more efficient assembler.&lt;/p&gt;

&lt;h3 id="more-replicates"&gt;More replicates&lt;/h3&gt;

&lt;p&gt;Previously each assembler was benchmarked on a single FASTQ file from a
reference genome. This allowed the possibility that a benchmark could be over
fitted to the sampling of the reads. I have updated the benchmarks so that each
calculated metric is the result of running the assembler on five different
subsampling of reads. This should therefore provide a more accurate view of
each how the assembler performs and I hope provide more confidence in the
results.&lt;/p&gt;
</content>
  </entry>
</feed>
