Augur and snakemake
The output of one augur command serves as input to the next and hence these commands need to be executed in order. This is a common pattern in bioinformatics and multiple tools – so called workflow managers – exist to facilitate this. Within nextstrain, we have relied on Snakemake.
Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile
.
Each rule takes a number of input files, specifies a few parameters, and produces output files.
A simple rule would look like this:
rule filter:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv"
output:
sequences = "results/filtered.fasta"
params:
min_date = 2012
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--output {output.sequences} \
--min-date {params.min_date}
"""
This rule would produce results/filtered.fasta
from the input files data/sequences.fasta
and data/metadata.tsv
using the augur filter
command.
Note that we explicitly specify what is an input and what is an output file.
To filter our data, we would now call snakemake as
snakemake --cores 1 results/filtered.fasta
and snakemake will run the same command as specified above.
So far, this is just a complicated reformulation of what we did above, but the benefit of workflow management becomes clear once we add more steps.
The next natural step in our phylogenetic pipeline is aligning the filtered sequences and we define a rule align
.
rule align:
input:
sequences = rules.filter.output.sequences,
reference = "config/zika_outgroup.gb"
output:
alignment = "results/aligned.fasta"
shell:
"""
augur align \
--sequences {input.sequences} \
--reference-sequence {input.reference} \
--output {output.alignment}
"""
If you now want to generate the alignment, you can type
snakemake --cores 1 results/aligned.fasta
and snakemake will
determine that
results/aligned.fasta
is an output of rulealign
check whether all required input files are in place and run the necessary rules if not
run rule
align
and check whether the fileresults/aligned.fasta
appeared.
If you supply a reference sequence to augur align
, augur will include that reference sequence in the alignment and strip all insertions relative to that reference.
This will guarantee a consistent reference coordinate system and genome annotation later on.