Creating a bacterial phylogenetic workflowļ
This tutorial explains how to create a single-build Nextstrain workflow for Tuberculosis sequences. However, much of it will be applicable to any run where you are starting with VCF files rather than FASTA files. Weāll create a Snakefile step-by-step for each step of the analysis.
Prerequisitesļ
Run through the first tutorial. This will verify your installation.
Setupļ
Download the example pathogen repository and enter the new directory.
git clone https://github.com/nextstrain/tb cd tb
Note
The data in this tutorial is public and is a subset of the data from Lee et al.ās 2015 paper Population genomics of Mycobacterium tuberculosis in the Inuit. As location was anonymized in the paper, location data provided here was randomly chosen from the region for illustrative purposes.
Create the Nextstrain Buildļ
This build will have the following steps:
Prepare the Sequencesļ
A Nextstrain build with VCF file input starts with:
A VCF file containing all the sequences you want to include (variable sites only)
A FASTA file of the reference sequence to which your VCF was mapped
A tab-delimited metadata file
There are other files you will need if you want to perform certain steps, like masking. If you are also working with TB sequences, you may be able to use the files provided here on your own data. Otherwise, youāll need to provide files specific to your pathogen.
Here, our input file is compressed with gzip - you can see it ends with .vcf.gz
. However, Augur can take gzipped or un-gzipped VCF files. It can also produce either gzipped or un-gzipped VCF files as output (detected from the file ending you provide). Here, weāll usually keep our output VCF files gzipped, by giving our output files endings like .vcf.gz
, but you can specify .vcf
instead.
All the data you need to make the TB build are in the data
and config
folders.
Filter the Sequencesļ
Sometimes you may want to exclude certain sequences from analysis. You may also wish to downsample your data based on certain criteria. filter
lets you do this.
For this example, weāll just exclude sequences in the file dropped_strains.txt
.
Weāll need to specify these starting files at the top of our Snakefile:
seq_file = "data/lee_2015.vcf.gz",
meta_file = "data/meta.tsv",
exclude_file = "config/dropped_strains.txt"
And weāll add this as our first rule:
rule filter:
input:
seq = seq_file,
meta = meta_file,
exclude = exclude_file
output:
"results/filtered.vcf.gz"
shell:
"""
augur filter --sequences {input.seq} \
--metadata {input.meta} \
--exclude {input.exclude} \
--output {output}
"""
Now run filter. If you are using the Snakefile included with the TB tutorial, you can run:
snakemake --cores 1 filter
If you have created your own Snakefile, youāll need to specify its name. For example, if it is called TB_snakefile
, you would run:
snakemake --cores 1 -s TB_snakefile filter
Mask the Sequencesļ
There may be regions in your pathogen sequences that are unreliable. For example, areas that are hard to map because of repeat regions. Often, these are excluded from analysis so that incorrect calls in these areas donāt influence the results. The areas to be masked are specified in a BED-format file. This is a standard, tab-delimited format with five columns: Chrom, ChomStart, ChromEnd, locus tag, and Comment. You can open up config/Locus_to_exclude_Mtb.bed
in the TB tutorial to see the file format.
The first, fourth, and fifth columns (Chrom, locus tag, and Comment) can be blank or contain anything - they will be ignored. All sites between each ChromStart and ChromEnd will be removed from the analysis.
Weāll need to add this BED-format file to the top of the Snakefile (below the files already there):
mask_file = "config/Locus_to_exclude_Mtb.bed"
Now we can add the mask
rule:
rule mask:
input:
seq = rules.filter.output,
mask = mask_file
output:
"results/masked.vcf.gz"
shell:
"""
augur mask --sequences {input.seq} \
--mask {input.mask} \
--output {output}
"""
Construct the Phylogenyļ
Now our sequences are ready to start analysis.
With VCF files, weāll do this in two steps that are slightly different from FASTA-input. 1. First, weāll use only the variable sites to construct a tree quickly. This will give us the topology, but the branch lengths will be incorrect. 2. Next, weāll consider the entire sequence to correct our branch lengths. At the same time, the sample date information will be used to create a time-resolved tree.
Construct an Initial Tree to Get the Topologyļ
You can use different tree-building programs to build your initial tree, and specify some parameters. Here, weāll use IQTree. We specify it here with the argument --method
, but itās also the default.
In tree
, we pass in the VCF file and the reference it was mapped to. We also pass in a list of sites that weād like to exclude from building the topology (optional). These are sites associated with drug-resistance mutations that can influence the topology. We exclude them here, but theyāll be allowed to influence branch length and be included in ancestral sequence reconstruction later.
We must add the reference sequence our VCF file was mapped to, and our list of sites to exclude from tree-building to the top of the Snakefile:
ref_file = "data/ref.fasta"
sites_file = "config/drm_sites.txt"
And add the tree
rule to the Snakefile:
rule tree:
input:
aln = rules.mask.output,
ref = ref_file,
sites = sites_file
output:
"results/tree_raw.nwk"
params:
method = 'iqtree'
shell:
"""
augur tree --alignment {input.aln} \
--vcf-reference {input.ref} \
--method {params.method} \
--exclude-sites {input.sites} \
--output {output}
"""
Fix Branch Lengths & Get a Time-Resolved Treeļ
Now weāll use the topology from tree
, but get more accurate branch lengths and a time-resolved tree. This adjusts branch lengths in the tree to position tips by their sample date and infer the most likely time of their ancestors, using TreeTime. There are many options that can be specified here in refine
to help you get a good tree.
refine
will produce as output: * another tree (newick format) * a JSON format file with the inferred dates and mutations on each node/branch
rule refine:
input:
tree = rules.tree.output,
aln = rules.mask.output,
metadata = meta_file,
ref = ref_file
output:
tree = "results/tree.nwk",
node_data = "results/branch_lengths.json",
params:
root = 'min_dev',
coal = 'opt'
shell:
"""
augur refine --tree {input.tree} \
--alignment {input.aln} \
--vcf-reference {input.ref} \
--metadata {input.metadata} \
--timetree \
--root {params.root} \
--coalescent {params.coal} \
--output-tree {output.tree} \
--output-node-data {output.node_data}
"""
In addition to assigning times to internal nodes, the refine
command filters tips that are likely outliers. Branch lengths in the resulting Newick tree measure adjusted nucleotide divergence. All other data inferred by TreeTime is stored by strain or internal node name in the JSON file.
Annotate the Phylogenyļ
Now that we have an accurate tree and some information about the ancestral sequences, we can annotate some interesting data onto our phylogeny. TreeTime can infer ancestral sequences and ancestral traits from an existing phylogenetic tree and metadata to annotate each tip of the tree.
Infer Ancestral Sequencesļ
We can reconstruct the ancestral sequences for the internal nodes on our phylogeny and identify any nucleotide mutations on the branches leading to any node in the tree.
For VCF runs, ancestral
will produce another VCF that contains the reconstructed sequence of all the internal nodes and the sequences from the tip nodes, as well as a JSON-format file that contains nucleotide mutation information for each node.
rule ancestral:
input:
tree = rules.refine.output.tree,
alignment = rules.mask.output,
ref = ref_file
output:
nt_data = "results/nt_muts.json",
vcf_out = "results/nt_muts.vcf"
params:
inference = "joint"
shell:
"""
augur ancestral --tree {input.tree} \
--alignment {input.alignment} \
--vcf-reference {input.ref} \
--inference {params.inference} \
--output-node-data {output.nt_data} \
--output-vcf {output.vcf_out}
"""
Identify Amino-Acid Mutationsļ
With translate
we can identify amino acid mutations from the nucleotide mutations and a GFF file with gene coordinate annotations. The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name. translate
will also produce a VCF-style file with the amino acid changes for each gene and each sequence, and FASTA file with the translated āreferenceā genes which the VCF-style file āmapsā to.
Because of the number of genes in TB, we will only translate genes associated with drug resistance to save time. We can pass in a list of genes to translate using --genes
. Note that the --reference-sequence
option is how you pass in the GFF file with the gene coordinates.
Weāll need to add the GFF file with the gene annotations and the file with a list of genes to translate to the list of files at the top of the Snakefile:
generef_file = "config/Mtb_H37Rv_NCBI_Annot.gff",
genes_file = "config/genes.txt"
rule translate:
input:
tree = rules.refine.output.tree,
ref = ref_file,
gene_ref = generef_file,
vcf = rules.ancestral.output.vcf_out,
genes = genes_file
output:
aa_data = "results/aa_muts.json",
vcf_out = "results/translations.vcf",
vcf_ref = "results/translations_reference.fasta"
shell:
"""
augur translate --tree {input.tree} \
--vcf-reference {input.ref} \
--ancestral-sequences {input.vcf} \
--genes {input.genes} \
--reference-sequence {input.gene_ref} \
--output-node-data {output.aa_data} \
--alignment-output {output.vcf_out} \
--vcf-reference-output {output.vcf_ref}
"""
Reconstruct Ancestral Traitsļ
traits
can reconstruct the probable ancestral state of traits like location and host (or others). This is done by specifying a column or columns in the metadata file.
--confidence
will give confidence estimates for the reconstructed traits. The output will be a JSON file with the trait (and confidence, if specified) information for each node.
rule traits:
input:
tree = rules.refine.output.tree,
meta = meta_file
output:
"results/traits.json"
params:
traits = 'location'
shell:
"""
augur traits --tree {input.tree} \
--metadata {input.meta} \
--columns {params.traits} \
--output-node-data {output}
"""
Identify Specified Cladesļ
In the original paper, the authors identified āsublineagesā within the dataset. We can add these to our dataset as ācladesā by defining the sublineages with amino-acid or nucleotide mutations specific to that sublineage, given here in the file config/clades.tsv
. Open it up in a text editor to have a look at the format.
The clades.tsv
file must be tab-delimited with four columns: clade, gene, site, and alt. The ācladeā column gives the name of the clade being defined - you can have more than one row per clade - it will only be defined from the branch where all criteria are met. The āgeneā and āsiteā columns specify the gene (or nuc
for nucleotide) and location (by AA position in the gene, or nucleotide position in the genome) where the branch must have the āaltā (4th column) value to be considered this clade.
As clades, these sublineages will be labelled and weāll be able to color the tree by them.
You can specify clades for your own data by first doing a run without clades, then mousing over branches where youād like to start defining a clade to see what mutations are present.
Weāll need to add the file that defines the clades to the top of our Snakefile:
clades_file = "config/clades.tsv"
rule clades:
input:
tree = rules.refine.output.tree,
aa_muts = rules.translate.output.aa_data,
nuc_muts = rules.ancestral.output.nt_data,
clades = clades_file
output:
clade_data = "results/clades.json"
shell:
"""
augur clades --tree {input.tree} \
--mutations {input.nuc_muts} {input.aa_muts} \
--clades {input.clades} \
--output-node-data {output.clade_data}
"""
Identify Drug Resistance Mutationsļ
sequence-traits
can identify any trait associated with particular nucleotide or amino-acid mutations, not just drug resistance mutations (DRMs).
This dataset doesnāt actually contain any drug resistance mutations, but identifying such mutations is often of interest to those working on tuberculosis. Here, weāll run this step as an example, even though it wonāt add anything to the tree for this dataset.
Open up the config/DRMs-AAnuc.tsv
file to see the format of a file that specifies sequence traits. It contains five columns: GENE, SITE, ALT, DISPLAY_NAME, and FEATURE. DISPLAY_NAME can be blank.
For drug resistance, we list the gene, the AA position in the gene, the AA mutation that confers resistance (you can list a site multiple times if multiple bases give resistance), and the name of the drug this mutation gives resistance to:
GENE SITE ALT DISPLAY_NAME FEATURE
gyrB 461 N Fluoroquinolones
gyrB 499 D Fluoroquinolones
rpoB 432 E Rifampicin
rpoB 432 K Rifampicin
We can leave DISPLAY_NAME blank, as Auspice will by default display the gene, site, and original and alternative base.
For mutations outside of protein-coding genes, we can specify their position using nucleotides:
GENE SITE ALT DISPLAY_NAME FEATURE
nuc 1472749 A rrs: C904A Streptomycin
nuc 1473246 G rrs: A1401G Amikacin Capreomycin Kanamycin
nuc 1673423 T fabG1: G-17T Isoniazid Ethionamide
nuc 1673425 T fabG1: C-15T Isoniazid Ethionamide
In the literature, these mutations are still referred to by their position within non-protein-coding genes (rrs
) or location near genes (-17 fabG1
), not their nucleotide location. We can ensure Auspice displays the more useful common nomenclature by giving entries for the DISPLAY_NAME column.
sequence-traits
will return a value for each āfeatureā - for example, all the mutations on the tree that lead to resistance to Streptomycin. It will also generate a count either of the total number of āfeaturesā each node has (ex: the total number of drugs a sequence is resistant to), or the total number or mutations specified in the file each node has (ex: the total number of DRMs a sequence has, even if some are for the same drug). You can specify a name for this count using the --label
argument (here: āDrug_Resistanceā). The --count
argument value specifies whether to count the number of traits (ex: drugs resistant to) (use traits
) or number of overall mutations (use mutations
).
Weāll need to add the file that defines the sequence traits (DRMs) to the top of our Snakefile:
drms_file = "config/DRMs-AAnuc.tsv"
rule seqtraits:
input:
align = rules.ancestral.output.vcf_out,
ref = ref_file,
trans_align = rules.translate.output.vcf_out,
trans_ref = rules.translate.output.vcf_ref,
drms = drms_file
output:
drm_data = "results/drms.json"
params:
count = "traits",
label = "Drug_Resistance"
shell:
"""
augur sequence-traits \
--ancestral-sequences {input.align} \
--vcf-reference {input.ref} \
--translations {input.trans_align} \
--vcf-translate-reference {input.trans_ref} \
--features {input.drms} \
--count {params.count} \
--label {params.label} \
--output-node-data {output.drm_data}
"""
Export the Resultsļ
Finally, collect all node annotations and metadata and export it all in Auspiceās JSON format. The resulting tree and metadata JSON files are the inputs to the Auspice visualization tool.
The names of the output tree and metadata files are here specified by a rule called all
at the beginning of our Snakefile. It should be even before the list of files, and looks like this:
rule all:
input:
auspice_tree = "auspice/tb_tree.json",
auspice_meta = "auspice/tb_meta.json"
This rule tells Snakemake what the final output of our entire run should look like. It will run all rules necessary to produce these files, so they should be the names of your final step. If you have an āallā rule, you can run your entire analysis just by running snakemake --cores 1
or snakemake --cores 1 --snakefile Snakefile2
(if the name of your Snakefile is not āSnakefileā).
Weāll need to add a few remaining files to our list of files at the start of our Snakefile:
colors_file = "config/color.tsv",
config_file = "config/config.json",
geo_info_file = "config/lat_longs.tsv"
The color.tsv
file is optional, but allows us to specify our own colors for particular traits. If you open it up, you can see that we choose our own colors for values in āregionā, ācountryā, ālocationā and āclade_membershipā. If you donāt supply a color.tsv
file, Auspice will choose colors for you. This can be the simplest way to start - then you can add colors for any traits where you donāt like what Auspice has chosen.
The lat_longs.tsv
file contains the latitudes and longitudes for the geographic locations of your data, and may or may not be needed for your data. Augur contains many latitudes and longitudes for countries and regions, but if you want to specify data at a different level (state, province, county, city), you can include your own file as well (it will be used in addition to the defaults, so country location can still be retrieved from the Augur file, for example). At the bottom of the config/lat_longs.tsv
file in the TB tutorial, notice there are entries for ālocationā, listing each village.
Since all the samples come from the region of North America and the country of Canada, we donāt include these anywhere in our data - all the samples would be the same. Instead, we have ālocationā as a color_options
entry, and also as our geo
(where the samples will be drawn on the map), and as a filters
option.
We also have a color_options
entry for āclade_membershipā, since we designated clades with the clades
rule. The trait is added to our tree as clade_membership
which is why this is the name of the option and the key
value, but we could set the legendTitle
and menuItem
to be anything we wish, if we wanted.
rule export:
input:
tree = rules.refine.output.tree,
metadata = meta_file,
branch_lengths = rules.refine.output.node_data,
traits = rules.traits.output,
nt_muts = rules.ancestral.output.nt_data,
aa_muts = rules.translate.output.aa_data,
drms = rules.seqtraits.output.drm_data,
color_defs = "config/colors.tsv",
config = "config/config.json",
geo_info = "config/lat_longs.tsv",
clades = rules.clades.output.clade_data
output:
auspice_json = "auspice/tb.json",
shell:
"""
augur export v2 \
--tree {input.tree} \
--metadata {input.metadata} \
--node-data {input.branch_lengths} {input.traits} {input.drms} {input.aa_muts} {input.nt_muts} {input.clades} \
--auspice-config {input.config} \
--colors {input.color_defs} \
--lat-longs {input.geo_info} \
--output {output.auspice_json} \
"""
As mentioned previously, this dataset has no drug resistance, so itās not included in the config.json
file to display, even though we ran the sequence-traits
rule. If you did have drug resistance information that you wanted to display, you would need to add it to the config.json
file as color_options
.
First, you would want to add a color-by for the total number of drugs each node is resistant to. Since we gave the label āDrug_Resistanceā when we ran the rule, this will be the name of the option, and the key
, but we can make the menuItem
and legendTitle
different if we wish:
"Drug_Resistance": {
"menuItem": "Drug_Resistance",
"legendTitle": "Drug Resistance",
"type": "discrete",
"key": "Drug_Resistance"
},
If you had given a different label when you ran the rule, you would change this entry to match.
You would then need an option for each drug where you have resistance information (or each FEATURE where you have information). For example, to show the mutations present that confer resistance to Streptomycin and Rifampicin:
"Streptomycin": {
"menuItem": "Streptomycin",
"legendTitle": "Streptomycin Resistance",
"type": "discrete",
"key": "Streptomycin"
},
"Rifampicin": {
"menuItem": "Rifampicin",
"legendTitle": "Rifampicin Resistance",
"type": "discrete",
"key": "Rifampicin"
},
You would need an entry for every FEATURE in your original file (though you could then remove any that had no information on the tree).