Creating a bacterial phylogenetic workflow

This tutorial explains how to create a single-build Nextstrain workflow for Tuberculosis sequences. However, much of it will be applicable to any run where you are starting with VCF files rather than FASTA files. We’ll create a Snakefile step-by-step for each step of the analysis.

Table of Contents

Prerequisites
Setup
Create the Nextstrain Build

Prerequisites 

Install Nextstrain.
Run through the first tutorial. This will verify your installation.

Setup 

Download the example pathogen repository and enter the new directory.
```
git clone https://github.com/nextstrain/tb
cd tb
```
Note

The data in this tutorial is public and is a subset of the data from Lee et al.’s 2015 paper Population genomics of Mycobacterium tuberculosis in the Inuit. As location was anonymized in the paper, location data provided here was randomly chosen from the region for illustrative purposes.

Create the Nextstrain Build 

This build will have the following steps:

Prepare the Sequences
- Filter the Sequences
- Mask the Sequences
Construct the Phylogeny
- Construct an Initial Tree to Get the Topology
- Fix Branch Lengths & Get a Time-Resolved Tree
Annotate the Phylogeny
Export the Results

Prepare the Sequences 

A Nextstrain build with VCF file input starts with:

A VCF file containing all the sequences you want to include (variable sites only)
A FASTA file of the reference sequence to which your VCF was mapped
A tab-delimited metadata file

There are other files you will need if you want to perform certain steps, like masking. If you are also working with TB sequences, you may be able to use the files provided here on your own data. Otherwise, you’ll need to provide files specific to your pathogen.

Here, our input file is compressed with gzip - you can see it ends with .vcf.gz. However, Augur can take gzipped or un-gzipped VCF files. It can also produce either gzipped or un-gzipped VCF files as output (detected from the file ending you provide). Here, we’ll usually keep our output VCF files gzipped, by giving our output files endings like .vcf.gz, but you can specify .vcf instead.

All the data you need to make the TB build are in the data and config folders.

Filter the Sequences 

Sometimes you may want to exclude certain sequences from analysis. You may also wish to downsample your data based on certain criteria. filter lets you do this.

For this example, we’ll just exclude sequences in the file dropped_strains.txt.

We’ll need to specify these starting files at the top of our Snakefile:

seq_file = "data/lee_2015.vcf.gz",
meta_file = "data/meta.tsv",
exclude_file = "config/dropped_strains.txt"

And we’ll add this as our first rule:

rule filter:
    input:
        seq = seq_file,
        meta = meta_file,
        exclude = exclude_file
    output:
        "results/filtered.vcf.gz"
    shell:
        """
        augur filter --sequences {input.seq} \
            --metadata {input.meta} \
            --exclude {input.exclude} \
            --output {output}
        """

Now run filter. If you are using the Snakefile included with the TB tutorial, you can run:

snakemake --cores 1 filter

If you have created your own Snakefile, you’ll need to specify its name. For example, if it is called TB_snakefile, you would run:

snakemake --cores 1 -s TB_snakefile filter

Mask the Sequences 

There may be regions in your pathogen sequences that are unreliable. For example, areas that are hard to map because of repeat regions. Often, these are excluded from analysis so that incorrect calls in these areas don’t influence the results. The areas to be masked are specified in a BED-format file. This is a standard, tab-delimited format with five columns: Chrom, ChomStart, ChromEnd, locus tag, and Comment. You can open up config/Locus_to_exclude_Mtb.bed in the TB tutorial to see the file format.

The first, fourth, and fifth columns (Chrom, locus tag, and Comment) can be blank or contain anything - they will be ignored. All sites between each ChromStart and ChromEnd will be removed from the analysis.

We’ll need to add this BED-format file to the top of the Snakefile (below the files already there):

mask_file = "config/Locus_to_exclude_Mtb.bed"

Now we can add the mask rule:

rule mask:
    input:
        seq = rules.filter.output,
        mask = mask_file
    output:
       "results/masked.vcf.gz"
    shell:
        """
        augur mask --sequences {input.seq} \
            --mask {input.mask} \
            --output {output}
        """

Construct the Phylogeny 

Now our sequences are ready to start analysis.

With VCF files, we’ll do this in two steps that are slightly different from FASTA-input. 1. First, we’ll use only the variable sites to construct a tree quickly. This will give us the topology, but the branch lengths will be incorrect. 2. Next, we’ll consider the entire sequence to correct our branch lengths. At the same time, the sample date information will be used to create a time-resolved tree.

Construct an Initial Tree to Get the Topology 

You can use different tree-building programs to build your initial tree, and specify some parameters. Here, we’ll use IQTree. We specify it here with the argument --method, but it’s also the default.

In tree, we pass in the VCF file and the reference it was mapped to. We also pass in a list of sites that we’d like to exclude from building the topology (optional). These are sites associated with drug-resistance mutations that can influence the topology. We exclude them here, but they’ll be allowed to influence branch length and be included in ancestral sequence reconstruction later.

We must add the reference sequence our VCF file was mapped to, and our list of sites to exclude from tree-building to the top of the Snakefile:

ref_file = "data/ref.fasta"
sites_file = "config/drm_sites.txt"

And add the tree rule to the Snakefile:

rule tree:
    input:
        aln = rules.mask.output,
        ref = ref_file,
        sites = sites_file
    output:
        "results/tree_raw.nwk"
    params:
        method = 'iqtree'
    shell:
        """
        augur tree --alignment {input.aln} \
            --vcf-reference {input.ref} \
            --method {params.method} \
            --exclude-sites {input.sites} \
            --output {output}
        """

Fix Branch Lengths & Get a Time-Resolved Tree 

Now we’ll use the topology from tree, but get more accurate branch lengths and a time-resolved tree. This adjusts branch lengths in the tree to position tips by their sample date and infer the most likely time of their ancestors, using TreeTime. There are many options that can be specified here in refine to help you get a good tree.

refine will produce as output: * another tree (newick format) * a JSON format file with the inferred dates and mutations on each node/branch

rule refine:
    input:
        tree = rules.tree.output,
        aln = rules.mask.output,
        metadata = meta_file,
        ref = ref_file
    output:
        tree = "results/tree.nwk",
        node_data = "results/branch_lengths.json",
    params:
        root = 'min_dev',
        coal = 'opt'
    shell:
        """
        augur refine --tree {input.tree} \
            --alignment {input.aln} \
            --vcf-reference {input.ref} \
            --metadata {input.metadata} \
            --timetree \
            --root {params.root} \
            --coalescent {params.coal} \
            --output-tree {output.tree} \
            --output-node-data {output.node_data}
        """

In addition to assigning times to internal nodes, the refine command filters tips that are likely outliers. Branch lengths in the resulting Newick tree measure adjusted nucleotide divergence. All other data inferred by TreeTime is stored by strain or internal node name in the JSON file.

Annotate the Phylogeny 

Now that we have an accurate tree and some information about the ancestral sequences, we can annotate some interesting data onto our phylogeny. TreeTime can infer ancestral sequences and ancestral traits from an existing phylogenetic tree and metadata to annotate each tip of the tree.

Infer Ancestral Sequences 

We can reconstruct the ancestral sequences for the internal nodes on our phylogeny and identify any nucleotide mutations on the branches leading to any node in the tree.

For VCF runs, ancestral will produce another VCF that contains the reconstructed sequence of all the internal nodes and the sequences from the tip nodes, as well as a JSON-format file that contains nucleotide mutation information for each node.

rule ancestral:
    input:
        tree = rules.refine.output.tree,
        alignment = rules.mask.output,
        ref = ref_file
    output:
        nt_data = "results/nt_muts.json",
        vcf_out = "results/nt_muts.vcf"
    params:
        inference = "joint"
    shell:
        """
        augur ancestral --tree {input.tree} \
            --alignment {input.alignment} \
            --vcf-reference {input.ref} \
            --inference {params.inference} \
            --output-node-data {output.nt_data} \
            --output-vcf {output.vcf_out}
        """

Identify Amino-Acid Mutations 

With translate we can identify amino acid mutations from the nucleotide mutations and a GFF file with gene coordinate annotations. The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name. translate will also produce a VCF-style file with the amino acid changes for each gene and each sequence, and FASTA file with the translated ‘reference’ genes which the VCF-style file ‘maps’ to.

Because of the number of genes in TB, we will only translate genes associated with drug resistance to save time. We can pass in a list of genes to translate using --genes. Note that the --reference-sequence option is how you pass in the GFF file with the gene coordinates.

We’ll need to add the GFF file with the gene annotations and the file with a list of genes to translate to the list of files at the top of the Snakefile:

generef_file = "config/Mtb_H37Rv_NCBI_Annot.gff",
genes_file = "config/genes.txt"

rule translate:
    input:
        tree = rules.refine.output.tree,
        ref = ref_file,
        gene_ref = generef_file,
        vcf = rules.ancestral.output.vcf_out,
        genes = genes_file
    output:
        aa_data = "results/aa_muts.json",
        vcf_out = "results/translations.vcf",
        vcf_ref = "results/translations_reference.fasta"
    shell:
        """
        augur translate --tree {input.tree} \
            --vcf-reference {input.ref} \
            --ancestral-sequences {input.vcf} \
            --genes {input.genes} \
            --reference-sequence {input.gene_ref} \
            --output-node-data {output.aa_data} \
            --alignment-output {output.vcf_out} \
            --vcf-reference-output {output.vcf_ref}
        """

Reconstruct Ancestral Traits 

traits can reconstruct the probable ancestral state of traits like location and host (or others). This is done by specifying a column or columns in the metadata file.

--confidence will give confidence estimates for the reconstructed traits. The output will be a JSON file with the trait (and confidence, if specified) information for each node.

rule traits:
    input:
        tree = rules.refine.output.tree,
        meta = meta_file
    output:
        "results/traits.json"
    params:
        traits = 'location'
    shell:
        """
        augur traits --tree {input.tree} \
            --metadata {input.meta} \
            --columns {params.traits} \
            --output-node-data {output}
        """

Identify Specified Clades 

In the original paper, the authors identified ‘sublineages’ within the dataset. We can add these to our dataset as ‘clades’ by defining the sublineages with amino-acid or nucleotide mutations specific to that sublineage, given here in the file config/clades.tsv. Open it up in a text editor to have a look at the format.

The clades.tsv file must be tab-delimited with four columns: clade, gene, site, and alt. The ‘clade’ column gives the name of the clade being defined - you can have more than one row per clade - it will only be defined from the branch where all criteria are met. The ‘gene’ and ‘site’ columns specify the gene (or nuc for nucleotide) and location (by AA position in the gene, or nucleotide position in the genome) where the branch must have the ‘alt’ (4th column) value to be considered this clade.

As clades, these sublineages will be labelled and we’ll be able to color the tree by them.

You can specify clades for your own data by first doing a run without clades, then mousing over branches where you’d like to start defining a clade to see what mutations are present.

We’ll need to add the file that defines the clades to the top of our Snakefile:

clades_file = "config/clades.tsv"

rule clades:
    input:
        tree = rules.refine.output.tree,
        aa_muts = rules.translate.output.aa_data,
        nuc_muts = rules.ancestral.output.nt_data,
        clades = clades_file
    output:
        clade_data = "results/clades.json"
    shell:
        """
        augur clades --tree {input.tree} \
            --mutations {input.nuc_muts} {input.aa_muts} \
            --clades {input.clades} \
            --output-node-data {output.clade_data}
        """

Identify Drug Resistance Mutations 

sequence-traits can identify any trait associated with particular nucleotide or amino-acid mutations, not just drug resistance mutations (DRMs).

This dataset doesn’t actually contain any drug resistance mutations, but identifying such mutations is often of interest to those working on tuberculosis. Here, we’ll run this step as an example, even though it won’t add anything to the tree for this dataset.

Open up the config/DRMs-AAnuc.tsv file to see the format of a file that specifies sequence traits. It contains five columns: GENE, SITE, ALT, DISPLAY_NAME, and FEATURE. DISPLAY_NAME can be blank.

For drug resistance, we list the gene, the AA position in the gene, the AA mutation that confers resistance (you can list a site multiple times if multiple bases give resistance), and the name of the drug this mutation gives resistance to:

GENE    SITE    ALT DISPLAY_NAME    FEATURE
gyrB    461 N       Fluoroquinolones
gyrB    499 D       Fluoroquinolones
rpoB    432 E       Rifampicin
rpoB    432 K       Rifampicin

We can leave DISPLAY_NAME blank, as Auspice will by default display the gene, site, and original and alternative base.

For mutations outside of protein-coding genes, we can specify their position using nucleotides:

GENE    SITE    ALT DISPLAY_NAME    FEATURE
nuc 1472749 A   rrs: C904A  Streptomycin
nuc 1473246 G   rrs: A1401G Amikacin Capreomycin Kanamycin
nuc 1673423 T   fabG1: G-17T    Isoniazid Ethionamide
nuc 1673425 T   fabG1: C-15T    Isoniazid Ethionamide

In the literature, these mutations are still referred to by their position within non-protein-coding genes (rrs) or location near genes (-17 fabG1), not their nucleotide location. We can ensure Auspice displays the more useful common nomenclature by giving entries for the DISPLAY_NAME column.

sequence-traits will return a value for each “feature” - for example, all the mutations on the tree that lead to resistance to Streptomycin. It will also generate a count either of the total number of “features” each node has (ex: the total number of drugs a sequence is resistant to), or the total number or mutations specified in the file each node has (ex: the total number of DRMs a sequence has, even if some are for the same drug). You can specify a name for this count using the --label argument (here: “Drug_Resistance”). The --count argument value specifies whether to count the number of traits (ex: drugs resistant to) (use traits) or number of overall mutations (use mutations).

We’ll need to add the file that defines the sequence traits (DRMs) to the top of our Snakefile:

drms_file = "config/DRMs-AAnuc.tsv"

rule seqtraits:
    input:
        align = rules.ancestral.output.vcf_out,
        ref = ref_file,
        trans_align = rules.translate.output.vcf_out,
        trans_ref = rules.translate.output.vcf_ref,
        drms = drms_file
    output:
        drm_data = "results/drms.json"
    params:
        count = "traits",
        label = "Drug_Resistance"
    shell:
        """
        augur sequence-traits \
            --ancestral-sequences {input.align} \
            --vcf-reference {input.ref} \
            --translations {input.trans_align} \
            --vcf-translate-reference {input.trans_ref} \
            --features {input.drms} \
            --count {params.count} \
            --label {params.label} \
            --output-node-data {output.drm_data}
        """

Export the Results 

Finally, collect all node annotations and metadata and export it all in Auspice’s JSON format. The resulting tree and metadata JSON files are the inputs to the Auspice visualization tool.

The names of the output tree and metadata files are here specified by a rule called all at the beginning of our Snakefile. It should be even before the list of files, and looks like this:

rule all:
    input:
        auspice_tree = "auspice/tb_tree.json",
        auspice_meta = "auspice/tb_meta.json"

This rule tells Snakemake what the final output of our entire run should look like. It will run all rules necessary to produce these files, so they should be the names of your final step. If you have an “all” rule, you can run your entire analysis just by running snakemake --cores 1 or snakemake --cores 1 --snakefile Snakefile2 (if the name of your Snakefile is not ‘Snakefile’).

We’ll need to add a few remaining files to our list of files at the start of our Snakefile:

colors_file = "config/color.tsv",
config_file = "config/config.json",
geo_info_file = "config/lat_longs.tsv"

The color.tsv file is optional, but allows us to specify our own colors for particular traits. If you open it up, you can see that we choose our own colors for values in ‘region’, ‘country’, ‘location’ and ‘clade_membership’. If you don’t supply a color.tsv file, Auspice will choose colors for you. This can be the simplest way to start - then you can add colors for any traits where you don’t like what Auspice has chosen.

The lat_longs.tsv file contains the latitudes and longitudes for the geographic locations of your data, and may or may not be needed for your data. Augur contains many latitudes and longitudes for countries and regions, but if you want to specify data at a different level (state, province, county, city), you can include your own file as well (it will be used in addition to the defaults, so country location can still be retrieved from the Augur file, for example). At the bottom of the config/lat_longs.tsv file in the TB tutorial, notice there are entries for ‘location’, listing each village.

Since all the samples come from the region of North America and the country of Canada, we don’t include these anywhere in our data - all the samples would be the same. Instead, we have ‘location’ as a color_options entry, and also as our geo (where the samples will be drawn on the map), and as a filters option.

We also have a color_options entry for ‘clade_membership’, since we designated clades with the clades rule. The trait is added to our tree as clade_membership which is why this is the name of the option and the key value, but we could set the legendTitle and menuItem to be anything we wish, if we wanted.

rule export:
    input:
        tree = rules.refine.output.tree,
        metadata = meta_file,
        branch_lengths = rules.refine.output.node_data,
        traits = rules.traits.output,
        nt_muts = rules.ancestral.output.nt_data,
        aa_muts = rules.translate.output.aa_data,
        drms = rules.seqtraits.output.drm_data,
        color_defs = "config/colors.tsv",
        config = "config/config.json",
        geo_info = "config/lat_longs.tsv",
        clades = rules.clades.output.clade_data
    output:
        auspice_json = "auspice/tb.json",
    shell:
        """
        augur export v2 \
            --tree {input.tree} \
            --metadata {input.metadata} \
            --node-data {input.branch_lengths} {input.traits} {input.drms} {input.aa_muts} {input.nt_muts} {input.clades} \
            --auspice-config {input.config} \
            --colors {input.color_defs} \
            --lat-longs {input.geo_info} \
            --output {output.auspice_json} \
            """

As mentioned previously, this dataset has no drug resistance, so it’s not included in the config.json file to display, even though we ran the sequence-traits rule. If you did have drug resistance information that you wanted to display, you would need to add it to the config.json file as color_options.

First, you would want to add a color-by for the total number of drugs each node is resistant to. Since we gave the label ‘Drug_Resistance’ when we ran the rule, this will be the name of the option, and the key, but we can make the menuItem and legendTitle different if we wish:

"Drug_Resistance": {
 "menuItem": "Drug_Resistance",
 "legendTitle": "Drug Resistance",
 "type": "discrete",
 "key": "Drug_Resistance"
},

If you had given a different label when you ran the rule, you would change this entry to match.

You would then need an option for each drug where you have resistance information (or each FEATURE where you have information). For example, to show the mutations present that confer resistance to Streptomycin and Rifampicin:

"Streptomycin": {
 "menuItem": "Streptomycin",
 "legendTitle": "Streptomycin Resistance",
 "type": "discrete",
 "key": "Streptomycin"
},
"Rifampicin": {
 "menuItem": "Rifampicin",
 "legendTitle": "Rifampicin Resistance",
 "type": "discrete",
 "key": "Rifampicin"
},

You would need an entry for every FEATURE in your original file (though you could then remove any that had no information on the tree).