augur ancestral

Infer ancestral sequences based on a tree.

The ancestral sequences are inferred using TreeTime. Each internal node gets assigned a nucleotide sequence that maximizes a likelihood on the tree given its descendants and its parent node. Each node then gets assigned a list of nucleotide mutations for any position that has a mismatch between its own sequence and its parent’s sequence. The node sequences and mutations are output to a node-data JSON file.

If amino acid options are provided, the ancestral amino acid sequences for each requested gene are inferred with the same method as the nucleotide sequences described above. The inferred amino acid mutations will be included in the output node-data JSON file, with the format equivalent to the output of augur translate.

The nucleotide and amino acid sequences are inferred separately in this command, which can potentially result in mismatches between the nucleotide and amino acid mutations. If you want amino acid mutations based on the inferred nucleotide sequences, please use augur translate.

Note

The mutation positions in the node-data JSON are one-based.

usage: augur ancestral [-h] --tree TREE [--alignment ALIGNMENT]
                       [--vcf-reference FASTA | --root-sequence FASTA/GenBank]
                       [--inference {joint,marginal}] [--seed SEED]
                       [--keep-ambiguous | --infer-ambiguous]
                       [--keep-overhangs] [--genes GENES [GENES ...]]
                       [--annotation ANNOTATION] [--translations TRANSLATIONS]
                       [--report-inconsistent-translation]
                       [--aa-root-sequence FASTA]
                       [--output-node-data OUTPUT_NODE_DATA]
                       [--output-sequences OUTPUT_SEQUENCES]
                       [--output-translations OUTPUT_TRANSLATIONS]
                       [--output-vcf OUTPUT_VCF]
                       [--validation-mode {error,warn,skip}]
                       [--skip-validation]

inputs

Tree and sequences to use for ancestral reconstruction

--tree, -t

prebuilt Newick

nucleotide options

Options to configure reconstruction of nucleotide sequences.

--alignment, -a

alignment in FASTA or VCF format

--vcf-reference

[VCF alignment only] file of the sequence the VCF was mapped to. Differences between this sequence and the inferred root will be reported as mutations on the root branch.

--root-sequence

[FASTA alignment only] file of the sequence that is used as root for mutation calling. Differences between this sequence and the inferred root will be reported as mutations on the root branch. If also reconstructing AA sequences, this (nuc) sequence will be translated to form the AA root sequences unless --aa-root-sequence is provided.

global options

Options to configure reconstruction of both nucleotide and amino acid sequences

--inference

Possible choices: joint, marginal

calculate joint or marginal maximum likelihood ancestral sequence states

Default: 'joint'

--seed

seed for random number generation

--keep-ambiguous

do not infer ambiguous states on tip sequences

Default: False

--infer-ambiguous

infer ambiguous states on tip sequences and replace with most likely state

Default: True

--keep-overhangs

do not infer states for gaps (-) on either side of the alignment

Default: False

amino acid options

Options to configure reconstruction of ancestral amino acid sequences.

--genes

gene(s) to translate (list or file containing list).

--annotation

GenBank or GFF file containing the annotation. Optional if reconstructing a single gene without nuc data.

--translations

Translated alignments for each CDS/Gene. If you are translating multiple genes you must specify the file name via a template like ā€˜aa_sequences_%GENE.fasta’ where %GENE will be replaced, If you are translating a single gene using a pattern is optional. Currently only supported for FASTA-input.

--report-inconsistent-translation

Report where amino acid reconstruction differed from a translation of the reconstructed nuc sequence. Requires nucleotide reconstruction.

Default: False

--aa-root-sequence

File(s) of the AA root sequence(s). Differences between this sequence and the inferred root will be reported as mutations on the root branch for each gene. For more than one gene this must include %GENE like other arguments.

outputs

Outputs supported for reconstructed ancestral sequences

--output-node-data

name of JSON file to save mutations and ancestral sequences to

--output-sequences

name of FASTA file to save ancestral nucleotide sequences to (FASTA alignments only)

--output-translations

name of the FASTA file(s) to save ancestral amino acid sequences to. Specify the file name via a template like ā€˜ancestral_aa_sequences_%GENE.fasta’ where %GENE will be replaced bythe gene name.

--output-vcf

name of output VCF file which will include ancestral seqs

general

--validation-mode

Possible choices: error, warn, skip

Control if optional validation checks are performed and what happens if they fail.

ā€˜error’ and ā€˜warn’ modes perform validation and emit messages about failed validation checks. ā€˜error’ mode causes a non-zero exit status if any validation checks failed, while ā€˜warn’ does not.

ā€˜skip’ mode performs no validation.

Note that some validation checks are non-optional and as such are not affected by this setting.

Default: error

--skip-validation

Skip validation of input/output files, equivalent to --validation-mode=skip. Use at your own risk!

Example Node Data JSON

Here’s an example of the output node-data JSON where NODE_1 has no mutations compared to it’s parent and NODE_2 has multiple mutations.

{
    "nodes": {
        "NODE_1": {
            "muts": [],
            "sequence": "TCCAAACAAAGT..."
        },
        "NODE_2": {
            "muts": [
              "A4461G",
              "A6591G",
              "A9184C",
              "A10385T",
              "T15098C"
            ],
            "sequence": "TCCAAACAAAGT..."
        }
    }
}