augur distance

Calculate the distance between sequences across entire genes or at a predefined subset of sites.

Distance calculations require selection of a comparison method (to determine which sequences to compare) and a distance map (to determine the weight of a mismatch between any two sequences).

Comparison methods include:

  1. root: the root and all nodes in the tree (the previous default for all distances)

  2. ancestor: each tip from a current season and its immediate ancestor (optionally, from a previous season)

  3. pairwise: all tips pairwise (optionally, all tips from a current season against all tips in previous seasons)

Ancestor and pairwise comparisons can be calculated with or without information about the current season. When no dates are provided, the ancestor comparison calculates the distance between each tip and its immediate ancestor in the given tree. Similarly, the pairwise comparison calculates the distance between all pairs of tips in the tree.

When the user provides a “latest date”, all tips sampled after that date belong to the current season and all tips sampled on that date or prior belong to previous seasons. When this information is available, the ancestor comparison calculates the distance between each tip in the current season and its last ancestor from a previous season. The pairwise comparison only calculates the distances between tips in the current season and those from previous seasons.

When the user also provides an “earliest date”, pairwise comparisons exclude tips sampled from previous seasons prior to the given date. These two date parameters allow users to specify a fixed time interval for pairwise calculations, limiting the computationally complexity of the comparisons.

Distance maps are defined in JSON format with two required top-level keys. The default key specifies the numeric value (integer or float) to assign to all mismatches by default. The map key specifies a dictionary of weights to use for distance calculations. These weights are indexed hierarchically by gene name and one-based gene coordinate and are assigned in either a sequence-independent or sequence-dependent manner. The simplest possible distance map calculates Hamming distance between sequences without any site-specific weights, as shown below:

    "name": "Hamming distance",
    "default": 1,
    "map": {}

Sequence-independent distances are defined by gene and position using a numeric value of the same type as the default value (integer or float). The following example is a distance map for antigenic amino acid substitutions near influenza A/H3N2 HA’s receptor binding sites. This map calculates the Hamming distance between amino acid sequences only at seven positions in the HA1 gene:

    "name": "Koel epitope sites",
    "default": 0,
    "map": {
        "HA1": {
            "145": 1,
            "155": 1,
            "156": 1,
            "158": 1,
            "159": 1,
            "189": 1,
            "193": 1

Sequence-dependent distances are defined by gene, position, and sequence pairs where the from sequence in each pair is interpreted as the ancestral state and the to sequence as the derived state. The following example is a distance map that assigns asymmetric weights to specific amino acid substitutions at a specific position in the influenza gene HA1:

    "default": 0.0,
    "map": {
       "HA1": {
           "112": [
                   "from": "V",
                   "to": "I",
                   "weight": 1.192
                   "from": "I",
                   "to": "V",
                   "weight": 0.002

The distance command produces a JSON output file in standard “node data” format that can be passed to augur export. In addition to the standard nodes field, the JSON includes a params field that describes the mapping of attribute names to requested comparisons and distance maps and any date parameters specified by the user. The following example JSON shows a sample output when the distance command is run with multiple comparisons and distance maps:

    "params": {
        "attributes": ["ep", "ne", "ne_star", "ep_pairwise"],
        "compare_to": ["root", "root", "ancestor", "pairwise"],
        "map_name": [
        "latest_date": "2009-10-01"
    "nodes": {
        "A/Afghanistan/AF1171/2008": {
            "ep": 7,
            "ne": 6,
            "ne_star": 1,
            "ep_pairwise": {
                "A/Aichi/78/2007": 1,
                "A/Argentina/3509/2006": 2

usage: augur distance [-h] --tree TREE --alignment ALIGNMENT [ALIGNMENT ...]
                      --gene-names GENE_NAMES [GENE_NAMES ...]
                      --attribute-name ATTRIBUTE_NAME [ATTRIBUTE_NAME ...]
                      --compare-to {root,ancestor,pairwise}
                      [{root,ancestor,pairwise} ...] --map MAP [MAP ...]
                      [--date-annotations DATE_ANNOTATIONS]
                      [--earliest-date EARLIEST_DATE]
                      [--latest-date LATEST_DATE] --output OUTPUT

Named Arguments


Newick tree


sequence(s) to be used, supplied as FASTA files


names of the sequences in the alignment, same order assumed


name to store distances associated with the given distance map; multiple attribute names are linked to corresponding positional comparison method and distance map arguments


Possible choices: root, ancestor, pairwise

type of comparison between samples in the given tree including comparison of all nodes to the root (root), all tips to their last ancestor from a previous season (ancestor), or all tips from the current season to all tips in previous seasons (pairwise)


JSON providing the distance map between sites and, optionally, sequences present at those sites; the distance map JSON minimally requires a ‘default’ field defining a default numeric distance and a ‘map’ field defining a dictionary of genes and one-based coordinates


JSON of branch lengths and date annotations from augur refine for samples in the given tree; required for comparisons to earliest or latest date


earliest date at which samples are considered to be from previous seasons (e.g., 2019-01-01). This date is only used in pairwise comparisons. If omitted, all samples prior to the latest date will be considered.


latest date at which samples are considered to be from previous seasons (e.g., 2019-01-01); samples from any date after this are considered part of the current season


JSON file with calculated distances stored by node name and attribute name