augur distance


Calculate the distance between sequences across entire genes or at a predefined subset of sites.

Distance calculations require selection of a comparison method (to determine which sequences to compare) and a distance map (to determine the weight of a mismatch between any two sequences).

Comparison methods

Comparison methods include:

  1. root: the root and all nodes in the tree (the previous default for all distances)

  2. ancestor: each tip from a current season and its immediate ancestor (optionally, from a previous season)

  3. pairwise: all tips pairwise (optionally, all tips from a current season against all tips in previous seasons)

Ancestor and pairwise comparisons can be calculated with or without information about the current season. When no dates are provided, the ancestor comparison calculates the distance between each tip and its immediate ancestor in the given tree. Similarly, the pairwise comparison calculates the distance between all pairs of tips in the tree.

When the user provides a “latest date”, all tips sampled after that date belong to the current season and all tips sampled on that date or prior belong to previous seasons. When this information is available, the ancestor comparison calculates the distance between each tip in the current season and its last ancestor from a previous season. The pairwise comparison only calculates the distances between tips in the current season and those from previous seasons.

When the user also provides an “earliest date”, pairwise comparisons exclude tips sampled from previous seasons prior to the given date. These two date parameters allow users to specify a fixed time interval for pairwise calculations, limiting the computationally complexity of the comparisons.

For all distance calculations, a consecutive series of gap characters (-) counts as a single difference between any pair of sequences. This behavior reflects the assumption that there was an underlying biological process that produced the insertion or deletion as a single event as opposed to multiple independent insertion/deletion events.

Distance maps

Distance maps are defined in JSON format with two required top-level keys. The default key specifies the numeric (floating point) value to assign to all mismatches by default. The map key specifies a dictionary of weights to use for distance calculations. These weights are indexed hierarchically by gene name and one-based gene coordinate and are assigned in either a sequence-independent or sequence-dependent manner. The simplest possible distance map calculates Hamming distance between sequences without any site-specific weights, as shown below:

{
    "name": "Hamming distance",
    "default": 1,
    "map": {}
}

To ignore specific characters such as gaps or ambiguous nucleotides from the distance calculation, define a top-level ignored_characters key with a list of characters to ignore.

{
    "name": "Hamming distance",
    "default": 1,
    "ignored_characters": ["-", "N"],
    "map": {}
}

By default, distances are floating point values whose precision can be controlled with the precision key that defines the number of decimal places to retain for each distance. The following example shows how to specify a precision of two decimal places in the final output:

{
    "name": "Hamming distance",
    "default": 1,
    "map": {},
    "precision": 2
}

Distances can be reported as integer values by specifying an output_type as integer or int as follows:

{
    "name": "Hamming distance",
    "default": 1,
    "map": {},
    "output_type": "integer"
}

Sequence-independent distances are defined by gene and position using a numeric value of the same type as the default value (integer or float). The following example is a distance map for antigenic amino acid substitutions near influenza A/H3N2 HA’s receptor binding sites. This map calculates the Hamming distance between amino acid sequences only at seven positions in the HA1 gene:

{
    "name": "Koel epitope sites",
    "default": 0,
    "map": {
        "HA1": {
            "145": 1,
            "155": 1,
            "156": 1,
            "158": 1,
            "159": 1,
            "189": 1,
            "193": 1
        }
    }
}

Sequence-dependent distances are defined by gene, position, and sequence pairs where the from sequence in each pair is interpreted as the ancestral state and the to sequence as the derived state. The following example is a distance map that assigns asymmetric weights to specific amino acid substitutions at a specific position in the influenza gene HA1:

{
    "default": 0.0,
    "map": {
       "HA1": {
           "112": [
               {
                   "from": "V",
                   "to": "I",
                   "weight": 1.192
               },
               {
                   "from": "I",
                   "to": "V",
                   "weight": 0.002
               }
           ]
       }
   }
}

The distance command produces a JSON output file in standard “node data” format that can be passed to augur export. In addition to the standard nodes field, the JSON includes a params field that describes the mapping of attribute names to requested comparisons and distance maps and any date parameters specified by the user. The following example JSON shows a sample output when the distance command is run with multiple comparisons and distance maps:

{
    "params": {
        "attributes": ["ep", "ne", "ne_star", "ep_pairwise"],
        "compare_to": ["root", "root", "ancestor", "pairwise"],
        "map_name": [
            "wolf_epitope",
            "wolf_nonepitope",
            "wolf_nonepitope",
            "wolf_epitope"
        ],
        "latest_date": "2009-10-01"
    },
    "nodes": {
        "A/Afghanistan/AF1171/2008": {
            "ep": 7,
            "ne": 6,
            "ne_star": 1,
            "ep_pairwise": {
                "A/Aichi/78/2007": 1,
                "A/Argentina/3509/2006": 2
            }
        }
    }
}

usage: augur distance [-h] --tree TREE --alignment ALIGNMENT [ALIGNMENT ...]
                      --gene-names GENE_NAMES [GENE_NAMES ...]
                      --attribute-name ATTRIBUTE_NAME [ATTRIBUTE_NAME ...]
                      --compare-to {root,ancestor,pairwise}
                      [{root,ancestor,pairwise} ...] --map MAP [MAP ...]
                      [--date-annotations DATE_ANNOTATIONS]
                      [--earliest-date EARLIEST_DATE]
                      [--latest-date LATEST_DATE] --output OUTPUT

Named Arguments

--tree

Newick tree

--alignment

sequence(s) to be used, supplied as FASTA files

--gene-names

names of the sequences in the alignment, same order assumed

--attribute-name

name to store distances associated with the given distance map; multiple attribute names are linked to corresponding positional comparison method and distance map arguments

--compare-to

Possible choices: root, ancestor, pairwise

type of comparison between samples in the given tree including comparison of all nodes to the root (root), all tips to their last ancestor from a previous season (ancestor), or all tips from the current season to all tips in previous seasons (pairwise)

--map

JSON providing the distance map between sites and, optionally, sequences present at those sites; the distance map JSON minimally requires a ‘default’ field defining a default numeric distance and a ‘map’ field defining a dictionary of genes and one-based coordinates

--date-annotations

JSON of branch lengths and date annotations from augur refine for samples in the given tree; required for comparisons to earliest or latest date

--earliest-date

earliest date at which samples are considered to be from previous seasons (e.g., 2019-01-01). This date is only used in pairwise comparisons. If omitted, all samples prior to the latest date will be considered.

--latest-date

latest date at which samples are considered to be from previous seasons (e.g., 2019-01-01); samples from any date after this are considered part of the current season

--output

JSON file with calculated distances stored by node name and attribute name