augur.distance

Calculate the distance between sequences across entire genes or at a predefined subset of sites.

Distance calculations require selection of a comparison method (to determine which sequences to compare) and a distance map (to determine the weight of a mismatch between any two sequences).

Comparison methods

Comparison methods include:

  1. root: the root and all nodes in the tree (the previous default for all distances)

  2. ancestor: each tip from a current season and its immediate ancestor (optionally, from a previous season)

  3. pairwise: all tips pairwise (optionally, all tips from a current season against all tips in previous seasons)

Ancestor and pairwise comparisons can be calculated with or without information about the current season. When no dates are provided, the ancestor comparison calculates the distance between each tip and its immediate ancestor in the given tree. Similarly, the pairwise comparison calculates the distance between all pairs of tips in the tree.

When the user provides a “latest date”, all tips sampled after that date belong to the current season and all tips sampled on that date or prior belong to previous seasons. When this information is available, the ancestor comparison calculates the distance between each tip in the current season and its last ancestor from a previous season. The pairwise comparison only calculates the distances between tips in the current season and those from previous seasons.

When the user also provides an “earliest date”, pairwise comparisons exclude tips sampled from previous seasons prior to the given date. These two date parameters allow users to specify a fixed time interval for pairwise calculations, limiting the computationally complexity of the comparisons.

For all distance calculations, a consecutive series of gap characters (-) counts as a single difference between any pair of sequences. This behavior reflects the assumption that there was an underlying biological process that produced the insertion or deletion as a single event as opposed to multiple independent insertion/deletion events.

Distance maps

Distance maps are defined in JSON format with two required top-level keys. The default key specifies the numeric (floating point) value to assign to all mismatches by default. The map key specifies a dictionary of weights to use for distance calculations. These weights are indexed hierarchically by gene name and one-based gene coordinate and are assigned in either a sequence-independent or sequence-dependent manner. The simplest possible distance map calculates Hamming distance between sequences without any site-specific weights, as shown below:

{
    "name": "Hamming distance",
    "default": 1,
    "map": {}
}

To ignore specific characters such as gaps or ambiguous nucleotides from the distance calculation, define a top-level ignored_characters key with a list of characters to ignore.

{
    "name": "Hamming distance",
    "default": 1,
    "ignored_characters": ["-", "N"],
    "map": {}
}

By default, distances are floating point values whose precision can be controlled with the precision key that defines the number of decimal places to retain for each distance. The following example shows how to specify a precision of two decimal places in the final output:

{
    "name": "Hamming distance",
    "default": 1,
    "map": {},
    "precision": 2
}

Distances can be reported as integer values by specifying an output_type as integer or int as follows:

{
    "name": "Hamming distance",
    "default": 1,
    "map": {},
    "output_type": "integer"
}

Sequence-independent distances are defined by gene and position using a numeric value of the same type as the default value (integer or float). The following example is a distance map for antigenic amino acid substitutions near influenza A/H3N2 HA’s receptor binding sites. This map calculates the Hamming distance between amino acid sequences only at seven positions in the HA1 gene:

{
    "name": "Koel epitope sites",
    "default": 0,
    "map": {
        "HA1": {
            "145": 1,
            "155": 1,
            "156": 1,
            "158": 1,
            "159": 1,
            "189": 1,
            "193": 1
        }
    }
}

Sequence-dependent distances are defined by gene, position, and sequence pairs where the from sequence in each pair is interpreted as the ancestral state and the to sequence as the derived state. The following example is a distance map that assigns asymmetric weights to specific amino acid substitutions at a specific position in the influenza gene HA1:

{
    "default": 0.0,
    "map": {
       "HA1": {
           "112": [
               {
                   "from": "V",
                   "to": "I",
                   "weight": 1.192
               },
               {
                   "from": "I",
                   "to": "V",
                   "weight": 0.002
               }
           ]
       }
   }
}

The distance command produces a JSON output file in standard “node data” format that can be passed to augur export. In addition to the standard nodes field, the JSON includes a params field that describes the mapping of attribute names to requested comparisons and distance maps and any date parameters specified by the user. The following example JSON shows a sample output when the distance command is run with multiple comparisons and distance maps:

{
    "params": {
        "attributes": ["ep", "ne", "ne_star", "ep_pairwise"],
        "compare_to": ["root", "root", "ancestor", "pairwise"],
        "map_name": [
            "wolf_epitope",
            "wolf_nonepitope",
            "wolf_nonepitope",
            "wolf_epitope"
        ],
        "latest_date": "2009-10-01"
    },
    "nodes": {
        "A/Afghanistan/AF1171/2008": {
            "ep": 7,
            "ne": 6,
            "ne_star": 1,
            "ep_pairwise": {
                "A/Aichi/78/2007": 1,
                "A/Argentina/3509/2006": 2
            }
        }
    }
}
augur.distance.get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map, aggregate_function=<built-in function max>)

Calculate distance between the two given nodes using the given distance map.

In cases where the distance map between sequences is asymmetric, the first node is interpreted as the “ancestral” sequence and the second node is interpreted as the “derived” sequence.

Parameters:
  • node_a_sequences (dict) – sequences by gene name for two nodes (samples) in a tree

  • node_b_sequences (dict) – sequences by gene name for two nodes (samples) in a tree

  • distance_map (dict) – definition of site-specific and, optionally, sequence-specific distances per gene

Returns:

distance between node sequences based on the given map

Return type:

float

Examples

>>> node_a_sequences = {"gene": "ACTG"}
>>> node_b_sequences = {"gene": "ACGG"}
>>> distance_map = {"default": 0, "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0.0
>>> distance_map = {"default": 1, "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1.0
>>> distance_map = {"default": 0.0, "map": {"gene": {3: 1.0}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0.0
>>> distance_map = {"default": 0.0, "map": {"gene": {2: 3.14159, 3: 1.0}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3.14159
>>> distance_map = {"default": 0.0, "precision": 2, "map": {"gene": {2: 3.14159, 3: 1.0}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3.14
>>> distance_map = {"default": 0.0, "output_type": "integer", "map": {"gene": {2: 3.14159, 3: 1.0}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3
>>> distance_map = {"default": 0.0, "output_type": "int", "map": {"gene": {2: 3.14159, 3: 1.0}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3
>>> distance_map = {"default": 0.0, "output_type": "unsupported", "map": {"gene": {2: 3.14159, 3: 1.0}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
Traceback (most recent call last):
    ...
ValueError: Unsupported output type of 'unsupported' provided in the distance map

For site- and sequence-specific maps, the order of the input sequences matters; the first sequence is treated as the ancestral sequence while the second is treated as the derived.

>>> distance_map = {"default": 0.0, "map": {"gene": {2: {('T', 'G'): 0.5}}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0.5
>>> distance_map = {"default": 0.0, "map": {"gene": {2: {('T', 'G'): 0.5}}}}
>>> get_distance_between_nodes(node_b_sequences, node_a_sequences, distance_map)
0.0

Treat a single indel as one event.

>>> node_a_sequences = {"gene": "ACTG"}
>>> node_b_sequences = {"gene": "A--G"}
>>> distance_map = {"default": 1, "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1.0

Use the maximum weight of all sites affected by an indel with a site-specific distance map.

>>> distance_map = {"default": 0, "map": {"gene": {1: 1, 2: 2}}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
2.0

Use the maximum weight of all mutations at all sites affected by an indel with a mutation-specific distance map.

>>> distance_map = {
...     "default": 0,
...     "map": {
...         "gene": {
...             1: {
...                 ('C', 'G'): 1,
...                 ('C', 'A'): 2
...             },
...             2: {
...                 ('T', 'G'): 3,
...                 ('T', 'A'): 2
...             }
...         }
...     }
... }
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3.0

Use the maximum weight of gaps at all sites affected by an indel with a mutation-specific distance map.

>>> distance_map = {
...     "default": 0,
...     "map": {
...         "gene": {
...             1: {
...                 ('C', '-'): 1,
...                 ('C', 'A'): 2
...             },
...             2: {
...                 ('T', 'G'): 3,
...                 ('T', '-'): 2
...             }
...         }
...     }
... }
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
2.0

If the default value is greater than any of the site-specific mismatches and the specific mismatch does not have a weighted defined, use the default weight.

>>> distance_map = {
...     "default": 4,
...     "map": {
...         "gene": {
...             1: {
...                 ('C', 'G'): 1,
...                 ('C', 'A'): 2
...             },
...             2: {
...                 ('T', 'G'): 3,
...                 ('T', 'A'): 2
...             }
...         }
...     }
... }
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
4.0

Count mismatches adjacent to indel events.

>>> node_a_sequences = {"gene": "ACTGTA"}
>>> node_b_sequences = {"gene": "A--CCA"}
>>> distance_map = {"default": 1, "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
3.0

Ignore specific characters defined in the distance map.

>>> node_a_sequences = {"gene": "ACTGG"}
>>> node_b_sequences = {"gene": "A--GN"}
>>> distance_map = {"default": 1, "ignored_characters":["-"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
1.0
>>> distance_map = {"default": 1, "ignored_characters":["-", "N"], "map": {}}
>>> get_distance_between_nodes(node_a_sequences, node_b_sequences, distance_map)
0.0
augur.distance.get_distances_to_all_pairs(tree, sequences_by_node_and_gene, distance_map, earliest_date=None, latest_date=None)

Calculate distances between each sample in the given sequences and all other samples in previous seasons using the given distance map.

Parameters:
  • tree (Bio.Phylo.BaseTree.Tree) – a rooted tree whose node names match the given dictionary of sequences by node and gene

  • sequences_by_node_and_gene (dict) – nucleotide or amino acid sequences by node name and gene

  • distance_map (dict) – site-specific and, optionally, sequence-specific distances between two sequences

  • earliest_date (pandas.Timestamp) – earliest or latest date to consider a node for comparison to a given sample; used to define a range of previous seasons relative to the most recent samples in the given tree. Dates are open intervals (inclusive) for interval of previous seasons. The latest date is a closed lower bound on the interval of the current season.

  • latest_date (pandas.Timestamp) – earliest or latest date to consider a node for comparison to a given sample; used to define a range of previous seasons relative to the most recent samples in the given tree. Dates are open intervals (inclusive) for interval of previous seasons. The latest date is a closed lower bound on the interval of the current season.

Returns:

distances calculated between each sample in the tree and all samples from previous samples with distances indexed by primary sample name and then past sample name

Return type:

dict

augur.distance.get_distances_to_last_ancestor(tree, sequences_by_node_and_gene, distance_map, latest_date)

Calculate distances between each sample in the given sequences and its last ancestor in a previous season using the given distance map.

Parameters:
  • tree (Bio.Phylo.BaseTree.Tree) – a rooted tree whose node names match the given dictionary of sequences by node and gene

  • sequences_by_node_and_gene (dict) – nucleotide or amino acid sequences by node name and gene

  • distance_map (dict) – site-specific and, optionally, sequence-specific distances between two sequences

  • latest_date (pandas.Timestamp) – latest date to consider a node as a potential ancestor of a given sample; used to define a previous season relative to the most recent samples in the given tree.

Returns:

distances calculated between each sample in the tree and its last ancestor sequence with distances indexed by node name

Return type:

dict

augur.distance.get_distances_to_root(tree, sequences_by_node_and_gene, distance_map)

Calculate distances between all samples in the given sequences and the node of the given tree using the given distance map.

Parameters:
  • tree (Bio.Phylo.BaseTree.Tree) – a rooted tree whose node names match the given dictionary of sequences by node and gene

  • sequences_by_node_and_gene (dict) – nucleotide or amino acid sequences by node name and gene

  • distance_map (dict) – site-specific and, optionally, sequence-specific distances between two sequences

Returns:

distances calculated between the root sequence and each sample in the tree and indexed by node name

Return type:

dict

augur.distance.read_distance_map(map_file)

Read a distance map JSON into a dictionary and assert that the JSON follows the correct format. Coordinates should be one-based in the JSON and are converted to zero-based coordinates on load.

Parameters:

map_file (str) – name of a JSON file containing a valid distance map

Returns:

Python representation of the distance map JSON

Return type:

dict

Examples

>>> sorted(read_distance_map("tests/data/distance_map_weight_per_site.json").items())
[('default', 0), ('map', {'HA1': {144: 1}})]
>>> sorted(read_distance_map("tests/data/distance_map_weight_per_site_and_sequence.json").items())
[('default', 0.0), ('map', {'SigPep': {0: {('W', 'P'): -8.3}}})]
augur.distance.register_parser(parent_subparsers)
augur.distance.run(args)