augur proximity

Command line reference

Find proximal sequences for focal sequences vs contextual sequences.

usage: augur proximity [-h] [--method {hamming}] --context-sequences
                       CONTEXT_SEQUENCES --focal-sequences FOCAL_SEQUENCES
                       --output-strains OUTPUT_STRAINS
                       [--output-matches OUTPUT_MATCHES]
                       [--output-sequences FASTA] [--k K]
                       [--max-distance MAX_DISTANCE] [--no-progress]
                       [--ignore-missing-data {none,all,flanking}]
                       [--nthreads NTHREADS]

Named Arguments

--method

Possible choices: hamming

Proximity approach used

Default: 'hamming'

--context-sequences

FASTA file with aligned contextual sequences

--focal-sequences

FASTA file with aligned focal sequences to find neighbors for

--output-strains

output file with one neighbor strain name per line

--output-matches

optional TSV file with columns: focal strain, context strain, distance

--output-sequences

All proximal strains found

--k

number of nearest neighbors to find per focal strain

Default: 5

--max-distance

maximum distance threshold for considering a sequence to match

Default: 4

--no-progress

Don’t print ongoing progress output

Default: False

--ignore-missing-data

Possible choices: none, all, flanking

All non-ATGC bases are converted to ‘N’, and then: - ‘none’ treats ‘N’ as a normal base for comparison purposes; - ‘all’ ignores positions where either sequence is N; - ‘flanking’ ignores runs of Ns at the start/end of each sequence.

Default: 'none'

--nthreads

Number of threads to use for parallel processing. Use ‘auto’ to use all available cores.

Default: 1

Overview

A common use case in outbreak investigation is to find the best set of related sequences from All available sequences. augur proximity is designed to do just this, by finding the nearest neighbor sequences for each sequence in a focal set.

Proximity calculations can be done as part of augur subsample which is often more ergonomic for pipelines

Note

Currently the only available method uses Hamming distance on nucleotide sequences (i.e. protein alignments are not currently supported).

Note

Sequences must be aligned before using this tool

Example usage for outbreak tracking

If you have a list of outbreak strains, commonly injected into the analysis via our multiple inputs support you can generate a set of samples via:

  1. Use augur filter to get the outbreak set

  2. Use augur proximity to generate a set of closely related strains, using the entire dataset as the context (background)

  3. Use augur filter to generate a (small) set of background sequences

  4. Use augur filter to merge the sets of strains produced from the above steps via --include …

The augur subsample command is purpose-built to do this in a single step, which avoids having to code this logic into Snakemake.

Performance

  • Parallalisation is extremely efficient, use --nthreads auto (or a specific count) to process multiple focal sequences in parallel.

  • Context sequences are loaded into a NumPy matrix for vectorised distance computation. Thus you will run out of memory is the number of context sequences is too large. Testing on ~500,000 influenza samples used only ~2GiB of memory.

SARS-CoV-2

Due to the many millions of sequences available, this tool will not be able to search against all sequences (as they won’t all fit into memory). We suggest you downsample those using temporal filters, nextstrain clades or pango lineages to get a smaller contextual set before using this tool.

Missing data handling

Non-ATCG characters in sequences are converted to N. The --ignore-missing-data option controls how positions with N are counted:

none (default)

N is treated as a regular base. Any position where two sequences differ (including N vs. a real base) counts toward the distance.

all

Positions where either the focal or context sequence has an N are ignored entirely (not counted as a mismatch).

flanking

Runs of Ns at the start and end of each sequence are ignored. Interior Ns are still counted normally (i.e the same as none).