augur proximity
Command line reference
Find proximal sequences for focal sequences vs contextual sequences.
usage: augur proximity [-h] [--method {hamming}] --context-sequences
CONTEXT_SEQUENCES --focal-sequences FOCAL_SEQUENCES
--output-strains OUTPUT_STRAINS
[--output-matches OUTPUT_MATCHES]
[--output-sequences FASTA] [--k K]
[--max-distance MAX_DISTANCE] [--no-progress]
[--ignore-missing-data {none,all,flanking}]
[--nthreads NTHREADS]
Named Arguments
- --method
Possible choices: hamming
Proximity approach used
Default:
'hamming'- --context-sequences
FASTA file with aligned contextual sequences
- --focal-sequences
FASTA file with aligned focal sequences to find neighbors for
- --output-strains
output file with one neighbor strain name per line
- --output-matches
optional TSV file with columns: focal strain, context strain, distance
- --output-sequences
All proximal strains found
- --k
number of nearest neighbors to find per focal strain
Default:
5- --max-distance
maximum distance threshold for considering a sequence to match
Default:
4- --no-progress
Don’t print ongoing progress output
Default:
False- --ignore-missing-data
Possible choices: none, all, flanking
All non-ATGC bases are converted to ‘N’, and then: - ‘none’ treats ‘N’ as a normal base for comparison purposes; - ‘all’ ignores positions where either sequence is N; - ‘flanking’ ignores runs of Ns at the start/end of each sequence.
Default:
'none'- --nthreads
Number of threads to use for parallel processing. Use ‘auto’ to use all available cores.
Default:
1
Overview
A common use case in outbreak investigation is to find the best set of related sequences
from All available sequences. augur proximity is designed to do just this, by finding
the nearest neighbor sequences for each sequence in a focal set.
Proximity calculations can be done as part of augur subsample which is often more ergonomic for pipelines
Note
Currently the only available method uses Hamming distance on nucleotide sequences (i.e. protein alignments are not currently supported).
Note
Sequences must be aligned before using this tool
Example usage for outbreak tracking
If you have a list of outbreak strains, commonly injected into the analysis via our multiple inputs support you can generate a set of samples via:
Use
augur filterto get the outbreak setUse
augur proximityto generate a set of closely related strains, using the entire dataset as the context (background)Use
augur filterto generate a (small) set of background sequencesUse
augur filterto merge the sets of strains produced from the above steps via --include …
The augur subsample command is purpose-built to do this in a single step, which avoids having to code this logic into Snakemake.
Performance
Parallalisation is extremely efficient, use
--nthreads auto(or a specific count) to process multiple focal sequences in parallel.Context sequences are loaded into a NumPy matrix for vectorised distance computation. Thus you will run out of memory is the number of context sequences is too large. Testing on ~500,000 influenza samples used only ~2GiB of memory.
SARS-CoV-2
Due to the many millions of sequences available, this tool will not be able to search against all sequences (as they won’t all fit into memory). We suggest you downsample those using temporal filters, nextstrain clades or pango lineages to get a smaller contextual set before using this tool.
Missing data handling
Non-ATCG characters in sequences are converted to N. The --ignore-missing-data
option controls how positions with N are counted:
none(default)N is treated as a regular base. Any position where two sequences differ (including N vs. a real base) counts toward the distance.
allPositions where either the focal or context sequence has an N are ignored entirely (not counted as a mismatch).
flankingRuns of Ns at the start and end of each sequence are ignored. Interior Ns are still counted normally (i.e the same as
none).