augur.proximity module

Find proximal sequences for focal sequences vs contextual sequences.

augur.proximity.distance_fn(ignore_missing_data, focal_seq, context_matrix, focal_valid_range=None, context_valid_mask=None)

Creates a function to be applied to a batch of context sequences which computes distances of this focal_seq to the batch of context sequences.

Returned function: (batch, batch_start, batch_end) -> distances

How we count “N”s depends on the ignore_missing_data

Return type:: Callable[[ndarray[tuple[Any, ...], dtype[int8]], int, int], ndarray[tuple[Any, ...], dtype[int8]]]

augur.proximity.flanking_masks(focal, context_matrix)

Compute ranges of flanking Ns for each focal sequence, and the same for contextual sequences in the form of a 2d mask matrix the same size as context_matrix

Return type:: tuple[dict[str, tuple[int, int]], ndarray[tuple[Any, ...], dtype[bool]]]

augur.proximity.get_valid_range(seq)

Return (start, end) indices of the non-flanking-N region of seq. start is the (0-based) index of the first non-N character end is the (0-based) index of the first N character of the trailing run

Return type:: tuple[int, int]

augur.proximity.load_context(fname, skip_strains, seq_len)

Load context sequences directly into a 2D numpy matrix (avoids the need for intermediate structures which create memory bottlenecks). This means we need to know the number of sequences ahead of time, so we start with a guess and then expand the matrix as needed.

Returns (context_matrix, context_names, skip_count).

Return type:: tuple[ndarray[tuple[Any, ...], dtype[int8]], list[str], int]

augur.proximity.load_focal_sequences(fname)

Load all focal sequences into memory as numpy arrays

Return type:: tuple[dict[str, ndarray[tuple[Any, ...], dtype[int8]]], int]

augur.proximity.proximity_argument_descriptions: dict[str, str] = {'context_sequences': 'FASTA file with aligned contextual sequences', 'focal_sequences': 'FASTA file with aligned focal sequences to find neighbors for', 'ignore_missing_data': "All non-ATGC bases are converted to 'N', and then:\n- 'none' treats 'N' as a normal base for comparison purposes;\n- 'all' ignores positions where either sequence is N;\n- 'flanking' ignores runs of Ns at the start/end of each sequence.", 'k': 'number of nearest neighbors to find per focal strain', 'max_distance': 'maximum distance threshold for considering a sequence to match', 'method': 'Proximity approach used', 'no_progress': "Don't print ongoing progress output", 'nthreads': "Number of threads to use for parallel processing. Use 'auto' to use all available cores.", 'output_matches': 'optional TSV file with columns: focal strain, context strain, distance', 'output_sequences': 'All proximal strains found', 'output_strains': 'output file with one neighbor strain name per line'}: augur proximity argument descriptions, stored as a dict so we can re-use in augur subsample related code

augur.proximity.register_arguments(parser)

Return type:: None

augur.proximity.register_parser(parent_subparsers)

Return type:: ArgumentParser

augur.proximity.run(args)

Return type:: None

augur.proximity.select_top_k(all_distances, context_names, context_matrix, k, max_distance)

Select top-k sequences (within the max_distance threshold) from a vector of distance scores. Results are sorted by distance counts, then alphabetically for those with the same distance.

To resolve tiebreaks (e.g. for k=3, max_distance=5, distances=[1,1,2,2,2,…], which of the distance=2 strains do we take?) we choose the strains with the fewest “N”s, with remaining ties resolved alpabetically.

Return type:: list[dict[str, str | int]]

augur.proximity.to_numpy_array(seq)

Convert a nucleotide sequence string to a numpy integer array.

Uses raw byte values of the lowercase string (a=97, c=99, g=103, t=116). Non-ATCG characters are replaced with the N sentinel value (n=110).

Return type:: ndarray[tuple[Any, ...], dtype[int8]]