augur.proximity moduleο
Find proximal sequences for focal sequences vs contextual sequences.
- augur.proximity.distance_fn(ignore_missing_data, focal_seq, context_matrix, focal_valid_range=None, context_valid_mask=None)ο
Creates a function to be applied to a batch of context sequences which computes distances of this focal_seq to the batch of context sequences.
Returned function: (batch, batch_start, batch_end) -> distances
How we count βNβs depends on the ignore_missing_data
- augur.proximity.flanking_masks(focal, context_matrix)ο
Compute ranges of flanking Ns for each focal sequence, and the same for contextual sequences in the form of a 2d mask matrix the same size as context_matrix
- augur.proximity.get_valid_range(seq)ο
Return (start, end) indices of the non-flanking-N region of seq. start is the (0-based) index of the first non-N character end is the (0-based) index of the first N character of the trailing run
- augur.proximity.load_context(fname, skip_strains, seq_len)ο
Load context sequences directly into a 2D numpy matrix (avoids the need for intermediate structures which create memory bottlenecks). This means we need to know the number of sequences ahead of time, so we start with a guess and then expand the matrix as needed.
Returns (context_matrix, context_names, skip_count).
- augur.proximity.load_focal_sequences(fname)ο
Load all focal sequences into memory as numpy arrays
- augur.proximity.proximity_argument_descriptions: dict[str, str] = {'context_sequences': 'FASTA file with aligned contextual sequences', 'focal_sequences': 'FASTA file with aligned focal sequences to find neighbors for', 'ignore_missing_data': "All non-ATGC bases are converted to 'N', and then:\n- 'none' treats 'N' as a normal base for comparison purposes;\n- 'all' ignores positions where either sequence is N;\n- 'flanking' ignores runs of Ns at the start/end of each sequence.", 'k': 'number of nearest neighbors to find per focal strain', 'max_distance': 'maximum distance threshold for considering a sequence to match', 'method': 'Proximity approach used', 'no_progress': "Don't print ongoing progress output", 'nthreads': "Number of threads to use for parallel processing. Use 'auto' to use all available cores.", 'output_matches': 'optional TSV file with columns: focal strain, context strain, distance', 'output_sequences': 'All proximal strains found', 'output_strains': 'output file with one neighbor strain name per line'}ο
augur proximity argument descriptions, stored as a dict so we can re-use in augur subsample related code
- augur.proximity.register_parser(parent_subparsers)ο
- Return type:
- augur.proximity.select_top_k(all_distances, context_names, context_matrix, k, max_distance)ο
Select top-k sequences (within the max_distance threshold) from a vector of distance scores. Results are sorted by distance counts, then alphabetically for those with the same distance.
To resolve tiebreaks (e.g. for k=3, max_distance=5, distances=[1,1,2,2,2,β¦], which of the distance=2 strains do we take?) we choose the strains with the fewest βNβs, with remaining ties resolved alpabetically.