augur.index module

Count occurrence of bases/residues in a set of sequences.

augur.index.get_characters_by_sequence_type(sequence_type)

Return character sets and labels for the given sequence type.

Parameters:

sequence_type (str) -- 'nuc' for nucleotide or 'aa' for amino acid.

Returns:

(values, labels) where values is a list of character sets and labels is a list of corresponding column names.

Return type:

tuple

augur.index.index_sequence(sequence, values)

Count the number of characters for a given sequence record.

Parameters:
  • sequence (Bio.SeqRecord.SeqRecord) -- sequence record to index.

  • values (list of set of str) -- values to count; sets must be non-overlapping and contain only single-character, lowercase strings

Returns:

summary of the given sequence’s strain name, length, character counts for the given values, and a final column with the number of characters that didn’t match any of those in the given values.

Return type:

list

Examples

>>> import Bio
>>> nuc_values = [set(v) for v in NUCLEOTIDE_CHARACTERS.values()]
>>> sequence_a = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGN-?XWN"), id="seq_A")
>>> index_sequence(sequence_a, nuc_values)
['seq_A', 10, 1, 1, 1, 1, 2, 1, 1, 1, 1]
>>> sequence_b = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGACTG"), id="seq_B")
>>> index_sequence(sequence_b, nuc_values)
['seq_B', 8, 2, 2, 2, 2, 0, 0, 0, 0, 0]

Characters in the given sequence that are not in the given list of values to count get counted in the final column.

>>> sequence_c = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTG%@!!!NN"), id="seq_C")
>>> index_sequence(sequence_c, nuc_values)
['seq_C', 11, 1, 1, 1, 1, 2, 0, 0, 0, 5]

Amino acid sequences work the same way with amino acid character sets.

>>> aa_values = [set(v) for v in AMINO_ACID_CHARACTERS.values()]
>>> sequence_e = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("MFLEKVG"), id="seq_E")
>>> index_sequence(sequence_e, aa_values)
['seq_E', 7, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

Invalid amino acid characters are counted in the final column.

>>> sequence_f = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("MFLEK-X*?!"), id="seq_F")
>>> index_sequence(sequence_f, aa_values)
['seq_F', 10, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2]

The list of value sets must not overlap.

>>> sequence_d = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("A!C!TGXN"), id="seq_D")
>>> index_sequence(sequence_d, [set('actg'), set('xn'), set('n')])
Traceback (most recent call last):
  ...
ValueError: character sets ... and {'n'} overlap: {'n'}

Value sets must contain only single-character, lowercase strings.

>>> index_sequence(sequence_d, [{'a'}, {'c'}, {'T'}, {'g'}])
Traceback (most recent call last):
  ...
ValueError: character set {'T'} contains a non-lowercase character: 'T'
>>> index_sequence(sequence_d, [{'actg'}])
Traceback (most recent call last):
  ...
ValueError: character set {'actg'} contains a multi-character (or maybe zero-length) string: 'actg'
>>> index_sequence(sequence_d, [{'a', 'c'}, {0, 1}])
Traceback (most recent call last):
  ...
ValueError: character set {0, 1} contains a non-string element: 0
augur.index.index_sequences(sequences_path, sequence_index_path, sequence_type)

Count the number of each valid character and invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.

The sequence type (nucleotide or amino acid) is auto-detected from the first sequence. Nucleotide sequences count A, C, G, T, N, other IUPAC characters, gaps, and ambiguous characters. Amino acid sequences count the 20 standard amino acids, stop codons, X, and gaps.

Parameters:
  • sequences_path (str or os.PathLike) -- path to a sequence file to index.

  • sequence_index_path (str or os.PathLike) -- path to a tab-delimited file containing the composition details for each sequence in the given input file.

Returns:

  • int -- number of sequences indexed

  • int -- total length of sequences indexed

augur.index.index_vcf(vcf_path, index_path)

Create an index with a list of strain names from a given VCF. We do not calculate any statistics for VCFs.

Parameters:
  • vcf_path (str or os.PathLike) -- path to a VCF file to index.

  • index_path (str or os.PathLike) -- path to a tab-delimited file containing the composition details for each sequence in the given input file.

Returns:

number of strains indexed

Return type:

int

augur.index.register_parser(parent_subparsers)
augur.index.run(args)

Index a set of sequences, counting character composition per sequence. Supports both nucleotide and amino acid sequences (auto-detected).