augur.index module

Count occurrence of bases in a set of sequences.

augur.index.index_sequence(sequence, values)

Count the number of nucleotides for a given sequence record.

Parameters
  • sequence (Bio.SeqRecord.SeqRecord) – sequence record to index.

  • values (list of sets of str) – values to count; sets must be non-overlapping and contain only single-character, lowercase strings

Returns

summary of the given sequence’s strain name, length, nucleotide counts for the given values, and a final column with the number of characters that didn’t match any of those in the given values.

Return type

list

>>> other_IUPAC = {'r', 'y', 's', 'w', 'k', 'm', 'd', 'h', 'b', 'v'}
>>> values = [{'a'},{'c'},{'g'},{'t'},{'n'}, other_IUPAC, {'-'}, {'?'}]
>>> sequence_a = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGN-?XWN"), id="seq_A")
>>> index_sequence(sequence_a, values)
['seq_A', 10, 1, 1, 1, 1, 2, 1, 1, 1, 1]
>>> sequence_b = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGACTG"), id="seq_B")
>>> index_sequence(sequence_b, values)
['seq_B', 8, 2, 2, 2, 2, 0, 0, 0, 0, 0]

Characters in the given sequence that are not in the given list of values to count get counted in the final column.

>>> sequence_c = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTG%@!!!NN"), id="seq_C")
>>> index_sequence(sequence_c, values)
['seq_C', 11, 1, 1, 1, 1, 2, 0, 0, 0, 5]

The list of value sets must not overlap.

>>> sequence_d = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("A!C!TGXN"), id="seq_D")
>>> index_sequence(sequence_d, [set('actg'), set('xn'), set('n')]) 
Traceback (most recent call last):
  ...
ValueError: character sets ... and {'n'} overlap: {'n'}

Value sets must contain only single-character, lowercase strings.

>>> index_sequence(sequence_d, [{'a'}, {'c'}, {'T'}, {'g'}])
Traceback (most recent call last):
  ...
ValueError: character set {'T'} contains a non-lowercase character: 'T'
>>> index_sequence(sequence_d, [{'actg'}])
Traceback (most recent call last):
  ...
ValueError: character set {'actg'} contains a multi-character (or maybe zero-length) string: 'actg'
>>> index_sequence(sequence_d, [{'a', 'c'}, {0, 1}])
Traceback (most recent call last):
  ...
ValueError: character set {0, 1} contains a non-string element: 0
augur.index.index_sequences(sequences_path, sequence_index_path)

Count the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (β€œ?” and β€œ-β€œ), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.

Parameters
  • sequences_path (str or Path-like) – path to a sequence file to index.

  • sequence_index_path (str or Path-like) – path to a tab-delimited file containing the composition details for each sequence in the given input file.

Returns

  • int – number of sequences indexed

  • int – total length of sequences indexed

augur.index.index_vcf(vcf_path, index_path)

Create an index with a list of strain names from a given VCF. We do not calculate any statistics for VCFs.

Parameters
  • vcf_path (str or Path-like) – path to a VCF file to index.

  • index_path (str or Path-like) – path to a tab-delimited file containing the composition details for each sequence in the given input file.

Returns

number of strains indexed

Return type

int

augur.index.register_arguments(parser)
augur.index.run(args)

runs index_sequences which counts the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (β€œ?” and β€œ-β€œ), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.