augur.index module¶

Count occurrence of bases in a set of sequences.

augur.index.index_sequence(sequence, values)¶

Count the number of nucleotides for a given sequence record.

Parameters

sequence (Bio.SeqRecord.SeqRecord) – sequence record to index.
values (list of sets of str) – values to count; sets must be non-overlapping and contain only single-character, lowercase strings

Returns

summary of the given sequence’s strain name, length, nucleotide counts for the given values, and a final column with the number of characters that didn’t match any of those in the given values.

Return type

list

>>> other_IUPAC = {'r', 'y', 's', 'w', 'k', 'm', 'd', 'h', 'b', 'v'}
>>> values = [{'a'},{'c'},{'g'},{'t'},{'n'}, other_IUPAC, {'-'}, {'?'}]
>>> sequence_a = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGN-?XWN"), id="seq_A")
>>> index_sequence(sequence_a, values)
['seq_A', 10, 1, 1, 1, 1, 2, 1, 1, 1, 1]

>>> sequence_b = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGACTG"), id="seq_B")
>>> index_sequence(sequence_b, values)
['seq_B', 8, 2, 2, 2, 2, 0, 0, 0, 0, 0]

Characters in the given sequence that are not in the given list of values to count get counted in the final column.

>>> sequence_c = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTG%@!!!NN"), id="seq_C")
>>> index_sequence(sequence_c, values)
['seq_C', 11, 1, 1, 1, 1, 2, 0, 0, 0, 5]

The list of value sets must not overlap.

>>> sequence_d = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("A!C!TGXN"), id="seq_D")
>>> index_sequence(sequence_d, [set('actg'), set('xn'), set('n')]) 
Traceback (most recent call last):
  ...
ValueError: character sets ... and {'n'} overlap: {'n'}

Value sets must contain only single-character, lowercase strings.

>>> index_sequence(sequence_d, [{'a'}, {'c'}, {'T'}, {'g'}])
Traceback (most recent call last):
  ...
ValueError: character set {'T'} contains a non-lowercase character: 'T'

>>> index_sequence(sequence_d, [{'actg'}])
Traceback (most recent call last):
  ...
ValueError: character set {'actg'} contains a multi-character (or maybe zero-length) string: 'actg'

>>> index_sequence(sequence_d, [{'a', 'c'}, {0, 1}])
Traceback (most recent call last):
  ...
ValueError: character set {0, 1} contains a non-string element: 0

augur.index.index_sequences(sequences_path, sequence_index_path)¶

Count the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (“?” and “-“), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.

Parameters

sequences_path (str or Path-like) – path to a sequence file to index.
sequence_index_path (str or Path-like) – path to a tab-delimited file containing the composition details for each sequence in the given input file.

Returns

int – number of sequences indexed
int – total length of sequences indexed

augur.index.index_vcf(vcf_path, index_path)¶

Create an index with a list of strain names from a given VCF. We do not calculate any statistics for VCFs.

Parameters

vcf_path (str or Path-like) – path to a VCF file to index.
index_path (str or Path-like) – path to a tab-delimited file containing the composition details for each sequence in the given input file.

Returns

number of strains indexed

Return type

int

augur.index.register_arguments(parser)¶

augur.index.run(args)¶: runs index_sequences which counts the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (“?” and “-“), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.