augur.index module¶
Count occurrence of bases in a set of sequences.
-
augur.index.
index_sequence
(sequence, values)¶ Count the number of nucleotides for a given sequence record.
- Parameters
sequence (Bio.SeqRecord.SeqRecord) – sequence record to index.
values (list of sets of str) – values to count; sets must be non-overlapping and contain only single-character, lowercase strings
- Returns
summary of the given sequence’s strain name, length, nucleotide counts for the given values, and a final column with the number of characters that didn’t match any of those in the given values.
- Return type
list
>>> other_IUPAC = {'r', 'y', 's', 'w', 'k', 'm', 'd', 'h', 'b', 'v'} >>> values = [{'a'},{'c'},{'g'},{'t'},{'n'}, other_IUPAC, {'-'}, {'?'}] >>> sequence_a = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGN-?XWN"), id="seq_A") >>> index_sequence(sequence_a, values) ['seq_A', 10, 1, 1, 1, 1, 2, 1, 1, 1, 1]
>>> sequence_b = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGACTG"), id="seq_B") >>> index_sequence(sequence_b, values) ['seq_B', 8, 2, 2, 2, 2, 0, 0, 0, 0, 0]
Characters in the given sequence that are not in the given list of values to count get counted in the final column.
>>> sequence_c = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTG%@!!!NN"), id="seq_C") >>> index_sequence(sequence_c, values) ['seq_C', 11, 1, 1, 1, 1, 2, 0, 0, 0, 5]
The list of value sets must not overlap.
>>> sequence_d = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("A!C!TGXN"), id="seq_D") >>> index_sequence(sequence_d, [set('actg'), set('xn'), set('n')]) Traceback (most recent call last): ... ValueError: character sets ... and {'n'} overlap: {'n'}
Value sets must contain only single-character, lowercase strings.
>>> index_sequence(sequence_d, [{'a'}, {'c'}, {'T'}, {'g'}]) Traceback (most recent call last): ... ValueError: character set {'T'} contains a non-lowercase character: 'T'
>>> index_sequence(sequence_d, [{'actg'}]) Traceback (most recent call last): ... ValueError: character set {'actg'} contains a multi-character (or maybe zero-length) string: 'actg'
>>> index_sequence(sequence_d, [{'a', 'c'}, {0, 1}]) Traceback (most recent call last): ... ValueError: character set {0, 1} contains a non-string element: 0
-
augur.index.
index_sequences
(sequences_path, sequence_index_path)¶ Count the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (“?” and “-“), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.
- Parameters
sequences_path (str or Path-like) – path to a sequence file to index.
sequence_index_path (str or Path-like) – path to a tab-delimited file containing the composition details for each sequence in the given input file.
- Returns
int – number of sequences indexed
int – total length of sequences indexed
-
augur.index.
index_vcf
(vcf_path, index_path)¶ Create an index with a list of strain names from a given VCF. We do not calculate any statistics for VCFs.
- Parameters
vcf_path (str or Path-like) – path to a VCF file to index.
index_path (str or Path-like) – path to a tab-delimited file containing the composition details for each sequence in the given input file.
- Returns
number of strains indexed
- Return type
int
-
augur.index.
register_arguments
(parser)¶
-
augur.index.
run
(args)¶ runs index_sequences which counts the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (“?” and “-“), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.