augur.index moduleο
Count occurrence of bases in a set of sequences.
- augur.index.index_sequence(sequence, values)ο
Count the number of nucleotides for a given sequence record.
- Parameters:
sequence (Bio.SeqRecord.SeqRecord) β sequence record to index.
values (list of set of str) β values to count; sets must be non-overlapping and contain only single-character, lowercase strings
- Returns:
summary of the given sequenceβs strain name, length, nucleotide counts for the given values, and a final column with the number of characters that didnβt match any of those in the given values.
- Return type:
Examples
>>> import Bio >>> other_IUPAC = {'r', 'y', 's', 'w', 'k', 'm', 'd', 'h', 'b', 'v'} >>> values = [{'a'},{'c'},{'g'},{'t'},{'n'}, other_IUPAC, {'-'}, {'?'}] >>> sequence_a = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGN-?XWN"), id="seq_A") >>> index_sequence(sequence_a, values) ['seq_A', 10, 1, 1, 1, 1, 2, 1, 1, 1, 1]
>>> sequence_b = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTGACTG"), id="seq_B") >>> index_sequence(sequence_b, values) ['seq_B', 8, 2, 2, 2, 2, 0, 0, 0, 0, 0]
Characters in the given sequence that are not in the given list of values to count get counted in the final column.
>>> sequence_c = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("ACTG%@!!!NN"), id="seq_C") >>> index_sequence(sequence_c, values) ['seq_C', 11, 1, 1, 1, 1, 2, 0, 0, 0, 5]
The list of value sets must not overlap.
>>> sequence_d = Bio.SeqRecord.SeqRecord(seq=Bio.Seq.Seq("A!C!TGXN"), id="seq_D") >>> index_sequence(sequence_d, [set('actg'), set('xn'), set('n')]) Traceback (most recent call last): ... ValueError: character sets ... and {'n'} overlap: {'n'}
Value sets must contain only single-character, lowercase strings.
>>> index_sequence(sequence_d, [{'a'}, {'c'}, {'T'}, {'g'}]) Traceback (most recent call last): ... ValueError: character set {'T'} contains a non-lowercase character: 'T'
>>> index_sequence(sequence_d, [{'actg'}]) Traceback (most recent call last): ... ValueError: character set {'actg'} contains a multi-character (or maybe zero-length) string: 'actg'
>>> index_sequence(sequence_d, [{'a', 'c'}, {0, 1}]) Traceback (most recent call last): ... ValueError: character set {0, 1} contains a non-string element: 0
- augur.index.index_sequences(sequences_path, sequence_index_path)ο
Count the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (β?β and β-β), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.
- Parameters:
sequences_path (str or os.PathLike) β path to a sequence file to index.
sequence_index_path (str or os.PathLike) β path to a tab-delimited file containing the composition details for each sequence in the given input file.
- Returns:
int β number of sequences indexed
int β total length of sequences indexed
- augur.index.index_vcf(vcf_path, index_path)ο
Create an index with a list of strain names from a given VCF. We do not calculate any statistics for VCFs.
- Parameters:
vcf_path (str or os.PathLike) β path to a VCF file to index.
index_path (str or os.PathLike) β path to a tab-delimited file containing the composition details for each sequence in the given input file.
- Returns:
number of strains indexed
- Return type:
- augur.index.register_parser(parent_subparsers)ο
- augur.index.run(args)ο
runs index_sequences which counts the number of A, C, T, G, N, other IUPAC nucleotides, ambiguous bases (β?β and β-β), and other invalid characters in a set of sequences and write the composition as a data frame to the given sequence index path.