augur.io
Interfaces for reading and writing data also known as input/output (I/O)
- augur.io.open_file(path_or_buffer, mode='r', **kwargs)
Opens a given file path and returns the handle.
Transparently handles compressed inputs and outputs.
- Parameters:
path_or_buffer – Name of the file to open or an existing IO buffer
mode (str) – Mode to open file (read or write)
- Returns:
File handle object
- Return type:
IO
- augur.io.read_metadata(metadata_file, delimiters=(',', '\t'), columns=None, id_columns=('strain', 'name'), chunk_size=None, dtype=None)
Read metadata from a given filename and into a pandas DataFrame or TextFileReader object.
- Parameters:
metadata_file (str) – Path to a metadata file to load.
delimiters (list of str) – List of possible delimiters to check for between columns in the metadata. Only one delimiter will be inferred.
columns (list of str) – List of columns to read. If unspecified, read all columns.
id_columns (list of str) – List of possible id column names to check for, ordered by priority. Only one id column will be inferred.
chunk_size (int) – Size of chunks to stream from disk with an iterator instead of loading the entire input file into memory.
dtype (dict or str) – Data types to apply to columns in metadata. If unspecified, pandas data type inference will be used. See documentation for an argument of the same name to pandas.read_csv().
- Return type:
pandas.DataFrame or pandas.io.parsers.TextFileReader
- Raises:
KeyError – When the metadata file does not have any valid index columns.
Examples
For standard use, request a metadata file and get a pandas DataFrame.
>>> read_metadata("tests/functional/filter/data/metadata.tsv").index.values[0] 'COL/FLR_00024/2015'
Requesting an index column that doesn’t exist should produce an error.
>>> read_metadata("tests/functional/filter/data/metadata.tsv", id_columns=("Virus name",)) Traceback (most recent call last): ... Exception: None of the possible id columns ('Virus name') were found in the metadata's columns ('strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url')
We also allow iterating through metadata in fixed chunk sizes.
>>> for chunk in read_metadata("tests/functional/filter/data/metadata.tsv", chunk_size=5): ... print(chunk.shape) ... (5, 14) (5, 14) (2, 14)
- augur.io.read_sequences(*paths, format='fasta')
Read sequences from one or more paths.
Automatically infer compression mode (e.g., gzip, etc.) and return a stream of sequence records in the requested format (e.g., “fasta”, “genbank”, etc.).
- Parameters:
paths (list of str or os.PathLike) – One or more paths to sequence files of any type supported by BioPython.
format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.).
- Yields:
Bio.SeqRecord.SeqRecord – Sequence record from the given path(s).
- augur.io.read_strains(*files, comment_char='#')
Reads strain names from one or more plain text files and returns the set of distinct strains.
Strain names can be commented with full-line or inline comments. For example, the following is a valid strain names file:
# this is a comment at the top of the file strain1 # exclude strain1 because it isn't sequenced properly strain2 # this is an empty line that will be ignored.
- augur.io.write_sequences(sequences, path_or_buffer, format='fasta')
Write sequences to a given path in the given format.
Automatically infer compression mode (e.g., gzip, etc.) based on the path’s filename extension.
- Parameters:
sequences (iterable of Bio.SeqRecord.SeqRecord) – A list-like collection of sequences to write
path_or_buffer (str or os.PathLike or io.StringIO) – A path to a file to write the given sequences in the given format.
format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.)
- Returns:
Number of sequences written out to the given path.
- Return type: