augur.io

Interfaces for reading and writing data also known as input/output (I/O)

augur.io.open_file(path_or_buffer, mode='r', **kwargs)

Opens a given file path and returns the handle.

Transparently handles compressed inputs and outputs.

Parameters

path_or_buffer (str or Path-like or IO buffer) – Name of the file to open or an existing IO buffer
mode (str) – Mode to open file (read or write)

Returns

File handle object

Return type

IO

augur.io.read_metadata(metadata_file, id_columns=('strain', 'name'), chunk_size=None)

Read metadata from a given filename and into a pandas DataFrame or TextFileReader object.

Parameters

metadata_file (str) – Path to a metadata file to load.
id_columns (list[str]) – List of possible id column names to check for, ordered by priority.
chunk_size (int) – Size of chunks to stream from disk with an iterator instead of loading the entire input file into memory.

Return type

pandas.DataFrame or pandas.TextFileReader

Raises

KeyError : – When the metadata file does not have any valid index columns.

For standard use, request a metadata file and get a pandas DataFrame.

>>> read_metadata("tests/functional/filter/data/metadata.tsv").index.values[0]
'COL/FLR_00024/2015'

Requesting an index column that doesn’t exist should produce an error.

>>> read_metadata("tests/functional/filter/data/metadata.tsv", id_columns=("Virus name",))
Traceback (most recent call last):
  ...
Exception: None of the possible id columns (('Virus name',)) were found in the metadata's columns ('strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url')

We also allow iterating through metadata in fixed chunk sizes.

>>> for chunk in read_metadata("tests/functional/filter/data/metadata.tsv", chunk_size=5):
...     print(chunk.shape)
...
(5, 14)
(5, 14)
(2, 14)

augur.io.read_sequences(*paths, format='fasta')

Read sequences from one or more paths.

Automatically infer compression mode (e.g., gzip, etc.) and return a stream of sequence records in the requested format (e.g., “fasta”, “genbank”, etc.).

Parameters

paths (list of str or Path-like objects) – One or more paths to sequence files of any type supported by BioPython.
format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.).

Yields

Bio.SeqRecord.SeqRecord – Sequence record from the given path(s).

augur.io.write_sequences(sequences, path_or_buffer, format='fasta')

Write sequences to a given path in the given format.

Automatically infer compression mode (e.g., gzip, etc.) based on the path’s filename extension.

Parameters

sequences (iterable of Bio.SeqRecord.SeqRecord objects) – A list-like collection of sequences to write
path_or_buffer (str or Path-like object or IO buffer) – A path to a file to write the given sequences in the given format.
format (str) – Format of input sequences matching any of those supported by BioPython (e.g., “fasta”, “genbank”, etc.)

Returns

Number of sequences written out to the given path.

Return type

int