augur.io.metadata module

exception augur.io.metadata.InvalidDelimiter: Bases: Exception

class augur.io.metadata.Metadata(path, delimiters, id_columns)

Bases: object

Represents a metadata file.

columns: Sequence[str]: Columns extracted from the first row in the metadata file.

delimiter: str: Inferred delimiter of metadata.

id_column: str: Inferred ID column.

open(**kwargs): Open the file with auto-compression/decompression.

path: str: Path to the file on disk.

rows(strict=True)

Yield rows in a dictionary format. Empty lines are ignored.

Parameters:: strict (bool) -- If True, raise an error when a row contains more or less than the number of expected columns.

augur.io.metadata.read_csv_with_index_col(filepath_or_buffer, **kwargs)

Wrapper around pd.read_csv() to retain index_col as a column in addition to setting it as the DataFrame index.

Examples

‘strain’ is available as both the index and a column.

>>> from io import StringIO
>>> csv_data = StringIO("strain,col\nA,val\nB,val")
>>> df = read_csv_with_index_col(csv_data, index_col='strain')
>>> df.index.name
'strain'
>>> 'strain' in df.columns
True

Chunked reading also works.

>>> csv_data.seek(0)
0
>>> chunks = read_csv_with_index_col(csv_data, index_col='strain', chunksize=1)
>>> chunk = next(chunks)
>>> chunk.index.name
'strain'
>>> 'strain' in chunk.columns
True

Without index_col, an error is shown.

>>> read_csv_with_index_col(csv_data)
Traceback (most recent call last):
    ...
Exception: index_col is required.

augur.io.metadata.read_metadata(metadata_file, delimiters=(',', '\\t'), columns=None, id_columns=('id', 'strain', 'name'), keep_id_as_column=False, chunk_size=None, dtype=None)

Read metadata from a given filename and into a pandas DataFrame or iterator of DataFrames when chunk_size is specified.

Parameters:

metadata_file (str) -- Path to a metadata file to load.
delimiters (Sequence[str]) -- List of possible delimiters to check for between columns in the metadata. Only one delimiter will be inferred.
columns (list[str] | None) -- List of columns to read. If unspecified, read all columns.
id_columns (Sequence[str]) -- List of possible id column names to check for, ordered by priority. Only one id column will be inferred.
keep_id_as_column (bool) -- If true, keep the resolved id column as a column in addition to setting it as the DataFrame index.
chunk_size (int | None) -- Size of chunks to stream from disk with an iterator instead of loading the entire input file into memory.
dtype (dict[str, Any] | str | None) -- Data types to apply to columns in metadata. If unspecified, pandas data type inference will be used. See documentation for an argument of the same name to pandas.read_csv().

Raises:

KeyError -- When the metadata file does not have any valid index columns.

Examples

For standard use, request a metadata file and get a pandas DataFrame.

>>> read_metadata("tests/functional/filter/data/metadata.tsv").index.values[0]
'COL/FLR_00024/2015'

Requesting an index column that doesn’t exist should produce an error.

>>> read_metadata("tests/functional/filter/data/metadata.tsv", id_columns=("Virus name",))
Traceback (most recent call last):
  ...
Exception: None of the possible id columns ('Virus name') were found in the metadata's columns ('strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url')

We also allow iterating through metadata in fixed chunk sizes.

>>> for chunk in read_metadata("tests/functional/filter/data/metadata.tsv", chunk_size=5):
...     print(chunk.shape)
...
(5, 14)
(5, 14)
(2, 14)

augur.io.metadata.read_metadata_with_sequences(metadata, metadata_delimiters, fasta, seq_id_column, seq_field='sequence', unmatched_reporting=DataErrorMethod.ERROR_FIRST, duplicate_reporting=DataErrorMethod.ERROR_FIRST)

Read rows from metadata file and yield each row as a single dict that has been updated with their corresponding sequence from the fasta file. Matches the metadata record with sequences using the sequence id provided in the seq_id_column. To ensure that the sequences can be matched with the metadata, the FASTA headers must contain the matching sequence id. The FASTA headers may include additional description parts after the id, but they will not be used to match the metadata.

Will report unmatched records if requested via unmatched_reporting. Note the ERROR_FIRST method will raise an error at the first unmatched metadata record but not for an unmatched sequence record because we can only check for unmatched sequences after exhausting the metadata generator.

Will report duplicate records if requested via duplicate_reporting.

Reads the fasta file with pyfastx.Fasta, which creates an index for the file to allow random access of sequences via the sequence id. Will remove any existing index file named <fasta>.fxi to force the rebuilding of the index so that there’s no chance of using stale cached indexes. See pyfastx docs for more details: https://pyfastx.readthedocs.io/en/latest/usage.html#fasta

When the metadata file is an Excel or OpenOffice workbook, only the first visible worksheet will be read and initial empty rows/columns will be ignored.

Parameters:

metadata (str) -- Path to a CSV, TSV, Excel, or OpenOffice metadata file or binary IO buffer
metadata_delimiters (list of str) -- List of possible delimiters to check for between columns in the metadata. Ignored if metadata is an Excel or OpenOffice file.
fasta (str) -- Path to a plain or gzipped FASTA file
seq_id_column (str) -- The column in the metadata file that contains the sequence id for matching sequences
seq_field (str, optional) -- The field name to use for the sequence in the updated record
unmatched_reporting (DataErrorMethod, optional) -- How should unmatched records be reported
duplicate_reporting (DataErrorMethod, optional) -- How should duplicate records be reported

Yields:

dict -- The parsed metadata record with the sequence

augur.io.metadata.read_table_to_dict(table, delimiters, duplicate_reporting=DataErrorMethod.ERROR_FIRST, id_column=None)

Read rows from table file and yield each row as a single dict.

Will report duplicate records based on the id_column if requested via duplicate_reporting after the generator has been exhausted.

When the table file is an Excel or OpenOffice workbook, only the first visible worksheet will be read and initial empty rows/columns will be ignored.

Parameters:

table (str) -- Path to a CSV, TSV, Excel, or OpenOffice file or binary IO buffer
delimiters (list of str) -- List of possible delimiters to check for between columns in the metadata. Only one delimiter will be inferred. Ignored if table is an Excel or OpenOffice file.
duplicate_reporting (DataErrorMethod, optional) -- How should duplicate records be reported
id_column (str, optional) -- Name of the column that contains the record identifier used for reporting duplicates. Uses the first column of the metadata if not provided.

Yields:

dict -- The parsed row as a single record

Raises:

AugurError -- Raised for any of the following reasons: 1. There are parsing errors from the csv standard library 2. The provided id_column does not exist in the metadata 3. The duplicate_reporting method is set to ERROR_FIRST or ERROR_ALL and duplicate(s) are found

augur.io.metadata.write_records_to_tsv(records, output_file)

Write each record from records as a single row to a TSV output_file. Uses the keys of the first record as output column names. Ignores extra keys in other records. If records are missing keys, they will have an empty string as the value.

Parameters:

records (iterable of dict) -- Iterator that yields dict that contains sequences
output_file (str) -- Path to the output TSV file. Accepts ‘-’ to output TSV to stdout.