augur.io.metadata

exception augur.io.metadata.InvalidDelimiter

Bases: Exception

class augur.io.metadata.Metadata(path, delimiters, id_columns)

Bases: object

Represents a metadata file.

columns: Sequence[str]

Columns extracted from the first row in the metadata file.

delimiter: str

Inferred delimiter of metadata.

id_column: str

Inferred ID column.

open(**kwargs)

Open the file with auto-compression/decompression.

path: str

Path to the file on disk.

rows(strict=True)

Yield rows in a dictionary format. Empty lines are ignored.

Parameters:

strict (bool) – If True, raise an error when a row contains more or less than the number of expected columns.

augur.io.metadata.read_metadata(metadata_file, delimiters=(',', '\t'), columns=None, id_columns=('strain', 'name'), chunk_size=None, dtype=None)

Read metadata from a given filename and into a pandas DataFrame or TextFileReader object.

Parameters:
  • metadata_file (str) – Path to a metadata file to load.

  • delimiters (list of str) – List of possible delimiters to check for between columns in the metadata. Only one delimiter will be inferred.

  • columns (list of str) – List of columns to read. If unspecified, read all columns.

  • id_columns (list of str) – List of possible id column names to check for, ordered by priority. Only one id column will be inferred.

  • chunk_size (int) – Size of chunks to stream from disk with an iterator instead of loading the entire input file into memory.

  • dtype (dict or str) – Data types to apply to columns in metadata. If unspecified, pandas data type inference will be used. See documentation for an argument of the same name to pandas.read_csv().

Return type:

pandas.DataFrame or pandas.io.parsers.TextFileReader

Raises:

KeyError – When the metadata file does not have any valid index columns.

Examples

For standard use, request a metadata file and get a pandas DataFrame.

>>> read_metadata("tests/functional/filter/data/metadata.tsv").index.values[0]
'COL/FLR_00024/2015'

Requesting an index column that doesn’t exist should produce an error.

>>> read_metadata("tests/functional/filter/data/metadata.tsv", id_columns=("Virus name",))
Traceback (most recent call last):
  ...
Exception: None of the possible id columns (('Virus name',)) were found in the metadata's columns ('strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url')

We also allow iterating through metadata in fixed chunk sizes.

>>> for chunk in read_metadata("tests/functional/filter/data/metadata.tsv", chunk_size=5):
...     print(chunk.shape)
...
(5, 14)
(5, 14)
(2, 14)
augur.io.metadata.read_metadata_with_sequences(metadata, metadata_delimiters, fasta, seq_id_column, seq_field='sequence', unmatched_reporting=DataErrorMethod.ERROR_FIRST, duplicate_reporting=DataErrorMethod.ERROR_FIRST)

Read rows from metadata file and yield each row as a single dict that has been updated with their corresponding sequence from the fasta file. Matches the metadata record with sequences using the sequence id provided in the seq_id_column. To ensure that the sequences can be matched with the metadata, the FASTA headers must contain the matching sequence id. The FASTA headers may include additional description parts after the id, but they will not be used to match the metadata.

Will report unmatched records if requested via unmatched_reporting. Note the ERROR_FIRST method will raise an error at the first unmatched metadata record but not for an unmatched sequence record because we can only check for unmatched sequences after exhausting the metadata generator.

Will report duplicate records if requested via duplicate_reporting.

Reads the fasta file with pyfastx.Fasta, which creates an index for the file to allow random access of sequences via the sequence id. Will remove any existing index file named <fasta>.fxi to force the rebuilding of the index so that there’s no chance of using stale cached indexes. See pyfastx docs for more details: https://pyfastx.readthedocs.io/en/latest/usage.html#fasta

Parameters:
  • metadata (str) – Path to a CSV or TSV metadata file

  • metadata_delimiters (list of str) – List of possible delimiters to check for between columns in the metadata.

  • fasta (str) – Path to a plain or gzipped FASTA file

  • seq_id_column (str) – The column in the metadata file that contains the sequence id for matching sequences

  • seq_field (str, optional) – The field name to use for the sequence in the updated record

  • unmatched_reporting (DataErrorMethod, optional) – How should unmatched records be reported

  • duplicate_reporting (DataErrorMethod, optional) – How should duplicate records be reported

Yields:

dict – The parsed metadata record with the sequence

augur.io.metadata.read_table_to_dict(table, delimiters, duplicate_reporting=DataErrorMethod.ERROR_FIRST, id_column=None)

Read rows from table file and yield each row as a single dict.

Will report duplicate records based on the id_column if requested via duplicate_reporting after the generator has been exhausted.

Parameters:
  • table (str) – Path to a CSV or TSV file or IO buffer

  • delimiters (list of str) – List of possible delimiters to check for between columns in the metadata. Only one delimiter will be inferred.

  • duplicate_reporting (DataErrorMethod, optional) – How should duplicate records be reported

  • id_column (str, optional) – Name of the column that contains the record identifier used for reporting duplicates. Uses the first column of the metadata if not provided.

Yields:

dict – The parsed row as a single record

Raises:

AugurError – Raised for any of the following reasons: 1. There are parsing errors from the csv standard library 2. The provided id_column does not exist in the metadata 3. The duplicate_reporting method is set to ERROR_FIRST or ERROR_ALL and duplicate(s) are found

augur.io.metadata.write_records_to_tsv(records, output_file)

Write each record from records as a single row to a TSV output_file. Uses the keys of the first record as output column names. Ignores extra keys in other records. If records are missing keys, they will have an empty string as the value.

Parameters:
  • records (iterable of dict) – Iterator that yields dict that contains sequences

  • output_file (str) – Path to the output TSV file. Accepts β€˜-’ to output TSV to stdout.