augur merge

Merge two or more datasets into one.

Datasets can consist of metadata and/or sequence files. If both are provided, the order and file contents are used independently.

Metadata

Metadata tables must be given unique names to identify them in the output and are merged in the order given.

Rows are joined by id (e.g. “strain” or “name” or other --metadata-id-columns), and ids must be unique within an input table (i.e. tables cannot contain duplicate ids). All rows are output, even if they appear in only a single table (i.e. a full outer join in SQL terms).

Columns are combined by name, either extending the combined table with a new column or overwriting values in an existing column. For columns appearing in more than one table, non-empty values on the right hand side overwrite values on the left hand side. The first table’s id column name is used as the output id column name. Non-id columns in other input tables that would conflict with this output id column name are not allowed and if present will cause an error.

One generated column per input table may be optionally appended to the end of the output table to identify the source of each row’s data. Column names are generated with the template given to --source-columns where “{NAME}” in the template is replaced by the table name given to --metadata. Values in each column are 1 or 0 for present or absent in that input table. By default no source columns are generated. You may make this behaviour explicit with --no-source-columns.

Metadata tables of arbitrary size can be handled, limited only by available disk space. Tables are not required to be entirely loadable into memory. The transient disk space required is approximately the sum of the uncompressed size of the inputs.

SQLite is used behind the scenes to implement the merge, but this should be considered an implementation detail that may change in the future. The SQLite 3 CLI, sqlite3, must be available. If it’s not on PATH (or you want to use a version different from what’s on PATH), set the SQLITE3 environment variable to path of the desired sqlite3 executable.

Sequences

Sequence files are unnamed and are merged in the order given. Sequence ids with more than one entry within a sequence file results in an error. Sequence ids with more than one entry across multiple sequences files is handled by keeping the entry from the rightmost file based on the given order.

SeqKit is used behind the scenes to implement the merge, but this should be considered an implementation detail that may change in the future. The CLI program seqkit must be available. If it’s not on PATH (or you want to use a version different from what’s on PATH), set the SEQKIT environment variable to path of the desired seqkit executable.

usage: augur merge [-h] [--metadata NAME=FILE [NAME=FILE ...]]
                   [--metadata-id-columns [TABLE=]COLUMN [[TABLE=]COLUMN ...]]
                   [--metadata-delimiters [TABLE=]CHARACTER [[TABLE=]CHARACTER ...]]
                   [--sequences FILE [FILE ...]]
                   [--skip-input-sequences-validation]
                   [--output-metadata FILE] [--source-columns TEMPLATE]
                   [--no-source-columns] [--output-sequences FILE] [--quiet]
                   [--nthreads N]

inputs

options related to input

--metadata

Metadata table names and file paths. Names are arbitrary monikers used solely for referring to the associated input file in other arguments and in output column names. Paths must be to seekable files, not unseekable streams. Compressed files are supported.

--metadata-id-columns

Possible metadata column names containing identifiers, considered in the order given. Columns will be considered for all metadata tables by default. Table-specific column names may be given using the same names assigned in --metadata. Only one ID column will be inferred for each table. (default: strain name)

Default: ('strain', 'name')

--metadata-delimiters

Possible field delimiters to use for reading metadata tables, considered in the order given. Delimiters will be considered for all metadata tables by default. Table-specific delimiters may be given using the same names assigned in --metadata. Only one delimiter will be inferred for each table. (default: , $’t’)

Default: (',', '\t')

--sequences

Sequence files in FASTA format. Compressed files are supported.

--skip-input-sequences-validation

Skip validation of --sequences (checking for no duplicates) to improve run time. Note that this may result in unexpected behavior in cases where validation would fail.

Default: False

outputs

options related to output

--output-metadata

Merged metadata as TSV. Compressed files are supported.

--source-columns

Template with which to generate names for the columns (described above) identifying the source of each row’s data. Must contain a literal placeholder, {NAME}, which stands in for the metadata table names assigned in --metadata. (default: disabled)

--no-source-columns

Suppress generated columns (described above) identifying the source of each row’s data. This is the default behaviour, but it may be made explicit or used to override a previous --source-columns.

--output-sequences

Merged sequences as FASTA. Compressed files are supported.

--quiet

Suppress informational and warning messages normally written to stderr. (default: disabled)

Default: False

other

other options

--nthreads

Number of CPUs/cores/threads/jobs to utilize at once.

Default: 1