augur mergeď
Merge two or more datasets into one.
Datasets can consist of metadata and/or sequence files. If both are provided, the order and file contents are used independently.
Metadata
Metadata tables must be given unique names to identify them in the output and are merged in the order given.
Rows are joined by id (e.g. âstrainâ or ânameâ or other âmetadata-id-columns), and ids must be unique within an input table (i.e. tables cannot contain duplicate ids). All rows are output, even if they appear in only a single table (i.e. a full outer join in SQL terms).
Columns are combined by name, either extending the combined table with a new column or overwriting values in an existing column. For columns appearing in more than one table, non-empty values on the right hand side overwrite values on the left hand side. The first tableâs id column name is used as the output id column name. Non-id columns in other input tables that would conflict with this output id column name are not allowed and if present will cause an error.
One generated column per input table may be optionally appended to the end of the output table to identify the source of each rowâs data. Column names are generated with the template given to âsource-columns where â{NAME}â in the template is replaced by the table name given to âmetadata. Values in each column are 1 or 0 for present or absent in that input table. By default no source columns are generated. You may make this behaviour explicit with âno-source-columns.
Metadata tables of arbitrary size can be handled, limited only by available disk space. Tables are not required to be entirely loadable into memory. The transient disk space required is approximately the sum of the uncompressed size of the inputs.
SQLite is used behind the scenes to implement the merge, but this should be considered an implementation detail that may change in the future. The SQLite 3 CLI, sqlite3, must be available. If itâs not on PATH (or you want to use a version different from whatâs on PATH), set the SQLITE3 environment variable to path of the desired sqlite3 executable.
Sequences
Sequence files are unnamed and are merged in the order given. Sequence ids with more than one entry within a sequence file results in an error. Sequence ids with more than one entry across multiple sequences files is handled by keeping the entry from the rightmost file based on the given order.
SeqKit is used behind the scenes to implement the merge, but this should be considered an implementation detail that may change in the future. The CLI program seqkit must be available. If itâs not on PATH (or you want to use a version different from whatâs on PATH), set the SEQKIT environment variable to path of the desired seqkit executable.
usage: augur merge [-h] [--metadata NAME=FILE [NAME=FILE ...]]
[--metadata-id-columns [TABLE=]COLUMN [[TABLE=]COLUMN ...]]
[--metadata-delimiters [TABLE=]CHARACTER [[TABLE=]CHARACTER ...]]
[--sequences FILE [FILE ...]]
[--skip-input-sequences-validation]
[--output-metadata FILE] [--source-columns TEMPLATE]
[--no-source-columns] [--output-sequences FILE] [--quiet]
inputsď
options related to input
- --metadata
Metadata table names and file paths. Names are arbitrary monikers used solely for referring to the associated input file in other arguments and in output column names. Paths must be to seekable files, not unseekable streams. Compressed files are supported.
- --metadata-id-columns
Possible metadata column names containing identifiers, considered in the order given. Columns will be considered for all metadata tables by default. Table-specific column names may be given using the same names assigned in âmetadata. Only one ID column will be inferred for each table. (default: strain name)
Default:
('strain', 'name')
- --metadata-delimiters
Possible field delimiters to use for reading metadata tables, considered in the order given. Delimiters will be considered for all metadata tables by default. Table-specific delimiters may be given using the same names assigned in âmetadata. Only one delimiter will be inferred for each table. (default: , $âtâ)
Default:
(',', '\t')
- --sequences
Sequence files in FASTA format. Compressed files are supported.
- --skip-input-sequences-validation
Skip validation of âsequences (checking for no duplicates) to improve run time. Note that this may result in unexpected behavior in cases where validation would fail.
Default:
False
outputsď
options related to output
- --output-metadata
Merged metadata as TSV. Compressed files are supported.
- --source-columns
Template with which to generate names for the columns (described above) identifying the source of each rowâs data. Must contain a literal placeholder, {NAME}, which stands in for the metadata table names assigned in âmetadata. (default: disabled)
- --no-source-columns
Suppress generated columns (described above) identifying the source of each rowâs data. This is the default behaviour, but it may be made explicit or used to override a previous âsource-columns.
- --output-sequences
Merged sequences as FASTA. Compressed files are supported.
- --quiet
Suppress informational and warning messages normally written to stderr. (default: disabled)
Default:
False