augur.merge module

Merge datasets into one.

Datasets can consist of metadata and/or sequence files. If both are provided, the order and file contents are used independently.

Metadata

Metadata tables must be given unique names to identify them in the output and are merged in the order given.

Rows are joined by id (e.g. “strain” or “name” or other --metadata-id-columns), and ids must be unique within an input table (i.e. tables cannot contain duplicate ids). All rows are output, even if they appear in only a single table (i.e. a full outer join in SQL terms).

Columns are combined by name, either extending the combined table with a new column or overwriting values in an existing column. For columns appearing in more than one table, non-empty values on the right hand side overwrite values on the left hand side. The first table’s id column name is used as the output id column name, unless overridden with --output-metadata-id-column. Non-id columns in other input tables that would conflict with this output id column name are not allowed and if present will cause an error.

One generated column per input table may be optionally appended to the end of the output table to identify the source of each row’s data. Column names are generated with the template given to --source-columns where “{NAME}” in the template is replaced by the table name given to --metadata. Values in each column are 1 or 0 for present or absent in that input table. By default no source columns are generated. You may make this behaviour explicit with --no-source-columns.

Metadata tables of arbitrary size can be handled, limited only by available disk space. Tables are not required to be entirely loadable into memory. The transient disk space required is approximately the sum of the uncompressed size of the inputs.

SQLite is used behind the scenes to implement the merge, but this should be considered an implementation detail that may change in the future. The SQLite 3 CLI, sqlite3, must be available. If it’s not on PATH (or you want to use a version different from what’s on PATH), set the SQLITE3 environment variable to path of the desired sqlite3 executable.

Sequences

Sequence files are unnamed and are merged in the order given. Sequence ids with more than one entry within a sequence file results in an error. Sequence ids with more than one entry across multiple sequences files is handled by keeping the entry from the rightmost file based on the given order.

SeqKit is used behind the scenes to implement the merge, but this should be considered an implementation detail that may change in the future. The CLI program seqkit must be available. If it’s not on PATH (or you want to use a version different from what’s on PATH), set the SEQKIT environment variable to path of the desired seqkit executable.

class augur.merge.NamedMetadata(name, *args, **kwargs)

Bases: Metadata

name: str: User-provided descriptive name for this metadata file.

table_name: str: Generated SQLite table name for this metadata file, based on name.

exception augur.merge.SQLiteError(proc, *args)

Bases: Exception

Exception raised when sqlite3 invocation fails. proc stores the failed process. Useful for retrieving info such as output.

augur.merge.count_unique(xs)

Return type:: Iterable[Tuple[TypeVar(T), int]]

augur.merge.merge_metadata(args)

augur.merge.merge_sequences(args)

augur.merge.pairs(xs)

Split an iterable of k=v strings into an iterable of (k,v) tuples.

Return type:: Iterable[Tuple[str, str]]

>>> pairs(["abc=123", "eight nine ten=el em en"])
[('abc', '123'), ('eight nine ten', 'el em en')]

Strings missing a k and/or a v part get an empty string.

>>> pairs(["v", "=v", "k=", "=", ""])
[('', 'v'), ('', 'v'), ('k', ''), ('', ''), ('', '')]

k ends at the first =.

>>> pairs(["abc=123=xyz", "=v=v"])
[('abc', '123=xyz'), ('', 'v=v')]

augur.merge.register_parser(parent_subparsers)

augur.merge.run(args)

augur.merge.shquote_humanized(x)

shquote for humans.

Use C-style escapes supported by shells (specifically, Bash) for characters that humans would typically use C-style escapes for instead of quoted literals.

<https://www.gnu.org/software/bash/manual/bash.html#ANSI_002dC-Quoting>

>>> shquote_humanized("abc")
'abc'

>>> shquote_humanized("\t")
"$'\\t'"

>>> shquote_humanized("abc def")
"'abc def'"

>>> shquote_humanized("abc\tdef")
"abc$'\\t'def"

augur.merge.sqlite3(*args, **kwargs): Internal helper for invoking sqlite3, the SQLite CLI program.

augur.merge.sqlite3_table_columns(db_path, table)

Return type:: Iterable[str]

augur.merge.sqlite_quote_dot(x)

Quote a SQLite CLI dot-command argument.

<https://sqlite.org/cli.html#dot_command_arguments>

augur.merge.sqlite_quote_id(*xs)

Quote a SQLite identifier.

<https://sqlite.org/lang_keywords.html>

>>> sqlite_quote_id('foo bar')
'"foo bar"'
>>> sqlite_quote_id('table name', 'column name')
'"table name"."column name"'
>>> sqlite_quote_id('weird"name')
'"weird""name"'

augur.merge.sqlite_quote_string(x)

Quote a SQLite string (i.e. produce a string literal).

<https://www.sqlite.org/lang_expr.html#literal_values_constants_>

augur.merge.validate_arguments(args)