Running an ingest workflowļ
This tutorial uses the Nextstrain CLI to help you get started running ingest workflows. Ingest workflows download public data from NCBI and output ingest datasets, which include curated metadata and sequences that can be used as input for phylogenetic or Nextclade workflows.
Note
You only need to run an ingest workflow if you do not want to use the data files already publicly hosted by Nextstrain. Individual pathogen repositories include documentation that links to their data files.
In this tutorial, you will run the ingest workflow of our Zika repository and view outputs on your computer. You will have a basic understanding of how to run ingest workflows for other pathogens and a foundation for understanding how to customize ingest workflows.
Table of Contents
Prerequisitesļ
Install Nextstrain. These instructions will install all of the software you need to complete this tutorial.
Download the Zika repositoryļ
All pathogen ingest workflows are stored in pathogen repositories (version-controlled folders) to track changes over time. Download the Zika repository.
$ git clone https://github.com/nextstrain/zika
Cloning into 'zika'...
[...more output...]
When itās done, youāll have a new directory called zika/
.
Run the default workflowļ
The zika ingest workflow uses the NCBI Datasets CLI tools to download public data and uses a combination of augur curate and other data manipulation tools to curate the downloaded data into a format suitable for phylogenetic workflows.
Change directory to the Zika pathogen repository downloaded in the previous step
$ cd zika
Run the default ingest workflow with the Nextstrain CLI.
$ nextstrain build ingest
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Building DAG of jobsā¦
[...a lot of output...]
This should take just a few minutes to complete. There should be two final output files:
ingest/results/metadata.tsv
ingest/results/sequences.fasta
The output files should have the same data formats as the public data files hosted by Nextstrain, available at:
https://data.nextstrain.org/files/workflows/zika/metadata.tsv.zst
https://data.nextstrain.org/files/workflows/zika/sequences.fasta.zst
Your results may have additional records depending on whether new data has been released since the public data files were last uploaded.
Configuring the ingest workflowļ
Now that youāve seen the default outputs of the ingest workflow, you can try configuring the ingest workflow to change the outputs.
Inspecting the uncurated metadataļ
If you want to see the uncurated NCBI Datasets data to decide what changes you would like to make to the workflow, you can download the uncurated NCBI data.
Hint
These commands are very similar to the commands run by the ingest workflow with some minor differences.
The ingest workflow restricts the columns to those defined in config["ncbi_datasets_fields"]
and keeps the header names as the more computer friendly āMnemonicā of the
NCBI Datasetsā available fields.
Enter an interactive Nextstrain shell to be able to run the NCBI Datasets CLI commands without installing them separately.
$ nextstrain shell .
Create the
ingest/data
directory if it doesnāt already exist.
$ mkdir -p ingest/data
Download the dataset with the pathogen NCBI taxonomy ID.
$ datasets download virus genome taxon <taxon-id> \
--filename ingest/data/ncbi_dataset.zip
Extract and format the metadata as a TSV file for easy inspection
$ dataformat tsv virus-genome \
--package ingest/data/ncbi_dataset.zip \
> ingest/data/raw_metadata.tsv
Exit the Nextstrain shell to return to your usual shell environment.
$ exit
The produced ingest/data/raw_metadata.tsv
will contain all of the fields available from NCBI Datasets.
Updating the workflow configļ
Weāll walk through an example custom config to include an additional column in the curated output.
For example, examining the raw NCBI metadata shows us that virus-name
is a NCBI Datasets field that is not currently downloaded by the default Zika ingest workflow.
If you wanted this field to be included in your outputs, you could perform the following steps.
Create a new build config directory
ingest/build-configs/tutorial/
$ mkdir ingest/build-configs/tutorial
Copy the default config to
ingest/build-configs/tutorial/config.yaml
$ cp ingest/defaults/config.yaml ingest/build-configs/tutorial/config.yaml
Modify the config parameters within your new custom config
ingest/build-configs/tutorial/config.yaml
.
Add
virus-name
to thencbi_datasets_fields
to make the workflow parse the column from the downloaded NCBI data.Update the
curate.field_map
with an entry for the new field to match the underscore naming scheme of column names.curate: field_map: virus-name: virus_name
Add
virus_name
to thecurate.metadata_columns
to configure the workflow to include the new column in the final output file.(Optional) Remove any other config parameters that you are not modifying
Note
Config parameters that are dictionaries will merge with the parameters defined in ingest/defaults/config.yaml
while all other types will overwrite the default.
See Snakemake documentation for more details on how configuration files work.
All config parameters available are listed in the ingest/defaults/config.yaml
file.
Any of the config parameters can be overridden in a custom config file.
Run the ingest workflow again with the custom config file.
$ nextstrain build ingest --configfile build-configs/tutorial/config.yaml --forceall
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Config file defaults/config.yaml is extended by additional config specified via the command line.
Building DAG of jobsā¦
[...a lot of output...]
Inspect the new
ingest/results/metadata.tsv
to see that it now includes the additionalvirus_name
column.
Advanced usage: Customizing the ingest workflowļ
Note
This section of the tutorial requires an understanding of Snakemake workflows.
In addition to configuring the ingest workflow, it is also possible to extend the ingest workflow with your own custom steps. Weāll walk through an example customization that joins additional metadata to the public data that youāve curated in the previous steps.
Create an additional metadata file
ingest/build-configs/tutorial/additional-metadata.tsv
genbank_accession column_A column_B column_C
AF013415 AAAAA BBBBB CCCCC
AF372422 AAAAA BBBBB CCCCC
AY326412 AAAAA BBBBB CCCCC
AY632535 AAAAA BBBBB CCCCC
EU303241 AAAAA BBBBB CCCCC
EU074027 AAAAA BBBBB CCCCC
EU545988 AAAAA BBBBB CCCCC
NC_012532 AAAAA BBBBB CCCCC
DQ859059 AAAAA BBBBB CCCCC
JN860885 AAAAA BBBBB CCCCC
Create a new rules file
ingest/build-configs/tutorial/merge-metadata.smk
rule merge_metadata:
input:
metadata="results/metadata.tsv",
additional_metadata="build-configs/tutorial/additional-metadata.tsv",
output:
merged_metadata="results/merged-metadata.tsv"
shell:
"""
tsv-join -H \
--filter-file {input.additional_metadata} \
--key-fields "genbank_accession" \
--append-fields "*" \
--write-all "?" \
{input.metadata} > {output.merged_metadata}
"""
This rule uses tsv-join to merge the
additional metadata with the metadata output from the ingest workflow.
The records will be merged using the genbank_accession
column and all fields from the additional-metadata.tsv
file will be appended to the metadata.
Any record in the metadata.tsv
that does not have a matching record in the additional-metadata.tsv
will have a
default ?
value in the new columns.
Add the following to the custom config file
ingest/build-configs/tutorial/config.yaml
custom_rules:
- build-configs/tutorial/merge-metadata.smk
The custom_rules
config tells the ingest workflow to include your custom rules so that you can run them as part of the workflow.
Run the ingest workflow again with the customized rule.
$ nextstrain build ingest merge_metadata --configfile build-configs/tutorial/config.yaml
Using profile profiles/default and workflow specific profile profiles/default for setting default command line arguments.
Config file config/defaults.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
[...a lot of output...]
5. Inspect the ingest/results/merged-metadata.tsv
file to see that it includes the additional columns column_A
, column_B
, and column_C
.
The records with the genbank_accession
listed in the additional-metadata.tsv
file should have the placeholder data in the new columns,
while other records should have the default ?
value.
Next stepsļ
Run the zika phylogenetic workflow with new ingested data as input by running
$ mv ingest/results/* phylogenetic/data/ $ nextstrain build phylogenetic
If youāve customized the ingest workflow then you may need to modify the phylogenetic workflow to use the new ingested data file. We are planning to write another tutorial to cover other modifications to your phylogenetic workflow.