Nextclade datasets

Nextclade dataset is a set of input data files required for Nextclade to run the analysis:

  • reference (root) sequence

  • reference tree

  • quality control configuration

  • gene map

  • PCR primers

Dataset might also include example sequence data (to be analyzed).

An instance of a dataset is a directory containing the dataset files.

Datasets names, reference sequences and version tags

There are 3 concepts that are important to understand in order to work with Nextclade datasets:

  1. Dataset name - identifies dataset purpose. Typically indicates name of the pathogen. Examples: sars-cov-2, flu_h3n2_ha. Each dataset is specific to a given virus. For example, a dataset for H1N1 flu is not suitable for analysing SARS-CoV-2 sequences and vice versa. Mixing incompatible datasets and sequences will produce incorrect results.

  2. Dataset’s reference sequence: each dataset can have multiple flavors, depending on the reference sequence it is based on. For example, one sars-cov-2 reference dataset can be based on MN908947 (Wuhan-Hu-1/2019) or reference sequences, and flu_h3n2_ha can be based on CY034116 (A/Wisconsin/67/2005) or other reference sequences. For each dataset name, among all available reference sequences, there is a default reference sequence defined (by dataset maintainers). It is used when no concrete reference sequence is specified. The dataset reference is specified using the corresponding accession ID.

  3. Dataset version and version tag: each reference dataset can have multiple versions. New versions are produced during dataset updates. Datasets are versioned to ensure correctness when running with different versions of Nextclade as well as reproducibility of results. For each reference in dataset there is exactly one latest version. It is used as a default when no version is specified. Version tag is the name unique to a given version.

A combination of (1) name, (2) reference sequence accession, (3) version tag uniquely identifies a downloaded dataset instance. These parameters are described in the file tag.json in the dataset directory.

Using datasets

Datasets in Nextclade Web

Nextclade Web loads the latest compatible datasets automatically. User can choose one of the datasets before starting the analysis using dataset selector.

Datasets in Nextclade CLI

Nextclade CLI implements subcommands allowing to list and to download datasets. This functionality requires internet connection.

List available datasets

The datasets can be listed with the dataset list subcommand:

nextclade dataset list --name sars-cov-2

This will print a list of available datasets to console. More options are available to control listing older and incompatible versions of datasets, as well as specific tags. See: nextclade dataset list --help

Download a dataset

Latest dataset

The datasets can be downloaded with the dataset get subcommand. For example SARS-CoV-2 dataset can be downloaded as follows:

nextclade dataset get --name 'sars-cov-2' --output-dir 'data/sars-cov-2'

The dataset files will be downloaded to the directory data/sars-cov-2 relative to the working directory.

Dataset with a specific reference sequence

You can set a reference sequence of the dataset explicitly, for example to always use MN908947 (Wuhan-Hu-1/2019) for SARS-CoV-2:

nextclade dataset get --name 'sars-cov-2' --reference 'MN908947' --output-dir 'data/sars-cov-2_MN908947'

If using this commands, repeated downloads may produce updated files in the future: after releases of new versions of this dataset. Reference sequence will stay the same even if the SARS-CoV-2 dataset’s default reference sequence changes in the future.

⚠️ We recommend to give descriptive names to dataset directories to avoid confusion. Currently Nextclade cannot verify that a given batch of user-provided sequences is compatible with a given dataset, and it will silently produce incorrect results.

Dataset with a specific reference sequence and version tag

You can set a version tag explicitly. For example to always use the SARS-CoV-2 dataset based on reference sequence MN908947 (Wuhan-Hu-1/2019) and a version released on June 25th 2021 (2021-06-25T00:00:00Z):

nextclade dataset get \
  --name 'sars-cov-2' \
  --reference 'MN908947' \
  --tag '2021-06-25T00:00:00Z' \
  --output-dir 'data/sars-cov-2_MN908947_2021-06-25T00:00:00Z'

In this case repeated downloads will always produce the same files. This is only recommended if you need strictly reproducible results and don’t care about updates. Note that with stale data, new clades and other new features will not be available. For general use, we recommend to periodically download the latest version.

💡️ Nextclade project hosts datasets on a very affordable file hosting, with edge caching. We don’t impose any rate limits. You are free to download these files reasonably often. For example, for a daily automated workflow it is recommended to download a fresh version of the dataset before every run.

Identify already downloaded dataset

Navigate to the dataset directory and find a file named tag.json. It contains information about the dataset: name, reference sequence, version tag and some other parameters.

Run the analysis with the downloaded dataset

The flag --input-dataset can be used to point Nextclade CLI to a dataset directory:

nextclade run \
  --input-dataset 'data/sars-cov-2' \
  --input-fasta 'my_sequences.fasta' \
  --output-tsv 'output/nextclade.tsv' \
  --output-tree 'output/tree.json' \
  --output-dir 'output/'

This will use all the required files from the dataset, so that the individual paths don’t need to be specified explicitly.

If --input-dataset as well as other --input-* flags for individual files are provided, then the individual flags override the corresponding file in the dataset. The remaining files, for which individual flags are not provided are taken from the dataset.

For example, to use a downloaded dataset but to override the reference tree file in it, you could run nextclade as follows:

nextclade run \
  --input-dataset 'datasets/sars-cov-2' \
  --input-fasta 'my_sequences.fasta' \
  --input-tree 'my_tree.json' \
  --output-tsv 'output/nextclade.tsv' \
  --output-tree 'output/tree.json' \
  --output-dir 'output/'

⚠️ When overriding dataset files make sure that the individual files are compatible with the dataset (in particular the pathogen and the reference sequence)

See nextclade run --help for all the flags related to analysis runs.

Run the analysis without the dataset

If the --input-dataset flag is not used, the individual --input-* flags are required for each file.

Dataset versioning and compatibility

When Nextclade software implements new features (for example new QC checks) it might require dataset changes that are incompatible with the previous versions of Nextclade.

Each dataset defines multiple versions, each containing a range of compatible Nextclade versions (separately for Nextclade Web and Nextclade CLI). A particular version of Nextclade can only use a dataset that has matching compatibility range.

Compatibility checks are ensured by default in Nextclade Web and Nextclade CLI when downloading datasets. However, Nextclade CLI users can additionally list and download any dataset version using advanced command-line flags (see nextclade dataset --help).

Creating a custom dataset

You can create a new dataset by creating a directory with the required input files. You can use one of the existing datasets as a starting point and modify its files as needed.

For example, you can create a dataset for the analysis of SARS-CoV-2 clades for a particular region, by making a copy of the default global SARS-CoV-2 dataset and replacing the reference tree file with the one that contains more representative samples that are more relevant for your region.

Online dataset repository

Nextclade team hosts a public file server containing all the dataset file themselves as well as the index file that lists all the datasets, their versions and file URLs. This server is the source of datasets for Nextclade Web and Nextclade CLI.

At this time we do not support the usage of the dataset repository outside of Nextclade. We cannot guarantee stability of the index file format or of the filesystem structure. They can change without notice.

The code and source data for datasets generation is in the GitHub repository: https://github.com/nextstrain/nextclade_data

Dataset updates

Maintainers add new datasets and dataset versions periodically to the online dataset repository, taking care to ensure compatibility with various versions of Nextclade software in use.

A dataset is uniquely identified by its name, e.g. sars-cov-2 or flu_vic_ha.

A version of a given dataset is uniquely identified by:

  • the name of the dataset it belongs to, e.g. sars-cov-2 or flu_vic_ha

  • the version tag, e.g. 2021-06-20T00:00:00Z

The dataset version tags are immutable: once a tag released the data for that tag stays the same, and downloads of this specific tag produce the same set of files.

If you need reproducible results, you should:

  • “freeze” the version of Nextclade CLI, that is keep the same version of Nextclade CLI across runs (check nextclade --version)

  • “freeze” the version tag of the dataset, that is keep the same dataset directory across runs or to redownload it with the specific --tag.

Nextclade Web always uses the latest versions of datasets available at the moment of loading the main page (reload the page for updates).