Input filesï
This section describes input files and their expected formats as well as how they are used in Nextclade Web, Nextclade CLI and Nextalign CLI.
Sequence dataï
A set of viral nucleotide sequences to be analyzed. Also referred to as Query sequences.
Nextclade Web (simple and advanced modes): accepted in âSequencesâ drag & drop box. A remote URL is also accepted in input-fasta
URL parameter.
Nextclade CLI and Nextalign CLI accept fasta inputs as one or multiple positional arguments. Accepts plain or compressed FASTA files. If a compressed fasta file is provided, it will be transparently decompressed. Supported compression formats: gz
, bz2
, xz
, zstd
. Decompressor is chosen based on file extension. If thereâs multiple input files, then different files can have different compression formats. If positional arguments provided, the plain fasta input is read from standard input (stdin).
Accepted formats: FASTA or plain text (one sequence per line).
Reference (root) sequenceï
Viral nucleotide sequence which serves as a reference for alignment and the analysis. Mutations are called relative to the reference sequence. It is expected to be the root of the reference tree. The best results are obtained when the reference sequence is a well-known consensus genome, of a very high quality, preferably complete and unambiguous (spans entire genome and has no ambiguous nucleotides).
Accepted formats: FASTA or plain text. The file is expected to contain only 1 sequence.
Nextclade Web (advanced mode): accepted in âRoot sequenceâ drag & drop box. A remote URL is also accepted in input-root-sequence
URL parameter.
CLI argument: --input-ref
Reference treeï
The phylogenetic reference tree which serves as a target for phylogenetic placement (see Algorithm: Phylogenetic placement). Nearest neighbour information is used to assign clades (see Algorithm: Clade Assignment) and to identify private mutations, including reversions.
The tree must be rooted at the sample that matches the reference (root) sequence.
The tree must contain a clade definition for every node (including internal).
The tree must be sufficiently large, diverse and to meet clade assignment expectations of a particular use-case, study or experiment. Only clades present on the reference tree can be assigned to Query sequences.
Nextclade Web (advanced mode): accepted in âReference treeâ drag & drop box. A remote URL is also accepted in input-tree
URL parameter.
Nextclade CLI flag: --input-tree
Accepted formats: Auspice JSON v2 (description, schema) - this is the same format that is used in Nextstrain. It is produced by Nextstrain Augur and consumed by Nextstrain Auspice. Refer to Nextstrain documentation at https://docs.nextstrain.org on how to build your own trees.
Quality control (QC) configurationï
A set of parameters and thresholds used to configure the QC checks. These should be tuned for the particular study or experiment, considering quality and tolerances of sequencing results of a given laboratory.
Nextclade Web (advanced mode): accepted in âQuality controlâ drag & drop box.
Nextclade CLI flag: --input-qc-config
Accepted formats: JSON. Example configuration for SARS-CoV-2:
{
"schemaVersion": "1.2.0",
"privateMutations": {
"enabled": true,
"typical": 8,
"cutoff": 24,
"weightLabeledSubstitutions": 4,
"weightReversionSubstitutions": 6,
"weightUnlabeledSubstitutions": 1
},
"missingData": {
"enabled": true,
"missingDataThreshold": 2700,
"scoreBias": 300
},
"snpClusters": {
"enabled": true,
"windowSize": 100,
"clusterCutOff": 6,
"scoreWeight": 50
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 10
},
"frameShifts": {
"enabled": true,
"ignoredFrameShifts": [
{ "geneName": "ORF3a", "codonRange": {"begin": 256, "end": 276 } },
{ "geneName": "ORF3a", "codonRange": {"begin": 258, "end": 276 } },
]
},
"stopCodons": {
"enabled": true,
"ignoredStopCodons": [
{"geneName": "ORF8", "codon": 26},
{"geneName": "ORF8", "codon": 67}
]
}
}
Note that the positions are 0-indexed and codon range ends are excluded. So ORF3a:257-276
should be encoded as {"begin": 256, "end": 276 }
.
Gene mapï
(or âgenome annotationsâ)
A table describing the genes of the virus (name, frame, position, etc.)
The gene map is required for codon-aware alignment, for gene translation and for calling of aminoacid mutations. Without gene map, peptides will not be output and aminoacid mutations will not be detected. Without gene map the nucleotide alignment step will not be informed by codon information (see: Algorithm: Sequence alignment and Algorithm: Translation). Since version 1.10.0
(web 1.13.0
) negative strands are supported, too.
Accepted formats: GFF3. Example gene map for SARS-CoV-2:
# seqname source feature start end score strand frame attribute
. . gene 26245 26472 . + . gene_name=E
. . gene 26523 27191 . + . gene_name=M
. . gene 28274 29533 . + . gene_name=N
. . gene 266 13468 . + . gene_name=ORF1a
. . gene 13468 21555 . + . gene_name=ORF1b
. . gene 25393 26220 . + . gene_name=ORF3a
. . gene 27202 27387 . + . gene_name=ORF6
. . gene 27394 27759 . + . gene_name=ORF7a
. . gene 27756 27887 . + . gene_name=ORF7b
. . gene 27894 28259 . + . gene_name=ORF8
. . gene 28284 28577 . + . gene_name=ORF9b
. . gene 21563 25384 . + . gene_name=S
Nextclade Web (advanced mode): accepted in âGene mapâ drag & drop box.
Nextclade CLI flag: --input-gene-map
Nextalign CLI flag: --gene-map
PCR primersï
A table that describes a set of PCR primers that might be used for PCR tests of the virus.
Used to detect changes in PCR primer regions. Without this table these checks will not be performed.
Nextclade Web (advanced mode): accepted in âPCR primersâ drag & drop box.
Nextclade CLI flag: --input-pcr-primers
Accepted formats: CSV with the following 4 columns âInstitute (Country),TargetGene,PrimerName,Sequenceâ. Example table of PCR primers for SARS-CoV-2:
Country (Institute),Target,Oligonucleotide ,Sequence
Charité (Germany) ,RdRp ,Charité_RdRp_F ,GTGARATGGTCATGTGTGGCGG
Charité (Germany) ,RdRp ,Charité_S_RdRp_P,CAGGTGGAACCTCATCAGGAGATGC
Charité (Germany) ,RdRp ,Charité_RdRp_R ,CARATGTTAAASACACTATTAGCATA
Charité (Germany) ,E ,Charité_E_F ,ACAGGTACGTTAATAGTTAATAGCGT
Charité (Germany) ,E ,Charité_E_P ,ACACTAGCCATCCTTACTGCGCTTCG
Charité (Germany) ,E ,Charité_E_R ,ATATTGCAGCAGTACGCACACA
Charité (Germany) ,N ,Charité_N_F ,CACATTGGCACCCGCAATC
Charité (Germany) ,N ,Charité_N_P ,ACTTCCTCAAGGAACAACATTGCCA
Charité (Germany) ,N ,Charité_N_R ,GAGGAACGAGAAGAGGCTTG
Note: the primers are processed differently depending on the primer type. The type is deduced from the suffix of primerâs name (3rd column). Conventions that are used:
_F
- forward primer_R
- reverse primer_P
- probe
Virus propertiesï
Introduced in CLI version 1.10.0
, web 1.13.0
Private mutations are split into 3 categories: reversion, labeled mutations and unlabeled mutations.
Through the virus_properties.json
config file, Nextclade is told which mutations to attach which labels to.
Private mutations to a genotype listed in the file are given the labels given in the file.
It is of the following schema (shortened for clarity):
{
"schemaVersion": "1.10.0",
"nucMutLabelMap": {
"174T": [
"20H"
],
"204T": [
"20E",
"21J"
]
},
"nucMutLabelMapReverse": {
"19A": [
"11083T",
"14805T",
"26144T"
],
"19B": [
"8782T",
"9477A",
]
}
}
Positions are 1-indexed.
Nextclade Web (advanced mode): accepted in âVirus propertiesâ drag & drop box.
Nextclade CLI flag: --input-virus-properties