Add your dataset to the collection of SARS-CoV-2 datasets

The Nextstrain team maintains nextstrain.org/sars-cov-2 to provide a resource for easy access to a variety of public analyses and interpretations by the Nextstrain team and the scientific community.

During the pandemic we are focused on the SARS-CoV-2 page, but may generalize our approach in the future to provide similar resources for other pathogens.

To add a dataset to the SARS-CoV-2 datasets list on nextstrain.org/sars-cov-2, create a pull request to the nextstrain.org repository on github using the following guide. If this guide doesn’t answer your questions or you aren’t familiar with git, open an issue in that same repository letting us know about the dataset you would like to add, and we can help, or ask any question on discussion.nextstrain.org.

Here is an example pull request for reference.

In the example above, a dataset is being added for Washington State, USA. This is a dataset maintained by the Bedford Lab, focused on sequences from that area. All information about this dataset is represented in the YAML format file in the nextstrain.org repository - static-site/content/SARS-CoV-2-Datasets.yaml - that contains the list of SARS-CoV-2 datasets.

In this case, this looks like the following:

  - url: null
    name: Washington
    geo: washington
    parentGeo: usa
    org: null

  - url: https://nextstrain.org/groups/blab/ncov/wa/4m
    name: Washington
    geo: washington
    region: North America
    level: division
    coords:
      - -120.644869
      - 46.988611
    org:
      name: Bedford Lab
      url: https://bedford.io/

Formatting

Spaces are used for indenting lines. The YAML file contains one list of all entries. The beginning section of the list contains hierarchy entries, while the latter section contains the dataset entries.

Hierarchy entries

This YAML file defines a hierarchy of datasets according to their geographic specificity, which determines how datasets appear in dropdown menus. Hierarchy entries in the list are just there to define this hierarchy with a geo field and a parentGeo field, like this one:

  - url: null
    name: Washington
    geo: washington
    parentGeo: usa
    org: null

Since Washington is in the USA, its parentGeo is usa. At the top of the hierarchy are hierarchy entries with the parentGeo value of null. We add this entry since there already existed one for usa, but not for washington which is the geo level of the dataset entry we are adding. The url and org fields are null since those only apply for dataset entries, but you no longer need to add these two extra fields as null anymore and can just leave them out like this:

  - name: Washington
    geo: washington
    parentGeo: usa

If such an entry already existed for washington, we would only need to add the other type of entry - a dataset entry.

Dataset entries

A dataset entry represents an actual dataset, and doesn’t need the parentGeo field since it will get filed under the geographic region that matches its geo field:

  - url: https://nextstrain.org/groups/blab/ncov/wa/4m
    name: Washington
    geo: washington
    region: North America
    level: division
    coords:
      - -120.644869
      - 46.988611
    org:
      name: Bedford Lab
      url: https://bedford.io/

Here is what each of these fields mean:

Field Example value Description Formatting
url https://nextstrain.org/groups/blab/ncov/wa/4m Link to the dataset Valid unique url
name Washington A name for the dataset to be displayed on nextstrain.org Any informative string
geo washington Name of the geographic level Lower case string consistent with geo hierarchy
region N/A not used anymore.
level region, country, division, or location Geographic specificity; see here for more details.
coords -120.644869, 46.988611 Longitutde and latitude in that order Longitude: number between -180 (West) and 180 (East); Latitude: number between -85 (South) and 85 (North)
org.name Bedford Lab Who maintains this dataset? String
org.url https://bedford.io/ Link to info about the maintainers Valid url

coords

If there was already a dataset for geo: washington (or any dataset with nearby coords), we would need to be sure to specify coordinates that are not identical to those of the existing washington dataset so that you can see both on the map. This will look different for different areas, since in some cases adjusting a latitude by 1 degree may be too much, and in other cases not enough. There is a script, scripts/check-sars-cov-2-datasets-yaml.js, in the nextstrain.org repository which will tell you if two datasets’ coordinates are too close to one another to be distinguished on the map.