Add your dataset to the collection of SARS-CoV-2 datasets
The Nextstrain team maintains nextstrain.org/sars-cov-2 to provide a resource for easy access to a variety of public analyses and interpretations by the Nextstrain team and the scientific community.
During the pandemic we are focused on the SARS-CoV-2 page, but may generalize our approach in the future to provide similar resources for other pathogens.
To add a dataset to the SARS-CoV-2 datasets list on nextstrain.org/sars-cov-2, create a pull request to the nextstrain.org repository on github using the following guide. If this guide doesn’t answer your questions or you aren’t familiar with git, open an issue in that same repository letting us know about the dataset you would like to add, and we can help, or ask any question on discussion.nextstrain.org.
Here is an example pull request for reference.
In the example above, a dataset is being added for Washington State, USA. This is a dataset maintained by the Bedford Lab, focused on sequences from that area. All information about this dataset is represented in the YAML format file in the nextstrain.org repository - static-site/content/SARS-CoV-2-Datasets.yaml - that contains the list of SARS-CoV-2 datasets.
In this case, this looks like the following:
- url: null name: Washington geo: washington parentGeo: usa org: null - url: https://nextstrain.org/groups/blab/ncov/wa/4m name: Washington geo: washington region: North America level: division coords: - -120.644869 - 46.988611 org: name: Bedford Lab url: https://bedford.io/
Spaces are used for indenting lines. The YAML file contains one list of all entries. The beginning section of the list contains hierarchy entries, while the latter section contains the dataset entries.
This YAML file defines a hierarchy of datasets according to their geographic specificity, which determines how datasets appear in dropdown menus.
Hierarchy entries in the list are just there to define this hierarchy with a
geo field and a
parentGeo field, like this one:
- url: null name: Washington geo: washington parentGeo: usa org: null
Since Washington is in the USA, its
At the top of the hierarchy are hierarchy entries with the
parentGeo value of
We add this entry since there already existed one for
usa, but not for
washington which is the
geo level of the dataset entry we are adding.
org fields are
null since those only apply for dataset entries, but you no longer need to add these two extra fields as
null anymore and can just leave them out like this:
- name: Washington geo: washington parentGeo: usa
If such an entry already existed for
washington, we would only need to add the other type of entry - a dataset entry.
A dataset entry represents an actual dataset, and doesn’t need the
parentGeo field since it will get filed under the geographic region that matches its
- url: https://nextstrain.org/groups/blab/ncov/wa/4m name: Washington geo: washington region: North America level: division coords: - -120.644869 - 46.988611 org: name: Bedford Lab url: https://bedford.io/
Here is what each of these fields mean:
||Link to the dataset||Valid unique url|
||A name for the dataset to be displayed on nextstrain.org||Any informative string|
||Name of the geographic level||Lower case string consistent with
||N/A not used anymore.|
||Geographic specificity; see here for more details.|
||Longitutde and latitude in that order||Longitude: number between -180 (West) and 180 (East); Latitude: number between -85 (South) and 85 (North)|
||Who maintains this dataset?||String|
||Link to info about the maintainers||Valid url|
If there was already a dataset for
geo: washington (or any dataset with nearby
coords), we would need to be sure to specify coordinates that are not identical to those of the existing
washington dataset so that you can see both on the map.
This will look different for different areas, since in some cases adjusting a latitude by 1 degree may be too much, and in other cases not enough.
There is a script,
scripts/check-sars-cov-2-datasets-yaml.js, in the nextstrain.org repository which will tell you if two datasets’ coordinates are too close to one another to be distinguished on the map.