⚠️ WARNING: This is an alpha release. Output file format and address may change at any time
This repo forms the basis of our continually-updated modelling of SARS-CoV-2 variant frequencies. Broadly speaking, the moving pieces in this repo are:
- Data ingest, which produces TSV files of sequence counts. See ./ingest/README for more details.
- Variant modelling, which is detailed in this README. The models themselves are defined in the evofr repo.
- The
./viz/
directory contains a web-app which visualises the latest model outputs. See ./viz/README for more details. Currently this web-app is available at nextstrain.github.io/forecasts-ncov/.
The automated pipeline runs daily based on a scheduled jobs and triggers from upstream data ingests. We use GitHub actions to schedule these jobs, often with one job triggering another upon completion.
- Case counts are fetched from external data sources daily at 8 AM PST
- Raw metadata/sequences are fetched and cleaned via nextstrain/ncov-ingest.
- The nextstrain/ncov-ingest pipelines trigger the clade counts jobs once the latest curated data has been uploaded to S3
- The GISAID and open data ingest pipelines have different run times, so their clade counts jobs are triggered at different times.
- Clade counts jobs trigger the model runs once the counts data has been uploaded to S3
- Model results are uploaded to S3 as dated files where the date indicates the run date
See available counts files for the input case counts and clade counts files.
The model results for GISAID data are stored at s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid
.
The model results for open (GenBank) data are stored at s3://nextstrain-data/files/workflows/forecasts-ncov/open
.
The latest results are stored as latest_results.json
and previously uploaded results can be found as <YYYY-MM-DD>_results.json
.
Data Provenance | Variant Classification | Geographic Resolution | Model | Address |
---|---|---|---|---|
GISAID | Nextstrain clades | Global | MLR | https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global/mlr/latest_results.json |
Pango lineages | https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json |
|||
open (GenBank) | Nextstrain clades | https://data.nextstrain.org/files/workflows/forecasts-ncov/open/nextstrain_clades/global/mlr/latest_results.json |
||
Pango lineages | https://data.nextstrain.org/files/workflows/forecasts-ncov/open/pango_lineages/global/mlr/latest_results.json |
Please follow installation instructions for Nextstrain's software tools.
To run pipeline for all available data generated by ingest:
nextstrain build .
To run the pipeline for specific data provenance, variant classification and geo resolution (e.g. gisaid, nextstrain_clades and global only):
nextstrain build . --configfile config/config.yaml --config data_provenances=gisaid variant_classifications=nextstrain_clades geo_resolutions=global
To run the pipeline for US states only using GISAID data:
nextstrain build . --configfile config/config.yaml --config data_provenances=gisaid geo_resolutions=usa
To run the pipeline that uploads the model results to S3 and sends Slack notifications:
nextstrain build . --configfile config/config.yaml config/optional.yaml
OR
Run the GitHub Action workflow named "Run models" to run the pipeline on AWS Batch.
The data_provenances
, variant_classifications
and geo_resolutions
are required configs for the pipeline.
The current available options for data_provenances
are
gisaid
open
The current available options for variant_classifications
are
nextstrain_clades
pango_lineages
The current available options for geo_resolutions
are
global
usa
The prepare_data
params in config/config.yaml
are used to subset the full
case counts and clades counts data to specific date range, locations, and clades.
As of 2023-04-04, the config for the automated pipeline is set to only include data from:
- the past 150 days
- excluding sequences from the last 12 days since they may be overly enriched for variants
- locations that have at least 500 sequences in the last 30 days
- excluding locations specifically listed in
defaults/global_excluded_locations.txt
- excluding locations specifically listed in
- clades that have at least 5000 sequences in the last 150 days
As of 2023-12-28, the config for the automated pipeline for US states is set to only include data from:
- the past 150 days
- excluding sequences from the last 12 days since they may be overly enriched for variants
- locations that have at least 90 sequences in the last 45 days
- excluding locations specifically listed in
defaults/usa_excluded_locations.txt
- excluding locations specifically listed in
- clades that have at least 1000 sequences in the last 150 days
The specific model configurations are housed in separate config YAML files or each model.
These separate config files must be provided in the main config as mlr_config
and renewal_config
in order to run the models.
By default, the model config files used are config/mlr-config.yaml
and config/renewal-config.yaml
.
Note the inputs and outputs for the models are overridden in the Snakemake pipeline to conform to the Snakemake input/output framework.
Model JSONs are post processed by ./scripts/modify-lineage-colours-and-order.py
.
For nextstrain_clades
this sets the colours and display names.
For pango_lineages
this orders lineages based on their full (unaliased) pango designation, and sets colours based on the associated nextstrain clade.
When new clades are added please modify the CLADES
definitions in the script accordingly.
No environment variables are required for open data. However, the following environment variables are required for the gisaid data:
AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
If running pipeline with uploads to S3, the following environment variables are required (regardless of data provenance):
AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
If running pipeline with Slack notifications, the following environment variables are required:
SLACK_CHANNELS
SLACK_TOKEN