The EpiVar Browser's server component can be deployed with other data than just the Aracena et al. dataset. The instructions below must be followed, paying especially close attention to the formats described the node setup guide.
Install these dependencies according to their own instructions:
- NodeJS version 16+
- Postgres
- Used for storing SNPs, features, associations (p-values), and other metadata.
- Tested with versions 13 through 15.
- See Setting up the Postgres database for how to prepare this.
- Redis
- Used for caching values
- Tested with version 6+
- note that a Redis instance should never be exposed to the internet!
EpiVar expects it to be available locally at
localhost:6379
; the default Redis port.
bw-merge-window
: https://github.com/c3g/bw-merge-windowbigWigSummary
: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
Executables must be on the PATH
of the application to be called directly.
After Postgres is installed, you should create a user (with a password) and database for the application.
For example, starting with a bash
/similar command-line shell and the
default postgres
user, you can access a Postgres shell:
sudo su - postgres
psql
You should now be connected to Postgres:
CREATE USER epivar WITH PASSWORD 'my-password';
CREATE DATABASE epivar_db WITH OWNER epivar;
To exit out of the Postgres session / postgres
user bash
session,
hit Control-d
twice.
The application requires data from multiple different sources. The data must also be transformed to be consumable by the application. The first step is therefore to prepare the data.
./input-files
contains portions of the source data for the Aracena et al. instance of the portal, as provided by Alain Pacis. These files can serve as a starting point or examples of formatting for customizing the portal.
The data directory is configurable but the application and this document
will use ./data
by default for generated file data. Data inserted into the
Postgres database ends up wherever Postgres is configured to persist data to.
Start with cp config.example.js config.js
to create the required config.
The default config should not need updating if you follow the instructions below,
but you can follow along to make sure everything matches.
The different data sources to generate/prepare are:
-
Condition and ethnicity configuration: Set up conditions/treatments and sample population groups/ethnicities for the portal in
config.js
in theconfig.conditions
andconfig.ethnicities
arrays:- Format: (example)
module.exports = { // ... conditions: [ {id: "NI", name: "Non-infected"}, {id: "Flu", name: "Flu"}, ], ethnicities: [ {id: "AF", name: "African-American", plotColor: "#5100FF", plotBoxColor: "rgba(81, 0, 255, 0.6)"}, {id: "EU", name: "European-American", plotColor: "#FF8A00", plotBoxColor: "rgba(255, 138, 0, 0.6)"}, ], // ... };
- Note: While the field is called
ethnicities
, this can in fact be used for non-ethnicity population groups as well. It is just used to visually separate points in the box plots generated by the server.
- Format: (example)
-
Metadata: This is the track's metadata. This can either be provided as an XLSX file with the following headers:
file.path
: relative path tobigWig
, withoutconfig.paths.tracks
directory prefixethnicity
: ethnicity / population group ID (not name!)- if set to
Exclude sample
, sample will be skipped
- if set to
condition
: condition / experimental group ID (not name!)sample_name
: Full sample name, uniquely indentifying the sample withinassay
,condition
,donor
, andtrack.view
variablesdonor
: donor ID (i.e., individual ID)track.view
: literal value, one ofsignal_forward
orsignal_reverse
track.track_type
: literal valuebigWig
assay.name
: one ofRNA-Seq
,ATAC-Seq
,H3K27ac
,H3K4me1
,H3K27me3
,H3K4me3
and the sheets (which match
assay.name
):- RNA-Seq
- ATAC-Seq
- H3K27ac
- H3K4me1
- H3K27me3
- H3K4me3
or a JSON file containing a list of objects with the following keys, mapping to the above headers in order:
path
ethnicity
condition
sample_name
donor
view
type
assay
Information on the track metadata file:
- Generate with:
node ./scripts/metadata-to-json.js < ./input-files/flu-infection.xlsx > ./data/metadata.json
- Replace
./input-files/flu-infection.xlsx
with the path to your metadata file - Optionally, the resulting JSON can just be generated directly (see above for keys)
- Replace
- Config:
EPIVAR_TRACK_METADATA_PATH
(environment variable to specify file path) - Input: One command-line argument specifying a path, e.g.,
./input-files/flu-infection.xlsx
. - Output: Prints JSON to
stdout
. This should be redirected into the path specified inEPIVAR_TRACK_METADATA_PATH
; by default,./data/metadata.json
(or, just generate this file directly) - Notes: This is really just an XLSX to JSON transformation. The version of the XLSX used for the Aracena et al. portal instance is available in this repository as a reference.
-
Pre-computed feature signals: Optionally, preset matrices can be provided with point values for box plots that have been batch-corrected and, e.g., age-regressed.
- Import with: N/A (Automatically read with the below Genes and Peaks import steps!)
- Input: A set of matrix files. These are TSV-formatted, with a header
row for sample names (
{ethnicity}##_{condition}
, e.g.,EU99_Flu
) and a header column at the start for feature names (chr#_startpos_endpos
orGENESYMBOL
). - Config: Use the
EPIVAR_POINTS_TEMPLATE
environment variable to configure where point matrices are loaded from. The$ASSAY
string is replaced with each assay in turn. Defaults to:./input-files/matrices/$ASSAY_batch.age.corrected_PCsreg.txt
- Notes: In our EpiVar instance, the corrections applied are:
- Batch correction
- Age regressed out as a cofactor
- Principal components regression
-
Genes: lists of gene names mapped to their characteristics, and features associated with specific genes.
- Import with: e.g.,
node ./scripts/import-genes.mjs < ./input-files/flu-infection-gene-peaks.csv
- Input: A pre-computed gene list for the assembly specified in
config.js
and a file resembling./input-files/flu-infection-gene-peaks.csv
- Examples for these files / the versions used for the Aracena et al. instance of the portal are already in the repository.
- Format:
flu-infection-gene-peaks.csv
: CSV with header row:"symbol","peak_ids","feature_type"
wheresymbol
is gene name,peak_ids
is a feature string (e.g.,chr1_9998_11177
), andfeature_type
is the name of the assay.
- Notes: The
name
column contains their name as provided in the input file, however there are inconsistencies in the notation of names, where sometimes the.
and-
will diverge from one source to another. Therefore, names are normalized by replacing all non-digits/non-letters with-
, and that is the uniquename_norm
column used for genes.
- Import with: e.g.,
-
Peaks: list of peaks names mapped to their characteristics. These files are CSVs which have the following headers:
rsID
: The rsID of the SNPsnp
: The SNP in the SNP-peak association; formatted likechr#_######
(UCSC-formatted chromosome name, underscore, position)feature
: The feature name - eitherchr#_startpos_endpos
orGENE_NAME
pvalue.*
where*
is the ID of the condition (by default,*
=NI
thenFlu
)- These are floating point numbers
feature_type
: The assay the peak is from - e.g.,RNA-seq
Information on the QTL/peak list files:
- Import with:
node ./scripts/import-peaks.js
followed bynode ./scripts/calculate-peak-groups.mjs
- Input:
./input-files/qtls/QTLS_complete_*.csv
(there are a couple truncated example files in./input-files/qtls
) - Config: Use the
EPIVAR_QTLS_TEMPLATE
environment variable to configure where QTL lists are loaded from. The$ASSAY
string is replaced with each assay in turn. Defaults to:./input-files/qtls/QTLs_complete_$ASSAY.csv
- Notes: The peak's associated feature is usually different from where the
peak position is; e.g., the peak SNP can be at
chr1:1000
, but the feature is at the rangechr1:3500-3600
. The second script calculates peak groups by SNP and gene for auto-complete.
-
Binned top peaks for assays: Used to generate Manhattan plots for chromosome/assay pairs, binned by SNP position.
- Generate with:
node ./scripts/calculate-top-peaks.mjs
- Notes: This will populate a table in the Postgres database.
- Generate with:
-
Tracks: There are pre-generated bigWig files that contain the signal data to use for merging and displaying in the browser. The paths should correspond to
- Config:
EPIVAR_TRACKS_DIR
environment variable, specifying the directory - Notes: A metadata item (from step Metadata above)
.path
field points to a path inside theconfig.paths.tracks
directory, eg:metadata.path = 'RNAseq/AF02_Flu.forward.bw'
filepath = path.join(config.paths.tracks, metadata.path)
- EpiVar-specific notes: You will need to either copy the files, or
in development mount them with
sshfs
to have access to them.
- Config:
-
Merged tracks: The directory to store the merged tracks.
- Generate with:
mkdir -p ./data/mergedTracks
- This location can be changed with the environment variable
VARWIG_MERGED_TRACKS_DIR
- This location can be changed with the environment variable
- Config:
VARWIG_MERGED_TRACKS_DIR
environment variable orconfig.paths.mergedTracks
(directory) - Notes: Make sure there is enough space for those tracks.
- Generate with:
-
GEMINI database: This contains variants' data.
- Generate with: Follow instructions on the
GEMINI website for information
on creating a GEMINI database from a VCF.
For Aracena et al. data, copy it or mount over
sshfs
. - Notes for Aracena et al. implementation:
Accessing it over
sshfs
in development is slow because thegemini
command needs to read it a lot. It might be easier to callgemini
directly onbeluga
. See the comment in./models/samples.mjs
about thegemini()
function for more details. Fetching the chromosomes list can also be expensive, so for development you might want to hardcode the list in the config atconfig.development.chroms
once you know what that list is.
- Generate with: Follow instructions on the
GEMINI website for information
on creating a GEMINI database from a VCF.
For Aracena et al. data, copy it or mount over
The EpiVar browser uses OpenID Connect (OIDC) for authentication/authorization (auth). It does not include its own username/password/identity layer.
A popular (free for small projects) provider for OIDC is Auth0.
For academic projects, CILogon is an excellent choice which can federate to different academic institutions' own auth systems.
When configuring EpiVar, one must set various parameters for the OIDC identity provider
in the .env
file (see the Production section below.)
Once the data are ready, you can install & build the application as follows:
npm run install
# Builds the frontend and copies it under ./public
npm run build
To use sshfs
to mount the bigWigs from beluga
or narval
:
# Either
sshfs -o defer_permissions \
beluga.computecanada.ca:/lustre03/project/rrg-bourqueg-ad/C3G/projects/DavidB_varwig/ \
/path/to/local/mnt
# Or
sshfs -o defer_permissions \
narval.computecanada.ca:/lustre03/project/rrg-bourqueg-ad/C3G/projects/DavidB_varwig/ \
/path/to/local/mnt
In development, you'd run:
npm run watch
: for the backendcd client && npm start
: for the frontend
In production, you may need to set up these to handle persistence & HTTPS:
- Set up an NGINX or Apache proxy with a LetsEncrypt certificate
(see epivar-prod/nginx.conf for an example.)
- For the reference deployment, we are using a VM behind a proxy. We needed to set
the following NGINX configuration values:
real_ip_header X-Forwarded-For;
andset_real_ip_from ####;
, where####
is the IP block for the hypervisor from the VM's perspective, in order to get correctX-Real-IP
values for the terms of use agreement.
- For the reference deployment, we are using a VM behind a proxy. We needed to set
the following NGINX configuration values:
- Set up Redis to handle caching
- Set up Postgres to handle persistent data
- In production, make sure to configure Postgres with lots of RAM and 4+ workers for gathers! Otherwise, autocomplete queries will be really slow.
- Set up
pm2
to runnode ./bin/www
with multiple workers (e.g.pm2 start ./bin/www --name epivar -i 0
)
You will also need to set up authentication via an OIDC layer. This is configured via
environment variables (which can either be typed into the service run command, or placed
into a .env
file and loaded at service start time).
Here is an example, with secrets redacted, for a setup via Auth0, complete with directory and Postgres configuration as well:
# Auth configuration
VARWIG_AUTH_SCOPE="openid profile"
VARWIG_CLIENT_ID=some_client
VARWIG_CLIENT_SECRET=some_secret
VARWIG_SESSION_SECRET=some_session_secret
VARWIG_ISSUER=https://dev-###.us.auth0.com/
VARWIG_AUTH_URL=https://dev-###.us.auth0.com/authorize
VARWIG_TOKEN_URL=https://dev-###.us.auth0.com/oauth/token
VARWIG_USERINFO_URL=https://dev-###.us.auth0.com/userinfo
# Other Varwig configuration
VARWIG_BASE_URL=https://flu-infection.vhost38.genap.ca
# Database configuration
VARWIG_PG_CONNECTION=postgres://epivar@localhost:5432/epivar_db
# Directories
VARWIG_MERGED_TRACKS_DIR=/flu-infection-data/mergedTracks
VARWIG_TRACKS_DIR=/flu-infection-data
VARWIG_GEMINI_DB=/flu-infection-data/allSamples_WGS.gemini.db
Note that trailing slashes are very important here; for example, a missing trailing slash for VARWIG_ISSUER
will
prevent successful authentication.
In production with CILogon, the auth scopes would be configured as follows:
VARWIG_AUTH_SCOPE="openid email org.cilogon.userinfo"
We use pm2
to run multiple processes of the application at a time to handle more simultaneous requests.
The PM2_HOME
folder is set to /home/dlougheed/.pm2
currently (sorry).
The instance title and subtitle can be configured at
./client/src/constants/app.js
.
Page content is stored as JSX components in ./client/src/components/pages
.
When deploying a new instance, make sure to change these pages from the default!