Skip to content

Latest commit

 

History

History
85 lines (65 loc) · 5.2 KB

README.md

File metadata and controls

85 lines (65 loc) · 5.2 KB

viral-mcorr examples

In these examples, you'll learn how to: (i) measure and fit correlation profiles across whole genome sequences for viruses using mcorrViralGenome and mcorr-viral-fit, (ii) measure and fit correlation profiles across single gene alignments using mcorr-gene-aln and mcorr-viral-fit, as well as generate 95% bootstrap CIs for the output parameters, and (iii) use mcorrLDGenome to generate genome-wide maps of Pearson's r for correlated substitutions.

For this example, we use the 191 sars-like coronaviruses we analyzed in our paper (link to preprint here), which were also used in the example of how to prepare viral sequences for use with viral-mcorr which you can find here. All of the alignment files are provided in this github repo. Before you begin, make sure you've installed viral-mcorr locally. Also download this repository onto your machine:

cd ~/Downloads
git clone clone https://github.com/kussell-lab/viral-mcorr_sl-cov_examples.git
cd viral-mcorr_sl-cov_examples

mcorrViralGenome

Here, we look at how to measure and fit correlation profiles across whole genome sequences for viruses using mcorrViralGenome and mcorr-viral-fit.

Calculate correlation profiles across the aligned sars-like coronavirus sequences using the following commands:

mcorrViralGenome aligned_sl-cov/aligned_sl-cov.xmfa mvg_out/sl-cov_mvg_out
mcorr-viral-fit mvg_out/sl-cov_mvg_out.csv mvg_out/sl-cov_mvg 

mcorrViralGenome takes ~10 seconds and mcorr-viral-fit takes ~5 seconds to run on a 2021 MacBook M1 with 16 GB RAM. What your outputs should look like are in the subdirectory mvg_out. The main results can be found in the following files:

  • sl-cov_mvg_template-switch_best_fit.svg and sl-cov_zero-recombo_best_fit.svg show the plots of the correlation profile and fitting for the template-switching recombination model and for the zero recombination case
  • sl-cov_comparemodels.csv shows the table of fitted parameters for all recombination models (template-switching, fragment-incorporation, and zero-recombination) and AIC values

mcorr-gene-aln

Here we look at how to measure and fit correlation profiles across single gene alignments using mcorr-gene-aln and mcorr-viral-fit, as well as generate 95% bootstrap CIs for inferred evolutionary parameters. This works with xmfa files for single genes.

For this example, let's use the spike protein from our sars-like coronavirus alignment (the CDS region spanning positions 21563 to 25384).

Calculate a correlation profile across the spike protein using the following commands:

mcorr-gene-aln aligned_sl-cov/genes/cds-YP_009724390.1_21563+25384.xmfa mga_out/spike_mga_out --show-progress
mcorr-viral-fit mga_out/spike_mga_out.csv mga_out/spike_mga 

mcorr-gene-aln takes ~20 minutes (using the default number of bootstraps) and mcorr-viral-fit takes ~30 seconds to run on a 2021 MacBook M1 with 16 GB RAM. What your outputs should look like are in the subdirectory mga_out. The main results can be found in the following files:

  • spike_mga_template-switch_best_fit.svg and spike_mga_zero-recombo_best_fit.svg show the plots of the correlation profile and fitting for the template-switching recombination model and for the zero recombination case
  • spike_mga_comparemodels.csv shows the table of fitted parameters for all recombination models (template-switching, fragment-incorporation, and zero-recombination) and AIC values
  • spike_mga_template-switch_fit_report.txt gives 95% bootstrap CIs for the parameters generated by fitting bootstrap replicates with the template-switching model. Note: using the default 'template-switching' option, fbar is actually the genome size L, hence this parameter is fixed for all bootstraps.

mcorrLDGenome

Here, we look at using mcorrLDGenome to measure correlations between specific genomic sites across viral genomes. First, use mcorrLDGenome to calculate the joint probability of a difference at two sites and the probability of a difference at a single site for each genomic position in the aligned sars-like coronavirus whole genome sequences:

mcorrLDGenome aligned_sl-cov/aligned_sl-cov.xmfa mcorrLD_out/sl-cov_ld_genome_out

This took ~8 minutes to run on a 2021 MacBook M1 with 16 GB of RAM. The output file is sl-cov_ld_genome_out.csv should be around 3.5 GB in size which, even compressed is a bit too large to store in this repo. You can then use the jupyter notebook pcc_example.ipynb to make a visual map of Pearson's correlation coefficient for synonymous substitutions across the sars-like coronavirus genome. The final output should look like the png file shown in the subdirectory mcorrLD_out. The jupyter notebook takes ~10 minutes to run (again on a 2021 MacBook M1 w/ 16 GB of RAM)

Note on jupyter notebooks: If you haven't worked with a jupyter notebook before, see this guide to get started. Alternatively you can try to run the jupyter notebook using Google Colab.