Releases: roblanf/sarscov2phylo
24-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 24-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_28_07.fasta -o global.fa -s ft_SH_22-08-20.tree -t 250
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 24th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_20-8-20.tree' is the 'ft_SH.tree' file from the 20-8-20 release.
The lnL of the final tree is: -530402.971
Filtering statistics
sequences downloaded from GISAID
56117
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 54383
Alignment length: 29903
Total # residues: 1622338790
Smallest: 29018
Largest: 29903
Average length: 29831.7
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 54383
Alignment length: 29903
Total # residues: 1613342793
Smallest: 28922
Largest: 29675
Average length: 29666.3
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 54106
Alignment length: 29903
Total # residues: 1605136009
Smallest: 28922
Largest: 29675
Average length: 29666.5
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 54106
Alignment length: 29661
Total # residues: 1601266979
Smallest: 28437
Largest: 29661
Average length: 29595.0
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 96957
#leaves: 54041
#dichotomies: 41025
#leaf labels: 54041
#inner labels: 38755
Number of new sequences added this iteration
779 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
22-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 22-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_27_05.fasta -o global.fa -s ft_SH_20-08-20.tree -t 54
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 22nd of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_20-8-20.tree' is the 'ft_SH.tree' file from the 20-8-20 release.
The lnL of the final tree is: -525105.728
Filtering statistics
sequences downloaded from GISAID
55440
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53750
Alignment length: 29903
Total # residues: 1603447695
Smallest: 29018
Largest: 29903
Average length: 29831.6
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53750
Alignment length: 29903
Total # residues: 1594560725
Smallest: 28922
Largest: 29675
Average length: 29666.2
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53473
Alignment length: 29903
Total # residues: 1586353941
Smallest: 28922
Largest: 29675
Average length: 29666.4
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53473
Alignment length: 29661
Total # residues: 1582537931
Smallest: 28437
Largest: 29661
Average length: 29595.1
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 95833
#leaves: 53430
#dichotomies: 40532
#leaf labels: 53430
#inner labels: 38294
Number of new sequences added this iteration
2507 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
20-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 20-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_21_23.fasta -o global.fa -s ft_SH_18-08-20.tree -t 54
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 20th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_18-8-20.tree' is the 'ft_SH.tree' file from the 18-8-20 release.
The lnL of the final tree is: -517033.164
Filtering statistics
sequences downloaded from GISAID
55121
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53345
Alignment length: 29903
Total # residues: 1591404269
Smallest: 29030
Largest: 29903
Average length: 29832.3
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53345
Alignment length: 29903
Total # residues: 1582547894
Smallest: 28961
Largest: 29675
Average length: 29666.3
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53068
Alignment length: 29903
Total # residues: 1574341110
Smallest: 29054
Largest: 29675
Average length: 29666.5
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53068
Alignment length: 29661
Total # residues: 1570546704
Smallest: 28437
Largest: 29661
Average length: 29595.0
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 95075
#leaves: 53026
#dichotomies: 40190
#leaf labels: 53026
#inner labels: 37969
Number of new sequences added this iteration
355 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
18-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 18-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_21_04.fasta -o global.fa -s ft_SH_16-08-20.tree -t 54
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 18th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_16-8-20.tree' is the 'ft_SH.tree' file from the 16-8-20 release.
The lnL of the final tree is: -513889.945
Filtering statistics
sequences downloaded from GISAID
54551
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53041
Alignment length: 29903
Total # residues: 1582325418
Smallest: 29030
Largest: 29903
Average length: 29832.1
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 53041
Alignment length: 29903
Total # residues: 1573527101
Smallest: 28961
Largest: 29675
Average length: 29666.2
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52758
Alignment length: 29903
Total # residues: 1565142267
Smallest: 29054
Largest: 29675
Average length: 29666.4
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52758
Alignment length: 29661
Total # residues: 1561355620
Smallest: 28437
Largest: 29661
Average length: 29594.7
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 94593
#leaves: 52747
#dichotomies: 39994
#leaf labels: 52747
#inner labels: 37785
Number of new sequences added this iteration
363 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- The likelihood of this tree is lower than of the previous release. This is quite interesting, and despite this tree having ~300 new sequences in it. It suggests the SPR moves found a change that makes substantial improvements to the likelihood of the tree, potentially (though this would need confirmation) underlining the benefit of an iterative approach.
16-08-2020
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 16-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_18_22.fasta -o global.fa -s ft_SH_14-08-20.tree -t 54
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 16th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_14-8-20.tree' is the 'ft_SH.tree' file from the 14-8-20 release.
The lnL of the final tree is: -526648.019
Filtering statistics
sequences downloaded from GISAID
53910
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52730
Alignment length: 29903
Total # residues: 1573059735
Smallest: 29030
Largest: 29903
Average length: 29832.3
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52730
Alignment length: 29903
Total # residues: 1571796972
Smallest: 29030
Largest: 29903
Average length: 29808.4
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52463
Alignment length: 29903
Total # residues: 1563852377
Smallest: 29131
Largest: 29903
Average length: 29808.7
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52463
Alignment length: 29782
Total # residues: 1558617245
Smallest: 28379
Largest: 29782
Average length: 29708.9
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 94825
#leaves: 52436
#dichotomies: 40650
#leaf labels: 52436
#inner labels: 38514
Number of new sequences added this iteration
102 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
14-8-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 14-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_17_23.fasta -o global.fa -s ft_SH_12-8-20.tree -t 54
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 14th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_12-8-20.tree' is the 'ft_SH.tree' file from the 12-8-20 release.
The lnL of the final tree is: -508420.130
Filtering statistics
sequences downloaded from GISAID
53830
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52650
Alignment length: 29903
Total # residues: 1570670171
Smallest: 29030
Largest: 29903
Average length: 29832.3
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52650
Alignment length: 29903
Total # residues: 1561929006
Smallest: 28961
Largest: 29675
Average length: 29666.3
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52366
Alignment length: 29903
Total # residues: 1553514614
Smallest: 29054
Largest: 29675
Average length: 29666.5
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 52366
Alignment length: 29661
Total # residues: 1549746185
Smallest: 28437
Largest: 29661
Average length: 29594.5
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 93916
#leaves: 52362
#dichotomies: 39722
#leaf labels: 52362
#inner labels: 37528
Number of new sequences added this iteration
1830 alignment_names_new.txt
Notable changes to the scripts in this release
- None (but see previous release as there were significant changes then)
Notable aspects of the trees
- None
12-8-20
Citation
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 12-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_11_01.fasta -o global.fa -s ft_SH_31-7-20.tree -t 54
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 12th of August 2020, Canberra (Australia) time.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_31-7-20.tree' is the 'ft_SH.tree' file from the 31-7-20 release.
The lnL of the final tree is: -494195.750
Filtering statistics
sequences downloaded from GISAID
52394
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 51232
Alignment length: 29903
Total # residues: 1528465751
Smallest: 29030
Largest: 29903
Average length: 29834.2
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 51232
Alignment length: 29903
Total # residues: 1519917556
Smallest: 28961
Largest: 29675
Average length: 29667.3
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 50944
Alignment length: 29903
Total # residues: 1511384465
Smallest: 29054
Largest: 29675
Average length: 29667.6
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 50944
Alignment length: 29661
Total # residues: 1507689400
Smallest: 28437
Largest: 29661
Average length: 29595.0
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 91303
#leaves: 50921
#dichotomies: 38592
#leaf labels: 50921
#inner labels: 36463
//
Number of new sequences added this iteration
3802 alignment_names_new.txt
Notable changes to the scripts in this release
- One major change. I have shifted away from re-estimating the tree from scratch every time, and instead moved to a more 'online' model in which I add new sequences to the tree from the previous release, then re-optimise that tree. This was motivated by a number of factors. First (and most importantly) you get better trees this way (when measured by lnL scores). This is simply because estimating trees from scratch each time involves starting with a sub-optimal tree (e.g. from Neighbour Joining) and then trying to optimise it. The more data you have, the harder both parts get. The new approach starts with the previous release's trees, adds new sequences to it, and then optimises that tree. Details are the in the readme associated with this release. Second, the computational cost of starting from scratch was becoming too high. Estimating a NJ tree for this many sequences takes longer and longer. As a result of these changes, I can no longer calculate bootstrap values like FBP and TBE that require lots of trees to be estimated from scratch. So, now I release just one tree, which has SH support values calculated by fasttree.
Notable aspects of the trees
- None
31-7-20
Citation
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 31-July-2020. Zenodo DOI: 10.5281/zenodo.3958883
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_07_31_06.fasta -o global.fa -t 27
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 31st of July 2020, Canberra (Australia) time.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloading the entire repo.
Filtering statistics
sequences downloaded from GISAID
48019
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 48014
Alignment length: 29903
Total # residues: 1432575924
Smallest: 29030
Largest: 29903
Average length: 29836.6
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 48014
Alignment length: 29903
Total # residues: 1424431566
Smallest: 28961
Largest: 29675
Average length: 29667.0
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 47815
Alignment length: 29903
Total # residues: 1418538813
Smallest: 29054
Largest: 29675
Average length: 29667.2
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 47815
Alignment length: 29661
Total # residues: 1415077600
Smallest: 28437
Largest: 29661
Average length: 29594.8
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 85486
#leaves: 47779
#dichotomies: 36010
#leaf labels: 47779
#inner labels: 37705
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
22-7-20
Citation
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 22-July-2020. Zenodo DOI: 10.5281/zenodo.3958883
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_07_22_07.fasta -o global.fa -t 34
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 22nd of July 2020, at 9PM Canberra (Australia) time.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloading the entire repo.
Filtering statistics
sequences downloaded from GISAID
44915
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 44446
Alignment length: 29903
Total # residues: 1326206221
Smallest: 29146
Largest: 29903
Average length: 29838.6
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 44446
Alignment length: 29903
Total # residues: 1318812088
Smallest: 29059
Largest: 29680
Average length: 29672.2
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 44278
Alignment length: 29903
Total # residues: 1313831108
Smallest: 29059
Largest: 29680
Average length: 29672.3
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 44278
Alignment length: 29661
Total # residues: 1310443036
Smallest: 28457
Largest: 29661
Average length: 29595.8
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 79266
#leaves: 44233
#dichotomies: 33504
#leaf labels: 44233
#inner labels: 35031
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- A few long branches, particularly on sequences from India. These could be real or due to a lot of sequencing error. If real they would suggest that there are some highly diverged sequences in India. They should be treated with additional diligence compared to other sequences.
11-7-20
The trees in this release were generated with the following command line:
bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_07_10_23.fasta -o global.fa -t 34
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 11th of July 2020, at 9AM Canberra (Australia) time.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloading the entire repo.
Filtering statistics
sequences downloaded from GISAID
40005
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 39587
Alignment length: 29903
Total # residues: 1181493406
Smallest: 29146
Largest: 29903
Average length: 29845.5
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 39587
Alignment length: 29903
Total # residues: 1176129304
Smallest: 29096
Largest: 29718
Average length: 29710.0
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 39430
Alignment length: 29903
Total # residues: 1171468961
Smallest: 29096
Largest: 29718
Average length: 29710.1
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 39430
Alignment length: 29704
Total # residues: 1168517787
Smallest: 28492
Largest: 29704
Average length: 29635.2
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 70775
#leaves: 39342
#dichotomies: 30116
#leaf labels: 39342
#inner labels: 31431
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None