diff --git a/docs/Manual.html b/docs/Manual.html new file mode 100644 index 0000000..32dcf5e --- /dev/null +++ b/docs/Manual.html @@ -0,0 +1,3373 @@ + + + + +
+ + + + + + + + +PhyloTrace Version 1.5.0
+Web: www.phylotrace.com
+Contact: info@phylotrace.com
+Github: https://github.com/infinity-a11y/PhyloTrace
+PhyloTrace is a platform for bacterial pathogen monitoring on a +genomic level. Its components evolve around Core-Genome Multilocus +Sequence Typing (cgMLST) and Antimicrobial Resistance Screening. Complex +analyses and computation are wrapped into an appealing and +easy-to-handle graphical user interface. Users build a local database +comprising analyzed isolates, manageable directly with the application. +The visualization of isolate relationship and genetic profile is highly +interactive, aiding to reveal patterns explaining outbreak dynamics and +events by connecting genomic information with epidemiologic variables. +PhyloTrace achieves universal compatibility by assigning unique hashes +based on sequence and allele information. This implementation enables +efficient comparison and sharing of inter-lab results.
+PhyloTrace is supposed to be used for research and +academic purposes only.
+Install the application by following the steps disclosed in the README +document on GitHub. Launch PhyloTrace from the applications menu of your +system. The app runs in the system’s default browser. PhyloTrace is +optimized for Chrome, Chromium, Brave as well as Opera and Vivaldi. +Avoid using Firefox as some elements are distorted or not visible at +all.
+PhyloTrace doesn’t force but encourages to build a local database and
+iteratively add new bacterial isolates together with respective allelic
+profile and meta data. Upon first launch either load an already existing
+database or create a new one.
To start completely from scratch with no previously built database
+available, select + Create New
on the start screen
+(Figure 1) and choose a path where the database should
+be built. A folder named Database will be created in the
+respective location. Make sure to select a location that has writing and
+reading permission. Since there are no entries added or schemes
+downloaded yet, the database is empty and you are immediately directed
+to the > Manage Scheme
tab after
+clicking on Load
. The drop down menu lists all bacterial
+species that are available in the cgMLST.org Nomenclature Server
+(h25). Selecting a species will display information about the
+scheme, such as the seed genome or the curators. Pick the species you
+want to work with and press Download
. You can now proceed
+to type the first assemblies belonging to the respective bacterial
+species (see 3 Allelic Typing).
+
If you or your working group / institution has already used
+PhyloTrace before, they might have saved the respective database folder
+on the internal file system. Click Browse
on the start
+screen (Figure 1) and select the path of the database
+folder. PhyloTrace will automatically recognize if the selected folder
+contains compatible data.
The database is structured by folders for each bacterial species you
+have worked with (see Figure 3). Therefore, when
+loading a local database, select which species you want to work with in
+this session. For example, if the database contains entries typed with
+Bordetella pertussis, Burkholderia pseudomallei and
+Klebsiella pneumoniae schemes, you can choose between one of
+them. Proceed by clicking on the Load
button. The database
+section containing data regarding the selected strain will load.
+
If the already existing database doesn’t include the strain you want
+to work with, pick any arbitrary strain and load the database. Then head
+over to the > Manage Scheme
tab and
+select your desired bacterial species from the list. Proceed to download
+the scheme files comprising gene variants and scheme info by clicking
+Download
. After the download is complete you are prompted
+to load the database again (see Figure 4). Select the
+strain which was just downloaded and confirm. Proceed to start the first
+typing process for this species (see 3 Allelic
+Typing).
The currently loaded species/scheme is displayed on the top of the +sidebar below the PhyloTrace logo. If there is more than one scheme +available in the current database directory, it can be changed in the +same session. To switch, click the button next to the displayed scheme +and choose the new one. After confirmation, the database is loaded with +the newly selected scheme. If you like to switch to a scheme present on +a database located in a different directory, restart the app and select +the respective path linking to this database folder.
+The typing process is the fundamental step which generates the data +(i.e. the allelic profile) for the genomic comparison. The method +applied is based on core-genome multi locus sequence typing (cgMLST). An +allelic profile is generated for selected bacterial isolates. The +allelic profile determines, which allele variants are present for each +gene in the cgMLST scheme. If the process was successful, the results, +i.e. the allelic profile of the respective isolate as well as +epidemiologic meta data, are added as entry to the local database (see +4 Database Browser). By repeating this +process with further isolates, a foundation for a library of bacterial +isolates is created. Technically there is no limit for the number of +entries in the database, although the performance might be reduced if +there are several hundred entries in the currently loaded scheme +(depends on system capacity). The variant calling and alignment steps of +the typing process are facilitated by BLAT (BLAST-like Alignment Tool) +for whole genome assemblies and KMA (k-mer alignment) algorithm +for raw reads 1,2. Allelic +typing for raw reads will be available soon.
+In the sidebar of the
+> Allelic Typing
tab select
+☑ Single | ☐ Multi
(see Figure 5).
+Clicking on Browse
will open a window so that an assembly
+file from the local system can be selected. Any of the commonly used
+FASTA file formats (.fasta, .fna or .fa) are accepted. Selecting an
+incompatible file type will inhibit the start of the typing process.
+Make sure that the assembly files contains sequence data of a bacterial
+species that matches the selected scheme. Afterwards the basic meta data
+(i.e. Assembly ID
, Assembly Name
,
+Isolation Date
, Host
, Country
,
+City
) can be declared. Filling out every field is not
+mandatory if you don’t wish to or don’t have the respective information.
+Note, that the Assembly ID
has to be unique, proceeding is
+not possible if the same name is already present in the local database.
+Except for the Assembly ID
these isolate variables can
+still be change afterwards in the
+>> Browse Entries
tab. Clicking on
+Confirm
will save the metadata and render the process
+executable.
Before starting the process, select whether to save the assembly file
+to the local database. If an assembly file is not saved, screening for
+resistance and virulence genes will later not be available for the
+respective isolate (see 5 AMR Screening).
+The assembly file can not be added in retrospect. Pressing
+Start
will launch the typing process. The alignment
+algorithm is now searching the selected assembly for the alleles
+contained in the scheme and checking which variant is present. The
+loading bar provided feedback on this progress. The duration varies
+depending on the capability of your system and the number of alleles and
+variants included in the scheme and can take a while. Once 100% is
+reached the typing results are evaluated and appended to the local
+database. Database changes in the tab
+>> Browse Entries
are automatically
+inhibited during this finalization step to avoid issues. After this last
+step is finished you can reset to start another one. If the typing was
+successful, the addition of a new entry is indicated by a pulsating
+button in the >> Browse Entries
tab.
+Click this button to load the updated database including the newly added
+entry.
Multi typing is recommended for larger collections of several
+assemblies belonging to the same species. This saves the time needed to
+start the process one by one. In the sidebar of the
+> Allelic Typing
tab switch to
+☐ Single | ☑ Multi
and click on Browse
to
+select a folder containing the assemblies. If you plan to type just a
+subset of the selected folder, untick the unwanted assemblies in the
+table below and choose a compatible Assembly ID
. The multi
+typing process is only startable if no incompatible files are ticked.
+Because all the files are seamlessly piped into the process the basic
+meta data can be only declared once for all assemblies. The values
+declared for Isolation Date
, Host
,
+Country
and City
will apply for every new
+entry that is produced in this multi typing process. The
+Assembly Name
will first be identical with the
+Assembly ID
, representing unique identifiers of the
+assembly. The file name of the respective assembly is automatically
+assigned to both. However all of the basic meta data values, except
+Assembly ID
, can be changed in retrospect once the entry
+has been successfully added to the database. After confirming the
+metadata the Start
button will be rendered. Note, that if
+the assembly file is selected not to save to the local database,
+screening for resistance and virulence genes will not be available for
+the respective isolate later (see 5 AMR
+Screening). The assembly file can also not be added in retrospect.
+Upon starting the multi typing process, a field where the progress is
+logged is displayed. The process can be monitored with this overview.
+The log of the multi typing process can be downloaded as text file.
+Notifications, providing feedback about the status of the multi typing
+process, show up for every relevant event, such as the (un-)successful
+addition of an entry or the finalization of the multi typing process. A
+pending typing process can be canceled by clicking
+Terminate
. During the process is in the typing or alignment
+phase (indicated by Processing in the log), you can keep
+working with PhyloTrace, e.g. visualizing or editing the local database.
+However, just as for single typing, the app is automatically recognizing
+when the process is switching to the evaluation and addition phase
+(indicated by Attaching in the log), hence any database changes
+are prohibited. After each successful addition you can reload the
+database in the >> Browse Entries
+tab, to inspect the new entry. Unsuccessful typing attempts are captured
+in the log and in the multi typing summary once the process has been
+finalized (see Figure 6). Individual results can be
+inspected by choosing them from the selector in the right column.
+Displayed are only notable events in which e.g. a new allele variant was
+found or unsuccessful allele calling attempts. Press Reset
+to start another multi typing process.
After each variant from the cgMLST scheme has been searched and
+aligned to the assembly, the results are evaluated to determine which
+allele variant is present for each locus. This is conducted by a
+conditional multi-step process that ensures correctness and minimizes
+false positive assignments. The steps and the logic applied in this
+process are shown in Figure 7. If none of the variants
+from the scheme could be found in the bacterial isolate, the presence of
+a potential new gene variant is evaluated (see 3.3.1 New Variant Validation).
+
In case none of the variants from the locally available scheme match
+perfectly, the locus is checked for the existence of a new and valid
+variant. To ascertain whether this variant is valid, the locus must
+fulfill conditions such that it is likely to encode a gene. If there are
+multiple different nucleotide regions in the assembly possibly coding
+for a gene, each of them is sorted and passed through the validation
+logic (see Figure 8).
Unlike the genetic distance between a pair of sequences, summing up +the number of positions in which nucleotides are different, the +calculation of allelic distance considers entire loci/alleles for the +calculation. To receive the allelic distance, algorithms based on the +distance calculation method employed by Hamming in 1950, originally +meant for information technology, are used3. The Hamming distance is a +metric that quantifies the discrepancy between two strings of equal +length. It calculates the number of positions where the characters +differ between the two strings. Essentially, it indicates the minimum +number of substitutions required to transform one string into the other. +For cgMLST with PhyloTrace, hashes, i.e. 64-bit words, organized in an +array represent the allelic profile. The positions of the array elements +correspond to the loci in the scheme and the hash represents the allele +sequence for the respective locus. This allelic profile is generated +during the typing process. Thus, for pairwise comparison of the allelic +profile of two isolates, the total number of discrepant alleles result +in the allelic distance value. Comparing a selection of isolates results +in a distance matrix (see 4.4 Distance +Matrix), which are then used to compute a tree (see 6 Visualization).
+If no variant could be assigned for some genes contained in the +scheme, NA values are be placed in the allelic profile for the +respective position of the gene/locus. This can happen either if the +corresponding gene is not found in the assembly sequence, if there are +multiple hits or when the variant in the assembly is non-coding (refer +to 3.3 Variant Assignment).
+In order to showcase how allelic distances are calculated for
+isolates with missing values, we set up an example. For simplicity
+reasons we consider just three isolates, Isolate 1
,
+Isolate 2
and Isolate 3
with three loci only,
+Locus A
, Locus B
and Locus C
. For
+Isolate 1
let Locus A
have variant 1,
+Locus B
a missing value NA and Locus
+C
variant 1. For Isolate 2
let Locus
+A
be a missing value NA, Locus B
+variant 1 and Locus C
variant 1. For
+Isolate 3
let Locus A
be 2, Locus
+B
also 2 and Locus C
1.
allelic_profile <- data.frame(A = c(1, NA, 2), B = c(NA, 1, 2), C = c(1, 1, 1),
+ row.names = c("Isolate 1", "Isolate 2", "Isolate 3"))
+allelic_profile
+## A B C
+## Isolate 1 1 NA 1
+## Isolate 2 NA 1 1
+## Isolate 3 2 2 1
+Option 1: Ignore missing values for pairwise +comparison
+Selecting the first option as missing value handling strategy, will +have NA’s ignored in the pairwise comparison between two isolates. +Unlike Option 2, only single missing values are ignored, not the entire +locus.
+# Option 1
+
+hamming.distIgnore <- function(x, y) {
+ sum( (x != y) & !is.na(x) & !is.na(y) )
+}
+
+proxy::dist(allelic_profile, method = hamming.distIgnore)
+## Isolate 1 Isolate 2
+## Isolate 2 0
+## Isolate 3 1 1
+The pair isolate 1 & 2, each have an NA for one of the first two
+loci A
and B
with the third locus
+C
being identical. Their allelic distance is 0,
+hence these two isolates are considered identical in their allelic
+profile. The two other pairs Isolate 1 & 3 as well as 2 & 3 both
+result in an allelic distance of 1.
Option 2: Omit loci with missing values for all +assemblies
+If the second option is selected, loci containing at least one +missing value, will be ignored for the calculation of allelic distances. +Unlike Option 1, the loci with missing values are entirely omitted for +all pairwise comparisons. Even if an isolate pair might both have valid +variant numbers for a locus, it is not included in the analysis if the +locus contains just one NA for another isolate. For the missing +value statistics shown in Figure 10 [5.5 Missing +Values], 41 loci, displayed as columns in the missing value table, would +not be considered for the distance calculation. For this option the +respective loci are filtered out from the allelic profile before +applying the distance computation. Because of the potential to skew the +whole picture with this option, choosing it is only recommended if there +are very few afflicted loci with missing values.
+# Option 2
+
+hamming.distOmit <- function(x, y) {
+ sum(x != y)
+}
+
+allelic_profile_noNA <- select(allelic_profile, -A, -B)
+
+proxy::dist(allelic_profile_noNA, method = hamming.distOmit)
+## Isolate 1 Isolate 2
+## Isolate 2 0
+## Isolate 3 0 0
+Locus A
and B
are omitted before
+calculating the distance. This leads to all isolates being considered
+identical with an allelic distance of 0, because they all carry
+variant 1 for the only remaining locus C
.
Option 3: Treat missing values as allele variant
+The third option is rather specific and, considering the consequences +for subsequent calculation of allelic distances and analyses, should be +used with caution. Here, NA values are treated as if they were +a separate variant.
+# Option 3
+
+hamming.distCategory <- function(x, y) {
+ sum((x != y | xor(is.na(x), is.na(y))) & !(is.na(x) & is.na(y)))
+}
+
+proxy::dist(allelic_profile, method = hamming.distCategory)
+## Isolate 1 Isolate 2
+## Isolate 2 2
+## Isolate 3 2 2
+Due to both NA’s being considered a further valid variant. +All isolate pairs receive an allelic distance of 2.
+Depending on the options for NA handling applied to these two allelic +profiles, the result of the allelic distance will be different. The +results of these example calculations are summarized in the table +below.
++Pair + | ++Option 1 + | ++Option 2 + | ++Option 3 + | +
---|---|---|---|
+Isolate 1 & 2 + | ++0 + | ++0 + | ++2 + | +
+Isolate 1 & 3 + | ++1 + | ++0 + | ++2 + | +
+Isolate 2 & 3 + | ++1 + | ++0 + | ++2 + | +
The > Database Browser
tab allows to
+examine and manage information saved in the local database of the
+selected scheme. It is divided in the
+>> Browse Entries
,
+>> Scheme Info,>> Loci Info
,
+>> Distance Matrix
and
+>> Missing Values
tabs.
Each assembly that has been successfully typed is added to the table
+in >> Browse Entries
. This overview
+allows to edit (see 4.1.1 Edit Meta Data),
+delete (see 4.1.3 Delete Entries), inspect
+(see 4.1.4 Browse the Allelic
+Profile) and add (see 4.1.2 Custom
+Variables) information connected with the entries. The table can
+also be downloaded (see 4.1.5 Download
+Entry Table). The table contains both, the meta data and the allelic
+profile for each entry. The meta data as well as custom variables (see
+4.1.2 Custom Variables) appear first on
+the left part of the table, while the allelic profile with the assigned
+variants is positioned on the right part of the table (see 4.1.4 Browse the Allelic
+Profile). The Index
automatically assigns a number to
+each entry and is eventually updated if entries are deleted (see 4.1.3 Delete Entries). The
+Include
status decides over the inclusion or exclusion of
+the respective entry for further analyses, such as Visualization (see 6 Visualization).
The basic meta data comprising Assembly Name
,
+Isolation Date
, Host
, Country
and
+City
can be edited in the entry table by left-clicking in
+the corresponding field. As soon as changes are detected, a pulsating
+button appears, that saves the changes on click. If you decide
+otherwise, press the Undo
button and go back to the
+previous state. Assembly ID
is the name of the isolate in
+the Isolate directory of the local database and can’t be
+changed. The Index
number as well as the assigned hashes
+representing the allele variants in the allelic profile also can’t be
+edited because it would vitiate the analysis.
There is also the option to add custom variables using the controls
+in the >> Browse Entries
sidebar.
+Choose a name for the variable and press the green +
button
+to add it. In the dialogue window select the variable type, categorical
+(character) or continuous (numeric). After confirmation the variable is
+ready to be filled with values. These can be changed in retrospect in
+the same way as basic meta data (see 4.1.1
+Edit Meta Data). Note, that the database needs to be saved,
+otherwise the custom variables are not permanently added. The custom
+variable type and name can’t be changed in retrospect, but they can be
+deleted by selecting them from the drop-down menu in the sidebar and
+clicking the red -
button. If more than five custom
+variables are present, a table summarizing them is displayed in the
+sidebar.
The Delete Entries panel on the top right corner of the
+>> Browse Entries
tab allows to
+delete single or multiple entries at once. Select one or multiple
+entries to be deleted according to their Index
in the
+drop-down menu. Clicking the red x
button will open a
+dialogue window, prompting for confirmation about the intention to
+irreversibly delete the selection. The deletion will lead to a complete
+removal of the respective entry together with all the meta data, custom
+variable values and allelic profile. However, if the database is not
+saved after the deletion, it will appear again in the next session or
+could also be undone with the Undo
button in the same
+session. Note, that if you select all entries
+for deletion, confirmation will immediately and irreversibly empty the
+database for the currently selected scheme and you will
+not have the option to undo this action.
Scrolling the entry table to the right will reveal the allelic
+profile. The variant numbers for each allele/locus are sorted
+column-wise for each entry. By default, only the first 20 loci are
+displayed. Its possible to manually change, which loci are shown by
+selecting or deselecting them in the Compare Loci panel on the
+right below the Delete Entries panel. The respectively assigned
+hash, representing distinct allele sequences is truncated to the first
+and last four digits. Locus columns, containing at least one entry with
+an allele variant that is different from the others, are highlighted in
+green.
If the Only Varying Loci
option is activated, only loci
+with differing variants (i.e. the columns highlighted in red) are
+displayed. For missing variant values, i.e. if no variant could be
+allocated to a locus (see 3.4.1
+Missing Value Handling), the corresponding cell appears empty.
The entry table can be downloaded as CSV file. There are two options
+to control this output. As the user sometimes might choose to only
+include a subset of entries in a current analysis, there is the option
+to include only the entries of interest in the output file. Activate the
+switch Only included Entries
to include only the entries
+that are checkmarked in the Include
column. Control the
+Include
status either by checking or unchecking the
+checkboxes in the Include
column or select or unselect all
+at once by using the buttons on the top-left of the entry table. Note
+that the database has to be saved for the changes to take effect. The
+Index
of the entries marked as included are highlighted in
+green and exclusively selected to be considered in visualization (see 6 Visualization). Moreover you can choose if
+and which loci should be included in the download. By default only the
+meta data and custom variables of the entries are included in the csv
+file. If you activate the switch Include Displayed Loci
,
+the currently displayed loci are included as well. Use the control in
+the Compare Loci
box, to decide which and how many loci are
+displayed. Upon clicking the Download
button you can choose
+to which location on your system the file should be saved.
The tab >> Scheme Info
allows to
+inspect the properties of the currently selected scheme. The table
+displays information regarding the cgMLST scheme downloaded from the cgMLST.org Nomenclature Server (h25).
+It comprises the name of the scheme, the version, the seed genome, genus
+and species, the number of loci included, the complex type distance and
+count parameters, the date of the most recent changes, the official
+curators, publications addressing this scheme as well as the accessory
+scheme.
The overview in the tab
+>> Loci Info
provides information on
+the loci included in the scheme as well as the distribution of alleles
+among isolates present in the local database. The table allows to browse
+the Locus ID (e.g. BP0001, BP2483), if known the gene identifier
+(e.g. glpK, pykA), the position of the loci in the seed genome, the
+length in nucleotides (e.g. 1233), the gene product (e.g. pyruvate
+kinase, chromosome partitioning protein) as well as the number of
+variants included in the base scheme. There is the option to filter the
+table by keywords or numbers. Note, that this applies to all attributes,
+so searching for “566” would result in the display of loci having an ID
+that includes this number (e.g. “BP0056”, “BP0566”, “BP1566”, etc.), or
+position (e.g. 317566, 1255669), length (e.g. 1566) and every other
+attribute containing the keywords or numbers.
Selecting a locus from the table will render alleles present in the
+database and their respective DNA sequence. Browse alleles by choosing
+them from the selector showing the respective frequency of the selected
+allele in the database. The sequence can be copied to the clipboard. A
+FASTA file comprising all hashed allele sequences from the currently
+selected locus can be exported with Save FASTA
. To export
+the table with metadata of all loci included in the scheme, click the
+download button right next to the header Loci at the top.
The tab >> Distance Matrix
shows
+a heatmap matrix of the allelic distances between the entries. For
+details on how the allelic distances are derived refer to 3.4 Calculation of Allelic
+Distance. For each pair of entries, the sum of allele variants that
+are not identical, i.e. allelic distance, is displayed in the respective
+cell. Here the choice, how missing values, i.e. entries having
+unsucessfull variant allocations for some loci, can have both small and
+big impact for the values and depends on different parameters (see 3.4.1 Missing Value Handling). In
+addition to the visualization with tree plots, changes in the missing
+value handling strategy can be directly observed in this overview. The
+readability of the matrix is enhanced by a heatmap. The values contained
+are normalized resulting in a color gradient from light green to dark
+red. The lowest value, which is always 0 in the diagonal (allelic
+distance of the same entry logically is zero), is highlighted in light
+green. The highest value (dark red) varies and depends on the highest
+allelic distance value in the matrix.
There is the option to change the appearance of the matrix. Choose
+whether Assembly Name
, Assembly ID
or
+Index
is displayed as column or row headers. As sometimes
+the focus might be centered on the subset of entries that are marked as
+included ion the entry table, the switch
+Only Included Entries
can be toggled to show only this
+selection. Also the display of the diagonal line and the upper triangle
+can be activated or deactivated using the switches
+Show Diagonal
and Show Upper Triangle
+respectively. The distance matrix can be downloaded as CSV file. Note,
+that the matrix is downloaded as currently displayed, including all the
+changes made to the appearance (e.g. with or without diagonal or
+Index
instead of Assembly Name
as header).
Missing values occur if a locus can not be found in the assembly or
+if the present allele contains mutations leading to a dysfunctional
+gene. As long as no entry in the local database has any missing values,
+the >> Missing Values
tab is not
+displayed. When adding a new entry with NA value(s) to the
+local database, containing no missing values so far, reloading the
+database will automatically have the
+>> Missing Values
tab render, to
+call attention on the newly occurring missing values. This tab provides
+statistical information about the occurrence of missing values, and most
+importantly: control buttons for the user, to select the strategy how
+missing values are treated for subsequent analyses. The selection how
+these values should be handled directly impacts the calculation of the
+allelic distances between the bacterial isolates. The options to choose
+from are detailed in 3.4.1 Missing
+Value Handling. Due to the importance of missing values and how they
+are treated, upon loading local databases containing at least one
+missing value, the >> Missing Values
+tab will always be rendered first.
Figure 14 shows statistics about the missing values +of the entries in this database. There are 1069 unsuccessful allele +allocations in total, i.e. the global sum of NA values of all +entries and loci. There are 2983 loci in total in the selected +Bordetella pertussis scheme and 217 of these have one or more +missing values, which makes up about 7.3 %. Isolates for which more than +5% of loci contain missing values are highlighted in orange. These +should be included in further analyses with caution because a +significant share of alleles couldn’t be determined.
+Each row in the table on the right shows an entry that contains at
+least one missing value. The next column, Errors
,
+respectively includes the sum of missing values for that isolate. The
+following columns are loci including at least one missing value (denoted
+by NA
).
Screening for species-specific genes of interest, e.g. antibiotics
+resistance, virulence or stress genes, can be performed using the
+integrated NCBI/AMRFinder
+tool. The tab > Resistance Profile
+provides the interface for this feature and lets users inspect the
+screening results in
+>> Browse Entries
and perform the
+screening from the tab >> Screrning
.
+Note, that not every species is available for screening with AMRFinder.
+The availability for the currently selected scheme is automatically
+checked.
Use the tab >> Screening
to run
+AMRFinder. Selecting one or multiple isolates and clicking
+Start
initiates the process. The runtime is estimated less
+than a minute per isolate. Only isolates for which the respective
+assembly file is present in the local database can be applied to gene
+screening. The results can be inspected in parallel using the selector
+on the right, appearing once at least one isolate finalized the
+screening. Feedback on unsuccessful typing attempts is displayed as
+well.
There are two viewing modes available to browse the resistance
+profile, resulting from gene screening. Selecting the view mode
+☑ Picker | ☐ Table
renders the option to select isolates
+from a simple selector. The table showing the resistance profile
+(including also virulence genes, stress genes, etc.) for the selected
+isolate will appear below . The view mode
+☐ Picker | ☑ Table
, shows the isolate entry table above the
+resistance profile instead of the selector and therefore, next to
+providing a good overview, enables filtering and sorting. Select an
+entry from the table to render the respective resistance profile for
+this isolate. The currently selected table can be exported as CSV with
+Profile Table
.
Based on the allelic distances in the distance matrix (see 4.4 Distance Matrix), different tree plots
+can be created. PhyloTrace allows to choose between three different tree
+construction algorithms, Minimum-Spanning
,
+Neighbour-Joining
and UPGMA
. This tree type
+can be selected in the sidebar of the
+> Visualization
tab (see Figure
+17). On click of the Create Tree
button, a tree
+plot of the currently selected tree type will be computed and displayed.
+You can switch to a different tree type and create another tree without
+losing the tree created before. If you switch back to the previous tree
+type, you will still have the previously created tree. Unless you create
+a new tree for the same tree type, the plot will be conserved in the
+current session. Switching between different tree types enables to
+seamlessly compare trees created with the same data set, but different
+tree construction algorithms. Changes for the entry table in the
+>> Browse Entries
tab, such as
+inclusion of additional isolates (via ticking Include
) or
+edited variables, will only take effect in the tree plot, if you save
+the database with the changes and click Create Tree
again.
+Once a tree has been created, it can be modified and customized without
+having to reload it again.
The minimum-spanning-tree (MST) algorithm constructs a tree by +connecting the closest points or nodes of the distance matrix without +forming cycles. It focuses on finding the shortest path to connect all +the nodes, resulting in a tree that minimizes the total edge length. +Refer to 6.1.1 MST Modification to find +out, how the tree appearance can be modified. The nodes represent single +bacterial isolates. Isolates with identical allelic profile, i.e. a +distance value of 0, are summarized in a single node. If the allelic +distance between isolates lies within a certain threshold, clusters are +drawn.
+Figure 18 shows the modification panels for MST
+plots. These are divided into Layout (see 6.1.1.1 Layout), Nodes (see 6.1.1.2 Nodes) and Edges (see 6.1.1.3 Edges). There are several options to customize
+MST graphs, e.g. colors, forms, sizes, titles, labels, and more. Note,
+that due to the nature of the generation of MST plots, the plot is reset
+to its initial position, when changing one of the modification
+parameters. MST graphs can be enriched with information by mapping
+variables. to the plot.
The Layout control panel allows to add title, subtitle and
+footer to the graph by typing them in the text fields. Individually
+change the color for them using the color button below the text fields.
+Also the overall background color can be modified. Toggle the
+Transparent
switch, to make the background transparent.
The Nodes control panel allows to control the appearance of +the nodes and related elements such as the label. The upper left +controls are related to the label, i.e. which isolates are represented +by the respective node. Using the drop-down menu, the label can be +changed to any variable present for the respective isolates according to +the entry table. The color of the node labels can be modified using the +color button and their sizes by clicking on the blue menu button right +next to it. The color of the nodes themselves can be changed using the +color button from the control panels on the upper right. Clicking the +menu buttons allows to change the opacity.
+Node colors can also be used to map a variable to the graph. Nodes +are colored according to the value present for the respective isolates +and transformed in a pie chart to show the distribution of values if +there are several clonal isolates summarized in a single node. Currently +only variables of categorical type can be used in this feature.
+The node size can be controlled from the bottom left controls. The
+size of nodes containing multiple isolates with identical allelic
+profiles, can be scaled by the number of isolates contained in them.
+Toggle the Scale by Duplicates
switch to activate this
+feature. Consequently, the slider to set the node size changes to a
+range selection instead of distinct values. In this way, the size of the
+smallest nodes, i.e. containing just one isolate, the size of the nodes
+containing most isolates as well as the overall range can be
+controlled.
The form of the nodes can be customized using the control panels on
+the bottom right. Activate the switch Show Shadows
to
+display shadows for the nodes. The shape of the nodes can be changed
+here as well. Choose between shapes that render the node labels below
+(Diamond
, Hexagon
, Dot
,
+Square
) or inside them (Circle
,
+Box
, Text
). If a variable is mapped, the form
+Pie Chart
is locked in and can not be changed.
The Edges control panel allows to control the appearance of the edges
+and related elements. Each edge is labelled by the value of the allelic
+distance that the isolates from connecting nodes have to each other.
+Except its appearace, this label currently can’t be changed. The color
+and size can be modified using the upper left controls Label.
+The color of the edges themselves can be controlled by Color in
+the upper right controls. Click the menu button to see the control for
+the transparency of the edges. On the bottom left, there is the option
+to scale the edge lengths by the allelic distance they represent. Toggle
+the Scale Edge Length
switch to activate this effect. The
+multiplier of this effect can be customized using the slider below.
+Activating this option when the subset of isolates displayed in the MST
+graph has very different allelic distances, e.g. for a maximum of 200
+and a minimum of 10, can lead to an untidy look of the plot. Drag the
+slider to lower values to minimize this issue.
The clustering controls are to be found in the Edges panel at the +bottom right. By default the “Complex Type Distance” value disclosed for +each scheme available on the cgMLST.org Nomenclature Server is selected +as the current cluster threshold. The threshold value can be modified to +any desired value. Nodes with distances that lie withing the selected +threshold are accordingly engulfed by cluster shapes. These are +differently colored in order to distinguish between the cluster groups. +Choose between the Rainbow and Viridis scales to +modify the coloring. There are two types of cluster shapes available: +Area and Skeleton. The cluster type Area +renders an area surrounding nodes that are part of a cluster. Skeleton +instead uses the edges to visualize clusters. This can be particularly +useful if the selection of isolates is complex, which can potentially +lead to overlapping clusters with the Area cluster type.
+The Neighbour-Joining (NJ) method constructs a tree by iteratively +joining pairs of nodes based on their pairwise distances. It aims to +minimize the total branch length in the tree and is commonly used for +constructing phylogenetic trees from distance matrices. Refer to 6.4 NJ and UPGMA Modification for +information on how the tree appearance can be modified.
+The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) +computes tree plots by grouping the most similar sequences or taxa +together at each step and then averaging their distances. It produces a +tree with equal branch lengths and is often used for hierarchical +clustering of data. Refer to 6.4 NJ +and UPGMA Modification for information on how the tree appearance +can be modified.
+The tree elements can be customized in great detail and supplemented
+with additional information such as variables (see 6.4.4 Variable Mapping). However the basic
+appearance, e.g. text and element sizes, are automatically adjusted to
+the qualities and quantities of the entries that were selected to be
+included for the tree. Due to the variable nature of different data
+sets, it is sometimes required to manually readjust some elements to
+receive a balanced look. While Minimum-Spanning trees have slightly
+different modification features and control inputs, NJ and UPGMA trees
+share the same control inputs. This is due to the different
+visualization technique used for the creation and display of MST plots.
+The controls to modify the tree are arranged in panels and divided in
+Layout
, Label
, Elements
and
+Variables
. In some panels you will find small menu buttons
+(highlighted in light blue). They allow to further modify the elements
+addressed by the respective panel in more detail (e.g. position or
+font-style).
The appearance of the general layout can be modified in detail. There
+is a range of different options, e.g. for controlling theme, colors,
+title & subtitle, size, legend and other elements. To switch to
+these controls navigate to the
+>> Visualization
tab and click the
+Layout
button from the menu left to the control panels.
+
Layout themes allow to change the geometrical appearance. You can +choose from a selection of themes that are further categorized in linear +and circular layouts. While the visual look changes when switching +between linear and circular theme, the quality, i.e. the order and +arrangement, of the hierarchical NJ and UPGMA trees, stays the same.
+Linear: Rectangular
,
+Roundrect
, Slanted
and
+Ellipse
Circular: Circular
,
+Inward
Moreover, a Rootedge
can be added by turning on the
+switch. The root of the tree can be considered as starting point,
+representing a theoretical “common ancestor” with an initial allelic
+profile, from which all other isolates developed. Next to aesthetics,
+displaying this element can help to distinguish “normal” branches,
+representing actual allelic distance between the isolates, from the
+root. The root menu lets you further modify it’s length and line
+type.
The Ladderize
switch is turned on by default. It sorts
+the tree branches by their length.
The color of lines, text as well as background, can be modified in
+the Color panel. The colored buttons show the color currently
+displayed as well as the respective HEXA code. Clicking them opens the
+color menu. You can either select a color by choosing it directly from
+the gradient field or by providing a HEXA or RGBA code. Note, that the
+Lines/Text color applies to the tree branches, legend text and
+title, but not to the tip labels. Their color can be modified in the
+respective Label
menu (see 6.4.2.1
+Tips).
Add title and subtitle in the Title panel. Their color +changes in accordance to the selected Lines/Text color, but can be +separately modified. The title menu allows to customize the font +size.
+The Sizing panel provides control of plot dimensions and
+position. For the aspect ratio, you can choose from 16:10
,
+16:9
and 4:3
. The overall size can be scaled
+with the slider below. If some elements are cut off you can zoom out
+using the slider at the bottom. Especially trees having a circular
+layout can sometimes appear small with too much white space around. In
+this case zooming in might be beneficial. The Sizing menu
+allows to horizontally and vertically position the content.
Legend and tree scale controls share the same panel. The tree scale +helps to estimate the actual allelic distance, represented by the branch +length. In case you prefer not to show this element you can hide it by +toggling the switch. It’s length can be changed in the tree scale menu +and proportionally scales with the branch length. If the scale +superimposes other elements, adjust its position by dragging the sliders +in the menu.
+If variables are mapped to the plot, a legend will appear. For the
+orientation, the options are either horizontal or vertical (see
+Figure 22). The legend menu allows to also adjust
+position and size.
The Label
menu allows to control whether and how certain
+labels are displayed. There are three different kinds of labels:
+Tips, Branches and Custom Labels. They can be
+modified in many different ways, e.g. in color, size or position.
+
The label at the tips represent the actual entries with their allelic
+profile that determined their position in the tree. By default, the
+Assembly Name
is displayed as tip label. However it is
+possible to select other basic variables, e.g. Host
,
+Country
, City
or Isolation Date
,
+from the drop down menu, or even choose not to show tip labels at all by
+toggling the Show switch. Instead of the tip labels being positioned
+right next to the tips, they can be aligned to the right by activating
+the Align
switch. UPGMA trees always have the tip labels
+aligned and NJ trees only have this activated by default for circular
+layouts. The menu on the right provides further customization options.
+The Opacity
slider can be used to change the transparency.
+The Position
parameter modifies the offset of the labels
+from the tip. Angle, size and font face can be changed as well.
+Customize the color of the label text with the color button and the
+color of the panel with the color button below. The panels envelope the
+tip label and are not shown by default. The controls in the panel menu
+allow to modify size of the panels (not the text itself) and to smooth
+the form.
Branch labels allow to supplement the tree with additional
+information by labelling the branch leading to the final tips with
+variables that are connected to the respective isolate. To show this
+element toggle the Show
switch in the Branches
+panel. The drop down lets you choose which variable or meta data to
+annotate. The color of the panel surrounding the branch label can be
+changed with the color button below. The menu button includes further
+controls, e.g. opacity, size, horizontal and vertical position, font
+face as well as edge smoothing. Note, that having branch labels doesn’t
+work for trees with circular layout. Also more complex linear trees with
+many isolates included mostly have too confined space for adding branch
+labels. Instead, consider mapping a variable to other tree elements such
+as tip points (see 6.4.4 Variable
+Mapping).
If there is a need for labels somewhere other than tips or branches,
+there is the option to create customized ones. The panel
+Custom Labels
lets you define the label. Click the green
++
button too add it. The label will be positioned at plot
+center. Create more labels by giving them a name and adding them again.
+To change the size and position, select the respective label from the
+drop down and open the menu next to the +
button. Do the
+desired changes and click the Apply
button for them to come
+into effect. Figure 25 shows a tree with two
+highlighted clades (see 6.4.3.5 Clade
+Highlight). The custom label function was used to annotate them.
+
The Elements
menu provides control over several special
+elements such as tip and node points or a heatmap. These are not
+essential but can amplify the explanatory power of the tree. Elements
+can be deactivated or activated and their appearance can be changed.
+
Tip points are located at the end of the tree branches and correspond
+to the isolates displayed. They can be modified in color or size to
+bring the ends of the tree into prominence. Alternatively this element
+can be used to map a variable (see 6.4.4
+Variable Mapping).
Node points, in contrast to tip points, solely represent theoretical +predecessors and relatives with respect to the isolates and their +allelic profile. Despite the option to map a variable, their look can be +customized in the same way like tip points. Mapping variables is not +possible because they connect several isolates which may potentially +have discrepant values for a chosen variable.
+Tiles are supplementary elements that can be used to map variables to
+the plot. They work with both circular and linear layouts. Up to five
+different tiles can be added by activating them in the
+Variables
menu (see 6.4.4
+Variable Mapping). To modify opacity, width or position, select the
+respective tiles that you wish to change with the selector at the top
+left corner of the panel. Any modifications will apply only for the
+selected tile. Opacity defines the transparency of the tile, enabling
+overlaying it e.g. over the tree. The width slider controls the width.
+Changing the position of the tiles for linear layouts, they are moved
+horizontally, while in circular layouts they are moved inwards or
+outwards in relation to the center of the circle.
Heatmaps can be a powerful tool to visualize related variables of the
+same type (either categorical or continuous). For more details refer to
+6.4.4 Variable Mapping. If the heatmap
+is activated in the Variables
menu it can be modified using
+the respective control panel in the Elements
menu. Width
+changes apply to the heatmap overall, not to single columns. Just as
+with tiles, the position control is moving the heatmap horizontally for
+linear layouts and inwards or outwards for circular layouts. In some
+situations, e.g. for long variable names or in circular layouts, it
+might make sense to modify the angle and/or position of the column
+headers. This can be done by using the controls in the heatmap menu.
+
Isolates are grouped in distinct hierarchical clades, which are
+defined by nodes that comprise several isolates or other daughter nodes
+and their respective isolates. In order to emphasize one or several
+clades toggle the Node View switch and inspect the respective node index
+of the clades you wish to highlight. Select the nodes in the drop-down
+menu below and deactivate the Node View again to see the highlighted
+clades. If only one clade is highlighted there is the option to
+customize its color with the color button below. If there is more than
+one clade selected, you choose from a color scale instead. Also use the
+menu in the Clade Highlight control panel to control the
+alignment of the clade highlights to each other. The borders of the
+colored squares can be modified to round or rectangular appearance.
+
Clades, which are located within another clade that is higher in the
+hierarchy can also be highlighted (see Figure 31).
+
Mapping variables, representing epidemiologic metadata or other
+properties of the isolates displayed, is a powerful way of enriching the
+plot with information. The Variables
menu provides full
+control which variables are mapped, the elements they are mapped to and
+the color scale that represents the different values of the selected
+variable. The control panel is ordered into Element, Variable and Color
+Scale columns (see Figure 32). The switches in the
+Element column can be turned on or off to activate or deactivate the
+display of a variable with the respective element. Select the variable
+to be mapped from the drop-down menu right of the element switch. It
+contains the basic meta data (Isolation Date
,
+Host
, City
, Country
) as well as
+the manually added custom variables (see 4.1.2 Custom Variables). The currently
+selected variable is checked for its number of distinct values and
+variable type (categorical or continuous). As this information is
+relevant for selecting the color scale, it is displayed directly next to
+the color scale selection menus. For categorical variables, the
+selectable color scales automatically change depending on the number of
+distinct values. If the number of distinct values is 7 or less you can
+select from qualitative color scales. As there is a limited number of
+distinct colors available in the qualitative color scales, they are not
+selectable if the variable exceeds 7 distinct values. Instead, gradient
+color scales can be selected from. Continuous variables have continuous
+and divergent color scales available. Using the colorblind friendly
+gradients Viridis and Cividis is recommended.
+Divergent color scales are useful for visualizing data where there’s a
+clear central point of interest to highlight positive and negative
+deviations from a central value like 0. An example for a use case are
+gene expression variables. E.g. fold change values, with colors
+indicating whether the change is positive (upregulation) or negative
+(downregulation) relative to a baseline expression level of 0 (no
+change).
In Figure 33 the Isolation Date
+variable is mapped to the tip label color (see 6.4.2.1
+Tips). Hence the tip labels indicate both the
+Assembly Name
and the Isolation Date
, with the
+Greys color scale highlighting more recently added isolates in
+darker shades. The tip point color is assigned to display the
+categorical City
variable in which the sample was acquired.
+In this example with two values only, the cities Graz and
+Vienna. The qualitative scale Set2 is chosen to
+distinguish the variable as well as possible from other variables. The
+tip point shapes circle and triangle represent the host from which the
+bacterial sample was taken. As the variable values are represented by
+shapes instead of colors, there is no color scale for this option.
+Continuous values can’t be represented by shapes. There are six
+different shapes available, hence selecting the tip point shape to
+represent categorical variables is only possible if there are 6 or less
+distinct values. The custom variables Patient Age
and
+ftsA
, which stands for expression values of the
+ftsA gene, are mapped to Tile 1 and 2 respectively. Except
+color values which are assign by the variable mapping, the appearance of
+the elements, such as tip point sizes, can still be modified (e.g. 6.4.3.1 Tip Points).
Figure 35 shows an example for gene expression fold
+changes mapped on a heatmap. While white/yellow colors indicates
+baseline expression levels around 0, green colors indicate upregulation
+and red colors downregulation. When a diverging scale is selected, you
+can choose the midpoint of the scale (Zero
,
+Mean
or Median
) using the drop down menu that
+appears right to the color scale selector. Zero
assigns the
+middle color of the diverging color scale to the value 0. The choices
+Mean
and Median
assign the middle color to the
+arithmetic mean and median of the respective value range. The appearance
+of the heatmap, such as width and position, can be modified using the
+respective control panel from the Elements
menu (see 6.4.3.4 Heatmap).
Neighbour-Joining and UPGMA trees can be downloaded in PNG, JPEG, BMP
+and SVG format. Minimum-Spanning trees can be downloaded in PNG, JPEG
+and BMP format. In addition they can be downloaded as HTML to preserve
+the interactivity of dragging, zooming and moving the MST graph. To
+initiate the download head to the
+> Visualization
tab. In the sidebar,
+below the Create Tree
button, you find the drop down to
+select the file type as well as the download button right next to it.
+Note: In order for the download to work, the plots have to be created
+first.
A report of HTML format can be created by clicking the button
+Print Report
, located in the sidebar of the
+> Visualization
tab. There are several
+options to control which information is included in the report. The
+elements are categorized in Entry Table
,
+General
, Analysis
and
+Attach Plot
(see 7.2.1 Report
+Elements). Note that the report requires prior creation of a tree
+plot. The entry table in the report, the attached plot as well as some
+analysis parameter, such as the tree algorithm, are all settled in the
+moment a tree is created. Therefore a proper report can only be
+generated after tree creation. For the entry table, instead of the
+entire local database for the respective scheme, only isolates of
+interest, i.e. the ones that have been used to generate the currently
+displayed tree are listed in the report. The download will be directed
+to the system location set in your browser download settings.
+
The sub-elements belonging to General
are
+Date
, Operator
, Institute
and
+Comment
. If you wish to include only a selection of these
+elements, tick or untick them accordingly. Unticking the
+General
element will deactivate the display of any
+sub-elements as well.
Ticking the Isolate Table
prints the entry table,
+comprising the isolate names as well as the selected metadata columns on
+the report. Note, that only entries that are marked as
+Included
in the database
+(>> Browse Entries
) are printed.
+Hence only isolates that are shown in the current tree are included.
The sub-elements belonging to the Analysis
parameters
+are Scheme
, Tree
, Distance
,
+NA Handling
and Version
. These parameters are
+automatically derived from the session as well as the created tree and
+can only be selected to be shown or hidden. As with the
+General
parameters, unticking
+Analysis Parameter
will hide all sub-elements.