diff --git a/docs/glossary.md b/docs/glossary.md index 3317c9a27b..20cca9204f 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -30,26 +30,39 @@ Here are some definitions of some key ideas encountered in this documentation. tree : A "gene tree", i.e., the genealogical tree describing how a collection of genomes (usually at the tips of the tree) are related to each other at some - chromosomal location. See {ref}`sec_nodes_or_individuals` for discussion - of what a "genome" is. + chromosomal {ref}`position ` or location. + As the trees may vary depending on this location, they are also known as "local + trees". See {ref}`sec_nodes_or_individuals` for discussion of what a "genome" is. (sec_data_model_definitions_tree_sequence)= tree sequence -: A "succinct tree sequence" (or tree sequence, for brevity) is an efficient - encoding of a sequence of correlated trees, such as one encounters looking - at the gene trees along a genome. A tree sequence efficiently captures the - structure shared by adjacent trees, (essentially) storing only what differs - between them. +: A "succinct tree sequence" (or tree sequence, for brevity) is an object + that stores the genetic ancestry and mutational history of a set of + aligned DNA sequences or genomes. The name reflects the idea that a common + way to treat genetic ancestry is as a sequence of correlated + {ref}`trees ` at different chromosomal + {ref}`positions `. + Branches that are shared between these trees are efficiently stored as a + single {ref}`edge `, and adjacent trees + may differ by only a few such edges. These edges connect + {ref}`nodes ` (genomes) in + the tree sequence, forming a + network or graph. Graphs of this sort are sometimes called ancestral + recombination graphs (ARGs), hence tree sequences provide a + flexible way to encode multiple types of ARG. (sec_data_model_definitions_node)= node -: Each branching point in each tree is associated with a particular genome +: Any point in a tree can be associated with a particular genome in a particular ancestor, called a "node". Since each node represents a - specific genome it has a unique `time`, thought of as its birth time, - which determines the height of any branching points it is associated with. - See {ref}`sec_nodes_or_individuals` for discussion of what a "node" is. + specific genome it has a unique `time`, thought of as its birth time. Nodes + may or may not correspond to branching points, either in a local + {ref}`tree ` or in the whole graph. + However a branching point must always be associated with a node. + See {ref}`sec_nodes_or_individuals` for discussion of what a "node" + represents. (sec_data_model_definitions_individual)= @@ -66,7 +79,7 @@ individual sample : The focal nodes of a tree sequence, usually thought of as those from which we have obtained data. The specification of these affects various - methods: (1) {meth}`TreeSequence.variants` and + methods: {meth}`TreeSequence.variants` and {meth}`TreeSequence.haplotypes` will output the genotypes of the samples, and {attr}`Tree.roots` only return roots ancestral to at least one sample. @@ -81,13 +94,15 @@ edge : The topology of a tree sequence is defined by a set of **edges**. Each edge is a tuple `(left, right, parent, child)`, which records a parent-child relationship among a pair of nodes on the - on the half-open interval of chromosome `[left, right)`. + on the half-open interval `[left, right)` along the genome. The difference + between `left` and `right` is known as the "span" of the edge. (sec_data_model_definitions_site)= site : Tree sequences can define the mutational state of nodes as well as their - topological relationships. A **site** is thought of as some position along + topological relationships. A **site** is thought of as some + {ref}`position ` along the genome at which variation occurs. Each site is associated with a unique position and ancestral state. @@ -114,6 +129,14 @@ migration population : A grouping of nodes, e.g., by sampling location. +(sec_data_model_definitions_position)= + +position +: A location along the genome, from 0 to the + {ref}`sequence length`. In `tskit` + positions are stored as floating-point numbers, although it is common to + restrict positions to occur at discrete integer locations. + (sec_data_model_definitions_provenance)= provenance