-
Notifications
You must be signed in to change notification settings - Fork 79
Documentation/Consistency Review #3329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3329 +/- ##
=======================================
Coverage 89.72% 89.72%
=======================================
Files 29 29
Lines 31181 31192 +11
Branches 5720 5722 +2
=======================================
+ Hits 27976 27988 +12
Misses 1796 1796
+ Partials 1409 1408 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through and it all looks good to me except for the IBDsegments stuff, which I'd have to spend more time on. I'd suggest pulling this commit out into it's own PR, and get the rest merged.
docs/stats.md
Outdated
| statistics, isolated samples without mutations directly above them are treated | ||
| as carrying the ancestral allele rather than as missing. Future versions of | ||
| tskit may expose options to treat missing data differently in statistics; for | ||
| now, if you need explicit control over how missing data are handled you should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| now, if you need explicit control over how missing data are handled you should | |
| now, if you need explicit control over how missing data is handled you should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95bad61 to
fe6a5fb
Compare
fe6a5fb to
32ae37f
Compare
|
IBD extracted to #3330 |
|
I'll wait for @petrelharp before merging. |
| requirements for each edge are: | ||
|
|
||
| - We must have {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`; | ||
| - We must have finite coordinates with {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant, but it's fine; this is supposed to be careful and specific. (you can't have a non-finite value x for which 0 <= x < L with L finite)
|
|
||
| #### Migration requirements | ||
|
|
||
| Given a valid set of nodes and edges, the requirements for a value set of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Given a valid set of nodes and edges, the requirements for a value set of | |
| Given a valid set of nodes and edges, the requirements for a valid set of |
| on the VCF parser, and so we do **not** account for this change in | ||
| coordinate system by default. | ||
| Tree sequence site positions can be floating point values, whereas VCF | ||
| requires discrete, 1-based integers. The ``position_transform`` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| requires discrete, 1-based integers. The ``position_transform`` argument | |
| requires positive integers. The ``position_transform`` argument |
| coordinate system by default. | ||
| Tree sequence site positions can be floating point values, whereas VCF | ||
| requires discrete, 1-based integers. The ``position_transform`` argument | ||
| controls how tskit maps coordinates into VCF. Translating non-integral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| controls how tskit maps coordinates into VCF. Translating non-integral | |
| controls how tskit maps coordinates into VCF. Translating non-integer |
| integer, so multiple sites may share the same output position. Because VCF | ||
| parsers differ, we do **not** automatically shift from tskit's 0-based | ||
| convention to VCF's 1-based convention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| integer, so multiple sites may share the same output position. Because VCF | |
| parsers differ, we do **not** automatically shift from tskit's 0-based | |
| convention to VCF's 1-based convention. | |
| integer, so multiple sites may share the same output position. | |
| Furthermore, tskit's coordinate system starts at zero, | |
| whereas the VCF standard requires positions to be positive, | |
| and so a site at position 0 is not valid in the VCF standard. | |
| Because VCF parsers differ, we do **not** do anything to account for this. |
| :::{warning} | ||
| Most VCF tools expect POS and contig coordinates to be 1-based. If your tree | ||
| sequence already uses discrete, 0-based integer positions, leaving the default | ||
| transform will produce off-by-one coordinates in the VCF. Supply an explicit | ||
| ``position_transform`` (for example, add 1 after rounding) or use the | ||
| ``"legacy"`` option to shift to 1-based coordinates. | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about how this is written. I think there are different interpretations of what's going on here:
- "0-based / 1-based" - if I'm interpreting correctly, this means that tskit coordinates are "distance to the right of the left end of the chromosome" and VCF is "which number site, counting from the left end of the chromosome".
- "valid values" - tskit allows a site at coordinate 0, VCF does not
Under interpretation (1), the coordinates are indeed off by one. However, I think that the simpler interpretation (2) should be the preferred interpretation, because then they're in the same coordinate system and there's no possibility to get mixed up with off-by-one errors.
Here's a rather pedantic suggestion for clarifying the situation.
| :::{warning} | |
| Most VCF tools expect POS and contig coordinates to be 1-based. If your tree | |
| sequence already uses discrete, 0-based integer positions, leaving the default | |
| transform will produce off-by-one coordinates in the VCF. Supply an explicit | |
| ``position_transform`` (for example, add 1 after rounding) or use the | |
| ``"legacy"`` option to shift to 1-based coordinates. | |
| ::: | |
| The simplest resolution of this discrepancy in convention between tskit and VCF | |
| positions is deal with any site at position 0 as a special case (for instance, | |
| by discarding or ignoring it). | |
| A different interpretation of this difference between tskit's position | |
| and VCF's POS field | |
| is that they are different coordinate systems: tskit coordinates are | |
| "distance to the right of the left end of the chromosome", | |
| while VCF coordinates are "which number site, counting from the left end | |
| of the chromosome and starting at one". | |
| Under this interpretation, the solution is to supply an explicit | |
| ``position_transform`` that adds 1 to the coordinate when outputting | |
| to VCF (or, using the ``"legacy"`` option described below). However, this can | |
| easily lead to off-by-one errors converting between the coordinate systems, | |
| so should only be used if you really are using 0-based coordinates for your | |
| tree sequence. | |
| :::{warning} | |
| Most VCF tools cannot deal with a POS value of 0. If your tree | |
| sequence contains a site with position 0, this will likely cause an error. | |
| ::: |
| For example, to shift all coordinates by 1 to make them strictly | ||
| 1-based, we could define: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For example, to shift all coordinates by 1 to make them strictly | |
| 1-based, we could define: | |
| For example, to shift all coordinates by 1, we could define: |
I'm removing "1-based" because I don't think that term is unambiguous.
| `UNKNOWN_TIME` will be used to fill the column. See the | ||
| more than one mutation occurs at the same site). If the `time` column | ||
| is absent, the mutation times in the resulting tree sequence are set to | ||
| {data}`tskit.UNKNOWN_TIME`. Unknown mutation times written out by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| {data}`tskit.UNKNOWN_TIME`. Unknown mutation times written out by | |
| {data}`tskit.UNKNOWN_TIME`, which is a numeric value that behaves like NaN. | |
| Unknown mutation times written out by |
|
This all looks good, but see comments/suggestions? |
I thought it wise before 1.0 to do a full (automated) review that documentation lines up with the code we're running. Several hours of token crunching and 45min of human review/improvement resulted in these commits, which are best viewed as separate diffs as it's a bit of a grab-bag. I think these are all worth merging though - even if they are things we weren't planning to get to.
That there wasn't more than this to find is very reassuring, but then a lot of effort has been put in to ensure tskit's quality.