Skip to content

Conversation

@benjeffery
Copy link
Member

I thought it wise before 1.0 to do a full (automated) review that documentation lines up with the code we're running. Several hours of token crunching and 45min of human review/improvement resulted in these commits, which are best viewed as separate diffs as it's a bit of a grab-bag. I think these are all worth merging though - even if they are things we weren't planning to get to.

That there wasn't more than this to find is very reassuring, but then a lot of effort has been put in to ensure tskit's quality.

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.72%. Comparing base (5ffcf6f) to head (32ae37f).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3329   +/-   ##
=======================================
  Coverage   89.72%   89.72%           
=======================================
  Files          29       29           
  Lines       31181    31192   +11     
  Branches     5720     5722    +2     
=======================================
+ Hits        27976    27988   +12     
  Misses       1796     1796           
+ Partials     1409     1408    -1     
Flag Coverage Δ
c-tests 86.71% <100.00%> (+<0.01%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests 87.05% <ø> (ø)
python-tests 98.84% <100.00%> (+<0.01%) ⬆️
python-tests-no-jit 33.51% <7.69%> (-0.05%) ⬇️
python-tests-numpy1 50.16% <92.30%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/tables.c 83.08% <100.00%> (+0.01%) ⬆️
python/tskit/trees.py 98.89% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through and it all looks good to me except for the IBDsegments stuff, which I'd have to spend more time on. I'd suggest pulling this commit out into it's own PR, and get the rest merged.

docs/stats.md Outdated
statistics, isolated samples without mutations directly above them are treated
as carrying the ancestral allele rather than as missing. Future versions of
tskit may expose options to treat missing data differently in statistics; for
now, if you need explicit control over how missing data are handled you should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
now, if you need explicit control over how missing data are handled you should
now, if you need explicit control over how missing data is handled you should

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjeffery
Copy link
Member Author

IBD extracted to #3330

@benjeffery benjeffery added this pull request to the merge queue Nov 20, 2025
@benjeffery benjeffery removed this pull request from the merge queue due to a manual request Nov 20, 2025
@benjeffery
Copy link
Member Author

I'll wait for @petrelharp before merging.

requirements for each edge are:

- We must have {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
- We must have finite coordinates with {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, but it's fine; this is supposed to be careful and specific. (you can't have a non-finite value x for which 0 <= x < L with L finite)


#### Migration requirements

Given a valid set of nodes and edges, the requirements for a value set of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given a valid set of nodes and edges, the requirements for a value set of
Given a valid set of nodes and edges, the requirements for a valid set of

on the VCF parser, and so we do **not** account for this change in
coordinate system by default.
Tree sequence site positions can be floating point values, whereas VCF
requires discrete, 1-based integers. The ``position_transform`` argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
requires discrete, 1-based integers. The ``position_transform`` argument
requires positive integers. The ``position_transform`` argument

coordinate system by default.
Tree sequence site positions can be floating point values, whereas VCF
requires discrete, 1-based integers. The ``position_transform`` argument
controls how tskit maps coordinates into VCF. Translating non-integral
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
controls how tskit maps coordinates into VCF. Translating non-integral
controls how tskit maps coordinates into VCF. Translating non-integer

Comment on lines +388 to +390
integer, so multiple sites may share the same output position. Because VCF
parsers differ, we do **not** automatically shift from tskit's 0-based
convention to VCF's 1-based convention.
Copy link
Contributor

@petrelharp petrelharp Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
integer, so multiple sites may share the same output position. Because VCF
parsers differ, we do **not** automatically shift from tskit's 0-based
convention to VCF's 1-based convention.
integer, so multiple sites may share the same output position.
Furthermore, tskit's coordinate system starts at zero,
whereas the VCF standard requires positions to be positive,
and so a site at position 0 is not valid in the VCF standard.
Because VCF parsers differ, we do **not** do anything to account for this.

Comment on lines +392 to +398
:::{warning}
Most VCF tools expect POS and contig coordinates to be 1-based. If your tree
sequence already uses discrete, 0-based integer positions, leaving the default
transform will produce off-by-one coordinates in the VCF. Supply an explicit
``position_transform`` (for example, add 1 after rounding) or use the
``"legacy"`` option to shift to 1-based coordinates.
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about how this is written. I think there are different interpretations of what's going on here:

  1. "0-based / 1-based" - if I'm interpreting correctly, this means that tskit coordinates are "distance to the right of the left end of the chromosome" and VCF is "which number site, counting from the left end of the chromosome".
  2. "valid values" - tskit allows a site at coordinate 0, VCF does not

Under interpretation (1), the coordinates are indeed off by one. However, I think that the simpler interpretation (2) should be the preferred interpretation, because then they're in the same coordinate system and there's no possibility to get mixed up with off-by-one errors.

Here's a rather pedantic suggestion for clarifying the situation.

Suggested change
:::{warning}
Most VCF tools expect POS and contig coordinates to be 1-based. If your tree
sequence already uses discrete, 0-based integer positions, leaving the default
transform will produce off-by-one coordinates in the VCF. Supply an explicit
``position_transform`` (for example, add 1 after rounding) or use the
``"legacy"`` option to shift to 1-based coordinates.
:::
The simplest resolution of this discrepancy in convention between tskit and VCF
positions is deal with any site at position 0 as a special case (for instance,
by discarding or ignoring it).
A different interpretation of this difference between tskit's position
and VCF's POS field
is that they are different coordinate systems: tskit coordinates are
"distance to the right of the left end of the chromosome",
while VCF coordinates are "which number site, counting from the left end
of the chromosome and starting at one".
Under this interpretation, the solution is to supply an explicit
``position_transform`` that adds 1 to the coordinate when outputting
to VCF (or, using the ``"legacy"`` option described below). However, this can
easily lead to off-by-one errors converting between the coordinate systems,
so should only be used if you really are using 0-based coordinates for your
tree sequence.
:::{warning}
Most VCF tools cannot deal with a POS value of 0. If your tree
sequence contains a site with position 0, this will likely cause an error.
:::

Comment on lines +418 to +419
For example, to shift all coordinates by 1 to make them strictly
1-based, we could define:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, to shift all coordinates by 1 to make them strictly
1-based, we could define:
For example, to shift all coordinates by 1, we could define:

I'm removing "1-based" because I don't think that term is unambiguous.

`UNKNOWN_TIME` will be used to fill the column. See the
more than one mutation occurs at the same site). If the `time` column
is absent, the mutation times in the resulting tree sequence are set to
{data}`tskit.UNKNOWN_TIME`. Unknown mutation times written out by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{data}`tskit.UNKNOWN_TIME`. Unknown mutation times written out by
{data}`tskit.UNKNOWN_TIME`, which is a numeric value that behaves like NaN.
Unknown mutation times written out by

@petrelharp
Copy link
Contributor

This all looks good, but see comments/suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants