MAINT: stats: minor numerical improvements to circular statistics #20766

fancidev · 2024-05-22T14:04:30Z

Reference issue

Closes gh-20240.

What does this implement/fix?

This PR makes the following numerical improvements to scipy.stats.circmean, circvar and circstd functions:

Avoid unnecessary loss of precision when high == 2*pi and low == 0 by computing (high-low)/(2*pi) as a "unit". (Test case: test_circmean_accuracy_tiny_input)
Remove redundant np.errstate(invalid='ignore') guard because the guarded statement is no more subject to invalid operation (nan) than the unguarded statements before or after it. This has the additional benefit of removing reference to np from an otherwise xp-compatible function.
Make sure the same pi is used in function signature (as default value for high) and used in scaling.
circstd should return +0.0 instead of -0.0 if mean resultant length is zero. (Test case: test_circstd_zero)
Do not rotate the input before computing the mean resultant vector. (Test case: test_circmean_accuracy_huge_input)

Additional information

Points 1-4 are straightforward. Let me elaborate Point 5 below. Let $\theta_1, \cdots, \theta_n$ denote the input observations, $H$ denote high, $L$ denote low, and let $\rho \equiv 2\pi/(H-L)$ denote the scaling factor to convert the input into radians. The current code (mainly _circfuncs_common) computes the mean resultant vector ${z}$ by

$$ \begin{align*} {z} &= \frac{1}{n} \sum_{k=1}^n e^{ i \rho (\theta_k-L) } \end{align*} $$

${z}$ in the above expression depends on the individual values of $L$ and $H$.

To factor the expression into one that only depends on the difference $(H-L)$, rewrite the (rotated and scaled) mean resultant vector ${z}$ as

$$ \begin{align*} {z} &= \frac{1}{n} \sum_{k=1}^n e^{ -i\rho L + i \rho \theta_k} \\ &=e^{ -i\rho L } \left( \frac{1}{n} \sum_{k=1}^n e^{ i \rho \theta_k} \right)\\ &= e^{ -i\rho L } z_0 \end{align*} $$

where we denote by

$$ z_0 := \frac{1}{n} \sum_{k=1}^n e^{ i \rho \theta_k} $$

to be the scaled but unrotated mean resultant vector. $z_0$ only depends on $(H-L)$ (through $\rho$).

We proceed to examine circvar, circstd, and circmean's dependency on $L$ and $H$. circvar is computed by

$$ \textrm{circvar} = 1-|z| =1-| e^{ -i\rho L } z_0 | = 1-| e^{ -i\rho L }|\cdot | z_0 |=1-|z_0| $$

This shows circvar depends on $z_0$ and thus $(H-L)$ only. By a similar argument, one can show that $\textrm{circstd}=\sqrt{-2\log|{z}|}$ depends on $(H-L)$ only. This is intuitive because circvar and circstd measure the dispersion of data, and dispersion is unaffected by rotation.

A little more notation is needed to show the dependence of circmean on $(H-L)$:

For complex number $z \ne 0$, let $\arg (z)$ denote the principal value of its arguments, taking value in an arbitrary interval (say $[-\pi,\pi)$).
For real numbers $a,b,c$, let $a \cong b \,(\mathop{\mathrm{mod}} c)$ denote congruence modulo $c$, i.e. $a-b=kc$ for some integer $k$.

The current code computes $\bar\theta \equiv$ circmean such that both of the following conditions hold:

$$ \begin{align*} &\rho(\bar\theta-L) \cong \arg(z) \,(\textrm{mod} \, 2\pi) &(1) \\ &L \le \bar\theta < H &(2) \\ \end{align*} $$

Using the fact that

$$ \arg(z) = \arg(e^{-i\rho L}z_0) \cong -\rho L+\arg(z_0) \,(\mathrm{mod}\,2\pi) $$

Condition (1) is equivalent to

$$ \begin{align*} &\rho\bar\theta \cong \arg(z_0) \,(\mathrm{mod}\,2\pi) &(1’) \\ \end{align*} $$

Now define $\bar\theta$ as

$$ \bar\theta := (\arg(z_0)/\rho - L) \textrm{ mod } (H-L) + L $$

Such defined $\bar\theta$ trivially satisfies Condition (2). It follows that

$$ \bar\theta -L\cong \arg(z_0)/\rho - L \, (\mathrm{mod}\, (H-L)) $$

and therefore

$$ \rho \bar\theta \cong \arg(z_0) \, (\mathrm{mod}\, \rho(H-L)) $$

But $\rho(H-L)=2\pi$. This shows $\bar\theta$ also satisfies Condition (1’). Hence it gives the desired circmean.

fancidev · 2024-05-25T06:57:33Z

The two CI errors are upstream:

PyTorch pow(-0.0, 0.5) wrongly returns -0.0. See BUG: torch.pow(-0.0) wrongly returns -0.0 (should be +0.0) pytorch/pytorch#127163
array_api_strict does not yet implement xp.signbit required by unit test. See Support for 2023.12 data-apis/array-api-strict#25

dschmitz89

Thanks for the PR @fancidev .

Overall, I like the idea of this PR, we just need to adjust the tests.

I think to use an array API usable signbit equivalent we need our own implementation. For similar cases, custom implementations were placed in this file.

Can you construct a test case where (high-low)/(2*pi) solves a precision issue?

fancidev · 2024-05-28T08:43:25Z

Thanks for taking a look @dschmitz89 !

I have an idea to work around the signbit thing and the torch.pow thing; will try it out later.

As for the numerical accuracy point, I may be able to construct a contrived example, let’s see!

I have one more small idea to apply to the _circfuncs_common function, which is still in the working. After that I’ll turn the PR into non-draft for you to take another look :-)

fancidev · 2024-05-30T01:07:41Z

@dschmitz89 I’ve done the changes I planned. Looking forward to your feedback!

I summarized the changes in the top post of this thread. Each of the five points are independent of the others; feel free to cherry-pick. Let me know if there’s anything you’d like me to elaborate.

(There’s a CI failure due to incompatibility with PyTorch in two of the newly added test cases. They’re easy to fix and I’ll fix them shortly.)

dschmitz89

This looks good @fancidev . Will leave it open for a few more days to give others a chance to look at it.

One subjective comment: do we in general care about a choice like +0.0 or -0.0? Of course a negative variance does not make sense but to me this represents a very unlikely to hit edge case. Every edge case needs to be maintained in future and comes consequently with costs (see for example the amount of effort required to add some level of unified nan handling in stats). scipy already goes to great lengths to achieve high precision in tails of distributions (at least in my opinion). Achieving a similar accuracy for edge cases in summary statistics would probably be a huge amount of work.

TLDR: I might be more hesitant to approve such changes in future. Here, the PR is very solid and already done, so not opposing it :).

fancidev · 2024-06-05T22:13:16Z

Thanks for the review @dschmitz89 !

One subjective comment: do we in general care about a choice like +0.0 or -0.0?

I agree with you that the cost/benefit ratio doesn’t look good if we dig into the sign of zero in stats. Therefore my personal preference is to leave the sign of zero unspecified in the public interface (or documentation) so that it doesn’t become a maintenance burden. In the implementation, if we happen to look a bit closer and do some polishing, it might be ok. (In this case a test case could be added to catch any regression.)

In this PR I polished the sign of zero “by-the-way” when I tested the functions for edge cases. The “fix” (+0.0) looks ugly because unfortunately PyTorch’s pow function is non-compliant (to IEEE or Array API). Otherwise the code would look much more natural that you wouldn’t even feel it!

I find the story similar for premature overflow — we probably don’t want to guarantee (the absence of) that in the documentation, but in the implementation we might opt to handle it.

github-actions bot added scipy.stats maintenance Items related to regular maintenance tasks labels May 22, 2024

lucascolley changed the title ~~MAINT:minor numerical improvements to circular statistics~~ MAINT: stats: minor numerical improvements to circular statistics May 22, 2024

fancidev marked this pull request as ready for review May 22, 2024 14:23

fancidev marked this pull request as draft May 22, 2024 23:26

fancidev added 4 commits May 28, 2024 06:23

Prevent loss of numerical precision if high == 2*pi.

8fe576b

Remove redundant errstate guard.

27072a9

circstd: return +0.0 instead of -0.0.

42a1430

Skip problematic xp backends.

1f7daec

fancidev force-pushed the circ-funcs branch from 2421a78 to 1f7daec Compare May 27, 2024 22:43

dschmitz89 requested changes May 28, 2024

View reviewed changes

circstd: work around pytorch and array_api_strict limitations.

03ae516

fancidev mentioned this pull request May 28, 2024

ENH: multiple small improvements to scipy.stats.circmean #20240

Closed

4 tasks

fancidev added 5 commits May 29, 2024 20:14

Merge branch 'main' of https://github.com/scipy/scipy into circ-funcs

3a4d562

Compute mean resultant vector from unrotated samples.

b1c2934

Add white-box test for numerical accuracy in conversion.

f6c9251

Pass period to _circfuncs_common to make the code clearer.

e30314b

Add white-box test for numerical accuracy in rotation.

c186897

fancidev marked this pull request as ready for review May 30, 2024 00:24

fancidev added 4 commits May 30, 2024 20:17

Work around PyTorch compatibility in test case.

c2a8ee3

Fix PyTorch compatibility in test case again.

9a375e0

One more PyTorch compatibility fix in test case.

e826d51

Yet another fix for PyTorch compatibility in test case.

335f5fe

fancidev mentioned this pull request May 31, 2024

DOC: update testing docs for alternative backends #20844

Closed

dschmitz89 approved these changes Jun 5, 2024

View reviewed changes

lucascolley added this to the 1.15.0 milestone Jun 5, 2024

dschmitz89 merged commit 11eeb55 into scipy:main Jun 14, 2024
31 of 32 checks passed

fancidev deleted the circ-funcs branch July 8, 2024 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: stats: minor numerical improvements to circular statistics #20766

MAINT: stats: minor numerical improvements to circular statistics #20766

fancidev commented May 22, 2024 •

edited

Loading

fancidev commented May 25, 2024

dschmitz89 left a comment •

edited

Loading

fancidev commented May 28, 2024

fancidev commented May 30, 2024

dschmitz89 left a comment

fancidev commented Jun 5, 2024 •

edited

Loading

MAINT: stats: minor numerical improvements to circular statistics #20766

MAINT: stats: minor numerical improvements to circular statistics #20766

Conversation

fancidev commented May 22, 2024 • edited Loading

Reference issue

What does this implement/fix?

Additional information

fancidev commented May 25, 2024

dschmitz89 left a comment • edited Loading

Choose a reason for hiding this comment

fancidev commented May 28, 2024

fancidev commented May 30, 2024

dschmitz89 left a comment

Choose a reason for hiding this comment

fancidev commented Jun 5, 2024 • edited Loading

fancidev commented May 22, 2024 •

edited

Loading

dschmitz89 left a comment •

edited

Loading

fancidev commented Jun 5, 2024 •

edited

Loading