Skip to content

MAINT: stats: minor numerical improvements to circular statistics #20766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 14, 2024

Conversation

fancidev
Copy link
Contributor

@fancidev fancidev commented May 22, 2024

Reference issue

Closes gh-20240.

What does this implement/fix?

This PR makes the following numerical improvements to scipy.stats.circmean, circvar and circstd functions:

  1. Avoid unnecessary loss of precision when high == 2*pi and low == 0 by computing (high-low)/(2*pi) as a "unit". (Test case: test_circmean_accuracy_tiny_input)

  2. Remove redundant np.errstate(invalid='ignore') guard because the guarded statement is no more subject to invalid operation (nan) than the unguarded statements before or after it. This has the additional benefit of removing reference to np from an otherwise xp-compatible function.

  3. Make sure the same pi is used in function signature (as default value for high) and used in scaling.

  4. circstd should return +0.0 instead of -0.0 if mean resultant length is zero. (Test case: test_circstd_zero)

  5. Do not rotate the input before computing the mean resultant vector. (Test case: test_circmean_accuracy_huge_input)

Additional information

Points 1-4 are straightforward. Let me elaborate Point 5 below. Let $\theta_1, \cdots, \theta_n$ denote the input observations, $H$ denote high, $L$ denote low, and let $\rho \equiv 2\pi/(H-L)$ denote the scaling factor to convert the input into radians. The current code (mainly _circfuncs_common) computes the mean resultant vector ${z}$ by

$$ \begin{align*} {z} &= \frac{1}{n} \sum_{k=1}^n e^{ i \rho (\theta_k-L) } \end{align*} $$

${z}$ in the above expression depends on the individual values of $L$ and $H$.

To factor the expression into one that only depends on the difference $(H-L)$, rewrite the (rotated and scaled) mean resultant vector ${z}$ as

$$ \begin{align*} {z} &= \frac{1}{n} \sum_{k=1}^n e^{ -i\rho L + i \rho \theta_k} \\ &=e^{ -i\rho L } \left( \frac{1}{n} \sum_{k=1}^n e^{ i \rho \theta_k} \right)\\ &= e^{ -i\rho L } z_0 \end{align*} $$

where we denote by

$$ z_0 := \frac{1}{n} \sum_{k=1}^n e^{ i \rho \theta_k} $$

to be the scaled but unrotated mean resultant vector. $z_0$ only depends on $(H-L)$ (through $\rho$).

We proceed to examine circvar, circstd, and circmean's dependency on $L$ and $H$. circvar is computed by

$$ \textrm{circvar} = 1-|z| =1-| e^{ -i\rho L } z_0 | = 1-| e^{ -i\rho L }|\cdot | z_0 |=1-|z_0| $$

This shows circvar depends on $z_0$ and thus $(H-L)$ only. By a similar argument, one can show that $\textrm{circstd}=\sqrt{-2\log|{z}|}$ depends on $(H-L)$ only. This is intuitive because circvar and circstd measure the dispersion of data, and dispersion is unaffected by rotation.

A little more notation is needed to show the dependence of circmean on $(H-L)$:

  • For complex number $z \ne 0$, let $\arg (z)$ denote the principal value of its arguments, taking value in an arbitrary interval (say $[-\pi,\pi)$).
  • For real numbers $a,b,c$, let $a \cong b \,(\mathop{\mathrm{mod}} c)$ denote congruence modulo $c$, i.e. $a-b=kc$ for some integer $k$.

The current code computes $\bar\theta \equiv$ circmean such that both of the following conditions hold:

$$ \begin{align*} &\rho(\bar\theta-L) \cong \arg(z) \,(\textrm{mod} \, 2\pi) &(1) \\ &L \le \bar\theta < H &(2) \\ \end{align*} $$

Using the fact that

$$ \arg(z) = \arg(e^{-i\rho L}z_0) \cong -\rho L+\arg(z_0) \,(\mathrm{mod}\,2\pi) $$

Condition (1) is equivalent to

$$ \begin{align*} &\rho\bar\theta \cong \arg(z_0) \,(\mathrm{mod}\,2\pi) &(1’) \\ \end{align*} $$

Now define $\bar\theta$ as

$$ \bar\theta := (\arg(z_0)/\rho - L) \textrm{ mod } (H-L) + L $$

Such defined $\bar\theta$ trivially satisfies Condition (2). It follows that

$$ \bar\theta -L\cong \arg(z_0)/\rho - L \, (\mathrm{mod}\, (H-L)) $$

and therefore

$$ \rho \bar\theta \cong \arg(z_0) \, (\mathrm{mod}\, \rho(H-L)) $$

But $\rho(H-L)=2\pi$. This shows $\bar\theta$ also satisfies Condition (1’). Hence it gives the desired circmean.

@github-actions github-actions bot added scipy.stats maintenance Items related to regular maintenance tasks labels May 22, 2024
@lucascolley lucascolley changed the title MAINT:minor numerical improvements to circular statistics MAINT: stats: minor numerical improvements to circular statistics May 22, 2024
@fancidev fancidev marked this pull request as ready for review May 22, 2024 14:23
@fancidev fancidev marked this pull request as draft May 22, 2024 23:26
@fancidev
Copy link
Contributor Author

The two CI errors are upstream:

  1. PyTorch pow(-0.0, 0.5) wrongly returns -0.0. See BUG: torch.pow(-0.0) wrongly returns -0.0 (should be +0.0) pytorch/pytorch#127163

  2. array_api_strict does not yet implement xp.signbit required by unit test. See Support for 2023.12 data-apis/array-api-strict#25

Copy link
Contributor

@dschmitz89 dschmitz89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @fancidev .

Overall, I like the idea of this PR, we just need to adjust the tests.

I think to use an array API usable signbit equivalent we need our own implementation. For similar cases, custom implementations were placed in this file.

Can you construct a test case where (high-low)/(2*pi) solves a precision issue?

@fancidev
Copy link
Contributor Author

Thanks for taking a look @dschmitz89 !

I have an idea to work around the signbit thing and the torch.pow thing; will try it out later.

As for the numerical accuracy point, I may be able to construct a contrived example, let’s see!

I have one more small idea to apply to the _circfuncs_common function, which is still in the working. After that I’ll turn the PR into non-draft for you to take another look :-)

@fancidev fancidev marked this pull request as ready for review May 30, 2024 00:24
@fancidev
Copy link
Contributor Author

@dschmitz89 I’ve done the changes I planned. Looking forward to your feedback!

I summarized the changes in the top post of this thread. Each of the five points are independent of the others; feel free to cherry-pick. Let me know if there’s anything you’d like me to elaborate.

(There’s a CI failure due to incompatibility with PyTorch in two of the newly added test cases. They’re easy to fix and I’ll fix them shortly.)

Copy link
Contributor

@dschmitz89 dschmitz89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @fancidev . Will leave it open for a few more days to give others a chance to look at it.

One subjective comment: do we in general care about a choice like +0.0 or -0.0? Of course a negative variance does not make sense but to me this represents a very unlikely to hit edge case. Every edge case needs to be maintained in future and comes consequently with costs (see for example the amount of effort required to add some level of unified nan handling in stats). scipy already goes to great lengths to achieve high precision in tails of distributions (at least in my opinion). Achieving a similar accuracy for edge cases in summary statistics would probably be a huge amount of work.

TLDR: I might be more hesitant to approve such changes in future. Here, the PR is very solid and already done, so not opposing it :).

@lucascolley lucascolley added this to the 1.15.0 milestone Jun 5, 2024
@fancidev
Copy link
Contributor Author

fancidev commented Jun 5, 2024

Thanks for the review @dschmitz89 !

One subjective comment: do we in general care about a choice like +0.0 or -0.0?

I agree with you that the cost/benefit ratio doesn’t look good if we dig into the sign of zero in stats. Therefore my personal preference is to leave the sign of zero unspecified in the public interface (or documentation) so that it doesn’t become a maintenance burden. In the implementation, if we happen to look a bit closer and do some polishing, it might be ok. (In this case a test case could be added to catch any regression.)

In this PR I polished the sign of zero “by-the-way” when I tested the functions for edge cases. The “fix” (+0.0) looks ugly because unfortunately PyTorch’s pow function is non-compliant (to IEEE or Array API). Otherwise the code would look much more natural that you wouldn’t even feel it!

I find the story similar for premature overflow — we probably don’t want to guarantee (the absence of) that in the documentation, but in the implementation we might opt to handle it.

@dschmitz89 dschmitz89 merged commit 11eeb55 into scipy:main Jun 14, 2024
31 of 32 checks passed
@fancidev fancidev deleted the circ-funcs branch July 8, 2024 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: multiple small improvements to scipy.stats.circmean
3 participants