-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
MAINT: stats: minor numerical improvements to circular statistics #20766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The two CI errors are upstream:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @fancidev .
Overall, I like the idea of this PR, we just need to adjust the tests.
I think to use an array API usable signbit
equivalent we need our own implementation. For similar cases, custom implementations were placed in this file.
Can you construct a test case where (high-low)/(2*pi)
solves a precision issue?
Thanks for taking a look @dschmitz89 ! I have an idea to work around the As for the numerical accuracy point, I may be able to construct a contrived example, let’s see! I have one more small idea to apply to the |
@dschmitz89 I’ve done the changes I planned. Looking forward to your feedback! I summarized the changes in the top post of this thread. Each of the five points are independent of the others; feel free to cherry-pick. Let me know if there’s anything you’d like me to elaborate. (There’s a CI failure due to incompatibility with PyTorch in two of the newly added test cases. They’re easy to fix and I’ll fix them shortly.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @fancidev . Will leave it open for a few more days to give others a chance to look at it.
One subjective comment: do we in general care about a choice like +0.0 or -0.0? Of course a negative variance does not make sense but to me this represents a very unlikely to hit edge case. Every edge case needs to be maintained in future and comes consequently with costs (see for example the amount of effort required to add some level of unified nan
handling in stats). scipy already goes to great lengths to achieve high precision in tails of distributions (at least in my opinion). Achieving a similar accuracy for edge cases in summary statistics would probably be a huge amount of work.
TLDR: I might be more hesitant to approve such changes in future. Here, the PR is very solid and already done, so not opposing it :).
Thanks for the review @dschmitz89 !
I agree with you that the cost/benefit ratio doesn’t look good if we dig into the sign of zero in In this PR I polished the sign of zero “by-the-way” when I tested the functions for edge cases. The “fix” (+0.0) looks ugly because unfortunately PyTorch’s I find the story similar for premature overflow — we probably don’t want to guarantee (the absence of) that in the documentation, but in the implementation we might opt to handle it. |
Reference issue
Closes gh-20240.
What does this implement/fix?
This PR makes the following numerical improvements to
scipy.stats.circmean
,circvar
andcircstd
functions:Avoid unnecessary loss of precision when
high == 2*pi
andlow == 0
by computing(high-low)/(2*pi)
as a "unit". (Test case:test_circmean_accuracy_tiny_input
)Remove redundant
np.errstate(invalid='ignore')
guard because the guarded statement is no more subject to invalid operation (nan
) than the unguarded statements before or after it. This has the additional benefit of removing reference tonp
from an otherwisexp
-compatible function.Make sure the same
pi
is used in function signature (as default value forhigh
) and used in scaling.circstd
should return+0.0
instead of-0.0
if mean resultant length is zero. (Test case:test_circstd_zero
)Do not rotate the input before computing the mean resultant vector. (Test case:
test_circmean_accuracy_huge_input
)Additional information
Points 1-4 are straightforward. Let me elaborate Point 5 below. Let$\theta_1, \cdots, \theta_n$ denote the input observations, $H$ denote $L$ denote $\rho \equiv 2\pi/(H-L)$ denote the scaling factor to convert the input into radians. The current code (mainly ${z}$ by
high
,low
, and let_circfuncs_common
) computes the mean resultant vectorTo factor the expression into one that only depends on the difference$(H-L)$ , rewrite the (rotated and scaled) mean resultant vector ${z}$ as
where we denote by
to be the scaled but unrotated mean resultant vector.$z_0$ only depends on $(H-L)$ (through $\rho$ ).
We proceed to examine$L$ and $H$ .
circvar
,circstd
, andcircmean
's dependency oncircvar
is computed byThis shows$z_0$ and thus $(H-L)$ only. By a similar argument, one can show that $\textrm{circstd}=\sqrt{-2\log|{z}|}$ depends on $(H-L)$ only. This is intuitive because
circvar
depends oncircvar
andcircstd
measure the dispersion of data, and dispersion is unaffected by rotation.A little more notation is needed to show the dependence of$(H-L)$ :
circmean
onThe current code computes$\bar\theta \equiv$
circmean
such that both of the following conditions hold:Using the fact that
Condition (1) is equivalent to
Now define$\bar\theta$ as
Such defined$\bar\theta$ trivially satisfies Condition (2). It follows that
and therefore
But$\rho(H-L)=2\pi$ . This shows $\bar\theta$ also satisfies Condition (1’). Hence it gives the desired
circmean
.