Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in nansum for float32 dtype compared to numpy #193

Open
agoodm opened this issue Aug 15, 2018 · 6 comments
Open

Inconsistencies in nansum for float32 dtype compared to numpy #193

agoodm opened this issue Aug 15, 2018 · 6 comments
Milestone

Comments

@agoodm
Copy link

agoodm commented Aug 15, 2018

Consider this simple example:

In [1]: import numpy as np

In [2]: import bottleneck as bn

In [3]: data = 2e5*np.random.rand(int(4e7)).astype('float32')

In [4]: np.nansum(data)
Out[4]: 4000034300000.0

In [5]: bn.nansum(data)
Out[5]: 3719060258816.0

Looks like errors in the computation are compounding due to loss of precision, as the problem becomes much less apparent for smaller datasets. Repeating the above for the float64 dtype gives me much more consistent results.

In [6]: bn.nansum(data.astype('float64'))
Out[6]: 4000035580557.9033

In [7]: np.nansum(data.astype('float64'))
Out[7]: 4000035580557.979

I tested this example for bottleneck 1.1.0 and 1.2.1

@kwgoodman
Copy link
Collaborator

I think numpy uses a more robust algorithm: https://en.wikipedia.org/wiki/Pairwise_summation. Bottleneck doesn't.

Is bottleneck used at JPL?!

@agoodm
Copy link
Author

agoodm commented Aug 15, 2018

@kwgoodman Thanks for the quick response! To be precise, I have been using it indirectly through xarray for multiple JPL projects.

I take it that this is a well known issue then and this isn't going to be addressed anytime soon? I ended up discovering this when I wanted to calculate some statistics for a dataset that's similar in size to the one in the example I posted, and got some pretty serious errors. In this case, the dataset had a min and max that were very close to each other, but the mean ended up being lower than the min and consequently the standard deviation was over an order of magnitude larger than its actual value.

@kwgoodman
Copy link
Collaborator

Happy to hear that bottleneck is used, even if indirectly, at JPL.

I'd be interested in trying pairwise summation in bn.nansum and bn.nanmean. I get paid for releases of numerox but have not found funding for bottleneck development.

@ahmedshaaban1
Copy link

Hi,
Has this problem been solved?

@rdbisme
Copy link
Collaborator

rdbisme commented Jan 26, 2025

I don't think it did and I have no bandwith to tackle this. #424

@rdbisme
Copy link
Collaborator

rdbisme commented Jan 26, 2025

This might also be related #414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants