Inconsistencies in nansum for float32 dtype compared to numpy #193

agoodm · 2018-08-15T21:44:41Z

Consider this simple example:

In [1]: import numpy as np

In [2]: import bottleneck as bn

In [3]: data = 2e5*np.random.rand(int(4e7)).astype('float32')

In [4]: np.nansum(data)
Out[4]: 4000034300000.0

In [5]: bn.nansum(data)
Out[5]: 3719060258816.0

Looks like errors in the computation are compounding due to loss of precision, as the problem becomes much less apparent for smaller datasets. Repeating the above for the float64 dtype gives me much more consistent results.

In [6]: bn.nansum(data.astype('float64'))
Out[6]: 4000035580557.9033

In [7]: np.nansum(data.astype('float64'))
Out[7]: 4000035580557.979

I tested this example for bottleneck 1.1.0 and 1.2.1

The text was updated successfully, but these errors were encountered:

kwgoodman · 2018-08-15T21:50:56Z

I think numpy uses a more robust algorithm: https://en.wikipedia.org/wiki/Pairwise_summation. Bottleneck doesn't.

Is bottleneck used at JPL?!

agoodm · 2018-08-15T22:06:00Z

@kwgoodman Thanks for the quick response! To be precise, I have been using it indirectly through xarray for multiple JPL projects.

I take it that this is a well known issue then and this isn't going to be addressed anytime soon? I ended up discovering this when I wanted to calculate some statistics for a dataset that's similar in size to the one in the example I posted, and got some pretty serious errors. In this case, the dataset had a min and max that were very close to each other, but the mean ended up being lower than the min and consequently the standard deviation was over an order of magnitude larger than its actual value.

kwgoodman · 2018-08-15T22:58:34Z

Happy to hear that bottleneck is used, even if indirectly, at JPL.

I'd be interested in trying pairwise summation in bn.nansum and bn.nanmean. I get paid for releases of numerox but have not found funding for bottleneck development.

ahmedshaaban1 · 2025-01-25T11:16:11Z

Hi,
Has this problem been solved?

rdbisme · 2025-01-26T13:06:32Z

I don't think it did and I have no bandwith to tackle this. #424

rdbisme · 2025-01-26T13:07:06Z

This might also be related #414

agoodm mentioned this issue Aug 15, 2018

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed pydata/xarray#2370

Closed

TomAugspurger mentioned this issue Aug 16, 2018

Mean and std for float32 dataframe pandas-dev/pandas#22385

Open

qwhelan added this to the 1.4.0 milestone Sep 22, 2019

TomAugspurger mentioned this issue Apr 28, 2020

Feature Request: support using bottleneck nan* functions dask/dask#6150

Open

mwcraig mentioned this issue Jul 30, 2020

Improve image combination performance astropy/ccdproc#741

Merged

8 tasks

alvin-c-shih mentioned this issue Dec 6, 2020

2020-12-07 - KDB Project Call Meeting Minutes finos/kdb#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in nansum for float32 dtype compared to numpy #193

Inconsistencies in nansum for float32 dtype compared to numpy #193

agoodm commented Aug 15, 2018

kwgoodman commented Aug 15, 2018

agoodm commented Aug 15, 2018

kwgoodman commented Aug 15, 2018

ahmedshaaban1 commented Jan 25, 2025

rdbisme commented Jan 26, 2025

rdbisme commented Jan 26, 2025

Inconsistencies in nansum for float32 dtype compared to numpy #193

Inconsistencies in nansum for float32 dtype compared to numpy #193

Comments

agoodm commented Aug 15, 2018

kwgoodman commented Aug 15, 2018

agoodm commented Aug 15, 2018

kwgoodman commented Aug 15, 2018

ahmedshaaban1 commented Jan 25, 2025

rdbisme commented Jan 26, 2025

rdbisme commented Jan 26, 2025