Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearsky functions handling tz-aware datetimes as tz-naive, contrary to user guide #2054

Closed
yhkee0404 opened this issue May 19, 2024 · 5 comments · Fixed by #2055
Closed
Labels
Milestone

Comments

@yhkee0404
Copy link
Contributor

yhkee0404 commented May 19, 2024

Describe the bug
Both of the functions clearsky.lookup_linke_turbidity and irradiance.get_extra_radiation are supposed to be tz-aware in accordance with the user guide, but take the tz-aware inputs as tz-naive in their implementations and return incorrect results.
Another functions using these two functions may also have been affected by the errors; i.e. Location.get_clearsky, clearsky.ineichen, or clearsky.simplified_solis.

To Reproduce
Refer to my comprehensive analysis on the .ipynb file simulated in the same way as the user guide.
It seems too big for the browser to render because it shows me only this message: An error occurred.
In a similar situation, please download the file or clone my repository to see the content.

Expected behavior
There should be no items searched on the file by the keyword rows , each of which means errors subject to timezones.

Screenshots
For instance, on the user guide, it prints: climatological linke_turbidity = 3.329496820286459.
However, after converting the local timezone to utc, it prints on my analysis above another value: climatological linke_turbidity = 3.3247187176482633.

Versions:

  • pvlib.__version__: v0.10.5
  • pandas.__version__: 2.2.2
  • python: 3.11.9

Additional context
Almost the same as #237

@yhkee0404 yhkee0404 changed the title Misuses of tz-naive clearsky functions on the user guide Misused tz-naive clearsky functions on the user guide May 19, 2024
@cwhanse
Copy link
Member

cwhanse commented May 20, 2024

I'm not quite following. You write "Two functions clearsky.lookup_linke_turbidity and irradiance.get_extra_radiation is tz-naive but has been used as if they were tz-aware" but in the clear-sky user guide, the times input is localized

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='1min', tz=tus.tz)

@yhkee0404
Copy link
Contributor Author

yhkee0404 commented May 20, 2024

Hi, @cwhanse
Thank you for your attention.

Yes, that localized times is a tz-aware input which is not a problem, but when it is given to the two functions they return unexpected results and this is the problem. That's why I said the functions are not tz-aware but has been misused as if they were tz-aware, and we are supposed to fix them to be tz-aware.

For instance, try converting the timezone of the times input using a common tz-aware function tz_convert and you can see the difference. This example was copied from the .ipynb file I mentioned above:

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='1min', tz=tus.tz)

cs = tus.get_clearsky(times)  # ineichen with climatology table by default
cs_utc = tus.get_clearsky(times.tz_convert('utc')).tz_convert(tus.tz)  # ineichen with climatology table by default

assert cs.index.equals(cs_utc.index)
assert not ((cs < 0).any().any() or (cs_utc < 0).any().any())
print(len(cs))
df = cs.join(cs_utc, how = 'outer', rsuffix = '_utc')[~ cs.fillna(-1).eq(cs_utc.fillna(-1)).all(axis = 'columns')]
print(df.index.hour.unique(), df.index.minute.nunique())
df
4321
Index([17, 18, 19], dtype='int32') 60
                                  ghi         dni        dhi     ghi_utc  \
2016-07-01 17:00:00-07:00  460.378093  778.936621  70.469943  460.055525   
2016-07-01 17:01:00-07:00  456.733517  777.196738  70.143925  456.411549   
2016-07-01 17:02:00-07:00  453.083990  775.435849  69.816605  452.762630   
2016-07-01 17:03:00-07:00  449.429616  773.653621  69.487974  449.108869   
2016-07-01 17:04:00-07:00  445.770496  771.849713  69.158023  445.450370   
...                               ...         ...        ...         ...   
2016-07-03 19:28:00-07:00    0.372376    6.479330   0.286308    0.368759   
2016-07-03 19:29:00-07:00    0.232132    4.744294   0.182824    0.229742   
2016-07-03 19:30:00-07:00    0.131663    3.388042   0.106134    0.130224   
2016-07-03 19:31:00-07:00    0.063133    2.356295   0.052039    0.062400   
2016-07-03 19:32:00-07:00    0.019268    1.593766   0.016222    0.019030   

                              dni_utc    dhi_utc  
2016-07-01 17:00:00-07:00  777.658658  70.787077  
2016-07-01 17:01:00-07:00  775.913720  70.460150  
2016-07-01 17:02:00-07:00  774.147732  70.131910  
2016-07-01 17:03:00-07:00  772.360361  69.802348  
2016-07-01 17:04:00-07:00  770.551267  69.471455  
...                               ...        ...  
2016-07-03 19:28:00-07:00    6.330446   0.284669  
2016-07-03 19:29:00-07:00    4.628755   0.181634  
2016-07-03 19:30:00-07:00    3.300508   0.105355  
2016-07-03 19:31:00-07:00    2.291654   0.051610  
2016-07-03 19:32:00-07:00    1.547308   0.016073  

[459 rows x 6 columns]

Does that difference conform to an expected behavior? I think cs_utc is the more accurate result than cs on the user guide.

On the other hand, you can also see on the user guide another times input as below which is tz-naive, as well as the tz-aware or localized one you mentioned. Both tz-naive and tz-aware times inputs are used without distiction on the user guide, but it is highly likely that they have been misused because merely preprocessing the inputs by the well-known tz-aware function tz_convert makes the functions result another values.

Another example from my analysis above:

# times = pd.date_range(start='2015-01-01', end='2016-01-01', freq='1D')
times_h = pd.date_range(start='2015-01-01', end='2016-01-01', freq='1h')
times_h_kst = pd.date_range(start='2015-01-01', end='2016-01-01', freq='1h').tz_localize('utc').tz_convert('Asia/Seoul')

# sites = [(32, -111, 'Tucson1'), (32.2, -110.9, 'Tucson2'),
#          (33.5, -112.1, 'Phoenix'), (35.1, -106.6, 'Albuquerque')]
lat, lon, name = 35.1, -106.6, 'Albuquerque'

# for lat, lon, name in sites:
turbidity_h = pvlib.clearsky.lookup_linke_turbidity(times_h, lat, lon, interp_turbidity=False)
turbidity_h_kst = pvlib.clearsky.lookup_linke_turbidity(times_h_kst, lat, lon, interp_turbidity=False).tz_convert(None)

assert turbidity_h.index.equals(turbidity_h_kst.index)
assert not ((turbidity_h < 0).any() or (turbidity_h_kst < 0).any())
print(len(turbidity_h))
df = pd.concat([turbidity_h, turbidity_h_kst], axis = 'columns', join = 'outer')[~ turbidity_h.fillna(-1).eq(turbidity_h_kst.fillna(-1))]
print(df.index.hour.unique(), df.index.minute.nunique())
df
8761
Index([15, 16, 17, 18, 19, 20, 21, 22, 23], dtype='int32') 1
                        0     1
2015-01-31 15:00:00  2.05  1.95
2015-01-31 16:00:00  2.05  1.95
2015-01-31 17:00:00  2.05  1.95
2015-01-31 18:00:00  2.05  1.95
2015-01-31 19:00:00  2.05  1.95
...                   ...   ...
2015-12-31 19:00:00  2.25  2.05
2015-12-31 20:00:00  2.25  2.05
2015-12-31 21:00:00  2.25  2.05
2015-12-31 22:00:00  2.25  2.05
2015-12-31 23:00:00  2.25  2.05

[99 rows x 2 columns]

Is this an expected behavior? Isn't turbidity_h right and turbidity_h_kst wrong? I think ideally the results should be the same as turbidity_h, and a different turbidity_h_kst is a bug, just like #237 mentioned above.

To summarize, the user guide has considered those functions as if tz-aware, where any tz_convert of the input may not change their results. However, you can easily find it does indeed as in my .ipynb file, which means the functions are tz-naive against our expectations and should be fixed into tz-aware like in #2055.

@yhkee0404 yhkee0404 changed the title Misused tz-naive clearsky functions on the user guide Fix clearsky functions handling tz-aware datetimes as tz-naive contrary to user guide May 25, 2024
@yhkee0404 yhkee0404 changed the title Fix clearsky functions handling tz-aware datetimes as tz-naive contrary to user guide Clearsky functions handling tz-aware datetimes as tz-naive, contrary to user guide May 25, 2024
@yhkee0404
Copy link
Contributor Author

Sorry for my poor English. I've thoroughly revised my words. Could you please review them again? @cwhanse

@cwhanse
Copy link
Member

cwhanse commented May 25, 2024

@yhkee0404 aha, I think I partly understand what's going on here. The Linke turbidity data distributed with pvlib are gridded (location), monthly values. These values (from an outside source) do not have datetime stamps, only a month. pvlib (somewhat arbitrarily) assigns these values to the middle day of each month so that a daily value can be determined by interpolation. That machinery is not shown to the user; it happens here:

def _interpolate_turbidity(lts, time):

The examples you provide are comparing the interpolated turbidity at the same location but localized to different timezones. The interpolation is being done using time.dayofyear. It looks to me that the dayofyear function may operating in local time, because the day of year is changed at midnight local, not midnight UTC.

Arguably, pandas had to make a decision how to return a day of year for each location, so I don't think this is a fault in pandas - its behaving as intended (I assume).

So: if we want Linke turbidity at a location to be consistent with equal timestamps, we need to work under the hood in pvlib to fix the interpolation.

import numpy as np
import pandas as pd
from pvlib.location import Location

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='30min', tz=tus.tz)
times_utc = times.tz_convert('utc')

doy = times.dayofyear
doy_utc = times_utc.dayofyear

td = np.append(0, np.diff(times.dayofyear))  # add a zero so it's the right length
print(times[td > 0])

td_utc = np.append(0, np.diff(times_utc.dayofyear))  # add a zero so it's the right length
print(times_utc[td_utc > 0])

@yhkee0404
Copy link
Contributor Author

yhkee0404 commented May 26, 2024

Thank you for your clear explanation! Could I add some details to your example?

  1. The inconsistent results of clearsky.lookup_linke_turbidity for each localized datetime depend not only on interp_turbidity=True and pandas.DatetimeIndex.dayofyear, but also on pandas.DatetimeIndex.month, regardless of interpolation. No matter why, the fundamental reason remains the same as you identified: the values are "changed at midnight local, not midnight UTC." Please note the influence of interp_turbidity and time.month:

if interp_turbidity:
linke_turbidity = _interpolate_turbidity(lts, time)
else:
months = time.month - 1
linke_turbidity = pd.Series(lts[months], index=time)

  1. The variability in irradiance.get_extra_radiation values results from pandas.DatetimeIndex.dayofyear, pandas.DatetimeIndex.month, and pandas.DatetimeIndex.year. The root cause, again, is the localized definition of midnight. Refer to Perform dayofyear-based calculations according to UTC, not local time #2055 for more details. Or, you can start by examining the following code and trace through irradiance._handle_extra_radiation_types and tools._pandas_to_doy to identify pandas.DatetimeIndex.dayofyear, then follow through to solarposition.pyephem_earthsun_distance which uses ephem.Date assuming UTC, and finally to solarposition.nrel_earthsun_distance which references pandas.DatetimeIndex.year and pandas.DatetimeIndex.month.

to_doy, to_datetimeindex, to_output = \
_handle_extra_radiation_types(datetime_or_doy, epoch_year)
# consider putting asce and spencer methods in their own functions
method = method.lower()
if method == 'asce':
B = solarposition._calculate_simple_day_angle(to_doy(datetime_or_doy),
offset=0)
RoverR0sqrd = 1 + 0.033 * np.cos(B)
elif method == 'spencer':
B = solarposition._calculate_simple_day_angle(to_doy(datetime_or_doy))
RoverR0sqrd = (1.00011 + 0.034221 * np.cos(B) + 0.00128 * np.sin(B) +
0.000719 * np.cos(2 * B) + 7.7e-05 * np.sin(2 * B))
elif method == 'pyephem':
times = to_datetimeindex(datetime_or_doy)
RoverR0sqrd = solarposition.pyephem_earthsun_distance(times) ** (-2)
elif method == 'nrel':
times = to_datetimeindex(datetime_or_doy)
RoverR0sqrd = \
solarposition.nrel_earthsun_distance(times, **kwargs) ** (-2)
else:
raise ValueError('Invalid method: %s', method)

  1. The assumed timezones for both functions seem to be UTC. As you mentioned, for clearsky.lookup_linke_turbidity "these values (from an outside source) do not have datetime stamps, only a month." Therefore, synchronizing input datetimes with the timezone distinguishing each day in the h5 file is key to resolving the timezone-related uncertainties in Linke turbidity.

# The .h5 file 'LinkeTurbidities.h5' contains a single 2160 x 4320 x 12
# matrix of type uint8 called 'LinkeTurbidity'. The rows represent global
# latitudes from 90 to -90 degrees; the columns represent global longitudes
# from -180 to 180; and the depth (third dimension) represents months of
# the year from January (1) to December (12). To determine the Linke
# turbidity for a position on the Earth's surface for a given month do the
# following: LT = LinkeTurbidity(LatitudeIndex, LongitudeIndex, month).
# Note that the numbers within the matrix are 20 * Linke Turbidity,
# so divide the number from the file by 20 to get the
# turbidity.
# The nodes of the grid are 5' (1/12=0.0833[arcdeg]) apart.
# From Section 8 of Aerosol optical depth and Linke turbidity climatology
# http://www.meteonorm.com/images/uploads/downloads/ieashc36_report_TL_AOD_climatologies.pdf
# 1st row: 89.9583 S, 2nd row: 89.875 S
# 1st column: 179.9583 W, 2nd column: 179.875 W
if filepath is None:
pvlib_path = os.path.dirname(os.path.abspath(__file__))
filepath = os.path.join(pvlib_path, 'data', 'LinkeTurbidities.h5')

Fortunately, we can experimentally confirm that both functions use UTC for dividing days without needing additional references. The following code provides evidence for irradiance.get_extra_radiation. Only assuming UTC makes the returned values unique for every day; localizing to any other timezone is awkward because two distinct values exist per each localized day then.

import pandas as pd
from pvlib.location import Location
import pvlib

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='30min', tz=tus.tz)
times_utc = times.tz_convert('utc')

dni_extra = pvlib.irradiance.get_extra_radiation(times)
dni_extra_utc = pvlib.irradiance.get_extra_radiation(times_utc).tz_convert(tus.tz)

assert dni_extra.index.equals(dni_extra_utc.index)
assert not ((dni_extra < 0).any() or (dni_extra_utc < 0).any())
print(len(dni_extra))
df = pd.concat([dni_extra, dni_extra_utc], axis = 'columns', join = 'outer')[~ dni_extra.fillna(-1).eq(dni_extra_utc.fillna(-1))]
print(df.index.hour.unique(), df.index.minute.nunique())
print(dni_extra_utc.tz_convert('UTC').groupby(by=lambda x: x.day).unique())
df
145
Index([17, 18, 19, 20, 21, 22, 23], dtype='int32') 2
1    [1320.4980145041654]
2    [1320.4715353124743]
3    [1320.4577467900974]
4    [1320.4566508404746]
dtype: object
                                     0            1
2016-07-01 17:00:00-07:00  1320.498015  1320.471535
2016-07-01 17:30:00-07:00  1320.498015  1320.471535
2016-07-01 18:00:00-07:00  1320.498015  1320.471535
2016-07-01 18:30:00-07:00  1320.498015  1320.471535
2016-07-01 19:00:00-07:00  1320.498015  1320.471535
2016-07-01 19:30:00-07:00  1320.498015  1320.471535
2016-07-01 20:00:00-07:00  1320.498015  1320.471535
2016-07-01 20:30:00-07:00  1320.498015  1320.471535
2016-07-01 21:00:00-07:00  1320.498015  1320.471535
2016-07-01 21:30:00-07:00  1320.498015  1320.471535
2016-07-01 22:00:00-07:00  1320.498015  1320.471535
2016-07-01 22:30:00-07:00  1320.498015  1320.471535
2016-07-01 23:00:00-07:00  1320.498015  1320.471535
2016-07-01 23:30:00-07:00  1320.498015  1320.471535
2016-07-02 17:00:00-07:00  1320.471535  1320.457747
2016-07-02 17:30:00-07:00  1320.471535  1320.457747
2016-07-02 18:00:00-07:00  1320.471535  1320.457747
2016-07-02 18:30:00-07:00  1320.471535  1320.457747
2016-07-02 19:00:00-07:00  1320.471535  1320.457747
2016-07-02 19:30:00-07:00  1320.471535  1320.457747
2016-07-02 20:00:00-07:00  1320.471535  1320.457747
2016-07-02 20:30:00-07:00  1320.471535  1320.457747
2016-07-02 21:00:00-07:00  1320.471535  1320.457747
2016-07-02 21:30:00-07:00  1320.471535  1320.457747
2016-07-02 22:00:00-07:00  1320.471535  1320.457747
2016-07-02 22:30:00-07:00  1320.471535  1320.457747
2016-07-02 23:00:00-07:00  1320.471535  1320.457747
2016-07-02 23:30:00-07:00  1320.471535  1320.457747
2016-07-03 17:00:00-07:00  1320.457747  1320.456651
2016-07-03 17:30:00-07:00  1320.457747  1320.456651
2016-07-03 18:00:00-07:00  1320.457747  1320.456651
2016-07-03 18:30:00-07:00  1320.457747  1320.456651
2016-07-03 19:00:00-07:00  1320.457747  1320.456651
2016-07-03 19:30:00-07:00  1320.457747  1320.456651
2016-07-03 20:00:00-07:00  1320.457747  1320.456651
2016-07-03 20:30:00-07:00  1320.457747  1320.456651
2016-07-03 21:00:00-07:00  1320.457747  1320.456651
2016-07-03 21:30:00-07:00  1320.457747  1320.456651
2016-07-03 22:00:00-07:00  1320.457747  1320.456651
2016-07-03 22:30:00-07:00  1320.457747  1320.456651
2016-07-03 23:00:00-07:00  1320.457747  1320.456651
2016-07-03 23:30:00-07:00  1320.457747  1320.456651

@kandersolar kandersolar added this to the v0.11.1 milestone Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants