Skip to content

Maybe add retry logic for failed remote data access #1219

@jpivarski

Description

@jpivarski

It wouldn't work for some failures, and that's the hard part: deciding when retrying is likely to work and when it's doomed and you're just extending the time needed before getting a useful error message to the user.

Here's the original #1217 (comment), which has some notes about how it could be implemented.


If this were to be done in Uproot, we'd have to dynamically adjust the request size and resubmit when we see such an error, and have some reasonable give-up policy after a specified number of retries to keep it from spiraling out of control. And if the OSError came from a local file, rather than XRootD, it's not retryable (if an open local file suddenly can't be read, someone must have unplugged the disk or something).

Possibly the smallest sections that would need to be retried are:

in HasBranches.arrays:

arrays, expression_context, branchid_interpretation = _regularize_expressions(
self,
expressions,
cut,
filter_name,
filter_typename,
filter_branch,
keys,
aliases,
language,
get_from_cache,
)
ranges_or_baskets = []
checked = set()
for _, context in expression_context:
for branch in context["branches"]:
if branch.cache_key not in checked:
checked.add(branch.cache_key)
for (
basket_num,
range_or_basket,
) in branch.entries_to_ranges_or_baskets(entry_start, entry_stop):
ranges_or_baskets.append((branch, basket_num, range_or_basket))
interp_options = {"ak_add_doc": ak_add_doc}
_ranges_or_baskets_to_arrays(
self,
ranges_or_baskets,
branchid_interpretation,
entry_start,
entry_stop,
decompression_executor,
interpretation_executor,
library,
arrays,
False,
interp_options,
)

in HasBranches.iterate:

ranges_or_baskets = []
checked = set()
for _, context in expression_context:
for branch in context["branches"]:
if branch.cache_key not in checked:
checked.add(branch.cache_key)
for (
basket_num,
range_or_basket,
) in branch.entries_to_ranges_or_baskets(
sub_entry_start, sub_entry_stop
):
previous_basket = previous_baskets.get(
(branch.cache_key, basket_num)
)
if previous_basket is None:
ranges_or_baskets.append(
(branch, basket_num, range_or_basket)
)
else:
ranges_or_baskets.append(
(branch, basket_num, previous_basket)
)
arrays = {}
interp_options = {"ak_add_doc": ak_add_doc}
_ranges_or_baskets_to_arrays(
self,
ranges_or_baskets,
branchid_interpretation,
sub_entry_start,
sub_entry_stop,
decompression_executor,
interpretation_executor,
library,
arrays,
True,
interp_options,
)

in Branch.array:

arrays = {}
expression_context = []
branchid_interpretation = {}
_regularize_branchname(
self,
self.name,
self,
interpretation,
get_from_cache,
arrays,
expression_context,
branchid_interpretation,
True,
False,
)
ranges_or_baskets = []
checked = set()
for _, context in expression_context:
for branch in context["branches"]:
if branch.cache_key not in checked and not isinstance(
branchid_interpretation[branch.cache_key],
uproot.interpretation.grouped.AsGrouped,
):
checked.add(branch.cache_key)
for (
basket_num,
range_or_basket,
) in branch.entries_to_ranges_or_baskets(entry_start, entry_stop):
ranges_or_baskets.append((branch, basket_num, range_or_basket))
interp_options = {"ak_add_doc": ak_add_doc}
_ranges_or_baskets_to_arrays(
self,
ranges_or_baskets,
branchid_interpretation,
entry_start,
entry_stop,
decompression_executor,
interpretation_executor,
library,
arrays,
False,
interp_options,
)

Maybe it's possible to retry only the _ranges_or_baskets_to_arrays call, but I'm not sure if there would be a counting problem if the arrays dict and ranges_or_baskets list are not reinitialized after a partial failure. (These are notes for implementation.) If the number of items in the arrays dict is wrong, the code will hang in this loop, rather than raise an error!

There would also need to be a way to change the granularity of the XRootD request while still requesting all the data the user wants. The Source.chunks method doesn't have a way to express that, but maybe the coalesce algorithm does? Unfortunately, coalesce arguments are specified per-FSSpecSource object, but maybe each retry could be a new, more finely granular FSSpecSource? (We don't know if another thread is using the same FSSpecSource object; we can't change it in place.)

Originally posted by @jpivarski in #1217 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions