-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add extension point for dataset-loader associations #2246
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnazaren I haven't taken a careful look at this, but I figured I should post what I have so far here.
zipline/pipeline/dispatcher.py
Outdated
self._column_loaders.clear() | ||
|
||
|
||
global_pipeline_dispatcher = PipelineDispatcher() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a global registry for this? In general, I think we should avoid the pattern of having a mutable global registry of stuff unless it's absolutely necessary.
zipline/pipeline/dispatcher.py
Outdated
pl : PipelineLoader | ||
The PipelineLoader to register for the column or dataset columns | ||
""" | ||
assert isinstance(pl, PipelineLoader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two notes on this:
- I don't think we need to be doing type checking on these inputs. A pipeline loader can be any object that implements
load_adjusted_array
with the correct signature. Forcing all the loaders to be subclasses of a base class isn't necessary. If a user wants to validate that their class implements the correct methods, that might make sense, but I had actually forgotten that thePipelineLoader
base class existed, and we definitely have loaders downstream that don't subclass that base class. - It's almost never a good idea to use
assert
for validating user input.assert
statements get disabled if you run python with-O2
. They're meant to be used for sanity checks in the internal logic of your program, not for verifying that a value received from a user matches some contract. Copy/pasting from a reply that I wrote elsewhere earlier this summer:
When Should You Use Assertions?
Generally speaking, you should use assertions for things that indicate that
there's a bug in your program or library that's severe enough that you'd prefer
for the program to crash immediately with a useful error rather than
continuing. In particular, you should only use assertions for things that you
have reason to believe to be true.
That generally means that you should use assertions for things like
preconditions and postconditions of functions that are internal to the library,
and that you shouldn't use assertions for things like validating user input
(whether or not the user passes you valid inputs isn't under your control, so
it can't be indicative of a bug in your part of the program).
A common scenario where assertions are useful is to have a user-facing API
layer that validates that input is well formed that then passes the known-good
input into internal library functions. You might include assertions in the
internal library functions to ensure that we don't accidentally call the
internal function on un-validated data, e.g.:
def user_facing_function(possibly_bad_data):
good_data = validate(possibly_bad_data)
return _internal_function(good_data)
def _internal_function(good_data):
assert is_good(good_data), "good_data isn't good. Did you forget to validate?"
# do stuff
zipline/pipeline/dispatcher.py
Outdated
dataset_loaders : dict[DataSet -> PipelineLoader] | ||
Map from datasets to pipeline loader for those datasets. | ||
""" | ||
def __init__(self, column_loaders=None, dataset_loaders=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to segregate these into column_loaders and dataset_loaders? We could just accept loaders
with a dictionary from either BoundColumn or DataSet -> loader.
zipline/pipeline/dispatcher.py
Outdated
# already registered, allowing users to register their own loaders | ||
# early on in extensions | ||
if isinstance(data, BoundColumn): | ||
if data not in self._column_loaders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems dangerous to me to silently ignore conflicts if a loader gets registered for a dataset multiple times. I would expect this to be an error, and if not, I would have expected the later registration to win.
Independently of the above, dict.setdefault
is a built-in way to write a key to a dictionary only if it's not present.
zipline/pipeline/dispatcher.py
Outdated
self.register(dataset, pl) | ||
|
||
def __call__(self, column): | ||
if column in self._column_loaders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally more idiomatic in python to write this as:
try:
return self._column_loaders[column]
except KeyError:
raise LookupError(...)
c.f. https://blogs.msdn.microsoft.com/pythonengineering/2016/06/29/idiomatic-python-eafp-versus-lbyl/, for example
zipline/pipeline/dispatcher.py
Outdated
self.column_loaders = mappingproxy(self._column_loaders) | ||
if dataset_loaders is not None: | ||
for dataset, pl in dataset_loaders.items(): | ||
self.register(dataset, pl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to try to avoid calling methods on self
in constructors. Generally when editing a method, most people assume that the class is fully-constructed by the time the method is called (for example, they assume that all attributes will be set). Calling a method in a constructor can violate that assumption.
zipline/utils/run_algo.py
Outdated
# we register our default loader last, after any loaders from users | ||
# have been registered via extensions | ||
register_pipeline_loader(USEquityPricing, pipeline_loader) | ||
|
||
def choose_loader(column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to create a new function here anymore. global_pipeline_dispatcher
is already a callable with the right signature.
tests/pipeline/test_dispatcher.py
Outdated
cls.add_class_callback(clear_all_associations) | ||
|
||
def test_load_not_registered(self): | ||
fake_col_instance = FakeColumn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked carefully at these tests yet, but in general, it's not expected that anyone should ever construct BoundColumn instances explicitly. The usual way to get a bound column is to do:
class SomeDataSet(DataSet):
column = Column(...)
SomeDataSet.column # This evaluates to a BoundColumn because Column implements the descriptor protocol.
Implements a possible solution according to what @ssanderson suggested in #2199