Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update uses of get_crds_parameters to allow for more efficient file opening #228

Open
braingram opened this issue Mar 20, 2025 · 1 comment · May be fixed by #229
Open

Update uses of get_crds_parameters to allow for more efficient file opening #228

braingram opened this issue Mar 20, 2025 · 1 comment · May be fixed by #229

Comments

@braingram
Copy link
Collaborator

braingram commented Mar 20, 2025

get_crds_parameters returns a subset of metadata that's passed to crds for reference file selection. Since only a subset is needed if Step and Pipeline are processing an input file (and not an input DataModel) it may be possible to perform a partial read of the file to more efficiently access the same information. This issue is to layout API changes needed to accommodate this mode.

The current get_crds_parameters uses are:
In Step.get_reference_file:

stpipe/src/stpipe/step.py

Lines 809 to 814 in 9692bfa

with self.open_model(input_file) as model:
reference_name = crds_client.get_reference_file(
model.get_crds_parameters(),
reference_file_type,
model.crds_observatory,
)

The use of open_model here seems superfluous as the input should be an open DataModel (and open_model will sometimes copy the input). Within the scope of this ticket I think this usage can be ignored.

In Step.get_config_from_reference:

stpipe/src/stpipe/step.py

Lines 862 to 867 in 9692bfa

with cls._datamodels_open(dataset, asn_n_members=1) as model:
if isinstance(model, Sequence):
# Pull out first model in ModelContainer
model = model[0]
crds_parameters = model.get_crds_parameters()
crds_observatory = model.crds_observatory

This is likely to receive as input a filename instead of a model due to usage in cmdline.

In Pipeline.get_config_from_reference:

with cls._datamodels_open(dataset, asn_n_members=1) as model:
if isinstance(model, Sequence):
crds_parameters = model._models[0].get_crds_parameters()
crds_observatory = model.crds_observatory
else:
crds_parameters = model.get_crds_parameters()
crds_observatory = model.crds_observatory

This looks functionally the same as the usage in Step.get_config_from_reference.

In Pipeline._precache_references_impl:

crds_refs = crds_client.get_multiple_reference_paths(
model.get_crds_parameters(), fetch_types, model.crds_observatory
)

This is indirectly called from Pipeline._precache_references:
with self.open_model(
input_file, asn_n_members=1, asn_exptypes=["science"]
) as model:
self._precache_references_opened(model)

which calls Pipeline._precache_references_opened which iterates through any container.

@braingram
Copy link
Collaborator Author

braingram commented Mar 20, 2025

My initial reaction is that we could add a class method Step._get_crds_parameters (although the naming could be confusing). Something like:

@classmethod
def _get_crds_parameters(dataset):
    ...
    return crds_parameters, observatory

The need for the class method is for get_config_from_reference (which is a classmethod and called before a Step is created) where the usage of the new _get_crds_parameters would look something like:

- with cls._datamodels_open(dataset, asn_n_members=1) as model: 
-     if isinstance(model, Sequence): 
-         crds_parameters = model._models[0].get_crds_parameters() 
-         crds_observatory = model.crds_observatory 
-     else: 
-         crds_parameters = model.get_crds_parameters() 
-         crds_observatory = model.crds_observatory 
+ crds_parameters, crds_observatory = cls._get_crds_parameters(dataset)

Things are a little trickier for _precache_references. The current code is written in a way where a container with multiple models would be iterated over to get references for all contained models. However the container is opened with asn_n_members=1, asn_exptypes=["science"] (so only the first science exposure is used for pre-fetching). This might be a bug and is worth some discussion. If we say "only the first science exposure is used for prefetching" then we could likely simplify _precache_references combining it with _precache_references_opened and _precache_references_impl adding a call to cls._get_crds_parameters. If we say this is a bug and perhaps we want to leave this up to the child class we could consider making cls._get_crds_parameters a generator and implementing _precache_references as:

for parameters, observatory in cls._get_crds_parameters(dataset):
   # handle overrides ...
   crds_refs = crds_client.get_multiple_reference_paths(...)

Another complication of _precache_references is that it logs meta.filename. This won't be loaded by spacetelescope/stdatamodels#284

@braingram braingram changed the title Update uses of get_crds_parameters to allow fo more efficient file opening Update uses of get_crds_parameters to allow for more efficient file opening Mar 20, 2025
@braingram braingram linked a pull request Mar 21, 2025 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant