Update uses of `get_crds_parameters` to allow for more efficient file opening #228

braingram · 2025-03-20T16:51:36Z

get_crds_parameters returns a subset of metadata that's passed to crds for reference file selection. Since only a subset is needed if Step and Pipeline are processing an input file (and not an input DataModel) it may be possible to perform a partial read of the file to more efficiently access the same information. This issue is to layout API changes needed to accommodate this mode.

The current get_crds_parameters uses are:
In Step.get_reference_file:

stpipe/src/stpipe/step.py

Lines 809 to 814 in 9692bfa

    
           with self.open_model(input_file) as model: 
        
               reference_name = crds_client.get_reference_file( 
        
                   model.get_crds_parameters(), 
        
                   reference_file_type, 
        
                   model.crds_observatory, 
        
               )

The use of open_model here seems superfluous as the input should be an open DataModel (and open_model will sometimes copy the input). Within the scope of this ticket I think this usage can be ignored.

In Step.get_config_from_reference:

stpipe/src/stpipe/step.py

Lines 862 to 867 in 9692bfa

    
           with cls._datamodels_open(dataset, asn_n_members=1) as model: 
        
               if isinstance(model, Sequence): 
        
                   # Pull out first model in ModelContainer 
        
                   model = model[0] 
        
               crds_parameters = model.get_crds_parameters() 
        
               crds_observatory = model.crds_observatory

This is likely to receive as input a filename instead of a model due to usage in cmdline.

In Pipeline.get_config_from_reference:

stpipe/src/stpipe/pipeline.py

Lines 187 to 193 in 9692bfa

    
           with cls._datamodels_open(dataset, asn_n_members=1) as model: 
        
               if isinstance(model, Sequence): 
        
                   crds_parameters = model._models[0].get_crds_parameters() 
        
                   crds_observatory = model.crds_observatory 
        
               else: 
        
                   crds_parameters = model.get_crds_parameters() 
        
                   crds_observatory = model.crds_observatory

This looks functionally the same as the usage in Step.get_config_from_reference.

In Pipeline._precache_references_impl:

stpipe/src/stpipe/pipeline.py

Lines 319 to 321 in 9692bfa

    
           crds_refs = crds_client.get_multiple_reference_paths( 
        
               model.get_crds_parameters(), fetch_types, model.crds_observatory 
        
           )

This is indirectly called from Pipeline._precache_references:

stpipe/src/stpipe/pipeline.py

Lines 263 to 266 in 9692bfa

    
           with self.open_model( 
        
               input_file, asn_n_members=1, asn_exptypes=["science"] 
        
           ) as model: 
        
               self._precache_references_opened(model)

which calls Pipeline._precache_references_opened which iterates through any container.

The text was updated successfully, but these errors were encountered:

braingram · 2025-03-20T17:27:44Z

My initial reaction is that we could add a class method Step._get_crds_parameters (although the naming could be confusing). Something like:

@classmethod
def _get_crds_parameters(dataset):
    ...
    return crds_parameters, observatory

The need for the class method is for get_config_from_reference (which is a classmethod and called before a Step is created) where the usage of the new _get_crds_parameters would look something like:

- with cls._datamodels_open(dataset, asn_n_members=1) as model: 
-     if isinstance(model, Sequence): 
-         crds_parameters = model._models[0].get_crds_parameters() 
-         crds_observatory = model.crds_observatory 
-     else: 
-         crds_parameters = model.get_crds_parameters() 
-         crds_observatory = model.crds_observatory 
+ crds_parameters, crds_observatory = cls._get_crds_parameters(dataset)

Things are a little trickier for _precache_references. The current code is written in a way where a container with multiple models would be iterated over to get references for all contained models. However the container is opened with asn_n_members=1, asn_exptypes=["science"] (so only the first science exposure is used for pre-fetching). This might be a bug and is worth some discussion. If we say "only the first science exposure is used for prefetching" then we could likely simplify _precache_references combining it with _precache_references_opened and _precache_references_impl adding a call to cls._get_crds_parameters. If we say this is a bug and perhaps we want to leave this up to the child class we could consider making cls._get_crds_parameters a generator and implementing _precache_references as:

for parameters, observatory in cls._get_crds_parameters(dataset):
   # handle overrides ...
   crds_refs = crds_client.get_multiple_reference_paths(...)

Another complication of _precache_references is that it logs meta.filename. This won't be loaded by spacetelescope/stdatamodels#284

braingram mentioned this issue Mar 20, 2025

Improve performance of DataModel reading for single (or simple) metadata access spacetelescope/stdatamodels#284

Open

braingram changed the title ~~Update uses of get_crds_parameters to allow fo more efficient file opening~~ Update uses of get_crds_parameters to allow for more efficient file opening Mar 20, 2025

stscijgbot-jp mentioned this issue Mar 21, 2025

Allow passing metadata dict (instead of datamodel) through stpipe for CRDS queries spacetelescope/jwst#9312

Open

braingram linked a pull request Mar 21, 2025 that will close this issue

add Step._get_crds_parameters #229

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update uses of `get_crds_parameters` to allow for more efficient file opening #228

Update uses of `get_crds_parameters` to allow for more efficient file opening #228

braingram commented Mar 20, 2025 •

edited

Loading

braingram commented Mar 20, 2025 •

edited

Loading

Update uses of get_crds_parameters to allow for more efficient file opening #228

Update uses of get_crds_parameters to allow for more efficient file opening #228

Comments

braingram commented Mar 20, 2025 • edited Loading

braingram commented Mar 20, 2025 • edited Loading

Update uses of `get_crds_parameters` to allow for more efficient file opening #228

Update uses of `get_crds_parameters` to allow for more efficient file opening #228

braingram commented Mar 20, 2025 •

edited

Loading

braingram commented Mar 20, 2025 •

edited

Loading