-
-
Notifications
You must be signed in to change notification settings - Fork 79
Enable save_metric=1 and sources MCMC metric info from new JSON file
#844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
In shifting the metric/type/step size info to being sourced from the metric JSON files instead of the CSV comments, we no longer will populate those attributes during the _assemble_draws step and instead lazily load the attributes from the metric JSON files when the properties are accessed. This allows the draws to still be used/manipulated in cases when the metric information is unavailable such as when only the Stan CSV files are saved.
|
I'll need some time to look over this but at first glance it looks pretty good. I don't think Pydantic would make a ton of sense for this alone, but as you note we will probably want something more powerful when loading the config files. I do worry about being too strict about IO, since it could mean that cmdstanpy becomes unusable with development versions or right when a new version is released before cmdstanpy has itself updated. Right now this will most-likely 'just work' as long as you don't request information related to the part of the config that is new/different in that cmdstan version, but if we were doing full validation it would fail even if you only actually need some unrelated piece |
|
Yeah if it were just this, I wouldn't be interested in adding the dependency, but with potentially quite a few output files that need to be parsed, I think it would be useful. We could always implement the equivalent structures without pydantic, would just be a bit extra work. Good point about strictness -- perhaps we could have validation issues throw warnings instead of errors? And only be strict where it would indicate something has gone wrong |
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First set of questions!
| config = self._validate_csv_files() | ||
| self._metadata: InferenceMetadata = InferenceMetadata(config) | ||
| self._chain_metric_info: list[MetricInfo] = [] | ||
| self._metric_info_parsed: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this flag? It seems like it's always equivalent to self._metric_type == None
| @@ -1,10 +1,16 @@ | |||
| """Container for metadata parsed from the output of a CmdStan run""" | |||
|
|
|||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for pydantic? I know there was a lot of discussion about annotation handling and pydantic, but I didn't keep up with it
| """Structured representation of HMC-NUTS metric information, | ||
| as output by CmdStan""" | ||
|
|
||
| chain_id: int = Field(gt=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe chain id can be equal to 0, just not greater than
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- is there a reason we need the chain ID in here? Since it isn't in the file is necessitates our own from_json rather than just using model_validate_json
| metric_type: Literal["diag_e", "dense_e", "unit_e"] | ||
| inv_metric: np.ndarray | ||
|
|
||
| model_config = {"arbitrary_types_allowed": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a scary config option -- what does it do for us here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the test files here use diag_e, probably worth trying the other values as well?
Submission Checklist
Summary
This PR enables
save_metric=1for cmdstan sampling with adaptation. This outputs a new metric JSON file that contains the step size, inv_metric, and metric type for the sampling run. This PR also now lazily sources this info from these files when the corresponding properties on theCmdStanMCMCobjects are accessed. By changing the source of this info, we can now also remove all the code where we were parsing adaptation info from the Stan CSV file, which this PR does.Importantly, this PR proposes adding
pydanticas a dependency tocmdstanpy. With sourcing more info from JSON files (in this metric change and soon from the config files), we want to be more careful about I/O validation. Pydantic is now a fairly standard way to parse and validate data in the Python ecosystem and I think is a good fit for our needs.Closes #713.
@WardBrian In the issue comments, we discussed an update to the
save_csvfilesmethods that would include the other output files. This PR doesn't include that change. I could work to incorporate it here, but I think it makes sense to look at that as a standalone PR.Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): myself
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: