feat: Handle extra mdoc fields and support fieldname aliases#33
feat: Handle extra mdoc fields and support fieldname aliases#33alisterburt merged 8 commits intoteamtomo:mainfrom
Conversation
alisterburt
left a comment
There was a problem hiding this comment.
@sumslogs thank you for taking the time to put together a thoughtfully constructed PR with links to relevant documentation! Lots of great improvements here
I agree the FrameDosesAndNumbers thing is weird - the SerialEM file formats documentation differs from the SerialEM source you linked...
I had one minor question about a change you made in the .to_string() method but otherwise this looks good to go!
|
as an aside, would you be interested in joining our regular developer meeting? Happens once a month, next one is tomorrow at 8am PST. If interested could you introduce yourself and your interests in our zulip channel and DM me your email so I can add you to the calendar invite |
09046f3 to
56c011f
Compare
src/mdocfile/data_models.py
Outdated
|
|
||
| from mdocfile.utils import find_section_entries, find_title_entries | ||
|
|
||
| log = logging.getLogger('mdocfile.parser') |
There was a problem hiding this comment.
could you switch this to just mdocfile if iterating? Thanks!
|
I noticed that pydantic actually has built-in alias support, so I refactored it to use that rather than making a custom dictionary mapping. |
56c011f to
0db8ec6
Compare
|
@sumslogs unsurprising, pydantic seems to try to cater to all 😂 Let me know when you're done iterating and we can merge I'm going to give you maintainer rights on this repo as you seem responsible and thoughtful - please still go through the PR flow so people have an opportunity to review contributions Great work here |
927a243 to
611d373
Compare
src/mdocfile/data_models.py
Outdated
| """ | ||
| model_config = ConfigDict(extra='allow', # keep extra field data | ||
| validate_by_name=True) # use our validations for aliased fields | ||
| # serialize_by_alias=True) # use the version of the fieldname the file arrived as |
There was a problem hiding this comment.
@alisterburt What to do here isn't entirely clear; the options are to either force the field to take one value (i.e. force output to become plural since it's plural in the package's def), or to use what the file came as upon serialization.
There was a problem hiding this comment.
Leaving field names as they were found (serialize_by_alias=True) makes sense in a context of someone using the package to validate/modify/output such that whatever originally created it can load it again. But could just as easily make an argument for standardizing it. It might make sense for the constructor methods to pass along pydantic config settings.
There was a problem hiding this comment.
@sumslogs yeah no obvious right answer - my intuition is that we don't want things to change magically too much so probably using what the file came in as if that's relatively okay to implement
611d373 to
774cf29
Compare
2c4a7d6 to
0eb3486
Compare
|
@alisterburt Alright after some iterating, I think this is ready to review again. It preserves the original field names on string and dataframe export, and I tried to keep it straight forward to be able to add new aliases if they arise. |
alisterburt
left a comment
There was a problem hiding this comment.
wonderful - thanks for all of the effort here, tests make intended behavior nice and clear. Reactivated CI as it got auto deactivated so made a few no_op changes in tests to trigger that running
|
deploying v0.2.3 |
Thermo's software (Tomo 5) is placing some fields into its Mdocs that were being dropped by mdocfile.
In particular, I observed it happening for "CountsPerElectron", and "FrameDosesAndNumber".
The SerialEM/IMOD documentation doesn't mention CountsPerElectron, but it's used elsewhere.
"FrameDosesAndNumber" is a bit more strange in that it's present in the SerialEM doc (and this package) as "FrameDosesAndNumbers" (plural).
However see:
To address those issues, this PR does a couple things:
I also noticed that the FrameDosesAndNumbers parser wasn't actually handling the sequence of tuple data structure correctly (and wasn't emitting the string serialization of it correctly) so I added that.