Skip to content

feat: Handle extra mdoc fields and support fieldname aliases#33

Merged
alisterburt merged 8 commits intoteamtomo:mainfrom
sumslogs:unknown_or_aliased_fields
Jan 28, 2026
Merged

feat: Handle extra mdoc fields and support fieldname aliases#33
alisterburt merged 8 commits intoteamtomo:mainfrom
sumslogs:unknown_or_aliased_fields

Conversation

@sumslogs
Copy link
Collaborator

@sumslogs sumslogs commented Jan 27, 2026

Thermo's software (Tomo 5) is placing some fields into its Mdocs that were being dropped by mdocfile.

In particular, I observed it happening for "CountsPerElectron", and "FrameDosesAndNumber".
The SerialEM/IMOD documentation doesn't mention CountsPerElectron, but it's used elsewhere.

"FrameDosesAndNumber" is a bit more strange in that it's present in the SerialEM doc (and this package) as "FrameDosesAndNumbers" (plural).
However see:

To address those issues, this PR does a couple things:

  1. Adds a mapping dictionary of "fieldname_aliases" in data_models.py to treat "FrameDosesAndNumber" as "FrameDosesAndNumbers". (And possible others in the future.)
  2. Doesn't drop unknown fields, but instead pass them through and issue a warning at parsing time that they were observed.

I also noticed that the FrameDosesAndNumbers parser wasn't actually handling the sequence of tuple data structure correctly (and wasn't emitting the string serialization of it correctly) so I added that.

Copy link
Collaborator

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumslogs thank you for taking the time to put together a thoughtfully constructed PR with links to relevant documentation! Lots of great improvements here

I agree the FrameDosesAndNumbers thing is weird - the SerialEM file formats documentation differs from the SerialEM source you linked...

I had one minor question about a change you made in the .to_string() method but otherwise this looks good to go!

@alisterburt
Copy link
Collaborator

as an aside, would you be interested in joining our regular developer meeting? Happens once a month, next one is tomorrow at 8am PST. If interested could you introduce yourself and your interests in our zulip channel and DM me your email so I can add you to the calendar invite

@sumslogs sumslogs force-pushed the unknown_or_aliased_fields branch from 09046f3 to 56c011f Compare January 27, 2026 20:26

from mdocfile.utils import find_section_entries, find_title_entries

log = logging.getLogger('mdocfile.parser')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you switch this to just mdocfile if iterating? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sumslogs
Copy link
Collaborator Author

I noticed that pydantic actually has built-in alias support, so I refactored it to use that rather than making a custom dictionary mapping.

@sumslogs sumslogs force-pushed the unknown_or_aliased_fields branch from 56c011f to 0db8ec6 Compare January 27, 2026 20:30
@alisterburt
Copy link
Collaborator

@sumslogs unsurprising, pydantic seems to try to cater to all 😂

Let me know when you're done iterating and we can merge

I'm going to give you maintainer rights on this repo as you seem responsible and thoughtful - please still go through the PR flow so people have an opportunity to review contributions

Great work here

@sumslogs sumslogs force-pushed the unknown_or_aliased_fields branch 2 times, most recently from 927a243 to 611d373 Compare January 27, 2026 21:38
"""
model_config = ConfigDict(extra='allow', # keep extra field data
validate_by_name=True) # use our validations for aliased fields
# serialize_by_alias=True) # use the version of the fieldname the file arrived as
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alisterburt What to do here isn't entirely clear; the options are to either force the field to take one value (i.e. force output to become plural since it's plural in the package's def), or to use what the file came as upon serialization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving field names as they were found (serialize_by_alias=True) makes sense in a context of someone using the package to validate/modify/output such that whatever originally created it can load it again. But could just as easily make an argument for standardizing it. It might make sense for the constructor methods to pass along pydantic config settings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumslogs yeah no obvious right answer - my intuition is that we don't want things to change magically too much so probably using what the file came in as if that's relatively okay to implement

@sumslogs sumslogs force-pushed the unknown_or_aliased_fields branch from 611d373 to 774cf29 Compare January 27, 2026 21:44
@sumslogs sumslogs force-pushed the unknown_or_aliased_fields branch from 2c4a7d6 to 0eb3486 Compare January 28, 2026 01:12
@sumslogs
Copy link
Collaborator Author

@alisterburt Alright after some iterating, I think this is ready to review again. It preserves the original field names on string and dataframe export, and I tried to keep it straight forward to be able to add new aliases if they arise.

Copy link
Collaborator

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful - thanks for all of the effort here, tests make intended behavior nice and clear. Reactivated CI as it got auto deactivated so made a few no_op changes in tests to trigger that running

@alisterburt alisterburt merged commit ee4d304 into teamtomo:main Jan 28, 2026
5 checks passed
@alisterburt
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants