-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a serialization library #1092
Comments
FYI, I started on https://github.com/TomAugspurger/msgspec-stac/blob/main/msgspec_stac.py a few weeks back but haven't done anything with it since (using msgspec). It seemed pretty straightforward, but I had a few questions:
In theory pydanic v2 is much faster (though still slower than msgspec last I saw). |
I did a dead-simple msgspec implementation myself just now using For more context, this is how it's done in Rust: https://github.com/gadomski/stac-rs/blob/dfaaabc00f581af3d6b948ee3de24f4b68e5acdd/stac/src/item.rs#L66-L68 |
There's a pydantic-stac implementation here https://github.com/stac-utils/stac-pydantic which doesn't seem really active, but I still made a PR to migrate it to pydantic v2 in the hope it'd be useful at some point. |
@thomas-maschler did some ad-hoc benchmarking in radiantearth/stac-spec#1252 (comment) and found pydantic to be slower than than pystac in the deserialization case. |
I would be very curious about where the time difference are coming from. All things being the same, I would definitely expect Pydantic should be faster since most of the work is happening in Rust and not Python. If I had to take a guess, it may be related to pystac validation being optional vs being built in with pydantic. There may be some ways around that as well as ways to improve performance vs those benchmarks. It could be very beneficial to go that route, if the performance is acceptable, as it would supersede |
bumping this! the code complexity for pystac is pretty high and when implementing the MLModel extension, pydantic v2 has felt more approachable. is pydantic 2 or another serialization library an option for pystac v2? |
Could be, however no one (that I know of) is currently working on a pystac v2 at the moment. |
so https://github.com/stac-utils/stac-pydantic already exists - but I am new and I know there are some functionality gaps. Would it be a high value / good first project for me to look at adding this? we could eventually support multiple serialization formats, defaulting to stac-pydantic - that would be powerful. |
You're of course welcome to poke at it, but I'd say this is a pretty complex first ticket -- you'll end up touching most of the codebase. I'll note too that some of the performance considerations discussed in this issue might be resolved by #1434. Zooming out, I think that we could pivot towards simply interoperability between pystac and stac-pydantic so folks can use both in the same ecosystem more seamlessly. |
from #1540 (comment) it sounds like pystacv2 is being worked on now! Looking forward to the rewrite and simpler extension implementations. This issue seems relevant and now that pystacv2 is being worked on, curious if we can move over the validation and serialization logic to pydantic? I took a look at the docs and love the direction. Just wanted to point out besides getting serialization for free, two other benefits to pydantic would be interoperability with other libraries that use pydantic, making it easier to validate STAC against other standards. There's also a lot of tooling built on top of pydantic for the web and validating LLMs output that pystac could then benefit from. |
For now, I haven't been using pydantic (or another serialization library) in my v2. My rationale is that we have a known domain of entities (Item, Collection, Catalog, and their sub-entities) so hand written methods to convert to JSON dictionaries should be acceptably performant.
It's a good point, though (referring back to the distinction between types of users as described here), I find that pydantic-based libraries are more useful in the "developer" level, aka those folks who are building servers, etc. In my experience, pydantic is less useful (or even gets in the way) at the "data producer" level, when you're creating STAC from geospatial assets, and at the "data consumer" level, when you're usually just looking to go to xarray or some other analysis framework. All that to say there's definetly a case to using pydantic in pystac, but I want to ensure we're building for all three user types and not just "developers".
What other standards do you have in mind?
It's true, however I'm inclined to separate these concerns — we have stac-pydantic, so maybe all we need to do is point people to do |
Gotcha, I'm not very familiar with these concerns since I haven't worked directly with data producers very much. I agree data consumers shouldn't need to interact with pydantic, but they also shouldn't need to interact directly with pystac? Just pystac-client or other tools that understand how to query stac apis or static catalogs.
I'd like to compose the MLM spec with the Pytorch 2 Inference Archive spec. The PT2 inference spec has metadata for model request/response for model servers that could be modeled with pydantic and composed with pydantic models for STAC MLM. I think this would make it easier for ml frameworks (torchgeo) to produce a single archive with all the model artifacts and metadata needed to run a model, either in a real time server or batch inference via some ml inference framework.
That actually does work for STAC MLM and I gave it a try for some example STAC MLM metadata. I'm fine with depending on both stac_pydantic and pystac and doing this conversion. |
pystac-client can return pystac objects so data consumers see pystac in that sense.
Yeah, I think this will be my recommendation for now...the pydantic world just feels a bit to specific for a general-use lib like pystac. Thanks for talking through this and providing examples, this really helps my context 🙇🏼 |
In PySTAC, we do a lot of work converting STAC objects to and from JSON (
to_dict/from_dict
) and that process can be error prone and tricky to maintain. As discussed in #1047, the use of a (de)serialization could:We should explore what it would take to add a serialization library as a dependency and convert our data structures to use it if it makes sense.
Two options (there may be more):
We probably don't want to use pydantic unless it's gotten a lot faster since last checked.
Downsides
By adding a dependency on a serialization library, we move away from our "light, few/no dependencies" model that we've been operating in for v1. Hopefully we can reduce our own code complexity by offloading some work to the serialization library, but we should ensure that the juice is worth the squeeze.
The text was updated successfully, but these errors were encountered: