Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a serialization library #1092

Open
gadomski opened this issue Apr 11, 2023 · 13 comments
Open

Use a serialization library #1092

gadomski opened this issue Apr 11, 2023 · 13 comments
Milestone

Comments

@gadomski
Copy link
Member

In PySTAC, we do a lot of work converting STAC objects to and from JSON (to_dict/from_dict) and that process can be error prone and tricky to maintain. As discussed in #1047, the use of a (de)serialization could:

  • get us some conversion methods for free,
  • possibly improve performance, and
  • possibly open up serialization to non-JSON formats

We should explore what it would take to add a serialization library as a dependency and convert our data structures to use it if it makes sense.

Two options (there may be more):

We probably don't want to use pydantic unless it's gotten a lot faster since last checked.

Downsides

By adding a dependency on a serialization library, we move away from our "light, few/no dependencies" model that we've been operating in for v1. Hopefully we can reduce our own code complexity by offloading some work to the serialization library, but we should ensure that the juice is worth the squeeze.

@TomAugspurger
Copy link
Collaborator

FYI, I started on https://github.com/TomAugspurger/msgspec-stac/blob/main/msgspec_stac.py a few weeks back but haven't done anything with it since (using msgspec). It seemed pretty straightforward, but I had a few questions:

  1. I don't think we can reasonably do full STAC validation using a library like this. Something like validating datetime, start_datetime, and end_datetime are together valid (i.e. start_datetime and end_datetime are provided if datetime is None), sounds hard and slow. We should I think view validation as a nice benefit, rather than the motivation for using one of these libraries.
  2. It wasn't clear to me what the relationship would be between pystac objects and (say) msgspec's objects. Would a pystac.Item be a msgspec.Struct subclass? Or would it internally use a struct (which would I think remove much of the performance benefit)?

We probably don't want to use pydantic unless it's gotten a lot faster since last checked.

In theory pydanic v2 is much faster (though still slower than msgspec last I saw).

@gadomski
Copy link
Member Author

gadomski commented May 9, 2023

I did a dead-simple msgspec implementation myself just now using class Item(Struct), and ran into issues around flattening dictionaries: jcrist/msgspec#315. It's not uncommon to have extra fields at the top level of STAC objects, and those need to be captured by a deserialization library. Maybe there's a good way to do it, but it wasn't immediately obvious to me.

For more context, this is how it's done in Rust: https://github.com/gadomski/stac-rs/blob/dfaaabc00f581af3d6b948ee3de24f4b68e5acdd/stac/src/item.rs#L66-L68

@huard
Copy link

huard commented Aug 29, 2023

There's a pydantic-stac implementation here https://github.com/stac-utils/stac-pydantic which doesn't seem really active, but I still made a PR to migrate it to pydantic v2 in the hope it'd be useful at some point.

@gadomski
Copy link
Member Author

@thomas-maschler did some ad-hoc benchmarking in radiantearth/stac-spec#1252 (comment) and found pydantic to be slower than than pystac in the deserialization case.

@eseglem
Copy link

eseglem commented Nov 14, 2023

I would be very curious about where the time difference are coming from. All things being the same, I would definitely expect Pydantic should be faster since most of the work is happening in Rust and not Python. If I had to take a guess, it may be related to pystac validation being optional vs being built in with pydantic.

There may be some ways around that as well as ways to improve performance vs those benchmarks. It could be very beneficial to go that route, if the performance is acceptable, as it would supersede stac-pydantic and help consolidate the ecosystem. I guess it really depends on how important that performance is vs maintenance effort and such.

@rbavery
Copy link

rbavery commented Mar 15, 2024

bumping this! the code complexity for pystac is pretty high and when implementing the MLModel extension, pydantic v2 has felt more approachable. is pydantic 2 or another serialization library an option for pystac v2?

@gadomski
Copy link
Member Author

is pydantic 2 or another serialization library an option for pystac v2?

Could be, however no one (that I know of) is currently working on a pystac v2 at the moment.

@KeynesYouDigIt
Copy link
Contributor

so https://github.com/stac-utils/stac-pydantic already exists - but I am new and I know there are some functionality gaps.

Would it be a high value / good first project for me to look at adding this? we could eventually support multiple serialization formats, defaulting to stac-pydantic - that would be powerful.

@gadomski
Copy link
Member Author

gadomski commented Nov 3, 2024

Would it be a high value / good first project for me to look at adding this? we could eventually support multiple serialization formats, defaulting to stac-pydantic - that would be powerful.

You're of course welcome to poke at it, but I'd say this is a pretty complex first ticket -- you'll end up touching most of the codebase. I'll note too that some of the performance considerations discussed in this issue might be resolved by #1434.

Zooming out, I think that we could pivot towards simply interoperability between pystac and stac-pydantic so folks can use both in the same ecosystem more seamlessly.

@rbavery
Copy link

rbavery commented Mar 19, 2025

Could be, however no one (that I know of) is currently working on a pystac v2 at the moment.

from #1540 (comment) it sounds like pystacv2 is being worked on now!

Looking forward to the rewrite and simpler extension implementations. This issue seems relevant and now that pystacv2 is being worked on, curious if we can move over the validation and serialization logic to pydantic?

I took a look at the docs and love the direction. Just wanted to point out besides getting serialization for free, two other benefits to pydantic would be interoperability with other libraries that use pydantic, making it easier to validate STAC against other standards. There's also a lot of tooling built on top of pydantic for the web and validating LLMs output that pystac could then benefit from.

@gadomski
Copy link
Member Author

curious if we can move over the validation and serialization logic to pydantic?

For now, I haven't been using pydantic (or another serialization library) in my v2. My rationale is that we have a known domain of entities (Item, Collection, Catalog, and their sub-entities) so hand written methods to convert to JSON dictionaries should be acceptably performant.

interoperability with other libraries that use pydantic

It's a good point, though (referring back to the distinction between types of users as described here), I find that pydantic-based libraries are more useful in the "developer" level, aka those folks who are building servers, etc. In my experience, pydantic is less useful (or even gets in the way) at the "data producer" level, when you're creating STAC from geospatial assets, and at the "data consumer" level, when you're usually just looking to go to xarray or some other analysis framework.

All that to say there's definetly a case to using pydantic in pystac, but I want to ensure we're building for all three user types and not just "developers".

making it easier to validate STAC against other standards.

What other standards do you have in mind?

There's also a lot of tooling built on top of pydantic for the web and validating LLMs output that pystac could then benefit from.

It's true, however I'm inclined to separate these concerns — we have stac-pydantic, so maybe all we need to do is point people to do pystac.to_dict -> stac_pydantic to get those wins?

@rbavery
Copy link

rbavery commented Mar 26, 2025

pydantic is less useful (or even gets in the way) at the "data producer" level, when you're creating STAC from geospatial assets

Gotcha, I'm not very familiar with these concerns since I haven't worked directly with data producers very much. I agree data consumers shouldn't need to interact with pydantic, but they also shouldn't need to interact directly with pystac? Just pystac-client or other tools that understand how to query stac apis or static catalogs.

What other standards do you have in mind?

I'd like to compose the MLM spec with the Pytorch 2 Inference Archive spec. The PT2 inference spec has metadata for model request/response for model servers that could be modeled with pydantic and composed with pydantic models for STAC MLM. I think this would make it easier for ml frameworks (torchgeo) to produce a single archive with all the model artifacts and metadata needed to run a model, either in a real time server or batch inference via some ml inference framework.

maybe all we need to do is point people to do pystac.to_dict -> stac_pydantic to get those wins?

That actually does work for STAC MLM and I gave it a try for some example STAC MLM metadata. I'm fine with depending on both stac_pydantic and pystac and doing this conversion.

@gadomski
Copy link
Member Author

they also shouldn't need to interact directly with pystac? Just pystac-client or other tools that understand how to query stac apis or static catalogs.

pystac-client can return pystac objects so data consumers see pystac in that sense.

I'm fine with depending on both stac_pydantic and pystac and doing this conversion.

Yeah, I think this will be my recommendation for now...the pydantic world just feels a bit to specific for a general-use lib like pystac. Thanks for talking through this and providing examples, this really helps my context 🙇🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants