Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow sorting keys on to_json and to_python by passing in sort_keys #1637

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aezomz
Copy link

@aezomz aezomz commented Feb 15, 2025

Hello Pydantic Team! This is my first time contributing to a Rust and Pyo3 related repo.
I am also new in Rust.
Do you think this PR will make sense?
Since I have been trying to do model_dump_json with sort keys too.

This feature should simulate the same as how we use json.dumps(data, sort_keys=True)

Will sort from:
        {
            'field_123': b'test_123',
            'field_b': 12,
            'field_a': b'test',
            'field_c': {'mango': 2, 'banana': 3, 'apple': 1},
            'field_n': [
                {'mango': 3, 'banana': 2, 'apple': 1},
                [{'mango': 3, 'banana': 2, 'apple': 1}, {'d': 3, 'b': 2, 'a': 1}],
                3,
            ],
            'field_d': [
                {'d': 3, 'b': 2, 'a': {'nested3': 3, 'nested1': 1, 'nested2': 2}},
                [[{'mango': 3, 'banana': 2, 'apple': 1}], {'d': 3, 'b': 2, 'a': 1}],
                3,
            ],
            'field_none': None,
        }
To:
    assert s.to_python(m, exclude_none=True, sort_keys=True) == snapshot(
        {
            'field_123': b'test_123',
            'field_a': b'test',
            'field_b': 12,
            'field_c': {'apple': 1, 'banana': 3, 'mango': 2},
            'field_n': [
                {'apple': 1, 'banana': 2, 'mango': 3},
                [{'apple': 1, 'banana': 2, 'mango': 3}, {'a': 1, 'b': 2, 'd': 3}],
                3,
            ],
            'field_d': [
                {'a': {'nested1': 1, 'nested2': 2, 'nested3': 3}, 'b': 2, 'd': 3},
                [[{'apple': 1, 'banana': 2, 'mango': 3}], {'a': 1, 'b': 2, 'd': 3}],
                3,
            ],
        }
    )

Take note that

  1. field_d is extra and still manage to sort
  2. sorting recursively for both defined schema and extras
  3. sorting including dictionary in array or nested array array

Please let me know if I miss out any other features that sort_keys=True is suppose to do!

Thanks!

Change Summary

allow sorting keys on to_json and to_python by passing in sort_keys

Related issue number

should fix pydantic/pydantic#7424
Might need to create another MR on Python repo though, need to check.

Checklist

  • Unit tests for the changes exist
  • Documentation reflects the changes where applicable
  • Pydantic tests pass with this pydantic-core (except for expected changes)
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Selected Reviewer: @davidhewitt

Copy link

codecov bot commented Feb 15, 2025

Codecov Report

Attention: Patch coverage is 86.16601% with 35 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/serializers/fields.rs 85.34% 34 Missing ⚠️
python/pydantic_core/core_schema.py 50.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link

codspeed-hq bot commented Feb 15, 2025

CodSpeed Performance Report

Merging #1637 will not alter performance

Comparing aezomz:allow_model_dump_sort_keys (5ce696c) with main (0a5bbfc)

Summary

✅ 157 untouched benchmarks

@aezomz aezomz force-pushed the allow_model_dump_sort_keys branch from 07a31f5 to 7222c8d Compare March 4, 2025 10:22
@aezomz
Copy link
Author

aezomz commented Mar 4, 2025

please review

@aezomz aezomz marked this pull request as ready for review March 4, 2025 13:19
Copy link
Member

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main issue is perf regression

Comment on lines 348 to 351
false,
false,
false,
true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidhewitt this is getting gross 😢

// NOTE! we maintain the order of the input dict assuming that's right
for result in main_iter {
let (key, value) = result?;
let mut items = main_iter.collect::<PyResult<Vec<_>>>()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that this extra Vec being created is the cause for the slowdown in benchmarks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the function out, so we can re-use this. and avoid creating new variable when sort keys is false

Comment on lines 1001 to 1002
'field_a': core_schema.model_field(core_schema.bytes_schema()),
'field_b': core_schema.model_field(core_schema.int_schema()),
'field_a': core_schema.model_field(core_schema.bytes_schema()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did these have to change? Is this to make the tests change? I would prefer a new self-contained test just for this bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seperated this.

@aezomz
Copy link
Author

aezomz commented Mar 5, 2025

I separated the test out, I also refactor the functions so we can reuse when sort_keys=true.
To keep the original perf benchmark, I have done a simple bool check on sort_keys before using expensive function like sorting.

Let me know what else I need to improve. Thanks

@zzstoatzz zzstoatzz mentioned this pull request Mar 10, 2025
@aezomz
Copy link
Author

aezomz commented Mar 18, 2025

please review, not sure how I can take it from here

Copy link
Member

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see now that this is not recursive (it only applies to the top level keys). Would it be hard to make it recursive? I fear that if we implement the non-recursive version someone is going to come along and want the recursive version... if so we can make it a Literal['recursive', 'top-level', 'unsorted'] or something like that.

Comment on lines 171 to 172
let mut items = main_iter.collect::<PyResult<Vec<_>>>()?;
items.sort_by_cached_key(|(key, _)| key_str(key).unwrap_or_default().to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're already collecting into a Vec here I think you should do something like this:

Suggested change
let mut items = main_iter.collect::<PyResult<Vec<_>>>()?;
items.sort_by_cached_key(|(key, _)| key_str(key).unwrap_or_default().to_string());
let mut items = main_iter.map(|k,v| (key_str(key), v)).collect::<PyResult<Vec<_>>>()?;

Might need to collapse two levels of Result or something like that.

Then you'll have to convert back to the original type. I think this will be the fastest approach.

Copy link
Author

@aezomz aezomz Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I kept the original implementation. Didn't quite get your advice, still quite new to Rust 😓
But added recursive mode.

Copy link
Member

@adriangb adriangb Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first thing process_field_entry_python does is call key_str:

let key_str = key_str(key)?;

So I think you can just convert to it from the get go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to convert it once and pass to the function.

@aezomz aezomz force-pushed the allow_model_dump_sort_keys branch from 0cf4b6d to 95f9329 Compare March 23, 2025 14:44
@aezomz
Copy link
Author

aezomz commented Mar 23, 2025

Hmm I see now that this is not recursive (it only applies to the top level keys). Would it be hard to make it recursive? I fear that if we implement the non-recursive version someone is going to come along and want the recursive version... if so we can make it a Literal['recursive', 'top-level', 'unsorted'] or something like that.

Added different sort mode as above, updated the PR description.

@aezomz
Copy link
Author

aezomz commented Mar 25, 2025

please review 👍

@aezomz aezomz force-pushed the allow_model_dump_sort_keys branch from 95f9329 to c83a212 Compare March 28, 2025 17:28
@aezomz
Copy link
Author

aezomz commented Mar 28, 2025

please review 👍

@@ -305,6 +305,7 @@ class SchemaSerializer:
exclude_unset: bool = False,
exclude_defaults: bool = False,
exclude_none: bool = False,
sort_keys: Literal['recursive', 'top-level', 'unsorted'] = 'unsorted',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need these three options? surely if we add this feature it should match python's json.dumps behaviour and be sort_keys: bool = False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are scenarios where you'd want to only sort the top level, e.g. the trick we were doing with our JSON data in FusionFire.

But I also agree that if no one is asking for it we shouldn't implement extra stuff.

How does it sound if for now we do sort_keys: bool = False and then if needed in the future we can do sort_keys: bool | Literal['top-level'] = False.
We reduce the complexity and make a nicer signature for now but can still add the top-level option in the future, even if it makes the signature a bit weird.

@samuelcolvin @aezomz wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, that's what we've done for previous flags.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me check the feature for sort_keys on json.dumps.
I only want to do sort_keys on model_dump_json thats why am here 😆
Let me try to implement this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @aezomz, you are being incredibly patient and helpful 🙏🏻

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Adrian, please review when ur free. I think David fixed the wasm build test already.

@aezomz aezomz force-pushed the allow_model_dump_sort_keys branch from c83a212 to 0075a4b Compare March 31, 2025 17:30
@aezomz
Copy link
Author

aezomz commented Apr 1, 2025

please review 👍

There is this test that is failing, unclear if its related to my change.
build-wasm-emscripten

Let me know any other places i can optimize. Thanks!

@aezomz aezomz force-pushed the allow_model_dump_sort_keys branch from dce2689 to 5ce696c Compare April 4, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add sorting keys to model_dump_json
4 participants