Skip to content

Fix: SqlCatalog list_namespaces() should return only sub-namespaces #1629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alessandro-nori
Copy link
Contributor

Resolves: #1627

@alessandro-nori alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch 3 times, most recently from 62b9efc to 9e55cf6 Compare February 8, 2025 20:44
@alessandro-nori alessandro-nori marked this pull request as ready for review February 8, 2025 20:44
@alessandro-nori
Copy link
Contributor Author

@kevinjqliu while working on this I also noticed that _namespace_exists() is using an exact comparison instead of LIKE for the query. I opened a new issue to track it #1630

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Fokko added this to the PyIceberg 0.9.0 release milestone Feb 12, 2025
stmt = union(
table_stmt,
namespace_stmt,
)
with Session(self.engine) as session:
return [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]
namespaces = [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of all this logic, can we just filter out the results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think I understand what you mean here so my answer can be unrelated.

If it's about filtering in the SQL query, I don't think it's possible to easily cover some edge cases. The Java implementation handles the filtering afterward as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry for the ambiguous statement. i meant to filter the results inline.

Something like this, filters the level and fuzzy match together

        with Session(self.engine) as session:
            namespace_tuple = Catalog.identifier_to_tuple(namespace)
            sub_namespaces_level_length = len(namespace_tuple) + 1 if namespace else 1
            namespaces = [
                ns[:sub_namespaces_level_length]  # truncate to the required level
                for ns in {Catalog.identifier_to_tuple(ns) for ns in session.execute(stmt).scalars()}
                if len(ns) >= sub_namespaces_level_length  # only get sub namespaces/children
                and ns[: len(namespace_tuple)] == namespace_tuple  # exclude fuzzy matches when `namespace` contains `%` or `_`
            ]

            return namespaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!
I didn't know that ns[:0] == () would actually work 😄

@@ -1116,17 +1116,30 @@ def test_create_namespace_with_empty_identifier(catalog: SqlCatalog, empty_names
lazy_fixture("catalog_sqlite"),
],
)
@pytest.mark.parametrize("namespace_list", [lazy_fixture("database_list"), lazy_fixture("hierarchical_namespace_list")])
def test_list_namespaces(catalog: SqlCatalog, namespace_list: List[str]) -> None:
def test_list_namespaces(catalog: SqlCatalog) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this test!

@@ -1158,13 +1171,13 @@ def test_list_non_existing_namespaces(catalog: SqlCatalog) -> None:
def test_drop_namespace(catalog: SqlCatalog, table_schema_nested: Schema, table_identifier: Identifier) -> None:
namespace = Catalog.namespace_from(table_identifier)
catalog.create_namespace(namespace)
assert namespace in catalog.list_namespaces()
assert namespace[:1] in catalog.list_namespaces()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this excluding the first result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking only the first level of the namespace.

Calling list_namespaces() without parameters now correctly returns only the top level namespaces.

So if namespace is a multi-level namespace, for example "db.ns1", only "db" is returned by list_namespaces()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, thanks for the explanation! i think the assert here is testing that the newly created namespace exists.
so perhaps something like this matches its behavior more

assert namespace in catalog.list_namespaces(namespace[:-1])

@alessandro-nori alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch from 1532ab4 to 8321d64 Compare February 13, 2025 09:56
if namespace:
namespace_tuple = Catalog.identifier_to_tuple(namespace)
# exclude fuzzy matches when `namespace` contains `%` or `_`
namespaces = [ns for ns in namespaces if ns[: len(namespace_tuple)] == namespace_tuple]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance nit: Should we move the len out of the loop?

Comment on lines 1126 to 1146
expected_list: list[tuple[str, ...]] = [("db",), ("db2",), ("db%",)]
for ns in expected_list:
assert ns in ns_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check the full list? This way we make sure that they are equal, and that the ns_list doesn't contain additional elements:

Suggested change
expected_list: list[tuple[str, ...]] = [("db",), ("db2",), ("db%",)]
for ns in expected_list:
assert ns in ns_list
assert ns_list == [("db",), ("db2",), ("db%",)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks!
I'm using sorted(ns_list) to make the test more resilient.
Or do you think we should force a specific order in list_namespaces()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note, the first assert ns_list == expected_list does not work as expected because the catalog contains other namespaces created in other tests.

        ns_list = catalog.list_namespaces()
>       assert sorted(ns_list) == [("db",), ("db%",), ("db2",)]
E       AssertionError: assert [('db',), ('d...t_new',), ...] == [('db',), ('db%',), ('db2',)]
E         Left contains 109 more items, first extra item: ('my_iceberg_database-alcotyqwtpiqunaddobf',)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this worked for me

    assert sorted(catalog.list_namespaces()) == [("db",), ("db%",), ("db2",)]
    assert sorted(catalog.list_namespaces("db")) == [("db", "ns1"), ("db", "ns2")]
    assert catalog.list_namespaces("db.ns1") == [("db", "ns1", "ns2")]
    assert catalog.list_namespaces("db.ns1.ns2") == []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first assertion doesn't work if you run all the tests in test_sql.py because there are other top-level namespaces created by other tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, maybe something like this then

assert all(namespace in catalog.list_namespaces() for namespace in [("db",), ("db%",), ("db2",)])

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this fix @alessandro-nori! This is such an interesting behavior. I had to take some time to go through the java implementation and tests thoroughly

I've added some comments on refactoring and testing. I think we might want to fix #1630 as part of this PR as well. WDYT?

stmt = union(
table_stmt,
namespace_stmt,
)
with Session(self.engine) as session:
return [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]
namespaces = [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry for the ambiguous statement. i meant to filter the results inline.

Something like this, filters the level and fuzzy match together

        with Session(self.engine) as session:
            namespace_tuple = Catalog.identifier_to_tuple(namespace)
            sub_namespaces_level_length = len(namespace_tuple) + 1 if namespace else 1
            namespaces = [
                ns[:sub_namespaces_level_length]  # truncate to the required level
                for ns in {Catalog.identifier_to_tuple(ns) for ns in session.execute(stmt).scalars()}
                if len(ns) >= sub_namespaces_level_length  # only get sub namespaces/children
                and ns[: len(namespace_tuple)] == namespace_tuple  # exclude fuzzy matches when `namespace` contains `%` or `_`
            ]

            return namespaces

Comment on lines 1126 to 1146
expected_list: list[tuple[str, ...]] = [("db",), ("db2",), ("db%",)]
for ns in expected_list:
assert ns in ns_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this worked for me

    assert sorted(catalog.list_namespaces()) == [("db",), ("db%",), ("db2",)]
    assert sorted(catalog.list_namespaces("db")) == [("db", "ns1"), ("db", "ns2")]
    assert catalog.list_namespaces("db.ns1") == [("db", "ns1", "ns2")]
    assert catalog.list_namespaces("db.ns1.ns2") == []

@@ -1158,13 +1171,13 @@ def test_list_non_existing_namespaces(catalog: SqlCatalog) -> None:
def test_drop_namespace(catalog: SqlCatalog, table_schema_nested: Schema, table_identifier: Identifier) -> None:
namespace = Catalog.namespace_from(table_identifier)
catalog.create_namespace(namespace)
assert namespace in catalog.list_namespaces()
assert namespace[:1] in catalog.list_namespaces()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, thanks for the explanation! i think the assert here is testing that the newly created namespace exists.
so perhaps something like this matches its behavior more

assert namespace in catalog.list_namespaces(namespace[:-1])

@kevinjqliu kevinjqliu removed this from the PyIceberg 0.9.0 release milestone Feb 16, 2025
@alessandro-nori
Copy link
Contributor Author

alessandro-nori commented Feb 17, 2025

Thanks for your review @kevinjqliu and @Fokko .
As suggested by @kevinjqliu , I'll start working on #1630 on a separate PR as a requirement to this one.
I could also solve both issues in the same PR but I already modified a lot of unit tests here.

If you prefer having both issues in the same PR for review let me know, it works either way for me.

PR for #1630 is #1671

@kevinjqliu
Copy link
Contributor

now that #1630 is merged, could you rebase off the latest main

@alessandro-nori alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch from 1ccfc22 to e32fed3 Compare February 18, 2025 10:36
@alessandro-nori alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch from e32fed3 to 83a0c54 Compare February 18, 2025 10:49
@Fokko
Copy link
Contributor

Fokko commented Feb 18, 2025

Thanks for splitting this @alessandro-nori 🙏

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, missed this one. Looks good, thanks @alessandro-nori

@Fokko Fokko merged commit f459662 into apache:main Mar 5, 2025
7 checks passed
@Fokko Fokko added this to the PyIceberg 0.9.1 milestone Apr 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect Output from list_namespaces in SqlCatalog
3 participants