Fix: `SqlCatalog` list_namespaces() should return only sub-namespaces #1629

alessandro-nori · 2025-02-08T20:35:09Z

Resolves: #1627

alessandro-nori · 2025-02-08T20:57:53Z

@kevinjqliu while working on this I also noticed that _namespace_exists() is using an exact comparison instead of LIKE for the query. I opened a new issue to track it #1630

kevinjqliu

added a comment about using %

the test lgtm to follows the behavior as java implementation
https://github.com/apache/iceberg/blob/41b458b7022c7b0cd78eeca9102392db7889d3c9/core/src/test/java/org/apache/iceberg/jdbc/TestJdbcCatalog.java#L761-L764

pyiceberg/catalog/sql.py

kevinjqliu · 2025-02-13T06:20:12Z

pyiceberg/catalog/sql.py

        stmt = union(
            table_stmt,
            namespace_stmt,
        )
        with Session(self.engine) as session:
-            return [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]
+            namespaces = [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]


instead of all this logic, can we just filter out the results?

Sorry, I don't think I understand what you mean here so my answer can be unrelated.

If it's about filtering in the SQL query, I don't think it's possible to easily cover some edge cases. The Java implementation handles the filtering afterward as well.

oops sorry for the ambiguous statement. i meant to filter the results inline.

Something like this, filters the level and fuzzy match together

with Session(self.engine) as session: namespace_tuple = Catalog.identifier_to_tuple(namespace) sub_namespaces_level_length = len(namespace_tuple) + 1 if namespace else 1 namespaces = [ ns[:sub_namespaces_level_length] # truncate to the required level for ns in {Catalog.identifier_to_tuple(ns) for ns in session.execute(stmt).scalars()} if len(ns) >= sub_namespaces_level_length # only get sub namespaces/children and ns[: len(namespace_tuple)] == namespace_tuple # exclude fuzzy matches when `namespace` contains `%` or `_` ] return namespaces

Thanks for the clarification!
I didn't know that ns[:0] == () would actually work 😄

kevinjqliu · 2025-02-13T06:20:26Z

tests/catalog/test_sql.py

@@ -1116,17 +1116,30 @@ def test_create_namespace_with_empty_identifier(catalog: SqlCatalog, empty_names
        lazy_fixture("catalog_sqlite"),
    ],
 )
-@pytest.mark.parametrize("namespace_list", [lazy_fixture("database_list"), lazy_fixture("hierarchical_namespace_list")])
-def test_list_namespaces(catalog: SqlCatalog, namespace_list: List[str]) -> None:
+def test_list_namespaces(catalog: SqlCatalog) -> None:


thanks for adding this test!

kevinjqliu · 2025-02-13T06:20:47Z

tests/catalog/test_sql.py

@@ -1158,13 +1171,13 @@ def test_list_non_existing_namespaces(catalog: SqlCatalog) -> None:
 def test_drop_namespace(catalog: SqlCatalog, table_schema_nested: Schema, table_identifier: Identifier) -> None:
    namespace = Catalog.namespace_from(table_identifier)
    catalog.create_namespace(namespace)
-    assert namespace in catalog.list_namespaces()
+    assert namespace[:1] in catalog.list_namespaces()


why is this excluding the first result?

I'm taking only the first level of the namespace.

Calling list_namespaces() without parameters now correctly returns only the top level namespaces.

So if namespace is a multi-level namespace, for example "db.ns1", only "db" is returned by list_namespaces()

i see, thanks for the explanation! i think the assert here is testing that the newly created namespace exists.
so perhaps something like this matches its behavior more

assert namespace in catalog.list_namespaces(namespace[:-1])

Fokko · 2025-02-14T19:09:55Z

pyiceberg/catalog/sql.py

+            if namespace:
+                namespace_tuple = Catalog.identifier_to_tuple(namespace)
+                # exclude fuzzy matches when `namespace` contains `%` or `_`
+                namespaces = [ns for ns in namespaces if ns[: len(namespace_tuple)] == namespace_tuple]


Performance nit: Should we move the len out of the loop?

Fokko · 2025-02-14T19:34:13Z

tests/catalog/test_sql.py

+    expected_list: list[tuple[str, ...]] = [("db",), ("db2",), ("db%",)]
+    for ns in expected_list:
+        assert ns in ns_list


Why not check the full list? This way we make sure that they are equal, and that the ns_list doesn't contain additional elements:

Suggested change

expected_list: list[tuple[str, ...]] = [("db",), ("db2",), ("db%",)]

for ns in expected_list:

assert ns in ns_list

assert ns_list == [("db",), ("db2",), ("db%",)]

Good idea, thanks!
I'm using sorted(ns_list) to make the test more resilient.
Or do you think we should force a specific order in list_namespaces()?

Another note, the first assert ns_list == expected_list does not work as expected because the catalog contains other namespaces created in other tests.

ns_list = catalog.list_namespaces() > assert sorted(ns_list) == [("db",), ("db%",), ("db2",)] E AssertionError: assert [('db',), ('d...t_new',), ...] == [('db',), ('db%',), ('db2',)] E Left contains 109 more items, first extra item: ('my_iceberg_database-alcotyqwtpiqunaddobf',)

this worked for me

assert sorted(catalog.list_namespaces()) == [("db",), ("db%",), ("db2",)] assert sorted(catalog.list_namespaces("db")) == [("db", "ns1"), ("db", "ns2")] assert catalog.list_namespaces("db.ns1") == [("db", "ns1", "ns2")] assert catalog.list_namespaces("db.ns1.ns2") == []

the first assertion doesn't work if you run all the tests in test_sql.py because there are other top-level namespaces created by other tests

huh, maybe something like this then

assert all(namespace in catalog.list_namespaces() for namespace in [("db",), ("db%",), ("db2",)])

kevinjqliu

Thanks for working on this fix @alessandro-nori! This is such an interesting behavior. I had to take some time to go through the java implementation and tests thoroughly

I've added some comments on refactoring and testing. I think we might want to fix #1630 as part of this PR as well. WDYT?

pyiceberg/catalog/sql.py

kevinjqliu · 2025-02-16T20:01:01Z

pyiceberg/catalog/sql.py

        stmt = union(
            table_stmt,
            namespace_stmt,
        )
        with Session(self.engine) as session:
-            return [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]
+            namespaces = [Catalog.identifier_to_tuple(namespace_col) for namespace_col in session.execute(stmt).scalars()]


oops sorry for the ambiguous statement. i meant to filter the results inline.

Something like this, filters the level and fuzzy match together

with Session(self.engine) as session: namespace_tuple = Catalog.identifier_to_tuple(namespace) sub_namespaces_level_length = len(namespace_tuple) + 1 if namespace else 1 namespaces = [ ns[:sub_namespaces_level_length] # truncate to the required level for ns in {Catalog.identifier_to_tuple(ns) for ns in session.execute(stmt).scalars()} if len(ns) >= sub_namespaces_level_length # only get sub namespaces/children and ns[: len(namespace_tuple)] == namespace_tuple # exclude fuzzy matches when `namespace` contains `%` or `_` ] return namespaces

tests/catalog/test_base.py

kevinjqliu · 2025-02-16T20:18:16Z

tests/catalog/test_sql.py

+    expected_list: list[tuple[str, ...]] = [("db",), ("db2",), ("db%",)]
+    for ns in expected_list:
+        assert ns in ns_list


this worked for me

assert sorted(catalog.list_namespaces()) == [("db",), ("db%",), ("db2",)] assert sorted(catalog.list_namespaces("db")) == [("db", "ns1"), ("db", "ns2")] assert catalog.list_namespaces("db.ns1") == [("db", "ns1", "ns2")] assert catalog.list_namespaces("db.ns1.ns2") == []

kevinjqliu · 2025-02-16T20:20:14Z

tests/catalog/test_sql.py

@@ -1158,13 +1171,13 @@ def test_list_non_existing_namespaces(catalog: SqlCatalog) -> None:
 def test_drop_namespace(catalog: SqlCatalog, table_schema_nested: Schema, table_identifier: Identifier) -> None:
    namespace = Catalog.namespace_from(table_identifier)
    catalog.create_namespace(namespace)
-    assert namespace in catalog.list_namespaces()
+    assert namespace[:1] in catalog.list_namespaces()


i see, thanks for the explanation! i think the assert here is testing that the newly created namespace exists.
so perhaps something like this matches its behavior more

assert namespace in catalog.list_namespaces(namespace[:-1])

tests/catalog/test_sql.py

alessandro-nori · 2025-02-17T14:19:03Z

Thanks for your review @kevinjqliu and @Fokko .
As suggested by @kevinjqliu , I'll start working on #1630 on a separate PR as a requirement to this one.
I could also solve both issues in the same PR but I already modified a lot of unit tests here.

If you prefer having both issues in the same PR for review let me know, it works either way for me.

PR for #1630 is #1671

kevinjqliu · 2025-02-17T18:24:03Z

now that #1630 is merged, could you rebase off the latest main

# Conflicts: # tests/catalog/test_sql.py

Fokko · 2025-02-18T13:00:29Z

Thanks for splitting this @alessandro-nori 🙏

kevinjqliu

LGTM! Thanks for fixing this

Fokko

Oof, missed this one. Looks good, thanks @alessandro-nori

…#1629) Resolves: #1627

alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch 3 times, most recently from 62b9efc to 9e55cf6 Compare February 8, 2025 20:44

alessandro-nori marked this pull request as ready for review February 8, 2025 20:44

kevinjqliu reviewed Feb 9, 2025

View reviewed changes

pyiceberg/catalog/sql.py Show resolved Hide resolved

alessandro-nori requested a review from kevinjqliu February 11, 2025 15:14

Fokko added this to the PyIceberg 0.9.0 release milestone Feb 12, 2025

kevinjqliu reviewed Feb 13, 2025

View reviewed changes

alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch from 1532ab4 to 8321d64 Compare February 13, 2025 09:56

Fokko reviewed Feb 14, 2025

View reviewed changes

kevinjqliu reviewed Feb 16, 2025

View reviewed changes

kevinjqliu removed this from the PyIceberg 0.9.0 release milestone Feb 16, 2025

kevinjqliu reviewed Feb 16, 2025

View reviewed changes

tests/catalog/test_sql.py Show resolved Hide resolved

alessandro-nori added 6 commits February 18, 2025 09:07

fix SqlCatalog list_namespaces()

90cc650

# Conflicts: # tests/catalog/test_sql.py

fix linter errors

619f11e

SqlCatalog: fix tests using list_namespaces

7523a90

fix InMemoryCatalog tests

58fcb40

extract len() out of loop

fac3343

refactor, address review

d1d590a

alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch from 1ccfc22 to e32fed3 Compare February 18, 2025 10:36

minor tests refactor

83a0c54

alessandro-nori force-pushed the issue_1627_sqlcatalog_list_namespaces branch from e32fed3 to 83a0c54 Compare February 18, 2025 10:49

new test for fuzzy match namespaces

53df102

kevinjqliu approved these changes Feb 20, 2025

View reviewed changes

Fokko approved these changes Mar 5, 2025

View reviewed changes

Fokko merged commit f459662 into apache:main Mar 5, 2025
7 checks passed

Fokko added this to the PyIceberg 0.9.1 milestone Apr 20, 2025

Fokko pushed a commit that referenced this pull request Apr 25, 2025

Fix: SqlCatalog list_namespaces() should return only sub-namespaces (…

c86f33d

…#1629) Resolves: #1627

Fix: SqlCatalog list_namespaces() should return only sub-namespaces #1629

Fix: SqlCatalog list_namespaces() should return only sub-namespaces #1629

Uh oh!

Conversation

alessandro-nori commented Feb 8, 2025

Uh oh!

alessandro-nori commented Feb 8, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alessandro-nori commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Feb 17, 2025

Uh oh!

Fokko commented Feb 18, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fix: `SqlCatalog` list_namespaces() should return only sub-namespaces #1629

Fix: `SqlCatalog` list_namespaces() should return only sub-namespaces #1629

alessandro-nori commented Feb 17, 2025 •

edited

Loading