[PECOBLR-201] add variant support #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

shivam2680 wants to merge 4 commits into main from shivam2680/variant

+258 −9

Contributor

shivam2680 commented May 19, 2025 •

edited

Loading

Description

This pull request introduces support for detecting and handling VARIANT column types in the Databricks SQL Thrift backend, along with corresponding tests for validation.
updated the _col_to_description and _hive_schema_to_description methods to process metadata for VARIANT types
Added unit and end-to-end tests to ensure proper functionality.

Testing details

End-to-End Tests:

Added tests/e2e/test_variant_types.py to validate VARIANT type detection and data retrieval. Includes tests for creating tables with VARIANT columns, inserting records, and verifying correct type handling and JSON parsing.

Unit Tests:

Tests cover scenarios like VARIANT type detection, handling of null or malformed metadata, and fallback behavior for missing Arrow schemas.


          add variant support

b834ce7

shivam2680 requested review from deeksha-db, samikshya-db, jprakash-db, yunbodeng-db, jackyhu-db and benc-db as code owners

May 19, 2025 10:30

github-actions bot commented May 19, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

shivam2680 changed the title ~~add variant support~~ [PECOBLR-201]. add variant support

shivam2680 changed the title ~~[PECOBLR-201]. add variant support~~ [PECOBLR-201] add variant support


          add extensive tests for data types

3e8cce3

shivam2680 had a problem deploying to azure-prod

May 19, 2025 10:47

— with

GitHub Actions Failure

github-actions bot commented May 19, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

samikshya-db mentioned this pull request

Databricks VARIANT DataType Support #424

Closed

jprakash-db reviewed

View reviewed changes

src/databricks/sql/thrift_backend.py

@@ @@ -692,12 +692,36 @@ def _col_to_description(col): @@
                       else:
                           precision, scale = None, None
+                      # Extract variant type from field if available

Contributor

jprakash-db Jun 17, 2025

Are you sure this is correct? I tried and was getting metadata as null when the column type is variant. Also for variant the pyarrow schema just shows string in my testing, shouldn't the server return variant type ?

Contributor Author

shivam2680 Jun 17, 2025

yes,
debug output:
[SHIVAM] field pyarrow.Field<CAST(1 AS VARIANT): string>
[SHIVAM] field metadata {b'Spark:DataType:SqlName': b'VARIANT', b'Spark:DataType:JsonType': b'"variant"'}

Contributor

jprakash-db Jun 17, 2025 •

edited

Loading

@shivam2680 I am getting this as the arrow_schema, where metadata is null. Is this some transient behaviour ? or am I missing something

tests/e2e/test_variant_types.py Outdated

+                                  """
+                              )
+                              variant_supported = True

Contributor

jprakash-db Jun 17, 2025

I don't understand the point of the test. If table creation passes then variant is supported otherwise it is not? Table creation can fail due to many reasons, how is this a good test ?

tests/e2e/test_variant_types.py Outdated

+                          yield variant_supported
+                          # Clean up if table was created
+                          if variant_supported:

Contributor

jprakash-db Jun 17, 2025

Best practice is to always use a
try: yield finally: Cleanup table
In your code, if code using the yield throws as error the table will not be dropped

tests/e2e/test_variant_types.py Outdated

+                          # Check if VARIANT type is supported
+                          try:
+                              # delete the table if it exists
+                              cursor.execute("DROP TABLE IF EXISTS pysql_test_variant_types_table")

Contributor

jprakash-db Jun 17, 2025

Don't add so many drop checks. Once before the table creation and once after

tests/e2e/test_variant_types.py Outdated

Comment on lines 120 to 134

+                          # Mixed types
+                          (13, '[1, "string", true, null, {"key": "value"}]', [1, "string", True, None, {"key": "value"}], "Array with mixed types"),
+                          # Special cases
+                          (14, '{}', {}, "Empty object"),
+                          (15, '[]', [], "Empty array"),
+                          (16, '{"unicode": "✓ öäü 😀"}', {"unicode": "✓ öäü 😀"}, "Unicode characters"),
+                          (17, '{"large_number": 9223372036854775807}', {"large_number": 9223372036854775807}, "Large integer"),
+                          # Deeply nested structure
+                          (18, '{"level1": {"level2": {"level3": {"level4": {"level5": "deep value"}}}}}',
+                           {"level1": {"level2": {"level3": {"level4": {"level5": "deep value"}}}}}, "Deeply nested structure"),
+                          # Date and time types
+                          (19, '"2023-01-01"', "2023-01-01", "Date as string (ISO format)"),

Contributor

jprakash-db Jun 17, 2025

How is this test related to this PR? In these tests you are testing whether json.loads is working properly, don't see the point of testing json.loads() ability

tests/e2e/test_variant_types.py Outdated

+                               "Complex object with timestamps"),
+                      ]
+                  )
+                  def test_variant_data_types(self, test_id, json_value, expected_result, description):

Contributor

jprakash-db Jun 17, 2025

Don't think this test should be there. json.loads() is a well tested library and there is no need to test whether it parses correctly or not


          addressed comments

d0e39ec

shivam2680 requested review from madhav-db, gopalldb, jayantsing-db and vikrantpuppala as code owners

June 17, 2025 08:38

shivam2680 had a problem deploying to azure-prod

June 17, 2025 08:38

— with

GitHub Actions Failure

shivam2680 had a problem deploying to azure-prod

June 17, 2025 08:38

— with

GitHub Actions Failure

github-actions bot commented Jun 17, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

shivam2680 requested a review from jprakash-db

June 17, 2025 15:03


          Merge branch 'main' into variant

shivam2680 had a problem deploying to azure-prod

June 18, 2025 06:06

— with

GitHub Actions Failure

github-actions bot commented Jun 18, 2025

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

jayantsing-db requested changes

View reviewed changes

src/databricks/sql/thrift_backend.py

Comment on lines +706 to +715

+                      if field is not None:
+                          try:
+                              # Check for variant type in metadata
+                              if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
+                                  sql_type = field.metadata.get(b"Spark:DataType:SqlName")
+                                  if sql_type == b"VARIANT":
+                                      cleaned_type = "variant"
+                          except Exception as e:
+                              logger.debug(f"Could not extract variant type from field: {e}")

Contributor

jayantsing-db Jun 18, 2025

please check with eng-sqlgateway if there is a way to get this from thrift metadata. python connector uses thrift metadata for getting metadata

Contributor

jayantsing-db Jun 18, 2025

is there is some documentation/contract around it or is it purely from empirical evidence?

vikrantpuppala removed request for vikrantpuppala and benc-db

July 8, 2025 03:36

vikrantpuppala removed request for yunbodeng-db, jackyhu-db, samikshya-db and deeksha-db

July 8, 2025 03:36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet