Skip to content

[PECOBLR-201] add variant support #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

shivam2680
Copy link
Contributor

@shivam2680 shivam2680 commented May 19, 2025

Description

This pull request introduces support for detecting and handling VARIANT column types in the Databricks SQL Thrift backend, along with corresponding tests for validation.
updated the _col_to_description and _hive_schema_to_description methods to process metadata for VARIANT types
Added unit and end-to-end tests to ensure proper functionality.

Testing details

End-to-End Tests:

  • Added tests/e2e/test_variant_types.py to validate VARIANT type detection and data retrieval. Includes tests for creating tables with VARIANT columns, inserting records, and verifying correct type handling and JSON parsing.

Unit Tests:

  • Tests cover scenarios like VARIANT type detection, handling of null or malformed metadata, and fallback behavior for missing Arrow schemas.

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@shivam2680 shivam2680 changed the title add variant support [PECOBLR-201]. add variant support May 19, 2025
@shivam2680 shivam2680 changed the title [PECOBLR-201]. add variant support [PECOBLR-201] add variant support May 19, 2025
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@@ -692,12 +692,36 @@ def _col_to_description(col):
else:
precision, scale = None, None

# Extract variant type from field if available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? I tried and was getting metadata as null when the column type is variant. Also for variant the pyarrow schema just shows string in my testing, shouldn't the server return variant type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,
debug output:
[SHIVAM] field pyarrow.Field<CAST(1 AS VARIANT): string>
[SHIVAM] field metadata {b'Spark:DataType:SqlName': b'VARIANT', b'Spark:DataType:JsonType': b'"variant"'}

Copy link
Contributor

@jprakash-db jprakash-db Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivam2680 I am getting this as the arrow_schema, where metadata is null. Is this some transient behaviour ? or am I missing something
Screenshot 2025-06-17 at 1 43 18 PM

"""
)

variant_supported = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the point of the test. If table creation passes then variant is supported otherwise it is not? Table creation can fail due to many reasons, how is this a good test ?

yield variant_supported

# Clean up if table was created
if variant_supported:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to always use a
try: yield finally: Cleanup table
In your code, if code using the yield throws as error the table will not be dropped

# Check if VARIANT type is supported
try:
# delete the table if it exists
cursor.execute("DROP TABLE IF EXISTS pysql_test_variant_types_table")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add so many drop checks. Once before the table creation and once after

Comment on lines 120 to 134
# Mixed types
(13, '[1, "string", true, null, {"key": "value"}]', [1, "string", True, None, {"key": "value"}], "Array with mixed types"),

# Special cases
(14, '{}', {}, "Empty object"),
(15, '[]', [], "Empty array"),
(16, '{"unicode": "✓ öäü 😀"}', {"unicode": "✓ öäü 😀"}, "Unicode characters"),
(17, '{"large_number": 9223372036854775807}', {"large_number": 9223372036854775807}, "Large integer"),

# Deeply nested structure
(18, '{"level1": {"level2": {"level3": {"level4": {"level5": "deep value"}}}}}',
{"level1": {"level2": {"level3": {"level4": {"level5": "deep value"}}}}}, "Deeply nested structure"),

# Date and time types
(19, '"2023-01-01"', "2023-01-01", "Date as string (ISO format)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this test related to this PR? In these tests you are testing whether json.loads is working properly, don't see the point of testing json.loads() ability

"Complex object with timestamps"),
]
)
def test_variant_data_types(self, test_id, json_value, expected_result, description):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this test should be there. json.loads() is a well tested library and there is no need to test whether it parses correctly or not

Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@shivam2680 shivam2680 requested a review from jprakash-db June 17, 2025 15:03
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Comment on lines +706 to +715
if field is not None:
try:
# Check for variant type in metadata
if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
sql_type = field.metadata.get(b"Spark:DataType:SqlName")
if sql_type == b"VARIANT":
cleaned_type = "variant"
except Exception as e:
logger.debug(f"Could not extract variant type from field: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check with eng-sqlgateway if there is a way to get this from thrift metadata. python connector uses thrift metadata for getting metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there is some documentation/contract around it or is it purely from empirical evidence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants