-
Notifications
You must be signed in to change notification settings - Fork 111
[PECOBLR-201] add variant support #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
@@ -692,12 +692,36 @@ def _col_to_description(col): | |||
else: | |||
precision, scale = None, None | |||
|
|||
# Extract variant type from field if available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is correct? I tried and was getting metadata
as null when the column type is variant
. Also for variant the pyarrow
schema just shows string
in my testing, shouldn't the server return variant
type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,
debug output:
[SHIVAM] field pyarrow.Field<CAST(1 AS VARIANT): string>
[SHIVAM] field metadata {b'Spark:DataType:SqlName': b'VARIANT', b'Spark:DataType:JsonType': b'"variant"'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivam2680 I am getting this as the arrow_schema, where metadata is null. Is this some transient behaviour ? or am I missing something
tests/e2e/test_variant_types.py
Outdated
""" | ||
) | ||
|
||
variant_supported = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the point of the test. If table creation passes then variant
is supported otherwise it is not? Table creation can fail due to many reasons, how is this a good test ?
tests/e2e/test_variant_types.py
Outdated
yield variant_supported | ||
|
||
# Clean up if table was created | ||
if variant_supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best practice is to always use a
try: yield finally: Cleanup table
In your code, if code using the yield throws as error the table will not be dropped
tests/e2e/test_variant_types.py
Outdated
# Check if VARIANT type is supported | ||
try: | ||
# delete the table if it exists | ||
cursor.execute("DROP TABLE IF EXISTS pysql_test_variant_types_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add so many drop checks. Once before the table creation and once after
tests/e2e/test_variant_types.py
Outdated
# Mixed types | ||
(13, '[1, "string", true, null, {"key": "value"}]', [1, "string", True, None, {"key": "value"}], "Array with mixed types"), | ||
|
||
# Special cases | ||
(14, '{}', {}, "Empty object"), | ||
(15, '[]', [], "Empty array"), | ||
(16, '{"unicode": "✓ öäü 😀"}', {"unicode": "✓ öäü 😀"}, "Unicode characters"), | ||
(17, '{"large_number": 9223372036854775807}', {"large_number": 9223372036854775807}, "Large integer"), | ||
|
||
# Deeply nested structure | ||
(18, '{"level1": {"level2": {"level3": {"level4": {"level5": "deep value"}}}}}', | ||
{"level1": {"level2": {"level3": {"level4": {"level5": "deep value"}}}}}, "Deeply nested structure"), | ||
|
||
# Date and time types | ||
(19, '"2023-01-01"', "2023-01-01", "Date as string (ISO format)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this test related to this PR? In these tests you are testing whether json.loads is working properly, don't see the point of testing json.loads()
ability
tests/e2e/test_variant_types.py
Outdated
"Complex object with timestamps"), | ||
] | ||
) | ||
def test_variant_data_types(self, test_id, json_value, expected_result, description): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this test should be there. json.loads() is a well tested library and there is no need to test whether it parses correctly or not
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
if field is not None: | ||
try: | ||
# Check for variant type in metadata | ||
if field.metadata and b"Spark:DataType:SqlName" in field.metadata: | ||
sql_type = field.metadata.get(b"Spark:DataType:SqlName") | ||
if sql_type == b"VARIANT": | ||
cleaned_type = "variant" | ||
except Exception as e: | ||
logger.debug(f"Could not extract variant type from field: {e}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check with eng-sqlgateway if there is a way to get this from thrift metadata. python connector uses thrift metadata for getting metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there is some documentation/contract around it or is it purely from empirical evidence?
Description
This pull request introduces support for detecting and handling VARIANT column types in the Databricks SQL Thrift backend, along with corresponding tests for validation.
updated the
_col_to_description
and_hive_schema_to_description
methods to process metadata for VARIANT typesAdded unit and end-to-end tests to ensure proper functionality.
Testing details
End-to-End Tests:
tests/e2e/test_variant_types.py
to validate VARIANT type detection and data retrieval. Includes tests for creating tables with VARIANT columns, inserting records, and verifying correct type handling and JSON parsing.Unit Tests: