Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/cassandra): Add support for Cassandra as a source #11822

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

sagar-salvi-apptware
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX labels Nov 8, 2024
@jjoyce0510
Copy link
Collaborator

Things to confirm:

  • Platform Instance is supported, and is included when generating keyspace (container) AND table (dataset) URNs
  • Make sure that UDT column types are correctly flattened and supported
  • Make sure the column type mapping matches the following
Cassandra Data Type DataHub Data Type
ascii StringType
bigint NumberType
blob BytesType
boolean BooleanType
counter NumberType
date DateType
decimal NumberType
double NumberType
float NumberType
inet StringType
int NumberType
list ArrayType
map MapType
set ArrayType
smallint NumberType
text StringType
time TimeType
timestamp DateType
timeuuid StringType
tinyint NumberType
tuple ArrayType
uuid StringType
varchar StringType
varint NumberType
frozen<map<text, text>> MapType
frozen<list> ArrayType
frozen<set> ArrayType

@jjoyce0510
Copy link
Collaborator

set -> ArrayType list -> ArrayType
map<?> -> MapType

And UDT -> Flattened structure

@jjoyce0510
Copy link
Collaborator

jjoyce0510 commented Nov 8, 2024

Views & Tables should have the same aspects

  • Data Platform Instance
  • Sub Types
  • Dataset Properties
  • Container
  • Browse Paths V2

And ensure we have table + column comments as description fields.

.....And

Views have

  • View Properties

@jjoyce0510
Copy link
Collaborator

Table + Column Comments should be QA'd locally.

@jjoyce0510
Copy link
Collaborator

Also please make sure error handling is clear. And try-excepts are around areas where we need. This connector should only fail in known cases. We should not have ANY uncaught exceptions!

Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

Comment on lines 33 to 35
if self.config.cloud:
cloud_config = self.config.cloud_config
assert cloud_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can replace this check as if self.config.cloud_config if we remove empty defaults in config as suggested earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have additional optional config so this check invalid right? self.config.cloud validated token and bundle path

    connect_timeout: int = Field(
        default=60,
        description="Timeout in seconds for establishing new connections to Cassandra.",
    )

    request_timeout: int = Field(
        default=60, description="Timeout in seconds for individual Cassandra requests."
    )

column_infos = self.cassandra_session.execute(
CassandraQueries.GET_COLUMNS_QUERY, [keyspace_name, table_name]
)
column_infos = sorted(column_infos, key=lambda c: c.column_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cassandra provide column order in its system view of columns? Some systems do provide it, and if so we should honour it rather than sorting by name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only applied for PRIMARY KEY not for others.

# Key interests are in setting the AllowAllNetworkAuthorizer,
# and in enabling materialized view feature.
# ---------------------------------------------------------------------------------------- #

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the commented part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants