Skip to content

feat: convert anomaly demo to spark-connect #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

razvan
Copy link
Member

@razvan razvan commented Apr 30, 2025

@razvan razvan marked this pull request as ready for review May 2, 2025 10:01
@razvan razvan requested a review from a team May 2, 2025 10:02
@razvan razvan moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering May 2, 2025
name: notebook
initContainers:
- name: download-notebook
image: oci.stackable.tech/sdp/spark-connect-client:3.5.5-stackable0.0.0-dev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might change depending on the outcome mentioned in the dependent PR:

stackabletech/docker-images#1071 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's wait with merging this one

@NickLarsenNZ NickLarsenNZ moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering May 2, 2025
NOTE: Using a custom image requires access to a repository where the image can be made available.
The Python notebook uses libraries such as `pandas` and `scikit-learn` to analyze the data.
In addition, since the model training is delegated to a Spark Connect server, some of these dependencies, most notably `scikit-learn`, must also be made available on the Spark Connect pods.
For convenience, a custom image is used in this demo that bundles all the required libraries for both the notebook and the Spark Connect server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: We could link to the Dockerfile (so that others can take the next steps for their use case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were the comments supposed to be in there?

#SCL = spark.sparkContext.broadcast(scaler)
#CLF = spark.sparkContext.broadcast(clf)
        # No broadcast variables when using Spark Connect
        # x_test = SCL.value.transform(x_test)
        # prediction = CLF.value.predict(x_test)[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I left them as an explanation / reminder that connect only supports a subset of the spark api

image:
# Using an image that includes scikit-learn (among other things)
# because this package needs to be available on the executors.
custom: oci.stackable.tech/sdp/spark-connect-client:3.5.5-stackable0.0.0-dev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might change depending on the outcome mentioned in the dependent PR:

stackabletech/docker-images#1071 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Development: In Review
Development

Successfully merging this pull request may close these issues.

2 participants