Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(tracking): Test demos on nightly versions for 25.3.0 #157

Open
11 of 13 tasks
Tracked by #686
NickLarsenNZ opened this issue Feb 27, 2025 · 11 comments
Open
11 of 13 tasks
Tracked by #686

chore(tracking): Test demos on nightly versions for 25.3.0 #157

NickLarsenNZ opened this issue Feb 27, 2025 · 11 comments

Comments

@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Feb 27, 2025

Pre-Release Demo Testing on Nightly

Part of stackabletech/issues#686

This is testing:

  1. That upgrades from the stable release to the nightly release of the operators and products do
    not negatively impact the products.
  2. That nightly demos work as documented from scratch.

Note

Record any issues or anomalies during the process in a comment on this issue.
Eg:

:green_circle: **airflow-scheduled-job**

The CRD had been updated and I needed to change the following in the manifest:
...

Replace the items in the task lists below with the applicable Pull Requests (if any).

Stable to Nightly Upgrade Testing Instructions

These instructions are for deploying and completing the stable demo, and then
upgading operators, CRDs, and products to the nightly versions well as upgrading
the operators and CRDS.

Tip

Be sure to select the stable docs version on https://docs.stackable.tech/home/stable/demos/.

# Install demo (stable operators) for the stable release (24.11.1).
stackablectl demo install <DEMO_NAME>

# --- IMPORTANT ---
# Run through the stable demo instructions (refer to the tasklist above).

# Get a list of installed operators
stackablectl operator installed --output=plain

# --- OPTIONAL ---
# Sometimes it is necessary to upgrade Helm charts. Look for other Helm Charts
# which might need updating.

# First, see which charts are installed. You can ignore the stackable-operator
# charts, or anything that might have been installed outside of this demo.
helm list

# Next, add the applicable Helm Chart repositories. For example:
helm repo add minio https://charts.min.io/
helm repo add bitnami https://charts.bitnami.com/bitnami

# Finally, upgrade the Charts to what is defined in `main`.
# For example:
helm upgrade minio minio/minio --version x.x.x
helm upgrade postgresql-hive bitnami/postgresql --version x.x.x
# --- OPTIONAL END ---

# Uninstall operators for the stable release (24.11)
stackablectl release uninstall 24.11

# Update CRDs to nightly version (on main)
# Repeat this for every operator used by the demo (use the list from the earlier step before deleting the operators)
kubectl replace -f https://raw.githubusercontent.com/stackabletech/commons-operator/main/deploy/helm/commons-operator/crds/crds.yaml
kubectl replace -f https://raw.githubusercontent.com/stackabletech/...-operator/main/deploy/helm/...-operator/crds/crds.yaml

# Install nightly version of operators (use the list from the earlier step before deleting the operators)
# MAKE SURE TO SPECIFY 0.0.0-dev AS THE VERSION FOR EACH OPERATOR, BECAUSE STACKABLECTL-24.11.3
# WILL OTHERWISE INSTALL THE WRONG VERSION.
stackablectl operator install commons=0.0.0-dev ...

# Optionally update the product versions in the CRDs (to the latest non-experimental version for the new release), e.g.:
kubectl patch hbaseclusters/hbase --type='json' -p='[{"op": "replace", "path": "/spec/image/productVersion", "value":"x.x.x"}]' # changed

Nightly from Scratch Testing Instructions

These instructions are for deploying and completing the nightly demo from scratch.

Tip

Be sure to select the nightly docs version on https://docs.stackable.tech/home/nightly/demos/.

# Install demo (nightly operators) for the nightly release.
stackablectl demo install <DEMO_NAME> --release dev

# --- IMPORTANT ---
# Run through the nightly demo instructions (refer to the tasklist above).
@NickLarsenNZ
Copy link
Member Author

NickLarsenNZ commented Mar 14, 2025

🟢 end-to-end-security

  • 🟢 stable to nightly:
    • The create-tables-in-trino take a while to come up due to exponential backoff. We should make it wait on Trino in an init container.
    • Hive 4.0.1 spits Thrift errors for each readiness probe, but appears to be working.
      ERROR [Metastore-Handler-Pool: Thread-65] server.TThreadPoolServer: Thrift Error occurred during processing of message.
      
  • 🟢 nightly from scratch:
    • Hive 4.0.1 spits Thrift errors for each readiness probe, but appears to be working.
      ERROR [Metastore-Handler-Pool: Thread-65] server.TThreadPoolServer: Thrift Error occurred during processing of message.
      

From @Jimvin:

If this is the new noise on Hive 4 with Kerberos we should document that. We can suppress the message with some logging config.
See: https://stackable.atlassian.net/browse/SUP-67?focusedCommentId=10944

suppress org.apache.thrift.server.TThreadPoolServer logs
spec:
  metastore:
    config:
      logging:
        containers:
          hive:
            loggers:
              org.apache.thrift.server.TThreadPoolServer:
                level: NONE
Full thrift error log
2025-03-14T15:24:56,586 ERROR [Metastore-Handler-Pool: Thread-65] server.TThreadPoolServer: Thrift Error occurred during processing of message.
org.apache.thrift.transport.TTransportException: org.apache.thrift.transport.TTransportException: Socket is closed by peer.
    at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge.java:729) ~[hive-standalone-metastore-common-4.0.1.jar:4.0.1]
    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:227) ~[libthrift-0.16.0.jar:0.16.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
    at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: org.apache.thrift.transport.TTransportException: Socket is closed by peer.
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:184) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:151) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:108) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:238) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:44) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:199) ~[libthrift-0.16.0.jar:0.16.0]
    at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge.java:711) ~[hive-standalone-metastore-common-4.0.1.jar:4.0.1]
    at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory$1.run(HadoopThriftAuthBridge.java:707) ~[hive-standalone-metastore-common-4.0.1.jar:4.0.1]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
    at javax.security.auth.Subject.doAs(Unknown Source) ~[?:?]
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) ~[hadoop-common-3.3.6.jar:?]
    at org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge$Server$TUGIAssumingTransportFactory.getTransport(HadoopThriftAuthBridge.java:707) ~[hive-standalone-metastore-common-4.0.1.jar:4.0.1]
    ... 4 more

@Techassi
Copy link
Member

Techassi commented Mar 17, 2025

🟢 logging

  • 🟢 Stable to Nightly:
    • Initial install worked without any issues
    • Upgrading to nightly operators and latest ZooKeeper version (3.9.3) worked without any issues
  • 🟢 Nightly from Scratch: No issues

@maltesander
Copy link
Member

maltesander commented Mar 17, 2025

🟢 airflow-scheduled-job

@Techassi
Copy link
Member

Techassi commented Mar 17, 2025

🟢 signal-processing

  • 🟢 Stable to Nightly:
    • Logging into Jupyterhub will create a pod on the fly. It still refers to docker.stackable.tech. I think this should instead point to oci.stackable.tech. (Retroactively changed on the release-24.11 branch)
      Container image "docker.stackable.tech/stackable/tools:1.0.0-stackable24.11.1" ...
      Pulling image "docker.stackable.tech/demos/jupyter-pyspark-with-alibi-detect:python-3.9"
      
    • Initial install worked without any issues
    • Upgrading to nightly operators and latest ZooKeeper works.
    • Bumping NiFi requires updating both the image URL and the productVersion. It works without any issues when bumped to 1.28.1.
    • Avoid re-running the notebook after the bumps. This will cause errors. This behaviour will be documented. Leaving the pod running during bumps will work as expected.
  • 🟢 Nightly from Scratch: Works without any issues.

@maltesander
Copy link
Member

🟢 hbase-hdfs-load-cycling-data

  • 🟢 Stable to Nightly: No Issues
  • 🟢 Nightly from Scratch: No issues

@NickLarsenNZ
Copy link
Member Author

NickLarsenNZ commented Mar 17, 2025

🟢 trino-taxi-data

  • 🟢 stable to nightly:
    • I had trouble on my local machine, but it worked fine in Replicated (k3s) with the largest node.
      • Took 12 restarts for the data to load into Trino, and Trino reports memory issues.
      • Superset login page Error 500. This was due to an old cookies from when I ran an OIDC demo.
  • 🟢 nightly from scratch:
    • setup-superset job fails:
      Issue 1010 - Superset encountered an error while running a command.
      
    • Trino now requires S3 connections over TLS. Without this the following error will be seen:
      2025-03-17T13:08:44.623076Z ERROR stackable_operator::logging::controller: Failed to reconcile object controller.name="trinocluster.trino.stackable.tech" error=reconciler for
      object TrinoCluster.v1alpha1.trino.stackable.tech/trino.default failed error.sources=[failed to parse TrinoCatalog.v1alpha1.trino.stackable.tech/hive.default, trino 469 and
      greater require TLS for S3]
      
    • Trino pods needed restarting after the S3Connection changes were made.
Trino oddities

Image

Image

@maltesander
Copy link
Member

🟢 jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data

  • 🟢 Stable to Nightly: No Issues
  • 🟢 Nightly from Scratch: No issues

@maltesander
Copy link
Member

🟢 jupyterhub-keycloak

  • ❕ Stable to Nightly: no stable version yet
  • 🟢 Nightly from Scratch: No issues

@razvan
Copy link
Member

razvan commented Mar 17, 2025

update: actually there is also a problem with Trino/S3/TLS

🟢 spark-k8s-anomaly-detection-taxi-data

  • Stable to nightly: requires PR below
  • Nightly from scratch: requires PR below

PR: #173

@xeniape
Copy link
Member

xeniape commented Mar 18, 2025

🟢 trino-iceberg

  • 🟢 Stable to Nightly: No issues
  • 🟢 Nightly from Scratch
    • Same as trino-taxi-data: Trino requires TLS for S3 connections
    2025-03-18T13:35:24.765709Z ERROR stackable_operator::logging::controller: Failed to reconcile object controller.name="trinocluster.trino.stackable.tech" error=reconciler for object TrinoCluster.v1alpha1.trino.stackable.tech/trino.default failed error.sources=[failed to parse TrinoCatalog.v1alpha1.trino.stackable.tech/lakehouse.default, trino 469 and greater require TLS for S3] 
    
    fix(stack/trino-iceberg): Use Minio with TLS, increase memory limits #176
    • trino coordinator and worker ran into OOM kills when going through the demo commands, increased memory as part of the above ticket as well
    • no other issues

@dervoeti
Copy link
Member

🟢 nifi-kafka-druid-water-level-data

  • 🟡 Stable to Nightly: It works mostly, but the "Number of days with measurements" counter throws an error. I think this is due to the Druid upgrade from 30.0.0 to 30.0.1. It's the exact same error that was fixed by fix: measurements_per_day query #180, so I think it's fine? It's a product related change so I'm not sure if we're responsible for this. Installing the demo from scratch includes the fix, so that works fine.
  • 🟢 Nightly from Scratch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants