Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(tracking): Test demos for 24.11.1 #137

Closed
12 tasks done
Tracked by #683
Techassi opened this issue Jan 15, 2025 · 14 comments
Closed
12 tasks done
Tracked by #683

chore(tracking): Test demos for 24.11.1 #137

Techassi opened this issue Jan 15, 2025 · 14 comments
Labels

Comments

@Techassi
Copy link
Member

Techassi commented Jan 15, 2025

Pre-Release Demo Testing

Part of stackabletech/issues#683

This is testing:

  1. That upgrades from the outgoing stable release to the new release of the operators and products do not negatively impact the products.
  2. That the new release demos work as documented from scratch.

Note

Record any issues or anomalies during the process in a comment on this issue.
Eg:

:green_circle: **airflow-scheduled-job**

The CRD had been updated and I needed to change the following in the manifest:
...

Replace the items in the task lists below with the applicable Pull Requests (if any).

24.7.0 to new 24.11.1 Upgrade Testing Instructions

These instructions are for deploying and completing the 24.7.0 demo, and then upgrading operators, CRDs, and products to the 24.11.1 versions well as upgrading the operators and CRDS.

Tip

Be sure to select the stable docs version on https://docs.stackable.tech/home/stable/demos/.

# Install demo (stable operators) for the stable release (23.7.0).
stackablectl demo install <DEMO_NAME> --release 24.7

# --- IMPORTANT ---
# Run through the stable demo instructions (refer to the tasklist above).

# Get a list of installed operators
stackablectl operator installed --output=plain

# --- OPTIONAL ---
# Sometimes it is necessary to upgrade Helm charts. Look for other Helm Charts
# which might need updating.

# First, see which charts are installed. You can ignore the stackable-operator
# charts, or anything that might have been installed outside of this demo.
helm list

# Next, add the applicable Helm Chart repositories. For example:
helm repo add minio https://charts.min.io/
helm repo add bitnami https://charts.bitnami.com/bitnami

# Finally, upgrade the Charts to what is defined in `main`.
# For example:
helm upgrade minio minio/minio --version x.x.x
helm upgrade postgresql-hive bitnami/postgresql --version x.x.x
# --- OPTIONAL END ---

# Uninstall operators for the stable release (24.7.0)
stackablectl release uninstall 24.7

# At this point, we assume release.yml has been updated with the new 24.11.1 release.
# if it hasn't, you will need to point stackablectl at a locally updated file using --release-file

# Update CRDs 24.11
# Repeat this for every operator used by the demo (use the list from the earlier step before deleting the operators)
kubectl replace -f https://raw.githubusercontent.com/stackabletech/commons-operator/release-24.11/deploy/helm/commons-operator/crds/crds.yaml
kubectl replace -f https://raw.githubusercontent.com/stackabletech/...-operator/release-24.11/deploy/helm/...-operator/crds/crds.yaml

# Install new release operators (use the list from the earlier step before deleting the operators)
stackablectl operator install commons=24.11.1 ...

# Optionally update the product versions in the CRDs (to the latest non-experimental version for the new release), e.g.:
kubectl patch hbaseclusters/hbase --type='json' -p='[{"op": "replace", "path": "/spec/image/productVersion", "value":"x.x.x"}]' # changed

24.11.1 from Scratch Testing Instructions

These instructions are for deploying and completing the 24.11.1 demo from scratch.

Tip

Be sure to select the nightly docs version on https://docs.stackable.tech/home/nightly/demos/.

# Install demo (stable operators) for the nightly release.
stackablectl demo install <DEMO_NAME> --release 24.11

# --- IMPORTANT ---
# Run through the nightly demo instructions (refer to the tasklist above).
@adwk67
Copy link
Member

adwk67 commented Jan 15, 2025

🟢 airflow-scheduled-job

  • upgrade: OK, although
    • clusterrole is removed when the release is uninstalled, leading to scheduler RBAC errors
    • one scheduled job (running every minute) fails during the upgrade: DAGs are using the KubernetesExecutor
    • DAG has to be edited in the ConfigMap to correct for this
  • 24.11.1 from scratch: OK

@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Jan 16, 2025

🟢 end-to-end-security

Documentation improvement notes
  • s/sql editor/SQL client/
  • When logging in as justin.martin, give a hint that the password is listed in the table at the top of the demo (or just mention the password is the username).
  • When switching from justin.martin to sophia.clarke, it is not obvious how to logout from the first user.
  • s/encrypting the contents with sha256/hashing the contents with sha256/
  • s/except if they are their supervisor/unless they are the supervisor/
  • s/The Rego rule for this behavior looks like this (again a snippet from the trino-policies ConfigMap)/The Rego rule for this behavior looks like this snippet from the trino-policies ConfigMap/
  • Make https://localhost:8443/ open externally.
🟢 Upgrade (24.7.0 -> 24.11.1)
  • Upgrading the postgres helm chart failed (it's a known and requires manual steps to upgrade major versions).
  • Upgrading superset to 4.0.2 fails (it seems to still be trying 3.1.3):
    • kubectl patch supersetclusters/superset --type='json' -p='[{"op": "replace", "path": "/spec/image/productVersion", "value":"4.0.2"}]'
    • Image
    • It works again after killing the pod manually. I would have expected the superset operator to handle this after the earlier patch command. Pinging @nightkr.
🟢 Fresh install

@Techassi
Copy link
Member Author

Techassi commented Jan 16, 2025

🟢 logging

  • The upgrade from 24.7 to 24.11.1 worked without any issues.
  • Installing the demo from scratch using 24.11.1 worked without any issues.

@Techassi
Copy link
Member Author

Techassi commented Jan 16, 2025

🟢 signal-processing

  • The upgrade from 24.7 to 24.11.1 worked without any issues.
  • Installing the demo from scratch using 24.11.1 worked without any issues
    • docker.stackable.tech/stackable/tools:1.0.0-stackable24.11.0 is used for the secret migration job.

@adwk67
Copy link
Member

adwk67 commented Jan 16, 2025

🟢 jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data

  • upgrade from 24.7 to 24.11.1: worked successfully
  • install 24.11.1 from scratch: worked successfully
Note on Spark images
  • The spark image used has been updated from one based on the 23.4 release (24.7) to one on the 24.3 release (24.11). This is intentional as it is not straightforward to identify a compatible combination of spark, python and jupyterhub versions.

@xeniape
Copy link
Member

xeniape commented Jan 16, 2025

🟢 trino-taxi-data

Upgrade (24.7.0 -> 24.11.1)

  • setup-superset took ~7minutes to complete (several crashes)
  • Problems upgrading postgresql version. Stuck in CrashLoopBackOff. Rolled back for the rest of the upgrade test (not release related)
postgresql 12:03:16.98 INFO  ==> ** Starting PostgreSQL **                                                                                                           
2025-01-16 12:03:17.078 GMT [1] FATAL:  database files are incompatible with server                                                                                  
2025-01-16 12:03:17.078 GMT [1] DETAIL:  The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17.0. 
  • remaining upgrade went fine

Fresh install

  • setup-superset took ~7minutes to complete (several crashes)
  • remaining install went fine

@adwk67
Copy link
Member

adwk67 commented Jan 16, 2025

🟢 spark-k8s-anomaly-detection-taxi-data

  • upgrade 24.7 to 24.11: OK
  • install 24.11.1 from scratch: OK

Auxiliary images (Stackable) used:

  • testing-tools:0.2.0-stackable24.11.1

@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Jan 16, 2025

For anyone wondering about 24.11.0 images still being pulled, this should now be fixed with: #138.

Caution

Be sure to clear your stackablectl cache:

stackablectl cache clean

@razvan
Copy link
Member

razvan commented Jan 16, 2025

🟡 data-lakehouse-iceberg-trino-spark

TL;DR: This demo requires a lot of resources and we maxed out on memory on IONOS. In general it looks good and provided enough resources it would probably be green.

Upgrade (first attempt)

Created a cluster like this to satisfy huge cpu and memory requirements:

replicated cluster create --tag owner=rami \
--name rami \
--distribution k3s \
--version 1.31.4 \
--instance-type r1.xlarge \
--disk 100 \
--ttl 8h \
--nodes 5
  • Superset fails to become ready because the liveliness probe fails. The solution was to stop the reconciliation and increase the number of probe retries to 10 and kill the pod.
  • The setup-superset job waited and completed successfully when the Superset stacklet became ready.
  • The load-test-data takes ages to complete. The ingestion of yellow trip data into minio is super slow. I killed it after one hour which lead to create-tables-in-trino being stuck.
  • Created a new job create-tables-in-trino-2 for Trino tables that doesn't wait for the load-test-data to complete successfully.
  • Meanwhile the connection to the Minio UI is dropped intermittently.
  • The create-tables-in-trino-2 also fails because : Failed to read file at s3a://staging/house-sales/postcode-geo-lookup/open-postcode-geo.csv"
  • Nifi drops connection when using port forwarding which is the only way to conect to services in Replicated clusters.
  • Stop here and look for a different cluster.

Upgrade (second attempt)

Mostly successful with the exceptions listed below

Created cluster with 14 nodes with the following configuration:

Image

  • All Spark streaming jobs report the following error:
org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.

But the topics are there (listed with kafka-topics.sh)

  • Trino still reports INSUFFICIENT RESSOURCES and INTERNAL SERVER ERROR for some queries.
  • Superset: two empty dashboards Shared bikes and Water level measurements.

Results:

  • 🟢 Update crds, and install 24.11.1 operators successful.
  • 🔴 Postgresql cannot be updated (had to rollback all postgresql charts)
│ 2025-01-16 15:41:02.657 GMT [1] FATAL:  database files are incompatible with server
│ 2025-01-16 15:41:02.657 GMT [1] DETAIL:  The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17.0.
  • 🟢 Product upgrades successful

Fresh install

On IONOS (14 nodes, 8 Skylake cores, 20GB RAM, 50GB HDD)

  • Trino pods cannot start due to memory constraints. Reduced replicas from 4 to 2 but the worker are busy GC-ing.
  • Some Superset dashboards still lacked data after a while due to Trino being slow.

@xeniape
Copy link
Member

xeniape commented Jan 16, 2025

🟢 trino-iceberg

Upgrade (24.7.0 -> 24.11.1)
All good

Fresh install
All good

@xeniape
Copy link
Member

xeniape commented Jan 16, 2025

🟢 nifi-kafka-druid-water-level-data

Upgrade (24.7.0 -> 24.11.1)

  • when upgrading druid-operator the druid deployment runs into image pull errors (version 28.0.1 not supported); druid version needs to be bumped to fix it
  • running commands for kafka from demo docs results in errors
kubectl exec -it kafka-broker-default-0 -c kcat-prober -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls-kcat/tls.key -X ssl.certificate.location=/stackable/tls-kcat/tls.crt -X ssl.ca.location=/stackable/tls-kcat/ca.crt -L"

%3|1737044642.110|SSL|rdkafka#producer-1| [thrd:app]: error:80000002:system library::No such file or directory: calling fopen(/stackable/tls-kcat/ca.crt, r)
%3|1737044642.110|SSL|rdkafka#producer-1| [thrd:app]: error:10000080:BIO routines::no such file
% ERROR: Failed to create producer: ssl.ca.location failed: error:05880002:x509 certificate routines::system lib
command terminated with exit code 1

Addition: error already occurs on 24.7 without any upgrades
Resolution: Not an actual error, paths for tls certs in kcat-prober just changed in 24.11 and the commands from the nightly docs were used. With the commands from 24.7 documentation everything worked fine. Migration instructions are mentioned in https://docs.stackable.tech/home/stable/release-notes/#_kafka_operator

  • remaining install went fine

Fresh install

  • earlier mentioned commands work correctly in fresh install
kubectl exec -it kafka-broker-default-0 -c kcat-prober -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls-kcat/tls.key -X ssl.certificate.location=/stackable/tls-kcat/tls.crt -X ssl.ca.location=/stackable/tls-kcat/ca.crt -L"

Metadata for all topics (from broker -1: ssl://localhost:9093/bootstrap):
 1 brokers:
  broker 1001 at kafka-broker-default-0-listener-broker.default.svc.cluster.local:9093 (controller)
 2 topics:
  topic "stations" with 8 partitions:
    partition 0, leader 1001, replicas: 1001, isrs: 1001
    partition 1, leader 1001, replicas: 1001, isrs: 1001
    partition 2, leader 1001, replicas: 1001, isrs: 1001
    partition 3, leader 1001, replicas: 1001, isrs: 1001
    partition 4, leader 1001, replicas: 1001, isrs: 1001
    partition 5, leader 1001, replicas: 1001, isrs: 1001
    partition 6, leader 1001, replicas: 1001, isrs: 1001
    partition 7, leader 1001, replicas: 1001, isrs: 1001
  topic "measurements" with 8 partitions:
    partition 0, leader 1001, replicas: 1001, isrs: 1001
    partition 1, leader 1001, replicas: 1001, isrs: 1001
    partition 2, leader 1001, replicas: 1001, isrs: 1001
    partition 3, leader 1001, replicas: 1001, isrs: 1001
    partition 4, leader 1001, replicas: 1001, isrs: 1001
    partition 5, leader 1001, replicas: 1001, isrs: 1001
    partition 6, leader 1001, replicas: 1001, isrs: 1001
    partition 7, leader 1001, replicas: 1001, isrs: 1001

@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Jan 17, 2025

🟢 hbase-hdfs-load-cycling-data

  • 🟢 Upgrade (24.7.0 -> 24.11.1)
  • 🟢 Fresh install of 24.11.1

@NickLarsenNZ NickLarsenNZ changed the title chore(tracking): Test demos on nightly versions for 24.11.1 chore(tracking): Test demos for 24.11.1 Jan 17, 2025
@NickLarsenNZ
Copy link
Member

NickLarsenNZ commented Jan 17, 2025

nifi-kafka-druid-earthquake-data

  • 🟢 Upgrade (24.7.0 -> 24.11.1)
    • I noticed one of the Copy-Code buttons doesn't work(for 24.7 at least).
      The one after If you are interested in how many records have been produced to the Kafka topic so far.
    • After some time (>=1h) druid finally stores data in the bucket. This should be mentioned in the demo.
    • ⚠ The operator restarts druid for example, but uses the old version which is no longer available. So patching has to happen before it comes up.
    • ⚠ After restart, druid no longer indicates the segment count via the Supervisor section. Even after re-running the NiFi processor. Segments can still be seen by directly browsing though, so it seems superficial.
  • ⌛ Fresh install of 24.11.1

@Techassi
Copy link
Member Author

Testing completed, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants