chore(tracking): Test demos for 24.11.1 #137

Techassi · 2025-01-15T14:08:03Z

Pre-Release Demo Testing

This is testing:

That upgrades from the outgoing stable release to the new release of the operators and products do not negatively impact the products.
That the new release demos work as documented from scratch.

Note

Record any issues or anomalies during the process in a comment on this issue.
Eg:

:green_circle: **airflow-scheduled-job**

The CRD had been updated and I needed to change the following in the manifest:
...

Replace the items in the task lists below with the applicable Pull Requests (if any).

24.7.0 to new 24.11.1 Upgrade Testing Instructions

These instructions are for deploying and completing the 24.7.0 demo, and then upgrading operators, CRDs, and products to the 24.11.1 versions well as upgrading the operators and CRDS.

Tip

Be sure to select the stable docs version on https://docs.stackable.tech/home/stable/demos/.

# Install demo (stable operators) for the stable release (23.7.0).
stackablectl demo install <DEMO_NAME> --release 24.7

# --- IMPORTANT ---
# Run through the stable demo instructions (refer to the tasklist above).

# Get a list of installed operators
stackablectl operator installed --output=plain

# --- OPTIONAL ---
# Sometimes it is necessary to upgrade Helm charts. Look for other Helm Charts
# which might need updating.

# First, see which charts are installed. You can ignore the stackable-operator
# charts, or anything that might have been installed outside of this demo.
helm list

# Next, add the applicable Helm Chart repositories. For example:
helm repo add minio https://charts.min.io/
helm repo add bitnami https://charts.bitnami.com/bitnami

# Finally, upgrade the Charts to what is defined in `main`.
# For example:
helm upgrade minio minio/minio --version x.x.x
helm upgrade postgresql-hive bitnami/postgresql --version x.x.x
# --- OPTIONAL END ---

# Uninstall operators for the stable release (24.7.0)
stackablectl release uninstall 24.7

# At this point, we assume release.yml has been updated with the new 24.11.1 release.
# if it hasn't, you will need to point stackablectl at a locally updated file using --release-file

# Update CRDs 24.11
# Repeat this for every operator used by the demo (use the list from the earlier step before deleting the operators)
kubectl replace -f https://raw.githubusercontent.com/stackabletech/commons-operator/release-24.11/deploy/helm/commons-operator/crds/crds.yaml
kubectl replace -f https://raw.githubusercontent.com/stackabletech/...-operator/release-24.11/deploy/helm/...-operator/crds/crds.yaml

# Install new release operators (use the list from the earlier step before deleting the operators)
stackablectl operator install commons=24.11.1 ...

# Optionally update the product versions in the CRDs (to the latest non-experimental version for the new release), e.g.:
kubectl patch hbaseclusters/hbase --type='json' -p='[{"op": "replace", "path": "/spec/image/productVersion", "value":"x.x.x"}]' # changed

24.11.1 from Scratch Testing Instructions

These instructions are for deploying and completing the 24.11.1 demo from scratch.

Tip

Be sure to select the nightly docs version on https://docs.stackable.tech/home/nightly/demos/.

# Install demo (stable operators) for the nightly release.
stackablectl demo install <DEMO_NAME> --release 24.11

# --- IMPORTANT ---
# Run through the nightly demo instructions (refer to the tasklist above).

The text was updated successfully, but these errors were encountered:

adwk67 · 2025-01-15T17:02:24Z

🟢 airflow-scheduled-job

upgrade: OK, although
- clusterrole is removed when the release is uninstalled, leading to scheduler RBAC errors
- one scheduled job (running every minute) fails during the upgrade: DAGs are using the KubernetesExecutor
- DAG has to be edited in the ConfigMap to correct for this
24.11.1 from scratch: OK

NickLarsenNZ · 2025-01-16T07:34:11Z

🟢 end-to-end-security

The demo at https://docs.stackable.tech/home/24.7/demos/end-to-end-security/ has no instructions, so I'm using the updated demo in https://docs.stackable.tech/home/stable/demos/end-to-end-security/ (24.11 at the time of writing) as the basis.
There were reports of the Superset pod crashing, but this did not occur for me (been running 82 minutes).
- UPDATE: The problem occurs on a fresh install. It is because we restore a dump for and older superset version. Fixed in fix(stack/end-to-end-security): Skip DB restore if the DB exists #139 and fix(stack/end-to-end-security): Skip DB restore if the DB exists #140

Documentation improvement notes

s/sql editor/SQL client/
When logging in as justin.martin, give a hint that the password is listed in the table at the top of the demo (or just mention the password is the username).
When switching from justin.martin to sophia.clarke, it is not obvious how to logout from the first user.
- UPDATE: Go to the trino UI and logout from there (eg: https://172.18.0.2:30515/).
s/encrypting the contents with sha256/hashing the contents with sha256/
s/except if they are their supervisor/unless they are the supervisor/
s/The Rego rule for this behavior looks like this (again a snippet from the trino-policies ConfigMap)/The Rego rule for this behavior looks like this snippet from the trino-policies ConfigMap/
Make https://localhost:8443/ open externally.

🟢 Upgrade (24.7.0 -> 24.11.1)

Upgrading the postgres helm chart failed (it's a known and requires manual steps to upgrade major versions).
Upgrading superset to 4.0.2 fails (it seems to still be trying 3.1.3):
- ```
kubectl patch supersetclusters/superset --type='json' -p='[{"op": "replace", "path": "/spec/image/productVersion", "value":"4.0.2"}]'
```
- It works again after killing the pod manually. I would have expected the superset operator to handle this after the earlier patch command. Pinging @nightkr.

🟢 Fresh install

Some ancillary images are referring to 24.11.0 (fixed in fix: Update references for 24.11.1 #138):

docker.stackable.tech/stackable/krb5:1.21.1-stackable24.11.0
docker.stackable.tech/stackable/testing-tools:0.2.0-stackable24.11.0
docker.stackable.tech/stackable/tools:1.0.0-stackable24.11.0

Superset migrations failed. Fixed in fix(stack/end-to-end-security): Skip DB restore if the DB exists #139 and fix(stack/end-to-end-security): Skip DB restore if the DB exists #140

Techassi · 2025-01-16T08:30:09Z

🟢 logging

The upgrade from 24.7 to 24.11.1 worked without any issues.
Installing the demo from scratch using 24.11.1 worked without any issues.

Techassi · 2025-01-16T08:50:52Z

🟢 signal-processing

The upgrade from 24.7 to 24.11.1 worked without any issues.
Installing the demo from scratch using 24.11.1 worked without any issues
- docker.stackable.tech/stackable/tools:1.0.0-stackable24.11.0 is used for the secret migration job.

adwk67 · 2025-01-16T09:28:30Z

🟢 jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data

upgrade from 24.7 to 24.11.1: worked successfully
install 24.11.1 from scratch: worked successfully

Note on Spark images

The spark image used has been updated from one based on the 23.4 release (24.7) to one on the 24.3 release (24.11). This is intentional as it is not straightforward to identify a compatible combination of spark, python and jupyterhub versions.

xeniape · 2025-01-16T09:48:58Z

🟢 trino-taxi-data

Upgrade (24.7.0 -> 24.11.1)

setup-superset took ~7minutes to complete (several crashes)
Problems upgrading postgresql version. Stuck in CrashLoopBackOff. Rolled back for the rest of the upgrade test (not release related)

postgresql 12:03:16.98 INFO  ==> ** Starting PostgreSQL **                                                                                                           
2025-01-16 12:03:17.078 GMT [1] FATAL:  database files are incompatible with server                                                                                  
2025-01-16 12:03:17.078 GMT [1] DETAIL:  The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17.0.

remaining upgrade went fine

Fresh install

setup-superset took ~7minutes to complete (several crashes)
remaining install went fine

adwk67 · 2025-01-16T09:58:50Z

🟢 spark-k8s-anomaly-detection-taxi-data

upgrade 24.7 to 24.11: OK
install 24.11.1 from scratch: OK

Auxiliary images (Stackable) used:

testing-tools:0.2.0-stackable24.11.1

NickLarsenNZ · 2025-01-16T09:59:10Z

For anyone wondering about 24.11.0 images still being pulled, this should now be fixed with: #138.

Caution

Be sure to clear your stackablectl cache:

stackablectl cache clean

razvan · 2025-01-16T11:17:11Z

🟡 data-lakehouse-iceberg-trino-spark

TL;DR: This demo requires a lot of resources and we maxed out on memory on IONOS. In general it looks good and provided enough resources it would probably be green.

Upgrade (first attempt)

Created a cluster like this to satisfy huge cpu and memory requirements:

replicated cluster create --tag owner=rami \
--name rami \
--distribution k3s \
--version 1.31.4 \
--instance-type r1.xlarge \
--disk 100 \
--ttl 8h \
--nodes 5

Superset fails to become ready because the liveliness probe fails. The solution was to stop the reconciliation and increase the number of probe retries to 10 and kill the pod.
The setup-superset job waited and completed successfully when the Superset stacklet became ready.
The load-test-data takes ages to complete. The ingestion of yellow trip data into minio is super slow. I killed it after one hour which lead to create-tables-in-trino being stuck.
Created a new job create-tables-in-trino-2 for Trino tables that doesn't wait for the load-test-data to complete successfully.
Meanwhile the connection to the Minio UI is dropped intermittently.
The create-tables-in-trino-2 also fails because : Failed to read file at s3a://staging/house-sales/postcode-geo-lookup/open-postcode-geo.csv"
Nifi drops connection when using port forwarding which is the only way to conect to services in Replicated clusters.
Stop here and look for a different cluster.

Upgrade (second attempt)

Mostly successful with the exceptions listed below

Created cluster with 14 nodes with the following configuration:

All Spark streaming jobs report the following error:

org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.

But the topics are there (listed with kafka-topics.sh)

Trino still reports INSUFFICIENT RESSOURCES and INTERNAL SERVER ERROR for some queries.
Superset: two empty dashboards Shared bikes and Water level measurements.

Results:

🟢 Update crds, and install 24.11.1 operators successful.
🔴 Postgresql cannot be updated (had to rollback all postgresql charts)

│ 2025-01-16 15:41:02.657 GMT [1] FATAL:  database files are incompatible with server
│ 2025-01-16 15:41:02.657 GMT [1] DETAIL:  The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17.0.

🟢 Product upgrades successful

Fresh install

On IONOS (14 nodes, 8 Skylake cores, 20GB RAM, 50GB HDD)

Trino pods cannot start due to memory constraints. Reduced replicas from 4 to 2 but the worker are busy GC-ing.
Some Superset dashboards still lacked data after a while due to Trino being slow.

xeniape · 2025-01-16T13:50:00Z

🟢 trino-iceberg

Upgrade (24.7.0 -> 24.11.1)
All good

Fresh install
All good

xeniape · 2025-01-16T15:12:00Z

🟢 nifi-kafka-druid-water-level-data

Upgrade (24.7.0 -> 24.11.1)

when upgrading druid-operator the druid deployment runs into image pull errors (version 28.0.1 not supported); druid version needs to be bumped to fix it
running commands for kafka from demo docs results in errors

kubectl exec -it kafka-broker-default-0 -c kcat-prober -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls-kcat/tls.key -X ssl.certificate.location=/stackable/tls-kcat/tls.crt -X ssl.ca.location=/stackable/tls-kcat/ca.crt -L"

%3|1737044642.110|SSL|rdkafka#producer-1| [thrd:app]: error:80000002:system library::No such file or directory: calling fopen(/stackable/tls-kcat/ca.crt, r)
%3|1737044642.110|SSL|rdkafka#producer-1| [thrd:app]: error:10000080:BIO routines::no such file
% ERROR: Failed to create producer: ssl.ca.location failed: error:05880002:x509 certificate routines::system lib
command terminated with exit code 1

Addition: error already occurs on 24.7 without any upgrades
Resolution: Not an actual error, paths for tls certs in kcat-prober just changed in 24.11 and the commands from the nightly docs were used. With the commands from 24.7 documentation everything worked fine. Migration instructions are mentioned in https://docs.stackable.tech/home/stable/release-notes/#_kafka_operator

remaining install went fine

Fresh install

earlier mentioned commands work correctly in fresh install

kubectl exec -it kafka-broker-default-0 -c kcat-prober -- /bin/bash -c "/stackable/kcat -b localhost:9093 -X security.protocol=SSL -X ssl.key.location=/stackable/tls-kcat/tls.key -X ssl.certificate.location=/stackable/tls-kcat/tls.crt -X ssl.ca.location=/stackable/tls-kcat/ca.crt -L"

Metadata for all topics (from broker -1: ssl://localhost:9093/bootstrap):
 1 brokers:
  broker 1001 at kafka-broker-default-0-listener-broker.default.svc.cluster.local:9093 (controller)
 2 topics:
  topic "stations" with 8 partitions:
    partition 0, leader 1001, replicas: 1001, isrs: 1001
    partition 1, leader 1001, replicas: 1001, isrs: 1001
    partition 2, leader 1001, replicas: 1001, isrs: 1001
    partition 3, leader 1001, replicas: 1001, isrs: 1001
    partition 4, leader 1001, replicas: 1001, isrs: 1001
    partition 5, leader 1001, replicas: 1001, isrs: 1001
    partition 6, leader 1001, replicas: 1001, isrs: 1001
    partition 7, leader 1001, replicas: 1001, isrs: 1001
  topic "measurements" with 8 partitions:
    partition 0, leader 1001, replicas: 1001, isrs: 1001
    partition 1, leader 1001, replicas: 1001, isrs: 1001
    partition 2, leader 1001, replicas: 1001, isrs: 1001
    partition 3, leader 1001, replicas: 1001, isrs: 1001
    partition 4, leader 1001, replicas: 1001, isrs: 1001
    partition 5, leader 1001, replicas: 1001, isrs: 1001
    partition 6, leader 1001, replicas: 1001, isrs: 1001
    partition 7, leader 1001, replicas: 1001, isrs: 1001

NickLarsenNZ · 2025-01-17T07:36:33Z

🟢 hbase-hdfs-load-cycling-data

🟢 Upgrade (24.7.0 -> 24.11.1)
🟢 Fresh install of 24.11.1

NickLarsenNZ · 2025-01-17T08:33:56Z

⏳ nifi-kafka-druid-earthquake-data

🟢 Upgrade (24.7.0 -> 24.11.1)
- I noticed one of the Copy-Code buttons doesn't work(for 24.7 at least).
  The one after If you are interested in how many records have been produced to the Kafka topic so far.
- After some time (>=1h) druid finally stores data in the bucket. This should be mentioned in the demo.
- ⚠ The operator restarts druid for example, but uses the old version which is no longer available. So patching has to happen before it comes up.
- ⚠ After restart, druid no longer indicates the segment count via the Supervisor section. Even after re-running the NiFi processor. Segments can still be seen by directly browsing though, so it seems superficial.
⌛ Fresh install of 24.11.1

Techassi · 2025-01-24T11:01:20Z

Testing completed, closing.

Techassi added the epic label Jan 15, 2025

Techassi mentioned this issue Jan 15, 2025

chore(tracking): SDP Release 24.11.1 stackabletech/issues#683

Closed

NickLarsenNZ changed the title ~~chore(tracking): Test demos on nightly versions for 24.11.1~~ chore(tracking): Test demos for 24.11.1 Jan 17, 2025

Techassi closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(tracking): Test demos for 24.11.1 #137

chore(tracking): Test demos for 24.11.1 #137

Techassi commented Jan 15, 2025 •

edited

Loading

Testing Demos for 24.11.1

adwk67 commented Jan 15, 2025 •

edited

Loading

NickLarsenNZ commented Jan 16, 2025 •

edited

Loading

Techassi commented Jan 16, 2025 •

edited

Loading

Techassi commented Jan 16, 2025 •

edited

Loading

adwk67 commented Jan 16, 2025 •

edited

Loading

xeniape commented Jan 16, 2025 •

edited

Loading

adwk67 commented Jan 16, 2025 •

edited

Loading

NickLarsenNZ commented Jan 16, 2025 •

edited

Loading

razvan commented Jan 16, 2025 •

edited

Loading

xeniape commented Jan 16, 2025 •

edited

Loading

xeniape commented Jan 16, 2025 •

edited

Loading

NickLarsenNZ commented Jan 17, 2025 •

edited

Loading

NickLarsenNZ commented Jan 17, 2025 •

edited

Loading

Techassi commented Jan 24, 2025

chore(tracking): Test demos for 24.11.1 #137

chore(tracking): Test demos for 24.11.1 #137

Comments

Techassi commented Jan 15, 2025 • edited Loading

Pre-Release Demo Testing

Testing Demos for 24.11.1

24.7.0 to new 24.11.1 Upgrade Testing Instructions

24.11.1 from Scratch Testing Instructions

adwk67 commented Jan 15, 2025 • edited Loading

NickLarsenNZ commented Jan 16, 2025 • edited Loading

🟢 Upgrade (24.7.0 -> 24.11.1)

🟢 Fresh install

Techassi commented Jan 16, 2025 • edited Loading

Techassi commented Jan 16, 2025 • edited Loading

adwk67 commented Jan 16, 2025 • edited Loading

xeniape commented Jan 16, 2025 • edited Loading

adwk67 commented Jan 16, 2025 • edited Loading

NickLarsenNZ commented Jan 16, 2025 • edited Loading

razvan commented Jan 16, 2025 • edited Loading

Upgrade (first attempt)

Upgrade (second attempt)

Fresh install

xeniape commented Jan 16, 2025 • edited Loading

xeniape commented Jan 16, 2025 • edited Loading

NickLarsenNZ commented Jan 17, 2025 • edited Loading

NickLarsenNZ commented Jan 17, 2025 • edited Loading

Techassi commented Jan 24, 2025

Techassi commented Jan 15, 2025 •

edited

Loading

adwk67 commented Jan 15, 2025 •

edited

Loading

NickLarsenNZ commented Jan 16, 2025 •

edited

Loading

Techassi commented Jan 16, 2025 •

edited

Loading

Techassi commented Jan 16, 2025 •

edited

Loading

adwk67 commented Jan 16, 2025 •

edited

Loading

xeniape commented Jan 16, 2025 •

edited

Loading

adwk67 commented Jan 16, 2025 •

edited

Loading

NickLarsenNZ commented Jan 16, 2025 •

edited

Loading

razvan commented Jan 16, 2025 •

edited

Loading

xeniape commented Jan 16, 2025 •

edited

Loading

xeniape commented Jan 16, 2025 •

edited

Loading

NickLarsenNZ commented Jan 17, 2025 •

edited

Loading

NickLarsenNZ commented Jan 17, 2025 •

edited

Loading