add Flink HA document and example #43

showuon · 2024-12-11T09:34:53Z

Add a ha-example folder to demonstrate the Flink HA configuration and usage.

tomncooper

Thanks for looking at this @showuon.

I left some style/grammer comments. Other than that the only thing I think needs adding is examples (or pointers to similar) of how to set the various configs you talk about.

ha-example/README.md

tomncooper · 2024-12-11T11:38:14Z

ha-example/README.md

+   recommendation-app-76b6854f98-9zb24   1/1     Running   0          3m59s
+   recommendation-app-taskmanager-1-1    1/1     Running   0          2m5s
+   ``` 
+5. Browse the minio console, to make sure the metadata of the job manager is uploaded to s3://test/ha  


Is there a generic link like localhost:1234?

Yes, I've documented in minio-install/README.md.

tomncooper · 2024-12-11T11:40:12Z

ha-example/minio-install/minio.yaml

@@ -0,0 +1,23 @@
+apiVersion: v1
+kind: Pod


Should we not be using a Deployment CR rather than raw pods?

Sure. Updated to using Deployment.

ha-example/minio-install/README.md

tomncooper · 2024-12-11T11:41:41Z

ha-example/minio-install/README.md

+
+   Click on the `Object Browser` to view the files in the buckets.
+
+After minio is deployed and bucket is created, the flink configuration can be set like this:


How would a user set this, via Helm?

Updated to this:

After minio is deployed and bucket is created, the flink configuration can be set like this in the FlinkDeployment CR:

apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment metadata: name: recommendation-app spec: image: quay.io/streamshub/flink-sql-runner:v0.0.1 flinkVersion: v1_19 flinkConfiguration: # minio setting s3.access-key: minioadmin s3.secret-key: minioadmin s3.endpoint: http://MINIO_POD_ID:9000 s3.path.style.access: "true" high-availability.storageDir: s3://test/ha state.checkpoints.dir: s3://test/cp

showuon · 2024-12-16T05:07:48Z

@tomncooper , thanks for reviewing the PR. I've updated the PR to address your comments. Thanks.

tomncooper

LGTM, just a couple of nits.

tomncooper · 2024-12-16T14:09:36Z

ha-example/README.md

+an external service must store a minimal amount of recovery metadata (like “ID of last committed checkpoint”),
+as well as information needed to elect and lock which Job Manager is the leader (to avoid split-brain situations).
+
+In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this:


Suggested change

In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this:

In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeployment` CR like this:

Oh, nice catch!

tomncooper · 2024-12-16T14:11:10Z

ha-example/README.md

+
+[Checkpointing](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/fault-tolerance/checkpointing/) is Flink’s primary fault-tolerance mechanism, wherein a snapshot of your job’s state is persisted periodically to some durable location.
+In the case of failure, of a Task running your job's code, Flink will restart the Task from the most recent checkpoint and resume processing.
+Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance.


Suggested change

Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance.

Although not strictly related to HA of the Flink cluster, it is important to enable checkpointing in production deployments to ensure fault tolerance.

Looks like we are using checkpointing in most places rather than check-pointing.

ha-example/README.md

Frawless · 2024-12-16T18:15:28Z

ha-example/README.md

+--set podSecurityContext=null \
+--set defaultConfiguration."log4j-operator\.properties"=monitorInterval\=30 \
+--set defaultConfiguration."log4j-console\.properties"=monitorInterval\=30 \
+--set replicas=2 \
+--set defaultConfiguration."flink-conf\.yaml"="kubernetes.operator.metrics.reporter.slf4j.factory.class\:\ org.apache.flink.metrics.slf4j.Slf4jReporterFactory
+kubernetes.operator.metrics.reporter.slf4j.interval\:\ 5\ MINUTE
+kubernetes.operator.reconcile.interval:\ 15\ s
+kubernetes.operator.observer.progress-check.interval:\ 5\ s


Are these configurations somehow related to HA configuration (except replicas)? It doesn't look like yes to me and it might be confusing for anyone.

Make sense. Updated.

Frawless · 2024-12-16T18:16:58Z

ha-example/README.md

+  name: recommendation-app
+spec:
+  image: quay.io/streamshub/flink-sql-runner:v0.0.1
+  flinkVersion: v1_19


We should unify flink version here and in docs bellow (1.19 vs 1.20)

Good point! Updated.

ha-example/README.md

ha-example/minio-install/README.md

ha-example/README.md

showuon · 2024-12-18T06:38:32Z

PR updated. Thanks for the comments!

showuon · 2024-12-19T02:23:39Z

If there are no more comments, I'm going to merge it today. Thanks.

add ha document and example

086a862

showuon requested a review from tomncooper December 11, 2024 09:35

tomncooper requested changes Dec 11, 2024

View reviewed changes

showuon force-pushed the ha branch 2 times, most recently from c4952a9 to 8c3d011 Compare December 16, 2024 05:06

address reviewer's comments

8c3d011

tomncooper approved these changes Dec 16, 2024

View reviewed changes

Frawless reviewed Dec 16, 2024

View reviewed changes

tinaselenge reviewed Dec 16, 2024

View reviewed changes

ha-example/README.md Outdated Show resolved Hide resolved

tinaselenge reviewed Dec 16, 2024

View reviewed changes

ha-example/README.md Outdated Show resolved Hide resolved

tinaselenge reviewed Dec 16, 2024

View reviewed changes

ha-example/README.md Outdated Show resolved Hide resolved

fix typo and improve docs

d8e06b4

Frawless approved these changes Dec 19, 2024

View reviewed changes

showuon merged commit 4fc9f57 into streamshub:main Dec 20, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Flink HA document and example #43

add Flink HA document and example #43

showuon commented Dec 11, 2024

tomncooper left a comment

tomncooper Dec 11, 2024

showuon Dec 16, 2024

tomncooper Dec 11, 2024

showuon Dec 16, 2024

tomncooper Dec 11, 2024

showuon Dec 16, 2024

showuon commented Dec 16, 2024

tomncooper left a comment

tomncooper Dec 16, 2024

showuon Dec 18, 2024

tomncooper Dec 16, 2024

Frawless Dec 16, 2024

showuon Dec 18, 2024

Frawless Dec 16, 2024

showuon Dec 18, 2024

showuon commented Dec 18, 2024

showuon commented Dec 19, 2024


		Click on the `Object Browser` to view the files in the buckets.

		After minio is deployed and bucket is created, the flink configuration can be set like this:

	In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this:
	In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeployment` CR like this:

	Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance.
	Although not strictly related to HA of the Flink cluster, it is important to enable checkpointing in production deployments to ensure fault tolerance.

add Flink HA document and example #43

add Flink HA document and example #43

Conversation

showuon commented Dec 11, 2024

tomncooper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon commented Dec 16, 2024

tomncooper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

showuon commented Dec 18, 2024

showuon commented Dec 19, 2024