-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Flink HA document and example #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this @showuon.
I left some style/grammer comments. Other than that the only thing I think needs adding is examples (or pointers to similar) of how to set the various configs you talk about.
ha-example/README.md
Outdated
recommendation-app-76b6854f98-9zb24 1/1 Running 0 3m59s | ||
recommendation-app-taskmanager-1-1 1/1 Running 0 2m5s | ||
``` | ||
5. Browse the minio console, to make sure the metadata of the job manager is uploaded to s3://test/ha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a generic link like localhost:1234
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've documented in minio-install/README.md
.
ha-example/minio-install/minio.yaml
Outdated
@@ -0,0 +1,23 @@ | |||
apiVersion: v1 | |||
kind: Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not be using a Deployment
CR rather than raw pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Updated to using Deployment
.
ha-example/minio-install/README.md
Outdated
|
||
Click on the `Object Browser` to view the files in the buckets. | ||
|
||
After minio is deployed and bucket is created, the flink configuration can be set like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a user set this, via Helm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to this:
After minio is deployed and bucket is created, the flink configuration can be set like this in the FlinkDeployment
CR:
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: recommendation-app
spec:
image: quay.io/streamshub/flink-sql-runner:v0.0.1
flinkVersion: v1_19
flinkConfiguration:
# minio setting
s3.access-key: minioadmin
s3.secret-key: minioadmin
s3.endpoint: http://MINIO_POD_ID:9000
s3.path.style.access: "true"
high-availability.storageDir: s3://test/ha
state.checkpoints.dir: s3://test/cp
c4952a9
to
8c3d011
Compare
@tomncooper , thanks for reviewing the PR. I've updated the PR to address your comments. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple of nits.
ha-example/README.md
Outdated
an external service must store a minimal amount of recovery metadata (like “ID of last committed checkpoint”), | ||
as well as information needed to elect and lock which Job Manager is the leader (to avoid split-brain situations). | ||
|
||
In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this: | |
In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeployment` CR like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice catch!
ha-example/README.md
Outdated
|
||
[Checkpointing](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/fault-tolerance/checkpointing/) is Flink’s primary fault-tolerance mechanism, wherein a snapshot of your job’s state is persisted periodically to some durable location. | ||
In the case of failure, of a Task running your job's code, Flink will restart the Task from the most recent checkpoint and resume processing. | ||
Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance. | |
Although not strictly related to HA of the Flink cluster, it is important to enable checkpointing in production deployments to ensure fault tolerance. |
Looks like we are using checkpointing
in most places rather than check-pointing
.
ha-example/README.md
Outdated
--set podSecurityContext=null \ | ||
--set defaultConfiguration."log4j-operator\.properties"=monitorInterval\=30 \ | ||
--set defaultConfiguration."log4j-console\.properties"=monitorInterval\=30 \ | ||
--set replicas=2 \ | ||
--set defaultConfiguration."flink-conf\.yaml"="kubernetes.operator.metrics.reporter.slf4j.factory.class\:\ org.apache.flink.metrics.slf4j.Slf4jReporterFactory | ||
kubernetes.operator.metrics.reporter.slf4j.interval\:\ 5\ MINUTE | ||
kubernetes.operator.reconcile.interval:\ 15\ s | ||
kubernetes.operator.observer.progress-check.interval:\ 5\ s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these configurations somehow related to HA configuration (except replicas)? It doesn't look like yes to me and it might be confusing for anyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Updated.
ha-example/README.md
Outdated
name: recommendation-app | ||
spec: | ||
image: quay.io/streamshub/flink-sql-runner:v0.0.1 | ||
flinkVersion: v1_19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should unify flink version here and in docs bellow (1.19 vs 1.20)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated.
PR updated. Thanks for the comments! |
If there are no more comments, I'm going to merge it today. Thanks. |
Add a
ha-example
folder to demonstrate the Flink HA configuration and usage.