Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Flink HA document and example #43

Merged
merged 3 commits into from
Dec 20, 2024
Merged

add Flink HA document and example #43

merged 3 commits into from
Dec 20, 2024

Conversation

showuon
Copy link
Contributor

@showuon showuon commented Dec 11, 2024

Add a ha-example folder to demonstrate the Flink HA configuration and usage.

@showuon showuon requested a review from tomncooper December 11, 2024 09:35
Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this @showuon.

I left some style/grammer comments. Other than that the only thing I think needs adding is examples (or pointers to similar) of how to set the various configs you talk about.

ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
recommendation-app-76b6854f98-9zb24 1/1 Running 0 3m59s
recommendation-app-taskmanager-1-1 1/1 Running 0 2m5s
```
5. Browse the minio console, to make sure the metadata of the job manager is uploaded to s3://test/ha
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a generic link like localhost:1234?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've documented in minio-install/README.md.

@@ -0,0 +1,23 @@
apiVersion: v1
kind: Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not be using a Deployment CR rather than raw pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Updated to using Deployment.

ha-example/minio-install/README.md Outdated Show resolved Hide resolved

Click on the `Object Browser` to view the files in the buckets.

After minio is deployed and bucket is created, the flink configuration can be set like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user set this, via Helm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to this:

After minio is deployed and bucket is created, the flink configuration can be set like this in the FlinkDeployment CR:

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: recommendation-app
spec:
  image: quay.io/streamshub/flink-sql-runner:v0.0.1
  flinkVersion: v1_19
  flinkConfiguration:
    # minio setting 
    s3.access-key: minioadmin
    s3.secret-key: minioadmin
    s3.endpoint: http://MINIO_POD_ID:9000
    s3.path.style.access: "true"
    high-availability.storageDir: s3://test/ha
    state.checkpoints.dir: s3://test/cp

@showuon showuon force-pushed the ha branch 2 times, most recently from c4952a9 to 8c3d011 Compare December 16, 2024 05:06
@showuon
Copy link
Contributor Author

showuon commented Dec 16, 2024

@tomncooper , thanks for reviewing the PR. I've updated the PR to address your comments. Thanks.

Copy link
Contributor

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of nits.

an external service must store a minimal amount of recovery metadata (like “ID of last committed checkpoint”),
as well as information needed to elect and lock which Job Manager is the leader (to avoid split-brain situations).

In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeplyment` CR like this:
In order to configure Job Managers in your Flink Cluster for high availability you need to add the following settings to the configuration in your `FlinkDeployment` CR like this:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice catch!


[Checkpointing](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/fault-tolerance/checkpointing/) is Flink’s primary fault-tolerance mechanism, wherein a snapshot of your job’s state is persisted periodically to some durable location.
In the case of failure, of a Task running your job's code, Flink will restart the Task from the most recent checkpoint and resume processing.
Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Although not strictly related to HA of the Flink cluster, it is important to enable check-pointing in production deployments to ensure fault tolerance.
Although not strictly related to HA of the Flink cluster, it is important to enable checkpointing in production deployments to ensure fault tolerance.

Looks like we are using checkpointing in most places rather than check-pointing.

ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
Comment on lines 22 to 29
--set podSecurityContext=null \
--set defaultConfiguration."log4j-operator\.properties"=monitorInterval\=30 \
--set defaultConfiguration."log4j-console\.properties"=monitorInterval\=30 \
--set replicas=2 \
--set defaultConfiguration."flink-conf\.yaml"="kubernetes.operator.metrics.reporter.slf4j.factory.class\:\ org.apache.flink.metrics.slf4j.Slf4jReporterFactory
kubernetes.operator.metrics.reporter.slf4j.interval\:\ 5\ MINUTE
kubernetes.operator.reconcile.interval:\ 15\ s
kubernetes.operator.observer.progress-check.interval:\ 5\ s

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these configurations somehow related to HA configuration (except replicas)? It doesn't look like yes to me and it might be confusing for anyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Updated.

name: recommendation-app
spec:
image: quay.io/streamshub/flink-sql-runner:v0.0.1
flinkVersion: v1_19

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should unify flink version here and in docs bellow (1.19 vs 1.20)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated.

ha-example/README.md Outdated Show resolved Hide resolved
ha-example/minio-install/README.md Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
ha-example/README.md Outdated Show resolved Hide resolved
@showuon
Copy link
Contributor Author

showuon commented Dec 18, 2024

PR updated. Thanks for the comments!

@showuon
Copy link
Contributor Author

showuon commented Dec 19, 2024

If there are no more comments, I'm going to merge it today. Thanks.

@showuon showuon merged commit 4fc9f57 into streamshub:main Dec 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants