[FLINK-36540][Runtime] Add Support for Hadoop Caller Context when using Flink to operate hdfs. #26681

liangyu-1 · 2025-06-16T01:34:15Z

What is the purpose of the change

As described in FLINK-36540.
When we use Flink to delete or write or modify files on Hadoop filesystem, callerContext is a helpful feature if we want to trace who did the operation or count how many files an application can create on hadoop filesystem. UGI is not good enough to trace these operations because if we have a tenant who has a lot of jobs writing into HDFS, we cannot find out which job caused the breakdown of HDFS.

I created a new interface and class in flink-core module, so that it will not cause the leak in ThreadLocal value, and it won't influence the situation if we do not use hdfs.

What's more, with this new feature and history json files in history server, we can calculate how many read operations and write operations a Flink application did to hdfs, and find out if there is a pressure or bottleneck to operate on hdfs files.

Brief change log

Add a new interface ContextWrapperFileSystem
Add a new class FileSystemContext
Add a new class HadoopFileSystemWithContext
Add initialization operation at the place where we initialize FileSystemSafetyNet

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change added tests and can be verified as follows:

(example:)

Tested on our YARN CLUSTER

I rebuild this project, and test the new jar file in my cluster, it prints out the correct caller context as expected

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2025-06-16T01:38:57Z

CI report:

89f2628 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

liangyu-1 · 2025-06-16T03:14:44Z

@dmvk @xintongsong @ferenc-csaky
Hi, would you please help me check this PR?
I implement this feature in a new way which is created in dmvk@bfe9f60

davidradl · 2025-06-18T12:46:16Z

docs/layouts/shortcodes/generated/all_taskmanager_section.html

+            <td><h5>hdfs.caller-context.enabled</h5></td>
+            <td style="word-wrap: break-word;">false</td>
+            <td>Boolean</td>
+            <td>A config of whether hadoop caller context is enabled.</td>


it would be more readable to say:
Whether hadoop caller context is enabled.

I forget to push my latest branch, sorry about that.

…ng Flink to operate hdfs.

davidradl · 2025-06-19T06:50:08Z

flink-core/src/main/java/org/apache/flink/core/fs/ContextWrapperFileSystem.java

+ * used, such as caller context or other metadata.
+ */
+@Experimental
+public interface ContextWrapperFileSystem {


I wonder if we need the words wrap and wrapper. It would simpler (/more intuitive?) to have ContextFileSystem and the method as addContext. WDYT?

davidradl · 2025-06-19T06:52:36Z

flink-core/src/main/java/org/apache/flink/core/fs/FileSystemContext.java

+        CONTEXTS.set(newContext);
+    }
+
+    static FileSystem wrapWithContextWhenActivated(FileSystem fs) {


what does WhenActivivated mean here? Maybe explain in comments if this is important. Otherwise could we not say addContext?

davidradl · 2025-06-19T06:56:31Z

...n/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriteRequestExecutorImpl.java

+                    context = context + "_local";
+                }
+                context = context + "JobID_" + jobID;
+                FileSystemContext.initializeContextForThread(context);


I was thinking some file systems would have contexts and some would not. The code does context processing when the file system might not have a context. Have I understood this correctly?

Yes, currently only hadoopFilesystem use this context.

davidradl

Please add unit tests

HuangZhenQiu · 2025-06-20T17:54:39Z

...-runtime/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java

@@ -115,6 +116,21 @@ public void run() {
                checkpointMetaData.getCheckpointId(),
                asyncStartDelayMillis);

+        String context = "FLINK";


Shall we put the into an util class?

HuangZhenQiu · 2025-06-20T17:55:21Z

...-runtime/src/main/java/org/apache/flink/streaming/runtime/tasks/AsyncCheckpointRunnable.java

+        } else {
+            context = context + "_local";
+        }
+        context = context + "JobID_" + taskEnvironment.getJobID() + "_TaskName_" + taskName;


Compare to use JobID only, it will be better to use both job name and job id for readability.

I disagree with this. Job names may contain special characters such as spaces.
In my case, I want to load this context into a structured table for further analysis, so I believe the job ID is sufficient.
If we need to find the exact job name, we can always look it up in the History Server.

HuangZhenQiu

Thanks for the contrition. We are also waiting for the feature.

liangyu-1 · 2025-06-27T02:14:28Z

@flinkbot run azure

…6540

ferenc-csaky · 2025-06-30T10:15:16Z

Does the CI failure related to this change? If not, let's rebase it to the latest master.

liangyu-1 · 2025-07-01T06:20:35Z

thanks for your reply @ferenc-csaky
I have re-run the failed UT, and it passed, so I think the failure is not related to this change.

davidradl reviewed Jun 18, 2025

View reviewed changes

[FLINK-36540][Runtime] Add Support for Hadoop Caller Context when usi…

7a5a6a5

…ng Flink to operate hdfs.

liangyu-1 force-pushed the FLINK-36540 branch from f138592 to 7a5a6a5 Compare June 19, 2025 02:31

liangyu-1 requested a review from davidradl June 19, 2025 02:32

[FLINK-36540][Runtime] Add Support for Hadoop Caller Context when usi…

c86d343

…ng Flink to operate hdfs.

davidradl reviewed Jun 19, 2025

View reviewed changes

davidradl suggested changes Jun 19, 2025

View reviewed changes

liangyu-1 added 2 commits June 19, 2025 17:53

add Unit Tests and modify some function name

62fd5ab

add Unit Tests and modify some function name

53da1f9

liangyu-1 requested a review from davidradl June 19, 2025 09:59

HuangZhenQiu reviewed Jun 20, 2025

View reviewed changes

extract ContextGetter.getContext method and add UT for this method

bea09b7

liangyu-1 requested a review from HuangZhenQiu June 23, 2025 09:27

liangyu-1 added 3 commits June 24, 2025 10:05

fix bug of Unit Test

cef0115

fix bug of Unit Test HadoopFsFactoryTest

24716de

fix some bug

04bdc9a

liangyu-1 and others added 3 commits June 27, 2025 10:34

Merge branch 'apache:master' into FLINK-36540

554117b

empty commit

e437f5f

Merge branch 'FLINK-36540' of github.com:liangyu-1/flink into FLINK-3…

e38a2bc

…6540

github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Jun 30, 2025

Merge branch 'apache:master' into FLINK-36540

89f2628

github-actions bot added community-reviewed PR has been reviewed by the community. and removed community-reviewed PR has been reviewed by the community. labels Jul 1, 2025

[FLINK-36540][Runtime] Add Support for Hadoop Caller Context when using Flink to operate hdfs. #26681

Are you sure you want to change the base?

[FLINK-36540][Runtime] Add Support for Hadoop Caller Context when using Flink to operate hdfs. #26681

Conversation

liangyu-1 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

liangyu-1 commented Jun 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidradl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuangZhenQiu left a comment

Choose a reason for hiding this comment

Uh oh!

liangyu-1 commented Jun 27, 2025

Uh oh!

ferenc-csaky commented Jun 30, 2025

Uh oh!

liangyu-1 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

liangyu-1 commented Jun 16, 2025 •

edited

Loading

flinkbot commented Jun 16, 2025 •

edited

Loading

liangyu-1 commented Jul 1, 2025 •

edited

Loading