Skip to content

[FLINK] Fix Nexmark: support multi-parallelism via ParallelSplit#36

Merged
zhanglistar merged 8 commits into
bigo-sg:gluten-0530from
ggjh-159:fix/nexmark-source-multi-parallelism
Jun 24, 2026
Merged

[FLINK] Fix Nexmark: support multi-parallelism via ParallelSplit#36
zhanglistar merged 8 commits into
bigo-sg:gluten-0530from
ggjh-159:fix/nexmark-source-multi-parallelism

Conversation

@ggjh-159

@ggjh-159 ggjh-159 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Introduce an abstraction for "one logical split that expands into per-subtask splits at runtime", so the planner can express a disjoint event range per subtask.

  • Add ParallelSplit (subclass of ConnectorSplit) declaring getSubtaskSplit(int index, int parallelism).
  • Add GeneratorConfig and NexmarkConfiguration POJOs mirroring the C++ side. GeneratorConfig serializes maxEventsOrZero as JSON key maxEvents to match the C++ serde. NexmarkConfiguration annotates isRateLimited() / isUseWallclockEventTime() with explicit @JsonGetter names — otherwise Jackson strips the is prefix and the C++ deserializer silently drops both fields.
  • Refactor NexmarkConnectorSplit to extend ParallelSplit, replacing int numRows with GeneratorConfig config + List<NexmarkConnectorSplit> subtaskSplits.

fix: gluten#12298
depends: bigo-sg/velox#45
related: apache/gluten#12304

Test

  • Unit test NexmarkConnectorSplitSerdeTest: JSON serde round-trip for NexmarkConfiguration / GeneratorConfig and ISerializable Java↔C++ round-trip for NexmarkConnectorSplit (single and parallel with two subtasks).
  • End-to-end via the companion gluten-flink PR (parallelism.default = 2, nexmark events.num = 10000, tps = 2000, q0): bid input rows 9200 (no duplicates), dateTime span ~5s, subtask splits firstEventId=1, maxEvents=5000 and firstEventId=5001, maxEvents=5000.

lgbo-ustc

This comment was marked as duplicate.

@lgbo-ustc lgbo-ustc left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern about the new testNexmarkConnectorSplitWithSubtasks: it looks like it is intended to verify the parallel split round-trip, but subtaskSplits is not currently serialized by velox4j Serde because auto-detect getters/fields are disabled and NexmarkConnectorSplit only exposes config via @JsonGetter. As a result, this test can pass while the deserialized split has subtaskSplits == null.

If subtaskSplits is intended to survive Java/C++ serde, please add an explicit getter and assert getSubtaskSplit() still works after round-trip. If it is intentionally Java-runtime-only and should not be sent to C++, please make the test/name/assertions explicit about that behavior so it does not look like the subtask list is being validated.

@lgbo-ustc lgbo-ustc left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern is that NexmarkConnectorSplit#getSubtaskSplit(int index, int parallelism) ignores the parallelism argument and directly indexes subtaskSplits. This assumes the planned split count always equals the runtime parallelism.

If runtime parallelism is larger than subtaskSplits.size(), this fails with IndexOutOfBoundsException. If runtime parallelism is smaller, some generated subtask ranges are silently unused, which can drop events. It would be safer to validate subtaskSplits != null and subtaskSplits.size() == parallelism, then fail with a clear exception if they do not match.

lgbo-ustc

This comment was marked as duplicate.

import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonGetter;
import com.fasterxml.jackson.annotation.JsonProperty;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming question: GeneratorConfig looks generic, but this class is Nexmark-specific: it contains NexmarkConfiguration and Nexmark event range fields such as firstEventId, maxEvents, and firstEventNumber. The corresponding C++ type also lives under the nexmark connector.

Could we make this Nexmark-specific in Java as well, e.g. NexmarkGeneratorConfig or a nexmark-specific package/name? If keeping the C++ name is intentional for serde parity, a short class comment saying this is Nexmark-only would help.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, using NexmarkGeneratorConfig.

@ggjh-159 ggjh-159 force-pushed the fix/nexmark-source-multi-parallelism branch 2 times, most recently from 9891619 to cf51f45 Compare June 18, 2026 09:17
@ggjh-159

Copy link
Copy Markdown
Author

One concern about the new testNexmarkConnectorSplitWithSubtasks: it looks like it is intended to verify the parallel split round-trip, but subtaskSplits is not currently serialized by velox4j Serde because auto-detect getters/fields are disabled and NexmarkConnectorSplit only exposes config via @JsonGetter. As a result, this test can pass while the deserialized split has subtaskSplits == null.

If subtaskSplits is intended to survive Java/C++ serde, please add an explicit getter and assert getSubtaskSplit() still works after round-trip. If it is intentionally Java-runtime-only and should not be sent to C++, please make the test/name/assertions explicit about that behavior so it does not look like the subtask list is being validated.

@lgbo-ustc
You're right — the old test was misleading. subtaskSplits is intentionally
Java-runtime-only and never crosses the C++ boundary; the C++ side only ever
sees per-subtask leaf splits (each carrying one NexmarkGeneratorConfig). The PR
now makes that explicit:

  1. Split NexmarkConnectorSplit (leaf, extends ConnectorSplit, crosses C++
    serde as before) from NexmarkParallelSplit (container, extends ParallelSplit, Java-only — never registered in ISerializableRegistry
    for the C++ polymorphic path).

  2. Removed the misleading testNexmarkConnectorSplitWithSubtasks (which
    invoked testISerializableRoundTrip and falsely implied C++ round-trip
    validation).

See src/main/java/.../connector/NexmarkConnectorSplit.java,
src/main/java/.../connector/NexmarkParallelSplit.java,
src/test/java/.../serde/NexmarkConnectorSplitSerdeTest.java

@ggjh-159

ggjh-159 commented Jun 18, 2026

Copy link
Copy Markdown
Author

Another concern is that NexmarkConnectorSplit#getSubtaskSplit(int index, int parallelism) ignores the parallelism argument and directly indexes subtaskSplits. This assumes the planned split count always equals the runtime parallelism.

If runtime parallelism is larger than subtaskSplits.size(), this fails with IndexOutOfBoundsException. If runtime parallelism is smaller, some generated subtask ranges are silently unused, which can drop events. It would be safer to validate subtaskSplits != null and subtaskSplits.size() == parallelism, then fail with a clear exception if they do not match.

@lgbo-ustc Fixed. NexmarkParallelSplit#getSubtaskSplit now validates inputs before indexing.

  @Override
  public ConnectorSplit getSubtaskSplit(int index, int parallelism) {
    Objects.requireNonNull(subtaskSplits, "subtaskSplits is null");
    if (parallelism != subtaskSplits.size()) {
      throw new IllegalStateException(
          String.format(
              "Runtime parallelism (%d) does not match planned subtask count (%d). "
                  + "Nexmark multi-parallelism requires the same parallelism at plan and runtime.",
              parallelism, subtaskSplits.size()));
    }
    if (index < 0 || index >= subtaskSplits.size()) {
      throw new IndexOutOfBoundsException(
          "Subtask index " + index + " out of range [0, " + subtaskSplits.size() + ")");
    }
    return subtaskSplits.get(index);
  }

@ggjh-159 ggjh-159 requested a review from lgbo-ustc June 18, 2026 11:07
@ggjh-159 ggjh-159 force-pushed the fix/nexmark-source-multi-parallelism branch from cf51f45 to b235914 Compare June 23, 2026 07:00
@zhanglistar zhanglistar merged commit 219cbea into bigo-sg:gluten-0530 Jun 24, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants