Skip to content

feat: support Boolean in approx_distinct#22707

Open
JeelRajodiya wants to merge 4 commits into
apache:mainfrom
JeelRajodiya:boolean-support-approx-count-dist
Open

feat: support Boolean in approx_distinct#22707
JeelRajodiya wants to merge 4 commits into
apache:mainfrom
JeelRajodiya:boolean-support-approx-count-dist

Conversation

@JeelRajodiya
Copy link
Copy Markdown
Contributor

@JeelRajodiya JeelRajodiya commented Jun 2, 2026

Which issue does this PR close?

No issue is open for Boolean specifically. Related: #1109 (closed by #21453) introduced the bitmap pattern this PR extends. The pre-#21453 code carried a // TODO support for boolean (trivial case) comment that was silently dropped without implementation.

Rationale for this change

Today approx_distinct(bool_col) errors with "Support for 'approx_distinct' for data type Boolean is not implemented". Boolean has at most 2 distinct non-null values, so HLL is overkill — a tiny pair of flags gives an exact answer at a fraction of the memory cost, matching the small-int bitmap strategy already in the codebase.

What changes are included in this PR?

Adds BooleanDistinctCountAccumulator (two named flags: has_seen_false / has_seen_true) in functions-aggregate-common and wires DataType::Boolean through the existing ApproxDistinctBitmapWrapper in approx_distinct.rs alongside the small-int arms. State serializes as List<Boolean>.

Are these changes tested?

Yes — SLT regression coverage in aggregate.slt for all-true, all-false, mixed, all-null, and GROUP BY cases.

Are there any user-facing changes?

Yes — approx_distinct(<boolean>) now works instead of erroring. No API or behavioral changes for existing types.

PR apache#21453 introduced the ApproxDistinctBitmapWrapper pattern for small
ints and incidentally dropped a "TODO support for boolean (trivial case)"
comment without wiring up Boolean. This adds a BooleanDistinctCountAccumulator
(a [bool; 2] flag pair) plugged into the existing wrapper so
approx_distinct returns an exact 0/1/2 for Boolean inputs.
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jun 2, 2026
@kumarUjjawal
Copy link
Copy Markdown
Contributor

Thank for working on this @JeelRajodiya Can you update the PR body using the template.

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs Outdated
…ntAccumulator

Use BooleanArray::has_true/has_false instead of iterating, and bail
early when both flags are already set. Per review feedback from
@Jefffrey on apache#22707.
@JeelRajodiya JeelRajodiya marked this pull request as draft June 2, 2026 07:21
…mall_int helpers

- Replace `seen: [bool; 2]` with named `has_seen_false`/`has_seen_true`
  fields and add `seen_both()` helper for clarity.
- Rename `get_small_int_*` helpers to `get_fixed_domain_*` since they
  also dispatch the Boolean arm, which is not a small int.
@JeelRajodiya JeelRajodiya marked this pull request as ready for review June 2, 2026 07:46
@JeelRajodiya
Copy link
Copy Markdown
Contributor Author

JeelRajodiya commented Jun 2, 2026

@Jefffrey PR is ready for review!

@coderfender
Copy link
Copy Markdown
Contributor

@JeelRajodiya , Thank you for the PR. Let me take a look and post review this week

@JeelRajodiya
Copy link
Copy Markdown
Contributor Author

Also, there are other types that approx distinct does not support.
We can support some of theose unsupported types using carrier types (see below)

Carrier mapping:

  • Float32/64 → UInt32/UInt64 (canonical bits — NaN→f::NAN, -0→+0)
  • Decimal32/64 → Int32/Int64 (same native value)
  • Decimal128/256 → Binary (little-endian native bytes)
  • FixedSizeBinary → Binary (raw bytes)

Or we can also create seprate accumulators for them. Whichever is preffered by datafusion team.
I'm happy to open PR for them as well. Please let me know

Copy link
Copy Markdown
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments but this looks good to me


#[cold]
fn get_small_int_approx_accumulator(
fn get_fixed_domain_approx_accumulator(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the name get_fixed_domain_state_field is chosen here ?

Copy link
Copy Markdown
Contributor Author

@JeelRajodiya JeelRajodiya Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the keeping the name small_int won't be relevant anymore and the types we included under these functions share a small, finite domain (i.e finite values) hence this name was chosen.

Comment thread datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs Outdated
…inctCountAccumulator

- Pull the per-flag update block into an `observe(&BooleanArray)` helper
  shared by `update_batch` and `merge_batch`.
- Sum the two flags as u8 first, then cast once to i64 in `count()`.

Per review feedback from @coderfender on apache#22707.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants