Skip to content

[SPARK-47547] BloomFilter fpp degradation #50933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ishnagy
Copy link

@ishnagy ishnagy commented May 19, 2025

What changes were proposed in this pull request?

This change fixes a performance degradation issue in the current BloomFilter implementation.

The current bit index calculation logic does not use any part of the indexable space above the first 31bits, so when the inserted item count approaches (or exceeds) Integer.MAX_VALUE, it will produce significantly worse collision rates than an (ideal) uniformly distributing hash function.

Why are the changes needed?

This should qualify as a bug.

The upper bound on the bit capacity of the current BloomFilter implementation in spark is approx 137G bits (64 bit longs in an Integer.MAX_VALUE sized array). The current indexing scheme can only address about 2G bits of these.

On the other hand, due to the way the BloomFilters are used, the bug won't cause any logical errors, it will gradually render the BloomFilter instance useless by forcing more-and-more queries on the slow path.

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

One new java testclass was added to sketch to test different combinations of item counts and expected fpp rates.

common/sketch/src/test/java/org/apache/spark/util/sketch/TestSparkBloomFilter.java

testAccuracyEvenOdd
in N number of iterations inserts N even numbers (2*i), and leaves out N odd numbers (2*i+1) from the BloomFilter.

The test checks the 100% accuracy of mightContain=true on all of the even items, and measures the mightContain=true (false positive) rate on the not-inserted odd numbers.

testAccuracyRandom
in 2N number of iterations inserts N pseudorandomly generated numbers in two differently seeded (theoretically independent) BloomFilter instances. All the random numbers generated in an even-iteration will be inserted into both filters, all the random numbers generated in an odd-iteration will be left out from both.

The test checks the 100% accuracy of mightContain=true for all of the items inserted in an even-loop. It counts the false positives as the number of odd-loop items for which the primary filter reports mightContain=true but secondary reports mightContain=false. Since we inserted the same elements into both instances, and the secondary reports non-insertion, the mightContain=true from the primary can only be a false positive.

patched

One minor (test) issue was fixed in

common/sketch/src/test/scala/org/apache/spark/util/sketch/BloomFilterSuite.scala

where the potential repetitions in the randomly generated stream of insertable items resulted in slightly worse fpp measurements than the actual. The problem affected the those testcases more where the cardinality of the tested type is low (the chance of repetition is high), e.g. Byte and Short.

removed from the default runs

Running these test as part of the default build process was turned off with adding @Disabled annotation to the new testclass.

Was this patch authored or co-authored using generative AI tooling?

No

}
}

long mightContainEven = 0;
Copy link
Contributor

@peter-toth peter-toth May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename these 2 in this test case to clarify that these are actually indices of numbers in a randomly generated stream.

optimalNumOfBits / Byte.SIZE / 1024 / 1024
);
Assumptions.assumeTrue(
2 * optimalNumOfBits / Byte.SIZE < 4 * ONE_GB,
Copy link
Contributor

@peter-toth peter-toth May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 4 * ONE_GB is a reasoable limit, can we extract it to a constant and add some comment to it.

"mightContainLong must return true for all inserted numbers"
);

double actualFpp = (double) mightContainOddIndexed / numItems;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ numItems doesn't seem correct here as you don't test numItems number of numbers that were surely not added into the filter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, it should probably be very close to the proper value, but this calculation doesn't account for the odd indexes ignored based on the secondary's result.

let me try to address that somehow.

@ishnagy ishnagy force-pushed the SPARK-47547_bloomfilter_fpp_degradation branch from 8edf4dd to 57298f0 Compare May 19, 2025 16:07
@ishnagy ishnagy force-pushed the SPARK-47547_bloomfilter_fpp_degradation branch from 57298f0 to f589e2c Compare May 19, 2025 16:11
@peter-toth
Copy link
Contributor

peter-toth commented May 20, 2025

Can you please post the output of the new TestSparkBloomFilter here when the 4GB limit of REQUIRED_HEAP_UPPER_BOUND_IN_BYTES is lifted?
And summarize the actual false positive rate (FPP) before and after this fix when numItems = {1000000, 1000000000, 5000000000} and expected FPP is the default 3%?

@ishnagy
Copy link
Author

ishnagy commented May 20, 2025

the tests with the 4GB limit are still running, I'll post a summary from the results tomorrow, and start a new run that can cover all of the 5G element count cases.

@ishnagy
Copy link
Author

ishnagy commented May 21, 2025

The filter-from-hex-constant test started to make me worry about compatibility with serialized instances created with the older logic. Even if we can deserialize the buffer and the seed properly, the actual bits will be set in completely different positions. That is, there's no point in trying to use an old (serialized) buffer with the new logic.

Should we create a dedicated BloomFilterImplV2 class for the fixed logic, just so we can keep the old V1 implementation for deserializing old byte streams?

@peter-toth
Copy link
Contributor

Should we create a dedicated BloomFilterImplV2 class for the fixed logic, just so we can keep the old V1 implementation for deserializing old byte streams?

I don't think we need to keep the old implementation just to support old serialized versions. It seems we use our bloom filter implementation only in BloomFilterAggregate.

cc @cloud-fan

@ishnagy
Copy link
Author

ishnagy commented May 23, 2025

I ran into some trouble with generating the test results (running on a single thread, the whole batch takes ~10h on my machine). I'll try to make an update on Monday.

@ishnagy
Copy link
Author

ishnagy commented May 26, 2025

version testName n fpp seed allocatedBitCount setBitCount saturation expectedFpp% actualFpp% runningTime
OLD testAccuracyEvenOdd 1000000 0.05 0 6235264 (0 MB) 2952137 0.473458 5.000000 % 5.025400 % PT19.267149499S
OLD testAccuracyEvenOdd 1000000 0.03 0 7298496 (0 MB) 3618475 0.495784 3.000000 % 3.022900 % PT19.628671953S
OLD testAccuracyEvenOdd 1000000 0.01 0 9585088 (1 MB) 4968111 0.518317 1.000000 % 0.994700 % PT19.476457289S
OLD testAccuracyEvenOdd 1000000 0.001 0 14377600 (1 MB) 7203887 0.501049 0.100000 % 0.102200 % PT19.944492903S
OLD testAccuracyEvenOdd 1000000000 0.05 0 6235224256 (743 MB) 1814052150 0.290936 5.000000 % 50.920521 % PT28M6.091484671S
OLD testAccuracyEvenOdd 1000000000 0.03 0 7298440896 (870 MB) 1938187323 0.265562 3.000000 % 59.888499 % PT30M26.383544378S
OLD testAccuracyEvenOdd 1000000000 0.01 0 9585058432 (1142 MB) 2065015223 0.215441 1.000000 % 76.025548 % PT36M30.827858084S
OLD testAccuracyEvenOdd 1000000000 0.001 0 14377587584 (1713 MB) 2127081112 0.147944 0.100000 % 90.896130 % PT45M58.403282401S
OLD testAccuracyEvenOdd 5000000000 0.05 0 31176121152 (3716 MB) 2147290054 0.068876 5.000000 % 99.963940 % PT1H28M39.598973373S
OLD testAccuracyEvenOdd 5000000000 0.03 0 36492204224 (4350 MB) 2147464804 0.058847 3.000000 % 99.995623 % PT1H41M22.171084285S
OLD testAccuracyEvenOdd 5000000000 0.01 0 47925291904 (5713 MB) 2147483464 0.044809 1.000000 % 99.999939 % PT1H59M42.481346242S
OLD testAccuracyEvenOdd 5000000000 0.001 0 71887937856 (8569 MB) 2147483648 0.029873 0.100000 % 100.000000 % PT2H32M41.743734635S

@ishnagy
Copy link
Author

ishnagy commented May 26, 2025

version testName n fpp seed allocatedBitCount setBitCount saturation expectedFpp% actualFpp% runningTime
NEW testAccuracyEvenOdd 1000000 0.05 0 6235264 (0 MB) 2952282 0.473481 5.000000 % 5.046800 % PT13.599525353S
NEW testAccuracyEvenOdd 1000000 0.03 0 7298496 (0 MB) 3619967 0.495988 3.000000 % 3.018000 % PT14.086955381S
NEW testAccuracyEvenOdd 1000000 0.01 0 9585088 (1 MB) 4968081 0.518314 1.000000 % 1.013400 % PT14.300125629S
NEW testAccuracyEvenOdd 1000000 0.001 0 14377600 (1 MB) 7205256 0.501145 0.100000 % 0.095100 % PT14.746387272S
NEW testAccuracyEvenOdd 1000000000 0.05 0 6235224256 (743 MB) 2963568196 0.475295 5.000000 % 4.889721 % PT35M6.22696009S
NEW testAccuracyEvenOdd 1000000000 0.03 0 7298440896 (870 MB) 3628684972 0.497186 3.000000 % 2.963030 % PT37M31.833552669S
NEW testAccuracyEvenOdd 1000000000 0.01 0 9585058432 (1142 MB) 4973807865 0.518913 1.000000 % 1.001407 % PT43M23.782325058S
NEW testAccuracyEvenOdd 1000000000 0.001 0 14377587584 (1713 MB) 7210348423 0.501499 0.100000 % 0.100803 % PT57M35.474342424S
NEW testAccuracyEvenOdd 5000000000 0.05 0 31176121152 (3716 MB) 14360939834 0.460639 5.000000 % 6.727508 % PT2H21M2.643592951S
NEW testAccuracyEvenOdd 5000000000 0.03 0 36492204224 (4350 MB) 17711039216 0.485338 3.000000 % 3.806971 % PT2H29M18.334864292S
NEW testAccuracyEvenOdd 5000000000 0.01 0 47925291904 (5713 MB) 24462662240 0.510433 1.000000 % 1.321482 % PT2H56M51.935983408S
NEW testAccuracyEvenOdd 5000000000 0.001 0 71887937856 (8569 MB) 35637830341 0.495741 0.100000 % 0.176216 % PT3H38M21.888031962S

@ishnagy ishnagy changed the title [WIP] [SPARK-47547] BloomFilter fpp degradation [SPARK-47547] BloomFilter fpp degradation May 27, 2025
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actualFpp% seems to be much better when the number of inserted items (n) is huge (~1B).
I'm not sure that the bug actually caused any issues in the injected runtime filters due to the much lower default values of spark.sql.optimizer.runtime.bloomFilter.max... configs, but it is also possible to build a bloom filter manually so it is better to fix it.

BTW, this issue seems to have been observed in Spark: https://stackoverflow.com/questions/78162973/why-is-observed-false-positive-rate-in-spark-bloom-filter-higher-than-expected and was tried to fix with #46370 before.
That old PR was similar to how the issue was fixed in Guava with adding a new strategy / Murmur implementation while this PR fixes the root cause in the current Bloom filter implementation.

@peter-toth
Copy link
Contributor

@cloud-fan, as you added the original bloom filter implementation to Spark, could you please take a look at this PR?

@ishnagy
Copy link
Author

ishnagy commented May 27, 2025

the only relevant difference between the OLD and the NEW versions is in the logic to derive the k hash bits:

OLD

    for (int i = 1; i <= numHashFunctions; i++) {
      int combinedHash = h1 + (i * h2);
      // ...
    }

NEW

    long combinedHash = (long) h1 * Integer.MAX_VALUE;
    for (long i = 0; i < numHashFunctions; i++) {
      combinedHash += h2;
      // ...
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants