Skip to content

Fail to read the native bloom_filter when the stage fallback to java #12013

@wankunde

Description

@wankunde

Backend

VL (Velox)

Bug description

How to reproduce this issue

Run UT in GlutenInjectRuntimeFilterSuite

  test("xxx") {
    withSQLConf(
      SQLConf.RUNTIME_BLOOM_FILTER_APPLICATION_SIDE_SCAN_SIZE_THRESHOLD.key -> "3000",
      SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "false",
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2000"
    ) {
      withTable("bf5_text") {
        spark.range(10000).toDF("a5").selectExpr("rpad('bf5_id_' || a5, 1024, 'x') as a5")
          .write.format("text").saveAsTable("bf5_text")
        assertRewroteWithBloomFilter(
          "select * from bf5_text join bf2 on " +
            "bf5_text.a5 = bf2.c2 where bf2.a2 = 67")
      }
    }
  }

How does this issue happen

  • Stage 0: gluten build a native bloom filter from bf2 where bf2.a2 = 67
  • Stage 1: gluten will try to offload filter and tableScan operator to gluten, but fallback to java due to the text datasource is not supported.
  • In stage1: spark will try to read the bloom filter in java side and throw exception Unexpected Bloom filter version number (16777217)

Our production case

Error plan:

Image

Error stack:

java.io.IOException: Unexpected Bloom filter version number (16777217)
	at org.apache.spark.util.sketch.BloomFilterImpl.readFrom0(BloomFilterImpl.java:251)
	at org.apache.spark.util.sketch.BloomFilterImpl.readFrom(BloomFilterImpl.java:260)
	at org.apache.spark.util.sketch.BloomFilterImpl.readFrom(BloomFilterImpl.java:266)
	at org.apache.spark.util.sketch.BloomFilter.readFrom(BloomFilter.java:185)
	at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.deserialize(BloomFilterMightContain.scala:120)
	at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.bloomFilter$lzycompute(BloomFilterMightContain.scala:92)
	at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.bloomFilter(BloomFilterMightContain.scala:90)
	at org.apache.spark.sql.catalyst.expressions.BloomFilterMightContain.doGenCode(BloomFilterMightContain.scala:105)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:207)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:202)
	at org.apache.spark.sql.execution.GeneratePredicateHelper.genPredicate$1(basicPhysicalOperators.scala:156)
	at org.apache.spark.sql.execution.GeneratePredicateHelper.$anonfun$generatePredicateCode$4(basicPhysicalOperators.scala:200)
	at scala.collection.immutable.List.map(List.scala:247)
	at scala.collection.immutable.List.map(List.scala:79)
	at org.apache.spark.sql.execution.GeneratePredicateHelper.generatePredicateCode(basicPhysicalOperators.scala:181)
	at org.apache.spark.sql.execution.GeneratePredicateHelper.generatePredicateCode$(basicPhysicalOperators.scala:140)
	at org.apache.spark.sql.execution.FilterExec.generatePredicateCode(basicPhysicalOperators.scala:220)
	at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:253)
	at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:198)
	at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:153)

Gluten version

Gluten 1.5

Spark version

Spark 4.0

Spark configurations

None

System information

None

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions