Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Reading LZ4-Compressed Parquet File Using Spark 3.5 + Blaze #771

Closed
merrily01 opened this issue Jan 17, 2025 · 2 comments
Closed
Labels

Comments

@merrily01
Copy link
Contributor

Describe the bug

Issue with Reading LZO-Compressed Parquet File Using Spark 3.5 + Blaze

To Reproduce
Steps to reproduce the behavior:

  1. The LZO-compressed Parquet file that reproduces the issue is attached, eg:
    part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet.txt

    Note: Please remove the “.txt” suffix to convert it back to a Parquet file before proceeding.

  2. Upload the aforementioned LZO-compressed Parquet file to HDFS for backup.

  3. Launch spark-shell with Spark 3.5 + Blaze.

  4. Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error as follows::

scala> spark.conf.set("spark.blaze.enable", true)
scala> val df = spark.read.parquet("hdfs://path/o/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
25/01/17 17:01:31 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (tjtx16-35-27.58os.org executor 2): java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[ParquetScan] error: Execution error: output_with_sender[ParquetScan]: output() returns error: Arrow error: External error: Arrow: Parquet argument error: External: the offset to copy is not contained in the decompressed buffer
	at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
	at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:95)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:143)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:662)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:682)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
  1. Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results, as follows:
scala> spark.conf.set("spark.blaze.enable", false)
scala> val df = spark.read.parquet("hdfs://path/to/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|cp_catalog_page_sk|cp_catalog_page_id|cp_start_date_sk|cp_end_date_sk|cp_department|cp_catalog_number|cp_catalog_page_number|      cp_description|  cp_type|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|                 1|  AAAAAAAABAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     1|In general basic ...|bi-annual|
|                 2|  AAAAAAAACAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     2|English areas wil...|bi-annual|
|                 3|  AAAAAAAADAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     3|Times could not a...|bi-annual|
|                 4|  AAAAAAAAEAAAAAAA|         2450815|          NULL|         NULL|                1|                  NULL|                NULL|bi-annual|
|                 5|  AAAAAAAAFAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     5|Classic buildings...|bi-annual|
|                 6|  AAAAAAAAGAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     6|Exciting principl...|bi-annual|
|                 7|  AAAAAAAAHAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     7|National services...|bi-annual|
|                 8|  AAAAAAAAIAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     8|Areas see early f...|bi-annual|
|                 9|  AAAAAAAAJAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     9|Intensive, econom...|bi-annual|
|                10|  AAAAAAAAKAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    10|Careful, intense ...|bi-annual|
|                11|  AAAAAAAALAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    11|At least national...|bi-annual|
|                12|  AAAAAAAAMAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    12|Girls indicate so...|bi-annual|
|                13|  AAAAAAAANAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    13|Miles see mainly ...|bi-annual|
|                14|  AAAAAAAAOAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    14|Rooms would say a...|bi-annual|
|                15|  AAAAAAAAPAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    15|Legal, required e...|bi-annual|
|                16|  AAAAAAAAABAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    16|Schools must know...|bi-annual|
|                17|  AAAAAAAABBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    17|More than true ca...|bi-annual|
|                18|  AAAAAAAACBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    18|Shops end problem...|bi-annual|
|                19|  AAAAAAAADBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    19|Poor, hostile gui...|bi-annual|
|                20|  AAAAAAAAEBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    20|Appropriate years...|bi-annual|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
only showing top 20 rows

Expected behavior

  1. Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error;
  2. Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results;

Screenshots
Enable the Blaze switch:
Image

Disable the Blaze switch:

Image

Additional context

Spark version: 3.5

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Feb 17, 2025
Copy link

github-actions bot commented Mar 3, 2025

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant