Skip to content

Gluten native Parquet writer ignores spark.hadoop Parquet write configs #12071

@wecharyu

Description

@wecharyu

Backend

VL (Velox)

Bug description

Problem

Spark builds the write-side Hadoop configuration with sessionState.newHadoopConfWithOptions(options) before invoking the file writer. This makes configs provided as spark.hadoop.<key> visible to the underlying Parquet writer as <key>.

Gluten Velox native write currently builds native write parameters from write options only, so configs coming from Spark HadoopConf are not propagated.

Reproduction

Enable the Velox native writer and set a Parquet write config through Spark HadoopConf, for example:

spark.conf.set("spark.hadoop.parquet.enable.dictionary", "false")

Expected Behavior

The native writer should respect HadoopConf-backed Parquet write configs in the same way Spark's native file write path does.

For spark.hadoop.parquet.enable.dictionary=false, the written Parquet footer should not contain dictionary encodings such as RLE_DICTIONARY or PLAIN_DICTIONARY.

Actual Behavior

The native writer ignores the spark.hadoop.* config, and the Parquet footer still shows dictionary encoding.

Gluten version

main branch

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions