Skip to content
This repository was archived by the owner on Dec 20, 2018. It is now read-only.
This repository was archived by the owner on Dec 20, 2018. It is now read-only.

Spark 2.2 + spark-avro_2.11-4.0.0.jar is very slow for certain job comparing Spark 1.6.3 + spark-avro_2.10-2.0.1.jar  #280

@java8964

Description

@java8964

We have some jobs that input source is AVRO files, run very slow in Spark 2, but works fine in Spark 1. The slow behavior looks like Spark2 AVRO is almost hang when reading the avro files. From the Spark UI, 100 executors only read about 2G input data in 1 hour. The issue is that this slowness only happened in one particular AVRO files. Other schema avro files read very fast. And this happens only in Spark 2.

While we have this issue on production, I also happen to find out a simple performance issue using this particular avro schema data in a simple test case in Spark 2, but works fine in Spark 1.

Here is my test case:
The data is about 1.3G avro file, with the following schema, and we use 12 executors and 4g executor memory for both Spark2 and Spark1
{
"namespace" : "com.xxx",
"type" : "record",
"name" : "Lists",
"fields" : [
{"name" : "account_id", "type" : "long"},
{"name" : "list_id", "type" : "string"},
{"name" : "sequence_id", "type" : ["null", "int"]} ,
{"name" : "name", "type" : ["null", "string"]},
{"name" : "state", "type" : ["null", "string"]},
{"name" : "description", "type" : ["null", "string"]},
{"name" : "dynamic_filtered_list", "type" : ["null", "int"]},
{"name" : "filter_criteria", "type" : ["null", "string"]},
{"name" : "created_at", "type" : ["null", "long"]},
{"name" : "updated_at", "type" : ["null", "long"]},
{"name" : "deleted_at", "type" : ["null", "long"]},
{"name" : "favorite", "type" : ["null", "int"]},
{"name" : "delta", "type" : ["null", "boolean"]},
{
"name" : "list_memberships", "type" : {
"type" : "array", "items" : {
"name" : "ListMembership", "type" : "record",
"fields" : [
{"name" : "channel_id", "type" : "string"},
{"name" : "created_at", "type" : ["null", "long"]},
{"name" : "created_source", "type" : ["null", "string"]},
{"name" : "deleted_at", "type" : ["null", "long"]},
{"name" : "sequence_id", "type" : ["null", "int"]}
]
}
}
}
]
}
On Spark 2, (I tested with spark-avro_2.11-4.0.0.jar and spark-avro_2.11-3.2.0.jar with Spark 2.2.0, both have the issue for the following simple case:)

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.types.StringType
import com.databricks.spark.avro._
val rawlists = spark.read.avro("/data/lists")
val lists = rawlists.withColumn("partitioner", $"account_id".cast(StringType).substr(-2,2))
lists.write.partitionBy("partitioner").mode(SaveMode.Overwrite).parquet("/pdata/lists")

It took 4.5 hours to finish on Spark2, and the executors have no status update for hours (only input size grow very slowly)

The following code run for about 15 minutes in Spark1.6.3 + spark-avro_2.10-2.0.1.jar
val rawlists = sqlContext.read.avro("/data/lists")
val lists = rawlists.withColumn("partitioner", $"account_id".cast(StringType).substr(-2,2))
lists.write.partitionBy("partitioner").mode(SaveMode.Overwrite).parquet("/pdata/lists")

I am not sure where is the problem, but in Spark2, for other avro file (different dataset/schema), performance is fine. I think the problem is the spark-avro library reading this particular kind of avro files, as it is very slow during the scan/read stage. Any idea?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions