You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the recent change from #793 , we obtain a schema mismatch when polling alerts with the Fink client:
line 425, in _decode_avro_alert
return fastavro.schemaless_reader(avro_alert, schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "fastavro/_read.pyx", line 1141, in fastavro._read.schemaless_reader
File "fastavro/_read.pyx", line 1168, in fastavro._read.schemaless_reader
File "fastavro/_read.pyx", line 747, in fastavro._read._read_data
File "fastavro/_read.pyx", line 620, in fastavro._read.read_record
File "fastavro/_read.pyx", line 739, in fastavro._read._read_data
File "fastavro/_read.pyx", line 527, in fastavro._read.read_union
IndexError: list index out of range
The three columns have a different nullable property (nullable in streaming, non-nullable in static)... Note that this can be also obtained by printing the schema from the fink_consumer (TODO: add a CLI argument to print the schema).
When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.
So it makes no sense that our static DataFrame contains non-nullable fields. Second, the code for these fields have not changed. This comes from bin/raw2science.py (added years ago):
# get the schema from one file schema=schema_converter(spark.read.format('parquet').load(files[0]).schema)
# or get it from the streaming df directly in distribute.py#L219df_tmp=df_tmp.selectExpr(cnames)
schema=schema_converter.to_avro(df_tmp.schema)
The text was updated successfully, but these errors were encountered:
Since the recent change from #793 , we obtain a schema mismatch when polling alerts with the Fink client:
Inspecting the schema
The three columns have a different nullable property (nullable in streaming, non-nullable in static)... Note that this can be also obtained by printing the schema from the
fink_consumer
(TODO: add a CLI argument to print the schema).Why this is weird?
First, in the documentation of Spark, it is said:
So it makes no sense that our static DataFrame contains non-nullable fields. Second, the code for these fields have not changed. This comes from
bin/raw2science.py
(added years ago):Funny enough, if I just load one file from the entire folder, the fields are nullable as expected:
Spark issue
I raised this issue in the Spark bug tracker: https://issues.apache.org/jira/browse/SPARK-48492
Solution
At this point, I see two temporary solutions:
The text was updated successfully, but these errors were encountered: