Skip to content

Add experimental Kafka/Avro source #644

@yruslan

Description

@yruslan

Background

Spark Structured Streaming can be used to read data from Kafka with records having Avro format. It would allow to maintain proper checkpoints and exactly-once semantics.

Feature

Add experimental Kafka/Avro source.

Example [Optional]

--

Proposed Solution [Optional]

Ideas so far:

  • Use foreachBatch(). Need to make sure checkpoints are updated.
    🛑 This doesn't work because you can't convert a streaming dataframe to batch dataframe. You can only provide a function that processes batch dataframes. Implementing this in Pramen requires interface changes across incremental sources.
  • Use kafka batch and save offsets in the offsets bookeeping table.
  • If checkpoints are not updated, data can be saved to a temporary location with batchid generation
  • Abstract away streaming ingestion as a separate job that does full source+save logic, and do not split it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions