-
Notifications
You must be signed in to change notification settings - Fork 3
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Background
Spark Structured Streaming can be used to read data from Kafka with records having Avro format. It would allow to maintain proper checkpoints and exactly-once semantics.
Feature
Add experimental Kafka/Avro source.
Example [Optional]
--
Proposed Solution [Optional]
Ideas so far:
- Use
foreachBatch()
. Need to make sure checkpoints are updated.
🛑 This doesn't work because you can't convert a streaming dataframe to batch dataframe. You can only provide a function that processes batch dataframes. Implementing this in Pramen requires interface changes across incremental sources. - Use kafka batch and save offsets in the offsets bookeeping table.
- If checkpoints are not updated, data can be saved to a temporary location with batchid generation
- Abstract away streaming ingestion as a separate job that does full source+save logic, and do not split it.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request