Skip to content

Ensure a sequential write to DM by always using a single output#898

Draft
thorkildcognite wants to merge 1 commit intomasterfrom
enforce-sequential
Draft

Ensure a sequential write to DM by always using a single output#898
thorkildcognite wants to merge 1 commit intomasterfrom
enforce-sequential

Conversation

@thorkildcognite
Copy link
Contributor

We want to go slow to go fast with data modeling, and never write with more than one partition at a time. This change ensures this by repartitioning down to 1 partition before emitting the result.

Since we almost always have 200 partitions going into the write, we would always clamp this down to the ideal, which was set to 1. But, if we actually had less than 50 partitions going into this, it would not be repartitioned and we would actually try to write to DMS with 50 parallel writes (potentially).

I first didn't find any way to trigger this, but I think I can trigger it by reading directly from time series, sequences etc., which have 50 or smaller as default partitions.

This is very hard to verify in production, but there should be no clear downside to ensuring this.

We want to go slow to go fast with data modeling, and never write with
more than one partition at a time. This ensures this by repartitioning
down to 1 partition before emitting the result.

Since we almost always had 200 partitions going into the write, we
would always clamp this down to the ideal, which was set to 1. But, if
we actually had less than 50 partitions going into this, it would not
be repartitioned and we would actually try to write to DMS with 50
parallel writes.

I first didn't find any way to trigger this, but I think I can trigger
it by reading directly from time series, sequences etc., which have 50
or smaller as default partitions.

This is very hard to verify in production, but there should be no
clear downside to ensuring this.
@thorkildcognite thorkildcognite requested a review from a team as a code owner February 8, 2024 01:59
@codecov
Copy link

codecov bot commented Feb 8, 2024

Codecov Report

Merging #898 (992c7c3) into master (b77b607) will increase coverage by 0.00%.
The diff coverage is 75.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #898   +/-   ##
=======================================
  Coverage   80.76%   80.77%           
=======================================
  Files          43       43           
  Lines        2839     2840    +1     
  Branches      115      121    +6     
=======================================
+ Hits         2293     2294    +1     
  Misses        546      546           
Files Coverage Δ
...rc/main/scala/cognite/spark/v1/DefaultSource.scala 93.08% <75.00%> (+0.03%) ⬆️

@thorkildcognite thorkildcognite marked this pull request as draft February 8, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant