Skip to content

Make segment-cache size configurable and use emptyDir for it #306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
sbernauer opened this issue Sep 28, 2022 · 6 comments
Closed
3 tasks

Make segment-cache size configurable and use emptyDir for it #306

sbernauer opened this issue Sep 28, 2022 · 6 comments
Assignees
Labels
priority/medium release/23.1.0 release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. size/M type/feature-new

Comments

@sbernauer
Copy link
Member

sbernauer commented Sep 28, 2022

Currently we the segment-cache location and size hardcoded to 300gb:

value: "[{\"path\":\"/stackable/var/druid/segment-cache\",\"maxSize\":\"300g\"}]"

Also /stackable/var/druid/segment-cache is not mounted but instead belongs to the container root directory.

We could either put the cache on a disk or in a ramdisk (by using Memory medium for emptydir).
My suggestion is putting the it on disk as this matches the Druid docs

Segments assigned to a Historical process are first stored on the local file system (in a disk cache) and then served by the Historical process

So we need a emptyDir without setting a explicit medium (using disk). We should also set the sizeLimit to the cache size.

  • segment-cache resides on emptyDir with correct sizeLimit
  • segment-cache size configurable.
  • segment-cache free percentage configurable. We default to 5% free percentage and set this as freeSpacePercent Druid attribute.
    CRD proposal
  historicals:
    roleGroups:
      default:
        replicas: 3
        config:
          resources:
            cpu:
              min: '200m'
              max: '4'
            memory:
              limit: '2Gi'
            storage:
              segmentCache: # Enum called e.g. "StorageVolumeConfig" (new)
                freePercentage: 5 # default: 5
                emptyDir: # struct EmptyDirConfig (new)
                  capacity: 10Gi
                  medium: "" # or "Memory"
                # OR
                pvc: # PvcConfig struct
                  capacity: 10Gi
                  storageClass: "ssd"

UPDATE: 04.11.12

Change of plan: since the operator framework doesn't support merging enum types currently, the solution above cannot be implemented. In agreement with others, a new temporary solution is proposed: an implementation with support for emptyDir storage will be made in this repository only. Later, when the framework is updated with the enum merging support, the complete solution from above will be implemented. This proposal is forward compatible with the one above from the user's perspective.

The manifest will look just like this (note the missing PVC configuration:

  historicals:
    roleGroups:
      default:
        replicas: 3
        config:
          resources:
            cpu:
              min: '200m'
              max: '4'
            memory:
              limit: '2Gi'
            storage:
              segmentCache:
                freePercentage: 5 # default: 5
                emptyDir:
                  capacity: 10Gi
                  medium: "" # or "Memory"

@sbernauer sbernauer moved this to Idea/Proposal in Stackable Engineering Sep 28, 2022
@lfrancke lfrancke changed the title Make segent-cache size configurable and use emptyDir for it Make segment-cache size configurable and use emptyDir for it Sep 30, 2022
@lfrancke
Copy link
Member

"on disk" -> is this an externally provided PV?

bors bot pushed a commit to stackabletech/stackablectl that referenced this issue Sep 30, 2022
## Description

Run with
`stackablectl --additional-demos-file demos/demos-v1.yaml --additional-stacks-file stacks/stacks-v1.yaml demo install nifi-kafka-druid-water-level-data`

Tested demo with 2.500.000.000 records


Hi all, here a short summary of the observations of the water-level demo:

NiFi uses content-repo pvc but keeps it at ~50% usage => Shoud be fine forever
Actions:
* Increase content-repo 5->10 gb, better safe than sorry. I was able to crash it by using large queues and stalling processors.

Kafka uses pvc (currently 15gb) => Should work fine for ~1 week
Actions:
* Look into retentions settings (low priority as it should work ~1 week) so that it works forever

Druid uses S3 for deep storage (S3 has 15gb). But currently it also cashes *everything* locally at the historical because we set `druid.segmentCache.locations=[{"path"\:"/stackable/var/druid/segment-cache","maxSize"\:"300g"}]` (hardcoded in https://github.com/stackabletech/druid-operator/blob/45525033f5f3f52e0997a9b4d79ebe9090e9e0a0/deploy/config-spec/properties.yaml#L725)
This does *not* really effect the demo, as 100.000.000 records (let's call it data of ~1 week) have ~400MB.
I think the main problem with the demo is that queries take > 5 minutes to complete and Superset shows timeouts.
The historical pod suspiciously uses exactly one core of cpu and the queries are really slow for a "big data" system IMHO.
This could be because either druid is only using a single core or because we dont set any resources (yet!) and the node does not have more cores available. Going to reasearch that.
Actions:
* Created stackabletech/druid-operator#306
* In the meantime configure overwrite in the demo `druid.segmentCache.locations=[{"path"\:"/stackable/var/druid/segment-cache","maxSize"\:"3g","freeSpacePercent":"5.0"}]`
* Research slow query performance
* Have a look at the queries the Superset Dashboard executes and optimize them
* Maybe we should bump the druid-operator versions in the demo (e.g. create release 22.09-druid which basically is 22.09 with a newer druid-op version). Therefore we get stable resources.
* Enable Druid auto compaction to reduce number of segments
@sbernauer
Copy link
Member Author

Nope, it's an emptyDir. Normally it's a spinning disk or ssd on the k8s node. A it's a cache there is no point saving it via a pvc

@soenkeliebau soenkeliebau moved this from Idea/Proposal to Refinement: Waiting for in Stackable Engineering Oct 11, 2022
@sbernauer sbernauer moved this from Refinement: Waiting for to Refinement Acceptance: Waiting for in Stackable Engineering Oct 17, 2022
@sbernauer sbernauer moved this from Refinement Acceptance: Waiting for to Refinement: In Progress in Stackable Engineering Oct 17, 2022
@lfrancke lfrancke moved this from Refinement: In Progress to Ready for Development in Stackable Engineering Oct 17, 2022
@razvan razvan self-assigned this Oct 17, 2022
@razvan razvan moved this from Ready for Development to Development: In Progress in Stackable Engineering Oct 17, 2022
@razvan razvan linked a pull request Oct 17, 2022 that will close this issue
@razvan razvan moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Oct 18, 2022
@sbernauer sbernauer self-assigned this Oct 19, 2022
@sbernauer sbernauer moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Oct 19, 2022
@sbernauer sbernauer moved this from Development: In Review to Development: In Progress in Stackable Engineering Oct 20, 2022
@sbernauer sbernauer moved this from Development: In Progress to Development: In Review in Stackable Engineering Oct 20, 2022
@razvan razvan moved this from Development: In Review to Development: In Progress in Stackable Engineering Oct 26, 2022
@razvan razvan moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Oct 30, 2022
@soenkeliebau
Copy link
Member

soenkeliebau commented Oct 31, 2022

Integration test for this failed, so should be investigated some more.

https://ci.stackable.tech/job/druid-operator-it-custom/32/

@soenkeliebau soenkeliebau moved this from Development: Waiting for Review to Development: In Progress in Stackable Engineering Oct 31, 2022
@fhennig
Copy link
Contributor

fhennig commented Oct 31, 2022

Maybe run it on AWS EKS 1.22 (nightly runs on that) instead of IONOS 1.24

@razvan razvan moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Nov 1, 2022
@fhennig fhennig moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Nov 1, 2022
@fhennig fhennig removed their assignment Nov 2, 2022
@fhennig
Copy link
Contributor

fhennig commented Nov 2, 2022

I've unassigned myself, since this will go into a bigger phase of "In Progress" again

@razvan
Copy link
Member

razvan commented Nov 3, 2022

Blocked by: stackabletech/operator-rs#497

bors bot pushed a commit that referenced this issue Nov 4, 2022
Part of: #306 

This PR has been extracted from #320 which will be closed. The part that was left out is the actual configuration the of segment cache size. That will be implemented in a future PR and will require a new operator-rs release.

:green_circle: CI https://ci.stackable.tech/view/02%20Operator%20Tests%20(custom)/job/druid-operator-it-custom/34/


Co-authored-by: Sebastian Bernauer <[email protected]>
bors bot pushed a commit that referenced this issue Nov 14, 2022
# Description

This doesn't add or change any functionality.

Fixes #335 

Required for #306 

This is based on #333 and has to be merged after that.

:green_circle: CI: https://ci.stackable.tech/view/02%20Operator%20Tests%20(custom)/job/druid-operator-it-custom/39/

## Review Checklist

- [x] Code contains useful comments
- [x] CRD change approved (or not applicable)
- [x] (Integration-)Test cases added (or not applicable)
- [x] Documentation added (or not applicable)
- [x] Changelog updated (or not applicable)
- [x] Cargo.toml only contains references to git tags (not specific commits or branches)
- [x] Helm chart can be installed and deployed operator works (or not applicable)

Once the review is done, comment `bors r+` (or `bors merge`) to merge. [Further information](https://bors.tech/documentation/getting-started/#reviewing-pull-requests)
@adwk67 adwk67 self-assigned this Nov 16, 2022
@bors bors bot closed this as completed in 1978a8e Nov 16, 2022
@razvan razvan moved this from Development: In Review to Acceptance: Waiting for in Stackable Engineering Nov 17, 2022
@lfrancke lfrancke moved this from Acceptance: Waiting for to Done in Stackable Engineering Nov 22, 2022
@lfrancke lfrancke added the release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/medium release/23.1.0 release-note/action-required Denotes a PR that introduces potentially breaking changes that require user action. size/M type/feature-new
Projects
None yet
6 participants