Efficient controller log archiving #1580

Lazin · 2021-06-11T18:17:12Z

Lazin
Jun 11, 2021
Collaborator

We have the following problem. To be able to run disaster recovery on a cluster level we need information stored inside the controller log. But the problem is that the controller log grows slowly (in normal circumstances). The archival storage subsystem only upload “sealed” segments that won’t receive any new updates. For the controller log we don’t want to wait until the last segment will be sealed. We want to upload recent data as frequently as rationally possible.

This means that controller log upload has to be a special case (general archival mechanism is disabled for it right now so it’s not even uploaded).

S3 doesn’t have an “append” operation. We can only upload and re-upload the whole objects. This means that a naive implementation will have to re-upload the last segment many times. The segment will grow to a large size eventually. This will lead to a lot of used traffic and resources being taken from other subsystems in redpanda.

Possible solution 1:
This is the most simple solution to this problem. We can generate manifest files with the information that we need to recover the cluster and upload these manifest files instead of the controller log.

Pros:

Recovery is simple (just parse a bunch of json documents).
Upload can be implemented using basic GetObject/PutObject API calls.
Works with S3 and GCS.
Tools are easy to write (cluster recovery can be driven by the simple python script).

Cons:

Archival service will depend on different parts of the system. It’ll have to know about users, ACLs and other things that need to be uploaded to S3 using manifests.

Possible solution 2:
Use multipart upload for controller log. We can create parts and upload them as the new data is added to the controller log. When the log is sealed we can complete multipart upload. During the recovery phase we need to find all unfinished parts and complete multipart upload.

Pros:

Simple upload (when all complexity related to multipart is sorted out).
We upload all cluster metadata without dealing with intermediate representations, when new metadata is introduced no changes have to be made to archival.

Cons:

This API is not supported on GCS.
There are corner cases. For instance if redpanda crashes we have to deal with the incomplete multipart upload after restart.
Recovery is a bit more complicated since we have to parse controller log.

Possible solution 3:
Upload controller log in chunks without doing the multipart upload. So basically, this means that we’re reimplementing the multipart upload feature on the redpanda side. When all segment data is uploaded we will have to merge the chunks into one object using the multipart api (won’t work in GCS) or just reupload full segment and delete the chunks (will work on GCS).

Pros:

Upload is relatively easy to implement.
We upload all cluster metadata without dealing with intermediate representations, when new metadata is introduced no changes have to be made to archival.
Will work on both S3 and GCS.
There is no corner cases like in solution 2.

Cons:

Recovery is a bit more complicated since we have to parse controller log.

This is how it will look like in practice. Right now we upload segments using the names like this

<prefix>/<ns>/<topic>/<partition-id>_<revision>/<base-offset>_<term>_v1.log

For controller log, instead of the object name with this structure we will have the following:

<prefix>/<ns>/<topic>/<partition-id>_<revision>/<base-offset>_<term>_v1.log/<start-offset>_<end-offset>.part

the prefix for all parts of the same segment and for the segment itself will be the same so it will be possible to quickly locate all related components.

Possible solution 4
We can roll segments of the controller log more often.

Pors:

Easy to impelemnt

Cons:

This can only be applied to controller log
More resources will be needed to manage the log (more opened files, more hydrated indexes, etc)
It will be harder to manage the data in S3, the manifest will grow very large eventually

dotnwat · 2021-06-14T00:11:36Z

dotnwat
Jun 14, 2021
Maintainer

@mmaslankaprv and @rystsov please take a look

0 replies

dotnwat · 2021-06-14T00:20:18Z

dotnwat
Jun 14, 2021
Maintainer

Without having thought about this deeply, it seems like option 3 is the most attractive, with option 1 potentially being the quickest to implement. Could you add a pro/con list above for a solution 4 where we roll the controller segment more often? For example, we could have a policy that if the controller log has received an update like a change to users or a change to partition assignments or topic creation that we'd roll the controller segment within 10 minutes or something?

@Lazin I'm wondering if this general problem encountered with the controller log may apply to normal partitions for some workloads. For example, a system with say 100 partitions each receiving around 1KB/s would have upwards of 100 GB not backed up for a long time until they all reached 1 GB and rolled. How do we handle this case today? Would option 3 or 4 handle this?

1 reply

Lazin Jun 15, 2021
Collaborator Author

This is totally correct. We might have a lot of data locally that won't be uploaded for a long time. We might want this mechanism to be used for all segments to narrow the gap between uploaded data and data on disk. Smaller segment size can't be used as a general solution everywhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient controller log archiving #1580

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Efficient controller log archiving #1580

Lazin Jun 11, 2021 Collaborator

Replies: 2 comments · 1 reply

dotnwat Jun 14, 2021 Maintainer

dotnwat Jun 14, 2021 Maintainer

Lazin Jun 15, 2021 Collaborator Author

Lazin
Jun 11, 2021
Collaborator

Replies: 2 comments 1 reply

dotnwat
Jun 14, 2021
Maintainer

dotnwat
Jun 14, 2021
Maintainer

Lazin Jun 15, 2021
Collaborator Author