Making Coherence CE more cloud friendly #69

javafanboy · 2022-05-24T06:51:28Z

javafanboy
May 24, 2022

I am involved in a project where we are migrating a large complex application (using Coherence) to the cloud (AWS) and is as a consequence of this running into some questions in particular how to make coherence auto scalable that I would be happy to discuss!

To use auto scaling with Coherence is in my view one of the biggest advantages with running it in a cloud environment. Not only can auto scaling be used to maintain a static cluster size of N nodes in the presence of node failures (ie make the Coherence cluster "self healing") but also scale up to increase resilience in overload situations and even make the deployment more economical by scaling the size of the cluster up/down (scaling in/out) tracking the load on the system.

As I understand it Coherence have no problem handling scale-out i.e. new nodes can be added without any risks (not sure how efficiently adding several nodes at once is - will coherence optimize the rebalancing dynamically or wait for re-balancing to occur for each node one by one resulting in several consecutive re-balancing operations). A question here is however if the quorums that can be configured by configuration files (or JVM parameters) can be updated without restarting the cluster? As I add (or remove) cluster members it would be nice to also be able to change the quorums accordingly...

The main stumbling block I see right now is however that there is no way to programatically order that a cache enabled node should be evicted of its partitions in preparation for a scale-in operation. Today I would just have to let autoscaling kill the node and ensure no more scale-in operations are started until rebalancing is completed. The big drawback with this is the risk of an unplanned failure of a node during the rebalancing as this would result in loss of data :-( I have filed a separate issue (improvement proposal) for this.

The only work-around I can see at the moment is to use a backup-count of two instead of the typical one but this results in a very high memory overhead for the cluster and also slows down writes as two replicas must be synchronously updated (asynchronous replication is of no use here as that could result in stale data in case of loss of the primary data partition).

A further challenge is how backups work (as a side note I find the term "backup" unfortunate - I see it as associated with persistent storage for DR purposes etc. and would really have liked "replica" much better as this is more in line with in memory systems):

In a cloud environment you typically deploy a cluster (with HA requirements) over several availability zones AZs (= separate data centers with low enough latency in between them that synchronous replication is possible without excessive performance loss) to handle large scale events of the cloud provider (like single data center fire or flood). The most common configuration with leading cloud providers is three AZs per geographical region.
As a consequence a possible worst case scenario is to loose all Coherence nodes in one single AZ. Fortunately Coherence can handle this case quite nicely when spreading storage enabled nodes over multiple AZs and we specify the AZ of each instance as its "rack" - this will instruct Coherence to not backup(s) on nodes in the same AZ as the primary partition.
Coherence CE also have a very nice feature "Read Locator" that could be used to speed up simple read operations (get & getAll) by directing them to the "nearest" node. If using a backup count of two this would result in reads to be routed within the same AZ (rack) largely avoiding the extra overhead added by inter-AZ deployment.
There is however one significant issue here - in order to handle the loss of a AZ one needs to have spare capacity for accommodating all the backups on only the instances deployed in two of the three AZs. This is a consequence of only having three racks and that there is no way to delay rebalancing separately for the second backup. Ideally it would be great to be able to configure a separate quorum for each backup level (or even better relate this to the number of racks populated by nodes but this may require some totally new concepts). For instance of a cluster consists of N nodes it would have been nice to say that rebalancing of the second backup would be delayed until N (or say 3 * N / 4) nodes are once again available etc. (by auto scaling launching replacement nodes). In combination with the ability to dynamically change the quorums (mentioned above) this would work very nicely...

aseovic · 2022-05-25T07:06:27Z

aseovic
May 25, 2022
Maintainer

@javafanboy The above certainly raises the number of question worth discussing, and you may be right -- we could probably make certain things more dynamic and configurable at runtime, to make them work better in elastic environments, such as cloud.

The fundamental question here, however, is do we allow users to sacrifice HA in order to avoid over-provisioning, and I'm actually inclined to say "maybe".

As you noted, the expectation we currently have is that there is enough capacity after the failure of the largest component (site/AD/AZ in this case) for both primaries and the configured number of replicas (to use your terminology ;-)). In other words, if you are running across N sites, you need to make sure that you have N/N-1 of required capacity provisioned to begin with (3/2 in your example), in order to have enough capacity after a site/AD/AZ loss.

Traditionally, this hasn't been a huge problem with on prem deployments, which were typically targeting machine-, or at best rack-safety -- when you are running a 20-machine cluster, adding 21st is only 5% cost, 11/10 is only 10% cost, etc. It does get a bit more expensive with racks, but even that was usually manageable. However, going to cloud makes 3/2 scenario typical, which is 50% additional cost to achieve HA.

While I think that our current behavior is a correct default, I do agree that users should have a choice in the matter, as they are ultimately the ones who need to make the tradeoff between the HA and the TCO, so providing an option to relax the guarantees we provide and choose to sacrifice replicas when there isn't enough capacity available probably makes sense, and may be a significantly better option than trying to create new backups and crashing completely due to OOM errors, which is something we've seen happen :-(

You also bring up an interesting point regarding multiple backups and reads from backups -- until recently, we didn't have this feature and all the reads were from the primary. Now that we do, I can see users doing what you are suggesting, and configuring number of backups in a way that ensures that each piece of data is available locally in each AZ (2 backups for the common 3 AZ setup). However, if you lose a whole AZ, it doesn't really make sense to have 2 backups -- one is sufficient, as it still ensures both HA and locality of the reads for each of the 2 remaining AZs, so maybe we need a notion of "dynamic backups" as well.

This would still require 3/2 over-provisioning to begin with, but at least you are getting something for it: faster reads. But you wouldn't need any spare capacity in the remaining 2 AZs in case of AZ failure, as they would simply either promote to primary or throw away the existing backups for the failed primary members, while remaining HA. I think that feature makes a lot of sense.

Another thing worth looking into is storing backups on disk instead of in memory. This would certainly reduce the associated RAM cost quite a bit, and make over-provisioning bearable. Considering that many cloud shapes have local NVMe/SSD drives, probably wouldn't significantly impact performance either (there are corner cases, of course, where this may not be suitable -- heavy expiry/eviction/purging unfortunately tends to bring disk-based stores to a grinding halt...)

Commercial Coherence versions already support this via Elastic Data, but it should be fairly simple to do in CE as well using something like Chronicle Map, which is something you were interested in contributing before (hint, hint... ;-)) Backup Map is a fairly simple structure, without many requirements, so it may really be a case of plug-and-play and some configuration support, for convenience. Happy to discuss that with you in more detail if you are still interested.

Anyway, can you please open an issue (or a set of issues) for enhancements that you believe would be worth doing, and we can discuss them independently and put on our road map once we are in agreement on what to do and how.

3 replies

javafanboy May 25, 2022
Author

Thanks for the insightful reply Aleks - I agree with all your points.

My current use-case is indeed very read heavy with a large number of objects and reasonably few updates so storing replicas (backpus) on flash could make sense (unless perhaps when using the option we both talked about of directing reads to them where it could slow things down too much but this is once again a tradeoff that could be made for each application and also depends on what kind of "flash" you have access to)...

javafanboy May 25, 2022
Author

I was once a decent Java developer but today mostly "program" with Powerpoint and whiteboard (when doing architecture and design work) so not sure when I, even though it would be fun, realistically would have the time to "fold up my sleeves" and get involved hands on in an open source project to actually commit code (or if any sensible project would like my code) :-) as this would need to happen on my "free time" that already feel limited...

aseovic May 27, 2022
Maintainer

No worries, we can certianly do the work once we agree what it is that needs to be done ;-)

As for flash read performance, it is actually surprisingly good on modern hardware, and for most use cases reads from flash are fast enough (and so are the writes). The only time I've seen people run into issues is when they are evicting (or removing in general) a lot of data frequently, as that triggers compaction of the files on disk more often, which can slow things down considerably.

thegridman · 2022-05-25T08:01:04Z

thegridman
May 25, 2022
Maintainer

The "dynamic backups" feature mentioned above, may be as simple as a new custom partitioning strategy. I'm using "simple" here in relative terms, but a partitioning strategy is simpler than an enhancement/change that needs to be applied to across the guts of Coherence.

1 reply

javafanboy May 25, 2022
Author

Yes that may be a way to go - have not looked enough on the Coherence code base to say for sure one way or the other if it would work though...

Making Coherence CE more cloud friendly #69

Uh oh!

Uh oh!

javafanboy May 24, 2022

Replies: 2 comments · 4 replies

Uh oh!

aseovic May 25, 2022 Maintainer

Uh oh!

javafanboy May 25, 2022 Author

Uh oh!

Uh oh!

javafanboy May 25, 2022 Author

Uh oh!

aseovic May 27, 2022 Maintainer

Uh oh!

thegridman May 25, 2022 Maintainer

Uh oh!

Uh oh!

javafanboy May 25, 2022 Author

javafanboy
May 24, 2022

Replies: 2 comments 4 replies

aseovic
May 25, 2022
Maintainer

javafanboy May 25, 2022
Author

javafanboy May 25, 2022
Author

aseovic May 27, 2022
Maintainer

thegridman
May 25, 2022
Maintainer

javafanboy May 25, 2022
Author