Making Coherence CE more cloud friendly #69
Replies: 2 comments 4 replies
-
@javafanboy The above certainly raises the number of question worth discussing, and you may be right -- we could probably make certain things more dynamic and configurable at runtime, to make them work better in elastic environments, such as cloud. The fundamental question here, however, is do we allow users to sacrifice HA in order to avoid over-provisioning, and I'm actually inclined to say "maybe". As you noted, the expectation we currently have is that there is enough capacity after the failure of the largest component (site/AD/AZ in this case) for both primaries and the configured number of replicas (to use your terminology ;-)). In other words, if you are running across N sites, you need to make sure that you have N/N-1 of required capacity provisioned to begin with (3/2 in your example), in order to have enough capacity after a site/AD/AZ loss. Traditionally, this hasn't been a huge problem with on prem deployments, which were typically targeting machine-, or at best rack-safety -- when you are running a 20-machine cluster, adding 21st is only 5% cost, 11/10 is only 10% cost, etc. It does get a bit more expensive with racks, but even that was usually manageable. However, going to cloud makes 3/2 scenario typical, which is 50% additional cost to achieve HA. While I think that our current behavior is a correct default, I do agree that users should have a choice in the matter, as they are ultimately the ones who need to make the tradeoff between the HA and the TCO, so providing an option to relax the guarantees we provide and choose to sacrifice replicas when there isn't enough capacity available probably makes sense, and may be a significantly better option than trying to create new backups and crashing completely due to OOM errors, which is something we've seen happen :-( You also bring up an interesting point regarding multiple backups and reads from backups -- until recently, we didn't have this feature and all the reads were from the primary. Now that we do, I can see users doing what you are suggesting, and configuring number of backups in a way that ensures that each piece of data is available locally in each AZ (2 backups for the common 3 AZ setup). However, if you lose a whole AZ, it doesn't really make sense to have 2 backups -- one is sufficient, as it still ensures both HA and locality of the reads for each of the 2 remaining AZs, so maybe we need a notion of "dynamic backups" as well. This would still require 3/2 over-provisioning to begin with, but at least you are getting something for it: faster reads. But you wouldn't need any spare capacity in the remaining 2 AZs in case of AZ failure, as they would simply either promote to primary or throw away the existing backups for the failed primary members, while remaining HA. I think that feature makes a lot of sense. Another thing worth looking into is storing backups on disk instead of in memory. This would certainly reduce the associated RAM cost quite a bit, and make over-provisioning bearable. Considering that many cloud shapes have local NVMe/SSD drives, probably wouldn't significantly impact performance either (there are corner cases, of course, where this may not be suitable -- heavy expiry/eviction/purging unfortunately tends to bring disk-based stores to a grinding halt...) Commercial Coherence versions already support this via Elastic Data, but it should be fairly simple to do in CE as well using something like Chronicle Map, which is something you were interested in contributing before (hint, hint... ;-)) Backup Map is a fairly simple structure, without many requirements, so it may really be a case of plug-and-play and some configuration support, for convenience. Happy to discuss that with you in more detail if you are still interested. Anyway, can you please open an issue (or a set of issues) for enhancements that you believe would be worth doing, and we can discuss them independently and put on our road map once we are in agreement on what to do and how. |
Beta Was this translation helpful? Give feedback.
-
The "dynamic backups" feature mentioned above, may be as simple as a new custom partitioning strategy. I'm using "simple" here in relative terms, but a partitioning strategy is simpler than an enhancement/change that needs to be applied to across the guts of Coherence. |
Beta Was this translation helpful? Give feedback.
-
I am involved in a project where we are migrating a large complex application (using Coherence) to the cloud (AWS) and is as a consequence of this running into some questions in particular how to make coherence auto scalable that I would be happy to discuss!
To use auto scaling with Coherence is in my view one of the biggest advantages with running it in a cloud environment. Not only can auto scaling be used to maintain a static cluster size of N nodes in the presence of node failures (ie make the Coherence cluster "self healing") but also scale up to increase resilience in overload situations and even make the deployment more economical by scaling the size of the cluster up/down (scaling in/out) tracking the load on the system.
As I understand it Coherence have no problem handling scale-out i.e. new nodes can be added without any risks (not sure how efficiently adding several nodes at once is - will coherence optimize the rebalancing dynamically or wait for re-balancing to occur for each node one by one resulting in several consecutive re-balancing operations). A question here is however if the quorums that can be configured by configuration files (or JVM parameters) can be updated without restarting the cluster? As I add (or remove) cluster members it would be nice to also be able to change the quorums accordingly...
The main stumbling block I see right now is however that there is no way to programatically order that a cache enabled node should be evicted of its partitions in preparation for a scale-in operation. Today I would just have to let autoscaling kill the node and ensure no more scale-in operations are started until rebalancing is completed. The big drawback with this is the risk of an unplanned failure of a node during the rebalancing as this would result in loss of data :-( I have filed a separate issue (improvement proposal) for this.
The only work-around I can see at the moment is to use a backup-count of two instead of the typical one but this results in a very high memory overhead for the cluster and also slows down writes as two replicas must be synchronously updated (asynchronous replication is of no use here as that could result in stale data in case of loss of the primary data partition).
A further challenge is how backups work (as a side note I find the term "backup" unfortunate - I see it as associated with persistent storage for DR purposes etc. and would really have liked "replica" much better as this is more in line with in memory systems):
Beta Was this translation helpful? Give feedback.
All reactions