perf(ecs): Narrowing the cache search for the ECS provider on views #6256

christosarvanitis · 2024-08-06T14:45:02Z

Attempt to address some of the issues described in spinnaker/spinnaker#6084

Improving the response times on:

/clusters
/applications
/serverGroups
Endpoints when ECS is enabled and a substantial amount of accounts/services exist in cache.

The perf issue with the Alarms still exists and will be addressed in a future PR

Adding some results from a performance test clouddriver response times:

GET {CLOUDDRIVER_URL}/applications
- Average: 104ms → 92.1ms (11% improvement)
- 95th Percentile: 130ms → 126ms
GET {CLOUDDRIVER_URL}/applications/{application_name}
- Average: 7.48s → 4.2s (43% improvement)
- 95th Percentile: 8.55s → 5.86s
GET {CLOUDDRIVER_URL}/applications/{application_name}/serverGroups
- Average: 2.72s → 2.16s (20% improvement)
- 95th Percentile: 3.17s → 3.11s
GET {CLOUDDRIVER_URL}/applications/{application_name}/clusters
- Average: 107ms → 43.3ms (59% improvement)
- 95th Percentile: 135ms → 88.4ms

christosarvanitis · 2024-08-07T07:40:13Z

@dbyron0 @deverton would appreciate your feedback on this change. There are still improvements to be made as the current implementation of ECS goes through every region per account to retrieve the necessary data from cache which is far from ideal when there are hundreds of accounts.
The main idea here is to limit the retrieval with an application name when we can.

The perf of alarms is still a problem as right now it goes through all the alarms and tries to match with a service but this will be addressed in a future PR.

christosarvanitis · 2024-09-04T13:40:04Z

@dbyron-sf @jasonmcintosh Added some results from an internal testing related to this change. Would appreciate any feedback!

jasonmcintosh · 2024-09-06T20:45:56Z

...cs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/cache/client/AbstractCacheClient.java

@@ -65,6 +65,11 @@ public Collection<T> getAll(String account, String region) {
    return convertAll(data);
  }

+  public Collection<T> getAll(Collection<String> identifiers) {
+    Collection<CacheData> allData = cacheView.getAll(keyNamespace, identifiers);


NIT: merge the variable dont' need it.

jasonmcintosh · 2024-09-06T20:48:05Z

.../main/java/com/netflix/spinnaker/clouddriver/ecs/provider/view/EcsServerClusterProvider.java

@@ -480,6 +485,15 @@ public Map<String, Set<EcsServerCluster>> getClusters() {
    return clusterMap;
  }

+  public Map<String, Set<EcsServerCluster>> getClusters0(String application) {


getClusters0 seems like a bad function name. I'm... wondering why this method vs. the getClusterSummaries up above.

Im following the same pattern/naming convention established here for the AWS provider https://github.com/spinnaker/clouddriver/blob/0226100a95b16e1176a296198b1f892c4514e4d6/clouddriver-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/provider/view/AmazonClusterProvider.groovy#L70:L78

But for the ECS provider the Cluster view never implemented the clusterSummaries. It was always returning the clusterDetails.

Which is another area of improvement. getClusterSummaries should return only the names of the serverGroups and loadbalancers. getClusterDetails should return the full details of the serverGroups and loadbalancers.

Does it make sense to go on addressing this in this PR or maybe a follow up?

Key differences:

getClusters0 in AWS is a parameterized private method which takes an argument to control whether details are added. In this case, there's no difference between the getDetails & getSummaries calls (which is a question - WHY there isn't a difference, but not looked at the data returned).

As such, it's extra code that doesn't help code deduplication. AN alternative is making getClusters0 private but I'd prefer merge the logic and just have one call the other vs. a separate method that's private that both call increasing the call stack.

Sounds good! No objection in implementing the getDetails and getSummaries better :) Just pushed the changes + i verified that the API responses before and after the change are the same.

With the difference that previously getClusterSummaries was returning All the ECS clusters/services for all the accounts (no application filtered). After the change it will return only the application clusters/services.

getSummaries is called via the applicationController -> GET /applications/

getDetails is called via the serverGroup controller -> GET /applications//serverGroups

...ver-ecs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/view/EcsApplicationProvider.java

jasonmcintosh · 2024-09-06T22:56:05Z

Few minor things but overall looks good.

christosarvanitis · 2024-09-11T14:53:23Z

@jasonmcintosh planning to push the Alarm caching/lookup perf improvements as well tomorrow.

christosarvanitis · 2024-09-12T14:49:15Z

...n/java/com/netflix/spinnaker/clouddriver/ecs/cache/client/EcsCloudWatchAlarmCacheClient.java


-    Collection<EcsMetricAlarm> allMetricAlarms = getAll(accountName, region);


Before All the alarms for an ECS account/region where fetched and iterated through to match the service. This is extremely costly.

After the change the ECSCluster is added during the caching cycles to the cache key id for the ECS provider in the alarms. We retrieve the IDs with ECS account/region/EcsClusterName and then try to match the service.

christosarvanitis · 2024-09-12T14:49:37Z

...n/java/com/netflix/spinnaker/clouddriver/ecs/cache/client/EcsCloudWatchAlarmCacheClient.java

-          metricAlarms.add(metricAlarm);
-          continue outLoop;
-        }
+      if (metricAlarm.getAlarmActions().stream().anyMatch(action -> action.contains(serviceName))


Small refactoring here to make it more readable

christosarvanitis · 2024-09-12T14:51:45Z

...va/com/netflix/spinnaker/clouddriver/ecs/provider/agent/EcsCloudMetricAlarmCachingAgent.java

@@ -118,7 +118,13 @@ Map<String, Collection<CacheData>> generateFreshData(Set<MetricAlarm> cacheableM
    Map<String, Collection<CacheData>> newDataMap = new HashMap<>();

    for (MetricAlarm metricAlarm : cacheableMetricAlarm) {
-      String key = Keys.getAlarmKey(accountName, region, metricAlarm.getAlarmArn());
+      String cluster =
+          metricAlarm.getDimensions().stream()


Based on the AWS SDK a cloudwatch alarm for the ECS contains 2 dimensions depending for the type:

Service alarm contains the dimension ECSCluster and ServiceName

Autoscaling group alarm of an ECS cluster contains the ECSCluster and the Capacity provider.

This change includes the ECSClusterName in the cached key id to make the search less costly

christosarvanitis · 2024-09-12T14:53:00Z

.../main/java/com/netflix/spinnaker/clouddriver/ecs/provider/view/EcsServerClusterProvider.java

-            .setMoniker(moniker);
-
+    EcsServerGroup serverGroup = new EcsServerGroup();
+    if (includeDetails) {


includeDetails is false only for the getSummaries. The rest of the logic remains the same

jasonmcintosh reviewed Sep 6, 2024

View reviewed changes

...ver-ecs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/view/EcsApplicationProvider.java Outdated Show resolved Hide resolved

perf(ecs): Narrowing the cache search for the ECS provider on views

6d6ec79

christosarvanitis force-pushed the perf-ecs branch from 36aa330 to 6d6ec79 Compare September 11, 2024 08:17

perf(ecs): ECS alarms to be cached/searched with EcsClusterName id

3e714e2

christosarvanitis commented Sep 12, 2024

View reviewed changes

christosarvanitis requested a review from jasonmcintosh September 12, 2024 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ecs): Narrowing the cache search for the ECS provider on views #6256

perf(ecs): Narrowing the cache search for the ECS provider on views #6256

christosarvanitis commented Aug 6, 2024 •

edited

Loading

christosarvanitis commented Aug 7, 2024

christosarvanitis commented Sep 4, 2024

jasonmcintosh Sep 6, 2024

christosarvanitis Sep 12, 2024

jasonmcintosh Sep 6, 2024

christosarvanitis Sep 11, 2024 •

edited

Loading

jasonmcintosh Sep 11, 2024

christosarvanitis Sep 12, 2024

jasonmcintosh commented Sep 6, 2024

christosarvanitis commented Sep 11, 2024

christosarvanitis Sep 12, 2024

christosarvanitis Sep 12, 2024

christosarvanitis Sep 12, 2024

christosarvanitis Sep 12, 2024


		Collection<EcsMetricAlarm> allMetricAlarms = getAll(accountName, region);

perf(ecs): Narrowing the cache search for the ECS provider on views #6256

Are you sure you want to change the base?

perf(ecs): Narrowing the cache search for the ECS provider on views #6256

Conversation

christosarvanitis commented Aug 6, 2024 • edited Loading

christosarvanitis commented Aug 7, 2024

christosarvanitis commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christosarvanitis Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasonmcintosh commented Sep 6, 2024

christosarvanitis commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christosarvanitis commented Aug 6, 2024 •

edited

Loading

christosarvanitis Sep 11, 2024 •

edited

Loading