Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ecs): Narrowing the cache search for the ECS provider on views #6256

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

christosarvanitis
Copy link
Contributor

@christosarvanitis christosarvanitis commented Aug 6, 2024

Attempt to address some of the issues described in spinnaker/spinnaker#6084

Improving the response times on:

  • /clusters
  • /applications
  • /serverGroups
    Endpoints when ECS is enabled and a substantial amount of accounts/services exist in cache.

The perf issue with the Alarms still exists and will be addressed in a future PR

Adding some results from a performance test clouddriver response times:
image
image
image

  • GET {CLOUDDRIVER_URL}/applications
    • Average: 104ms → 92.1ms (11% improvement)
    • 95th Percentile: 130ms → 126ms
  • GET {CLOUDDRIVER_URL}/applications/{application_name}
    • Average: 7.48s → 4.2s (43% improvement)
    • 95th Percentile: 8.55s → 5.86s
  • GET {CLOUDDRIVER_URL}/applications/{application_name}/serverGroups
    • Average: 2.72s → 2.16s (20% improvement)
    • 95th Percentile: 3.17s → 3.11s
  • GET {CLOUDDRIVER_URL}/applications/{application_name}/clusters
    • Average: 107ms → 43.3ms (59% improvement)
    • 95th Percentile: 135ms → 88.4ms

@christosarvanitis
Copy link
Contributor Author

@dbyron0 @deverton would appreciate your feedback on this change. There are still improvements to be made as the current implementation of ECS goes through every region per account to retrieve the necessary data from cache which is far from ideal when there are hundreds of accounts.
The main idea here is to limit the retrieval with an application name when we can.

The perf of alarms is still a problem as right now it goes through all the alarms and tries to match with a service but this will be addressed in a future PR.

@christosarvanitis
Copy link
Contributor Author

@dbyron-sf @jasonmcintosh Added some results from an internal testing related to this change. Would appreciate any feedback!

@@ -65,6 +65,11 @@ public Collection<T> getAll(String account, String region) {
return convertAll(data);
}

public Collection<T> getAll(Collection<String> identifiers) {
Collection<CacheData> allData = cacheView.getAll(keyNamespace, identifiers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: merge the variable dont' need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -480,6 +485,15 @@ public Map<String, Set<EcsServerCluster>> getClusters() {
return clusterMap;
}

public Map<String, Set<EcsServerCluster>> getClusters0(String application) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getClusters0 seems like a bad function name. I'm... wondering why this method vs. the getClusterSummaries up above.

Copy link
Contributor Author

@christosarvanitis christosarvanitis Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im following the same pattern/naming convention established here for the AWS provider https://github.com/spinnaker/clouddriver/blob/0226100a95b16e1176a296198b1f892c4514e4d6/clouddriver-aws/src/main/groovy/com/netflix/spinnaker/clouddriver/aws/provider/view/AmazonClusterProvider.groovy#L70:L78

But for the ECS provider the Cluster view never implemented the clusterSummaries. It was always returning the clusterDetails.

Which is another area of improvement. getClusterSummaries should return only the names of the serverGroups and loadbalancers. getClusterDetails should return the full details of the serverGroups and loadbalancers.

Does it make sense to go on addressing this in this PR or maybe a follow up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key differences:

  • getClusters0 in AWS is a parameterized private method which takes an argument to control whether details are added. In this case, there's no difference between the getDetails & getSummaries calls (which is a question - WHY there isn't a difference, but not looked at the data returned).
  • As such, it's extra code that doesn't help code deduplication. AN alternative is making getClusters0 private but I'd prefer merge the logic and just have one call the other vs. a separate method that's private that both call increasing the call stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! No objection in implementing the getDetails and getSummaries better :) Just pushed the changes + i verified that the API responses before and after the change are the same.

With the difference that previously getClusterSummaries was returning All the ECS clusters/services for all the accounts (no application filtered). After the change it will return only the application clusters/services.

getSummaries is called via the applicationController -> GET /applications/

getDetails is called via the serverGroup controller -> GET /applications//serverGroups

@jasonmcintosh
Copy link
Member

Few minor things but overall looks good.

@christosarvanitis
Copy link
Contributor Author

@jasonmcintosh planning to push the Alarm caching/lookup perf improvements as well tomorrow.


Collection<EcsMetricAlarm> allMetricAlarms = getAll(accountName, region);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before All the alarms for an ECS account/region where fetched and iterated through to match the service. This is extremely costly.

After the change the ECSCluster is added during the caching cycles to the cache key id for the ECS provider in the alarms. We retrieve the IDs with ECS account/region/EcsClusterName and then try to match the service.

metricAlarms.add(metricAlarm);
continue outLoop;
}
if (metricAlarm.getAlarmActions().stream().anyMatch(action -> action.contains(serviceName))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small refactoring here to make it more readable

@@ -118,7 +118,13 @@ Map<String, Collection<CacheData>> generateFreshData(Set<MetricAlarm> cacheableM
Map<String, Collection<CacheData>> newDataMap = new HashMap<>();

for (MetricAlarm metricAlarm : cacheableMetricAlarm) {
String key = Keys.getAlarmKey(accountName, region, metricAlarm.getAlarmArn());
String cluster =
metricAlarm.getDimensions().stream()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the AWS SDK a cloudwatch alarm for the ECS contains 2 dimensions depending for the type:

  • Service alarm contains the dimension ECSCluster and ServiceName
  • Autoscaling group alarm of an ECS cluster contains the ECSCluster and the Capacity provider.

This change includes the ECSClusterName in the cached key id to make the search less costly

.setMoniker(moniker);

EcsServerGroup serverGroup = new EcsServerGroup();
if (includeDetails) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

includeDetails is false only for the getSummaries. The rest of the logic remains the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants