Skip to content

feat: use shared http.Client with connection pooling for pod metric scraping#1232

Open
chenhuiluo wants to merge 1 commit into
volcano-sh:mainfrom
chenhuiluo:chenhui/optimize-httpclient
Open

feat: use shared http.Client with connection pooling for pod metric scraping#1232
chenhuiluo wants to merge 1 commit into
volcano-sh:mainfrom
chenhuiluo:chenhui/optimize-httpclient

Conversation

@chenhuiluo

Copy link
Copy Markdown

Problem

scrapePod uses http.DefaultClient, which has MaxIdleConnsPerHost=2 (Go default). Since each Pod is addressed by a unique IP:Port, idle connections from one cycle cannot be reused in the next — the connection pool per host is too small and ages out before the next 15-second reconcile. This means every cycle rebuilds TCP connections to all pods: 10 ModelServings × 8 Pods = ~80
TCP handshakes every 15s, with connection establishment consuming 30-40% of total scrape latency.

Solution

Introduce a shared metricHTTPClient with a custom http.Transport tuned for the reconcile cadence:

  • MaxIdleConnsPerHost=5 — allows idle connections to each Pod IP to survive across cycles
  • IdleConnTimeout=30s — slightly longer than the 15s reconcile period, so connections stay warm between scrapes while still being reclaimed when unused
  • MaxIdleConns=1000 — caps total idle pool to bound memory

All MetricCollector instances share the same connection pool, so idle connections from one collector benefit others.

Changes

  • metric_collector.go: add metricTransport, metricHTTPClient package vars; add httpClient field to MetricCollector; replace http.DefaultClient.Do(req) with collector.httpClient.Do(req) in scrapePod
  • metric_collector_test.go: update newTestCollector() to init httpClient; add TestMetricHTTPClientConnectionPooling to verify transport parameters

…pod scraping

Replace http.DefaultClient in scrapePod with a shared metricHTTPClient
that uses a custom Transport tuned for the 15-second reconcile cycle:

- MaxIdleConnsPerHost=5 (vs Go default 2) to reuse TCP connections
  across cycles instead of rebuilding per pod per cycle
- IdleConnTimeout=30s (vs Go default 90s) to keep connections alive
  between cycles while still reclaiming unused ones
- MaxIdleConns=1000 to cap total idle pool memory

All MetricCollector instances share the same connection pool, so idle
connections from one collector are available to others.

Signed-off-by: luochenhui1 <luochenhui1@kingsoft.com>
Copilot AI review requested due to automatic review settings June 17, 2026 11:43
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a shared HTTP transport/client for pod-metric scraping to improve connection reuse across reconcile cycles and adds a unit test asserting the pooling configuration.

Changes:

  • Add package-level metricTransport and metricHTTPClient configured for connection pooling.
  • Inject the shared HTTP client into MetricCollector and use it for scrapePod.
  • Add a test validating transport/client pooling settings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/autoscaler/autoscaler/metric_collector.go Adds shared HTTP transport/client and routes scraping through the injected client.
pkg/autoscaler/autoscaler/metric_collector_test.go Adds a unit test asserting the transport/client pooling configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +57
// metricTransport is a shared http.Transport for scraping pod metrics.
// MaxIdleConnsPerHost is raised from the Go default (2) to allow connection
// reuse across reconcile cycles. IdleConnTimeout is set slightly longer than
// the reconcile period so that connections to the same Pod IP survive between
// scrapes. MaxIdleConns caps the total idle pool to bound memory usage.
var metricTransport = &http.Transport{
MaxIdleConns: 1000,
MaxIdleConnsPerHost: 5,
IdleConnTimeout: 30 * time.Second,
}
Comment on lines +53 to +57
var metricTransport = &http.Transport{
MaxIdleConns: 1000,
MaxIdleConnsPerHost: 5,
IdleConnTimeout: 30 * time.Second,
}

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shared http.Client and http.Transport configuration for MetricCollector to enable connection pooling and reuse across pod-scraping cycles. Feedback was provided to clone http.DefaultTransport instead of creating a new transport from scratch to preserve critical default settings (such as proxy support and dial timeouts). Additionally, it is recommended to add a fallback to http.DefaultClient in case collector.httpClient is nil to prevent potential nil pointer panics.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +53 to +57
var metricTransport = &http.Transport{
MaxIdleConns: 1000,
MaxIdleConnsPerHost: 5,
IdleConnTimeout: 30 * time.Second,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new http.Transport from scratch loses all the optimized defaults of http.DefaultTransport, such as Proxy, DialContext (with its default 30s timeout and keep-alive), TLSHandshakeTimeout, and ForceAttemptHTTP2. This can lead to issues like missing proxy support or suboptimal connection dialing.

Instead, you should clone http.DefaultTransport and modify only the fields you need.

Suggested change
var metricTransport = &http.Transport{
MaxIdleConns: 1000,
MaxIdleConnsPerHost: 5,
IdleConnTimeout: 30 * time.Second,
}
var metricTransport = func() *http.Transport {
t := http.DefaultTransport.(*http.Transport).Clone()
t.MaxIdleConns = 1000
t.MaxIdleConnsPerHost = 5
t.IdleConnTimeout = 30 * time.Second
return t
}()

return "", err
}
resp, err := http.DefaultClient.Do(req)
resp, err := collector.httpClient.Do(req)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If MetricCollector is instantiated directly without using NewMetricCollector (for example, in external tests or other packages), collector.httpClient will be nil, leading to a nil pointer panic when scrapePod is called.

To prevent this, we should fall back to http.DefaultClient if collector.httpClient is nil.

Suggested change
resp, err := collector.httpClient.Do(req)
client := collector.httpClient
if client == nil {
client = http.DefaultClient
}
resp, err := client.Do(req)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants