feat: use shared http.Client with connection pooling for pod metric scraping#1232
feat: use shared http.Client with connection pooling for pod metric scraping#1232chenhuiluo wants to merge 1 commit into
Conversation
…pod scraping Replace http.DefaultClient in scrapePod with a shared metricHTTPClient that uses a custom Transport tuned for the 15-second reconcile cycle: - MaxIdleConnsPerHost=5 (vs Go default 2) to reuse TCP connections across cycles instead of rebuilding per pod per cycle - IdleConnTimeout=30s (vs Go default 90s) to keep connections alive between cycles while still reclaiming unused ones - MaxIdleConns=1000 to cap total idle pool memory All MetricCollector instances share the same connection pool, so idle connections from one collector are available to others. Signed-off-by: luochenhui1 <luochenhui1@kingsoft.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces a shared HTTP transport/client for pod-metric scraping to improve connection reuse across reconcile cycles and adds a unit test asserting the pooling configuration.
Changes:
- Add package-level
metricTransportandmetricHTTPClientconfigured for connection pooling. - Inject the shared HTTP client into
MetricCollectorand use it forscrapePod. - Add a test validating transport/client pooling settings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/autoscaler/autoscaler/metric_collector.go | Adds shared HTTP transport/client and routes scraping through the injected client. |
| pkg/autoscaler/autoscaler/metric_collector_test.go | Adds a unit test asserting the transport/client pooling configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // metricTransport is a shared http.Transport for scraping pod metrics. | ||
| // MaxIdleConnsPerHost is raised from the Go default (2) to allow connection | ||
| // reuse across reconcile cycles. IdleConnTimeout is set slightly longer than | ||
| // the reconcile period so that connections to the same Pod IP survive between | ||
| // scrapes. MaxIdleConns caps the total idle pool to bound memory usage. | ||
| var metricTransport = &http.Transport{ | ||
| MaxIdleConns: 1000, | ||
| MaxIdleConnsPerHost: 5, | ||
| IdleConnTimeout: 30 * time.Second, | ||
| } |
| var metricTransport = &http.Transport{ | ||
| MaxIdleConns: 1000, | ||
| MaxIdleConnsPerHost: 5, | ||
| IdleConnTimeout: 30 * time.Second, | ||
| } |
There was a problem hiding this comment.
Code Review
This pull request introduces a shared http.Client and http.Transport configuration for MetricCollector to enable connection pooling and reuse across pod-scraping cycles. Feedback was provided to clone http.DefaultTransport instead of creating a new transport from scratch to preserve critical default settings (such as proxy support and dial timeouts). Additionally, it is recommended to add a fallback to http.DefaultClient in case collector.httpClient is nil to prevent potential nil pointer panics.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| var metricTransport = &http.Transport{ | ||
| MaxIdleConns: 1000, | ||
| MaxIdleConnsPerHost: 5, | ||
| IdleConnTimeout: 30 * time.Second, | ||
| } |
There was a problem hiding this comment.
Creating a new http.Transport from scratch loses all the optimized defaults of http.DefaultTransport, such as Proxy, DialContext (with its default 30s timeout and keep-alive), TLSHandshakeTimeout, and ForceAttemptHTTP2. This can lead to issues like missing proxy support or suboptimal connection dialing.
Instead, you should clone http.DefaultTransport and modify only the fields you need.
| var metricTransport = &http.Transport{ | |
| MaxIdleConns: 1000, | |
| MaxIdleConnsPerHost: 5, | |
| IdleConnTimeout: 30 * time.Second, | |
| } | |
| var metricTransport = func() *http.Transport { | |
| t := http.DefaultTransport.(*http.Transport).Clone() | |
| t.MaxIdleConns = 1000 | |
| t.MaxIdleConnsPerHost = 5 | |
| t.IdleConnTimeout = 30 * time.Second | |
| return t | |
| }() |
| return "", err | ||
| } | ||
| resp, err := http.DefaultClient.Do(req) | ||
| resp, err := collector.httpClient.Do(req) |
There was a problem hiding this comment.
If MetricCollector is instantiated directly without using NewMetricCollector (for example, in external tests or other packages), collector.httpClient will be nil, leading to a nil pointer panic when scrapePod is called.
To prevent this, we should fall back to http.DefaultClient if collector.httpClient is nil.
| resp, err := collector.httpClient.Do(req) | |
| client := collector.httpClient | |
| if client == nil { | |
| client = http.DefaultClient | |
| } | |
| resp, err := client.Do(req) |
Problem
scrapePoduseshttp.DefaultClient, which hasMaxIdleConnsPerHost=2(Go default). Since each Pod is addressed by a uniqueIP:Port, idle connections from one cycle cannot be reused in the next — the connection pool per host is too small and ages out before the next 15-second reconcile. This means every cycle rebuilds TCP connections to all pods: 10 ModelServings × 8 Pods = ~80TCP handshakes every 15s, with connection establishment consuming 30-40% of total scrape latency.
Solution
Introduce a shared
metricHTTPClientwith a customhttp.Transporttuned for the reconcile cadence:MaxIdleConnsPerHost=5— allows idle connections to each Pod IP to survive across cyclesIdleConnTimeout=30s— slightly longer than the 15s reconcile period, so connections stay warm between scrapes while still being reclaimed when unusedMaxIdleConns=1000— caps total idle pool to bound memoryAll
MetricCollectorinstances share the same connection pool, so idle connections from one collector benefit others.Changes
metric_collector.go: addmetricTransport,metricHTTPClientpackage vars; addhttpClientfield toMetricCollector; replacehttp.DefaultClient.Do(req)withcollector.httpClient.Do(req)inscrapePodmetric_collector_test.go: updatenewTestCollector()to inithttpClient; addTestMetricHTTPClientConnectionPoolingto verify transport parameters