-
Notifications
You must be signed in to change notification settings - Fork 23
lme optimization
Branch: lme_timeout_fix_optimization_tmp
Author: Nabendu Maiti
Date: 2026-03-19
SHA: 69bbeed
Audience: Patch reviewers, architects
- Executive Summary
- Problem Statement
- Files Changed
- Architecture Context
- Change Details by Layer
- Sequence Diagrams
- Risk Assessment
- Testing Coverage
- Architectural Observations
This commit addresses a class of AMT/LME stall bugs where rpc-go could hang indefinitely or wait excessively long when the Intel ME Interface (MEI) kernel driver returned transient "device busy" errors or when the AMT firmware was slow to confirm an APF channel open. The changes introduce:
-
Timed waits (replacing unbounded
WaitGroup.Wait()blocks) at the channel-open barrier in both the RPS executor and the local HTTP transport. -
Retry-with-backoff logic at three call sites where transient MEI busy errors are expected: LME channel connect,
GetGeneralSettings, and HTTP transportRoundTrip. -
APF timer acceleration: resets the APF session flush timer to 500 ms whenever an
APF_CHANNEL_DATAmessage arrives, eliminating stalls caused by a slow timer firing after data has already been delivered. - LMS probe result caching: avoids redundant TCP probes on every WSMAN call during a multi-step activation.
- Reduced global timeouts to align with realistic AMT response characteristics and avoid unnecessary delay in failure scenarios.
-
URL normalization to reject structurally invalid
--urlvalues early and prevent misrouted activation flows.
| Symptom | Root Cause |
|---|---|
rpc activate --local hangs for 30 s+ then fails |
WaitGroup.Wait() was unbounded; if AMT never confirmed channel open (e.g., MEI busy), the goroutine blocked forever |
| Repeated "device or resource busy" errors causing abort | Single-attempt MEI connect with no retry; transient kernel driver contention treated as fatal |
| Activation stalls mid-sequence after first WSMAN call | Each WSMAN call re-probed LMS port (TCP connect + TLS handshake overhead per call) |
| APF data flushed too late during LME sessions | APF session timer used a long default; 500 ms flush acceleration needed for channel data deliveries |
Malformed --url (e.g., "http://") routed incorrectly |
No validation before URL was evaluated as non-empty |
Premature EnsureWSMAN during local activation |
WSMAN init triggered before it was needed, wasting an LME slot |
| File | +Lines | −Lines | Category |
|---|---|---|---|
internal/rps/executor.go |
+90 | −2 | Core: LME channel loop |
internal/local/amt/localTransport.go |
+89 | −3 | Core: HTTP transport |
internal/commands/activate/local.go |
+38 | +1 | Local activation retry |
internal/lm/engine.go |
+23 | −1 | APF timer override |
internal/commands/activate/activate.go |
+26 | −4 | URL normalization, WSMAN cleanup |
internal/local/amt/wsman.go |
+38 | −4 | LMS probe caching |
internal/commands/activate/activate_test.go |
+12 | 0 | Tests: URL normalization |
pkg/utils/constants.go |
+6 | −6 | Timeout constants tuning |
internal/rps/rps.go |
+1 | 0 | Minor |
pkg/heci/linux.go |
+1 | 0 | Cosmetic (blank line) |
| Total | +289 | −41 |
┌──────────────────────────────────────────────────────────────────────┐
│ rpc-go Process │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ CLI / │ │ RPS │ │ Local Activation │ │
│ │ Activate │───▶│ Executor │ │ Service │ │
│ │ Command │ │ (WebSocket) │ │ (local.go) │ │
│ └──────┬──────┘ └──────┬───────┘ └──────────┬─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GoWSMANMessages / LocalTransport │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ LMS probe cache (plainProbeDone / useLocalLMX) │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └──────────────────┬───────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┴──────────────┐ │
│ │ LM Engine │ │
│ │ ┌───────────────────────┐ │ │
│ │ │ APF Session + Timer │ │ │
│ │ │ (500ms override on │ │ │
│ │ │ CHANNEL_DATA) │ │ │
│ │ └───────────────────────┘ │ │
│ └───────────────┬──────────────┘ │
└─────────────────────┼────────────────────────────────────────────── ┘
│ ioctl / HECI driver
▼
┌───────────────┐
│ MEI kernel │
│ driver │
│ /dev/mei0 │
└───────┬───────┘
│ HECI bus
▼
┌───────────────┐
│ Intel AMT │
│ Firmware │
└───────────────┘
BEFORE AFTER
┌──────────────────┐ ┌──────────────────┐
LMS conn │ 10 s │ │ 6 s │
LMS dialer │ 5 s │ │ 4 s │
HECI read │ 30 s │ │ 15 s │
HECI retry │ 3000 ms │ │ 2000 ms │
HECI backoff │ 500 ms │ │ 300 ms │
AMT response │ 30 s │ │ 10 s │
└──────────────────┘ └──────────────────┘
These constants govern every wait in the LME/LMS/MEI stack.
| Constant | Old Value | New Value | Unit | Rationale |
|---|---|---|---|---|
LMSConnectionTimeout |
10 | 6 | seconds | LMS TCP connect completes < 1 s on healthy system; 6 s is ample |
LMSDialerTimeout |
5 | 4 | seconds | Aligns with reduced connection timeout |
HeciReadTimeout |
30 | 15 | seconds | AMT typically responds in < 5 s; 15 s catches stalls without waiting 30 s |
HeciRetryDelay |
3000 | 2000 | ms | Reduces dead time between HECI retries |
HeciConnectRetryBackoff |
500 | 300 | ms | Shorter initial backoff on transient MEI busy; linear scale preserves total budget |
AMTResponseTimeout |
30 | 10 | seconds | Matches typical AMT response SLA; 30 s was masking hangs |
Architectural note: All timeouts are centralized here. No hardcoded values were introduced in this commit; retry delays always reference
utils.HeciConnectRetryBackoff.
A new wrapper processWithLocalTimerOverride(message []byte) replaces direct apf.Process() calls in both execute() and Listen().
const lmeAPFChannelDataFlushOverride = 500 * time.Millisecond
func (lme *LMEConnection) processWithLocalTimerOverride(message []byte) bytes.Buffer {
processed := apf.Process(message, lme.Session)
if len(message) > 0 && message[0] == apf.APF_CHANNEL_DATA && lme.Session.Timer != nil {
if !lme.Session.Timer.Stop() {
select { case <-lme.Session.Timer.C: default: }
}
lme.Session.Timer.Reset(lmeAPFChannelDataFlushOverride)
}
return processed
}The APF session timer controls how quickly buffered channel data is flushed. When the default timer period is long and APF_CHANNEL_DATA arrives, the data sits in the buffer until the timer fires. This creates visible stalls in the WSMAN response path. The 500 ms override causes the flush to happen much sooner after data arrival without disrupting orderly session teardown on other message types.
| Function | Before | After |
|---|---|---|
execute() |
apf.Process(result, lme.Session) |
lme.processWithLocalTimerOverride(result) |
Listen() |
apf.Process(result2, lme.Session) |
lme.processWithLocalTimerOverride(result2) |
The Stop() + drain-channel pattern prevents the Reset() from racing with an already-fired timer:
Timer.Stop() == false → timer already fired → drain Timer.C → Reset()
Timer.Stop() == true → timer not yet fired → Reset() directly
This is the standard Go pattern for safe timer reuse.
The RPS path is used during remote/cloud-driven provisioning where RPS sends activation commands over WebSocket to rpc-go, which forwards them via LME to AMT firmware.
connectLMEWithRetry()
attempt 0 → Connect() ─ success → return nil
└ transient busy → wait 300ms
attempt 1 → Connect() ─ success → return nil
└ transient busy → wait 600ms
attempt 2 → Connect() ─ success/failure → return
| Parameter | Value |
|---|---|
| Max retries | 2 (3 total attempts) |
| Backoff |
(attempt+1) × 300 ms (linear: 300, 600 ms) |
| Retry trigger |
"device or resource busy", "resource busy", "no such device", "device unavailable"
|
| Non-transient | Returned immediately without retry |
Before:
e.waitGroup.Wait() // blocks forever if AMT never confirmsAfter:
channelOpenTimer := time.NewTimer(channelOpenTimeout) // bounded by LMETimerTimeout (capped at AMTResponseTimeout)
channelOpenDone := make(chan struct{})
go func() {
defer close(channelOpenDone)
e.waitGroup.Wait()
}()
select {
case <-channelOpenDone:
// channel confirmed, proceed
case <-channelOpenTimer.C:
e.lastError = fmt.Errorf("timeout waiting for AMT channel open...")
return true // signal error to caller
}| Aspect | Before | After |
|---|---|---|
| Mechanism |
context.WithTimeout / ctx.Done()
|
time.NewTimer / timer.C
|
| Goroutine overhead | Context creates goroutine + cancelFunc | Timer is a single heap object |
| Leak risk |
defer cancel() was correct but context propagation adds complexity |
Direct timer; explicit Stop-drain in defer |
| Semantics |
ctx.Done() closed on deadline or cancel |
timer.C fires once at deadline |
The change removes the "context" import and replaces it with "strings" (for the new isTransientMEIBusy helper), netting zero new imports.
The LocalTransport implements http.RoundTripper for WSMAN over MEI when LMS is not available. It is used by GoWSMANMessages during local activation.
Same retry pattern as §5.3.1 but applied to LocalTransport.RoundTrip(). Uses isMEIDeviceBusyError() which checks for "device or resource busy" and "resource busy" (subset of the executor's isTransientMEIBusy; "no such device" not included at transport level as that implies a more permanent absence).
| Aspect | Executor isTransientMEIBusy
|
Transport isMEIDeviceBusyError
|
|---|---|---|
"device or resource busy" |
✓ | ✓ |
"resource busy" |
✓ | ✓ |
"no such device" |
✓ | ✗ |
"device unavailable" |
✓ | ✗ |
Reviewer note: The narrower set in the transport is intentional — a missing MEI device (
"no such device") at the transport layer should fall through to an LMS fallback, not retry directly.
Same pattern as §5.3.2. On timeout, the transport explicitly calls l.Close() to unblock the Listen() goroutine before returning the error — preventing goroutine leaks.
Previously the inner select loop had no deadline branch. A responseTimer is now added:
case <-responseTimer.C:
respErr = fmt.Errorf("timeout waiting for LME response after %s", responseTimeout)
break LoopThis ensures RoundTrip always returns within AMTResponseTimeout (10s).
BEFORE RoundTrip() AFTER RoundTrip()
───────────────────── ──────────────────────────────
Connect() for attempt 0..2:
│ Connect() ── ok ──────────┐
▼ └── busy? ── wait 300ms │
go Listen() │
│ go Listen() │
▼ │ │
WaitGroup.Wait() ◄── hangs select: │
forever ├─ WG done → proceed ◄──┘
if no ACK └─ timer → Close + error
read loop: read loop:
select: select:
├─ data channel ├─ data channel
└─ error channel ├─ error channel
(no deadline) └─ responseTimer (10s)
type GoWSMANMessages struct {
// ...existing fields...
plainProbeDone bool // true once LMS TCP probe has run
useLocalLMX bool // true if the LMS probe failed (use MEI instead)
}BEFORE SetupWsmanClient() (non-TLS path):
Every call:
ctx, cancel := context.WithTimeout(...)
dialer.DialContext(tcp4, LMSAddress:LMSPort)
if err → use localTransport
else → use LMS
AFTER SetupWsmanClient() (non-TLS path):
First call only:
dial LMS, record result in plainProbeDone + useLocalLMX
Subsequent calls:
skip probe entirely, use cached useLocalLMX decision
lazy-init localTransport only when needed
A full local ACM activation may call SetupWsmanClient many times (for each WSMAN GET/PUT). Without caching, each call attempted a TCP connect-and-close to the LMS port — wasted syscalls and time on systems without LMS.
// Before
dialer := &cryptotls.Dialer{Config: tlsConfig}
conn, _ := dialer.DialContext(ctx, "tcp", utils.LMSAddress+":"+utils.LMSTLSPort)
// After
dialer := &cryptotls.Dialer{
Config: tlsConfig,
NetDialer: &net.Dialer{Timeout: probeTimeout}, // explicit inner timeout
}
conn, _ := dialer.DialContext(ctx, "tcp", net.JoinHostPort(g.target, utils.LMSTLSPort))-
net.Dialer{Timeout: probeTimeout}gives the TCP layer its own deadline independent of the TLS context. -
net.JoinHostPortreplaces string concatenation — handles IPv6 addresses correctly (wraps host in[...]). -
g.targetreplaces hardcodedutils.LMSAddress, enabling the transport to work against custom LMS addresses.
func normalizeActivateURL(raw string) string {
value := strings.TrimSpace(raw)
if value == "" { return "" }
u, err := url.Parse(value)
if err != nil { return value } // unparseable → pass through, fail later
scheme := strings.ToLower(u.Scheme)
if (scheme == "http" || scheme == "https" || scheme == "ws" || scheme == "wss") && u.Host == "" {
return "" // scheme-only or malformed → treat as empty
}
return value
}| Input | Output | Why |
|---|---|---|
"wss://rps.example.com:8080" |
unchanged | valid |
"http://" |
"" |
scheme with no host |
"https://" |
"" |
scheme with no host |
"somestring" |
"somestring" |
no scheme → pass through |
"" |
"" |
empty |
This is called at the top of Validate(), before any mode-detection logic, so cmd.URL == "" is always a reliable sentinel by the time the mode-selection branches run.
// REMOVED from Run() in the local-activation branch:
if err := cmd.EnsureWSMAN(ctx); err != nil {
return err
}EnsureWSMAN opens an LME connection eagerly. When local activation itself subsequently opens the LME for WSMAN calls, this created two concurrent slots, wasting an MEI connection and contributing to the "device busy" errors the rest of this commit fixes.
GetGeneralSettings() is a WSMAN call needed to obtain the AMT digest realm used when hashing passwords for CCM/ACM activation. It is now wrapped in getGeneralSettingsWithRetry().
attempt 0: GetGeneralSettings() ──── ok ──────────────── return response
└── transient busy? ── wait 300ms
attempt 1: GetGeneralSettings() ──── ok ──────────────── return response
└── transient busy? ── wait 600ms
attempt 2: GetGeneralSettings() ──── ok ──────────────── return response
└── transient busy? ── wait 900ms
attempt 3: GetGeneralSettings() ──── any result ──────── return (final)
| Parameter | Value |
|---|---|
| Max retries | 3 (4 total attempts) |
| Backoff |
(attempt+1) × HeciConnectRetryBackoff — 300, 600, 900 ms |
| Transient signals |
"device or resource busy", "resource busy", "no such device", "device unavailable"
|
| Function | Steps where called |
|---|---|
activateCCM() |
Before digest realm extraction |
activateACMWithTLS() |
Before cert chain operations |
activateACMLegacy() |
Before digest realm extraction |
sequenceDiagram
participant E as RPS Executor
participant L as LM Engine
participant M as MEI Driver
participant A as AMT Firmware
E->>M: Connect()
M-->>E: ok / busy (fatal, no retry)
E->>L: go Listen()
E->>E: waitGroup.Wait() [BLOCKS FOREVER if no ACK]
A-->>L: CHANNEL_OPEN_CONFIRMATION
L->>E: waitGroup.Done()
E->>A: Send request data
A-->>L: CHANNEL_DATA
L-->>E: data channel
Note over E: No response deadline → hangs if AMT silent
sequenceDiagram
participant E as RPS Executor
participant L as LM Engine
participant M as MEI Driver
participant A as AMT Firmware
E->>M: Connect() [attempt 0]
alt MEI busy
M-->>E: EBUSY
E->>E: sleep 300ms
E->>M: Connect() [attempt 1]
end
M-->>E: ok
E->>L: go Listen()
E->>E: channelOpenTimer(LMETimerTimeout)
par goroutine
E->>E: go waitGroup.Wait() → close(done)
and select
A-->>L: CHANNEL_OPEN_CONFIRMATION
L->>E: waitGroup.Done() → done closed
E->>E: case <-done: proceed
and timeout path
E->>E: case <-timer: log error, return
end
E->>A: Send request data
A-->>L: APF_CHANNEL_DATA
Note over L: processWithLocalTimerOverride:<br/>reset session timer → 500ms
L-->>E: data channel
E->>E: responseTimer(AMTResponseTimeout=10s)
Note over E: response or timeout, always returns
flowchart TD
A([getGeneralSettingsWithRetry]) --> B[attempt = 0]
B --> C[wsman.GetGeneralSettings]
C -->|success| D([return response, nil])
C -->|error| E{transient\nMEI busy?}
E -->|No| F([return empty, lastErr])
E -->|Yes| G{attempt ==\nmaxRetries=3?}
G -->|Yes| F
G -->|No| H[sleep attempt+1 × 300ms]
H --> I[attempt++]
I --> C
flowchart LR
subgraph "1st WSMAN call"
A[SetupWsmanClient] --> B{plainProbeDone?}
B -->|No| C[TCP probe LMS port]
C -->|LMS up| D[useLocalLMX=false\nplainProbeDone=true]
C -->|LMS down| E[useLocalLMX=true\nplainProbeDone=true]
end
subgraph "2nd+ WSMAN calls"
F[SetupWsmanClient] --> G{plainProbeDone?}
G -->|Yes| H{useLocalLMX?}
H -->|true| I[lazy-init LocalTransport]
H -->|false| J[use LMS TCP]
end
| Area | Risk | Severity | Mitigation |
|---|---|---|---|
Reduced AMTResponseTimeout (30→10 s) |
Legitimate slow AMT responses (e.g., slow TPM operations during ACM activate) may now timeout | Medium | 10 s covers known AMT response times; adjustable via constants; can be reverted independently |
Reduced HeciReadTimeout (30→15 s) |
Slow HECI reads under load may timeout | Low-Medium | Platform testing should confirm; stalls that took 15–30 s were already symptomatic of errors |
plainProbeDone per GoWSMANMessages instance |
If custom code creates new instances per call, caching is ineffective | Low | Within rpc-go, instances are held for the duration of an activation flow |
normalizeActivateURL drops scheme-only URLs silently |
A user passing http:// by mistake gets no error message |
Low | Logs no warning; could be improved with an explicit error; not a regression |
Removed EnsureWSMAN
|
If some activation path relied on early connection check | Low |
EnsureWSMAN was never a precondition for correctness; activation WSMAN calls open their own connection |
| APF 500 ms timer override | If firmware expects the default timer for flow control | Low | 500 ms is faster than default but still gradual; APF flow control is governed by window size, not timer |
isMEIDeviceBusyError vs isTransientMEIBusy divergence |
Different retry criteria at transport vs. executor could cause inconsistent behaviour | Low | Documented above; narrower set at transport layer is intentional |
| Goroutine leak on channel-open timeout (LocalTransport) |
Listen() goroutine blocked on HECI read |
Low |
l.Close() is called on timeout to signal the goroutine |
| Change | Test Type | Location | Coverage |
|---|---|---|---|
normalizeActivateURL |
Unit tests |
activate_test.go (+12 lines) |
Valid URLs, empty, scheme-only, non-URL strings |
getGeneralSettingsWithRetry |
Manual / integration | — | Not unit-tested in this commit; mock-based test would require wsman mock extension |
connectLMEWithRetry |
Manual | — | No unit test; isTransientMEIBusy is inferable from string matching |
isMEIDeviceBusyError |
Manual | — | Same as above |
| Timed channel-open wait | Manual | — | Requires mock that delays waitGroup.Done()
|
| APF timer override | Manual | — | Requires mock APF session with controllable timer |
| LMS probe caching | Manual | — | Requires test harness with controllable TCP listener |
Reviewer suggestion: The retry helpers (
isTransientMEIBusy,isMEIDeviceBusyError) are straightforward string-match functions and would benefit from table-driven unit tests in a follow-up commit.
Three independent layers (RPS executor, local transport, local activation service) now share the same transient-retry pattern. The logic is duplicated rather than extracted into a shared helper. This is acceptable for the current scope but warrants a utility function (RetryOnMEIBusy(fn func() error, maxRetries int) error) if a fourth call site emerges.
The commit deliberately moves away from context.WithTimeout for internal waits in favour of time.NewTimer. This is architecturally sound for leaf-level waits where context propagation is not needed — it reduces goroutine overhead and makes the timeout lifecycle explicit. For cross-goroutine cancellation (e.g., user Ctrl+C), the outer os/signal channel in executor.go remains in place.
All timing parameters live in pkg/utils/constants.go. The commit does not introduce configurable timeouts (e.g., via config.yaml). Operators needing different timeouts must recompile. This is consistent with the existing design but could become a friction point on hardened or slower hardware.
String-matching on kernel error messages is a fragile pattern — kernel messages can differ across kernel versions, distributions, and locales. A future improvement would be to map syscall.EBUSY directly from the HECI driver's error return rather than text searching.
The switch from utils.LMSAddress + ":" + utils.LMSTLSPort to net.JoinHostPort(g.target, utils.LMSTLSPort) quietly improves IPv6 correctness. This is a low-risk but meaningful improvement if LMS is ever exposed on an IPv6 address.