Skip to content

lme optimization

Nabendu Maiti edited this page Mar 23, 2026 · 1 revision

Commit Review: fix: optimize LME activation sequence and local transport

Branch: lme_timeout_fix_optimization_tmp
Author: Nabendu Maiti
Date: 2026-03-19
SHA: 69bbeed
Audience: Patch reviewers, architects


Table of Contents

  1. Executive Summary
  2. Problem Statement
  3. Files Changed
  4. Architecture Context
  5. Change Details by Layer
  6. Sequence Diagrams
  7. Risk Assessment
  8. Testing Coverage
  9. Architectural Observations

1. Executive Summary

This commit addresses a class of AMT/LME stall bugs where rpc-go could hang indefinitely or wait excessively long when the Intel ME Interface (MEI) kernel driver returned transient "device busy" errors or when the AMT firmware was slow to confirm an APF channel open. The changes introduce:

  • Timed waits (replacing unbounded WaitGroup.Wait() blocks) at the channel-open barrier in both the RPS executor and the local HTTP transport.
  • Retry-with-backoff logic at three call sites where transient MEI busy errors are expected: LME channel connect, GetGeneralSettings, and HTTP transport RoundTrip.
  • APF timer acceleration: resets the APF session flush timer to 500 ms whenever an APF_CHANNEL_DATA message arrives, eliminating stalls caused by a slow timer firing after data has already been delivered.
  • LMS probe result caching: avoids redundant TCP probes on every WSMAN call during a multi-step activation.
  • Reduced global timeouts to align with realistic AMT response characteristics and avoid unnecessary delay in failure scenarios.
  • URL normalization to reject structurally invalid --url values early and prevent misrouted activation flows.

2. Problem Statement

Symptom Root Cause
rpc activate --local hangs for 30 s+ then fails WaitGroup.Wait() was unbounded; if AMT never confirmed channel open (e.g., MEI busy), the goroutine blocked forever
Repeated "device or resource busy" errors causing abort Single-attempt MEI connect with no retry; transient kernel driver contention treated as fatal
Activation stalls mid-sequence after first WSMAN call Each WSMAN call re-probed LMS port (TCP connect + TLS handshake overhead per call)
APF data flushed too late during LME sessions APF session timer used a long default; 500 ms flush acceleration needed for channel data deliveries
Malformed --url (e.g., "http://") routed incorrectly No validation before URL was evaluated as non-empty
Premature EnsureWSMAN during local activation WSMAN init triggered before it was needed, wasting an LME slot

3. Files Changed

File +Lines −Lines Category
internal/rps/executor.go +90 −2 Core: LME channel loop
internal/local/amt/localTransport.go +89 −3 Core: HTTP transport
internal/commands/activate/local.go +38 +1 Local activation retry
internal/lm/engine.go +23 −1 APF timer override
internal/commands/activate/activate.go +26 −4 URL normalization, WSMAN cleanup
internal/local/amt/wsman.go +38 −4 LMS probe caching
internal/commands/activate/activate_test.go +12 0 Tests: URL normalization
pkg/utils/constants.go +6 −6 Timeout constants tuning
internal/rps/rps.go +1 0 Minor
pkg/heci/linux.go +1 0 Cosmetic (blank line)
Total +289 −41

4. Architecture Context

Component Map

┌──────────────────────────────────────────────────────────────────────┐
│                          rpc-go Process                              │
│                                                                      │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────────────────┐  │
│  │  CLI /      │    │  RPS         │    │  Local Activation      │  │
│  │  Activate   │───▶│  Executor    │    │  Service               │  │
│  │  Command    │    │  (WebSocket) │    │  (local.go)            │  │
│  └──────┬──────┘    └──────┬───────┘    └──────────┬─────────────┘  │
│         │                  │                        │                │
│         ▼                  ▼                        ▼                │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │               GoWSMANMessages / LocalTransport               │   │
│  │   ┌──────────────────────────────────────────────────────┐   │   │
│  │   │   LMS probe cache (plainProbeDone / useLocalLMX)     │   │   │
│  │   └──────────────────────────────────────────────────────┘   │   │
│  └──────────────────┬───────────────────────────────────────────┘   │
│                     │                                                │
│     ┌───────────────┴──────────────┐                                │
│     │           LM Engine          │                                │
│     │  ┌───────────────────────┐   │                                │
│     │  │  APF Session + Timer  │   │                                │
│     │  │  (500ms override on   │   │                                │
│     │  │   CHANNEL_DATA)       │   │                                │
│     │  └───────────────────────┘   │                                │
│     └───────────────┬──────────────┘                                │
└─────────────────────┼────────────────────────────────────────────── ┘
                      │ ioctl / HECI driver
                      ▼
              ┌───────────────┐
              │  MEI kernel   │
              │  driver       │
              │  /dev/mei0    │
              └───────┬───────┘
                      │ HECI bus
                      ▼
              ┌───────────────┐
              │  Intel AMT    │
              │  Firmware     │
              └───────────────┘

Timeout Budget — Before vs. After

                       BEFORE                         AFTER
                ┌──────────────────┐          ┌──────────────────┐
 LMS conn       │     10 s         │          │      6 s         │
 LMS dialer     │      5 s         │          │      4 s         │
 HECI read      │     30 s         │          │     15 s         │
 HECI retry     │   3000 ms        │          │   2000 ms        │
 HECI backoff   │    500 ms        │          │    300 ms        │
 AMT response   │     30 s         │          │     10 s         │
                └──────────────────┘          └──────────────────┘

5. Change Details by Layer

5.1 Timeout Constants — pkg/utils/constants.go

These constants govern every wait in the LME/LMS/MEI stack.

Constant Old Value New Value Unit Rationale
LMSConnectionTimeout 10 6 seconds LMS TCP connect completes < 1 s on healthy system; 6 s is ample
LMSDialerTimeout 5 4 seconds Aligns with reduced connection timeout
HeciReadTimeout 30 15 seconds AMT typically responds in < 5 s; 15 s catches stalls without waiting 30 s
HeciRetryDelay 3000 2000 ms Reduces dead time between HECI retries
HeciConnectRetryBackoff 500 300 ms Shorter initial backoff on transient MEI busy; linear scale preserves total budget
AMTResponseTimeout 30 10 seconds Matches typical AMT response SLA; 30 s was masking hangs

Architectural note: All timeouts are centralized here. No hardcoded values were introduced in this commit; retry delays always reference utils.HeciConnectRetryBackoff.


5.2 LM Engine — APF Timer Override — internal/lm/engine.go

What changed

A new wrapper processWithLocalTimerOverride(message []byte) replaces direct apf.Process() calls in both execute() and Listen().

const lmeAPFChannelDataFlushOverride = 500 * time.Millisecond

func (lme *LMEConnection) processWithLocalTimerOverride(message []byte) bytes.Buffer {
    processed := apf.Process(message, lme.Session)

    if len(message) > 0 && message[0] == apf.APF_CHANNEL_DATA && lme.Session.Timer != nil {
        if !lme.Session.Timer.Stop() {
            select { case <-lme.Session.Timer.C: default: }
        }
        lme.Session.Timer.Reset(lmeAPFChannelDataFlushOverride)
    }

    return processed
}

Why

The APF session timer controls how quickly buffered channel data is flushed. When the default timer period is long and APF_CHANNEL_DATA arrives, the data sits in the buffer until the timer fires. This creates visible stalls in the WSMAN response path. The 500 ms override causes the flush to happen much sooner after data arrival without disrupting orderly session teardown on other message types.

Call sites changed

Function Before After
execute() apf.Process(result, lme.Session) lme.processWithLocalTimerOverride(result)
Listen() apf.Process(result2, lme.Session) lme.processWithLocalTimerOverride(result2)

Timer safety pattern

The Stop() + drain-channel pattern prevents the Reset() from racing with an already-fired timer:

Timer.Stop() == false  →  timer already fired  →  drain Timer.C  →  Reset()
Timer.Stop() == true   →  timer not yet fired   →  Reset() directly

This is the standard Go pattern for safe timer reuse.


5.3 RPS Executor — LME Channel Handling — internal/rps/executor.go

The RPS path is used during remote/cloud-driven provisioning where RPS sends activation commands over WebSocket to rpc-go, which forwards them via LME to AMT firmware.

5.3.1 Retry on MEI busy (connectLMEWithRetry)

connectLMEWithRetry()
  attempt 0 → Connect() ─ success → return nil
                         └ transient busy → wait 300ms
  attempt 1 → Connect() ─ success → return nil
                         └ transient busy → wait 600ms
  attempt 2 → Connect() ─ success/failure → return
Parameter Value
Max retries 2 (3 total attempts)
Backoff (attempt+1) × 300 ms (linear: 300, 600 ms)
Retry trigger "device or resource busy", "resource busy", "no such device", "device unavailable"
Non-transient Returned immediately without retry

5.3.2 Timed channel-open wait (replaces blocking WaitGroup.Wait())

Before:

e.waitGroup.Wait()  // blocks forever if AMT never confirms

After:

channelOpenTimer := time.NewTimer(channelOpenTimeout)  // bounded by LMETimerTimeout (capped at AMTResponseTimeout)
channelOpenDone  := make(chan struct{})

go func() {
    defer close(channelOpenDone)
    e.waitGroup.Wait()
}()

select {
case <-channelOpenDone:
    // channel confirmed, proceed
case <-channelOpenTimer.C:
    e.lastError = fmt.Errorf("timeout waiting for AMT channel open...")
    return true  // signal error to caller
}

5.3.3 Response timeout: context.WithTimeouttime.NewTimer

Aspect Before After
Mechanism context.WithTimeout / ctx.Done() time.NewTimer / timer.C
Goroutine overhead Context creates goroutine + cancelFunc Timer is a single heap object
Leak risk defer cancel() was correct but context propagation adds complexity Direct timer; explicit Stop-drain in defer
Semantics ctx.Done() closed on deadline or cancel timer.C fires once at deadline

The change removes the "context" import and replaces it with "strings" (for the new isTransientMEIBusy helper), netting zero new imports.


5.4 Local Transport — RoundTrip — internal/local/amt/localTransport.go

The LocalTransport implements http.RoundTripper for WSMAN over MEI when LMS is not available. It is used by GoWSMANMessages during local activation.

5.4.1 MEI busy retry on RoundTrip

Same retry pattern as §5.3.1 but applied to LocalTransport.RoundTrip(). Uses isMEIDeviceBusyError() which checks for "device or resource busy" and "resource busy" (subset of the executor's isTransientMEIBusy; "no such device" not included at transport level as that implies a more permanent absence).

Aspect Executor isTransientMEIBusy Transport isMEIDeviceBusyError
"device or resource busy"
"resource busy"
"no such device"
"device unavailable"

Reviewer note: The narrower set in the transport is intentional — a missing MEI device ("no such device") at the transport layer should fall through to an LMS fallback, not retry directly.

5.4.2 Timed channel-open wait

Same pattern as §5.3.2. On timeout, the transport explicitly calls l.Close() to unblock the Listen() goroutine before returning the error — preventing goroutine leaks.

5.4.3 Response timeout added to loop

Previously the inner select loop had no deadline branch. A responseTimer is now added:

case <-responseTimer.C:
    respErr = fmt.Errorf("timeout waiting for LME response after %s", responseTimeout)
    break Loop

This ensures RoundTrip always returns within AMTResponseTimeout (10s).

Before/After flow diagram

BEFORE RoundTrip()                   AFTER RoundTrip()
─────────────────────                ──────────────────────────────
Connect()                            for attempt 0..2:
  │                                    Connect() ── ok ──────────┐
  ▼                                    └── busy? ── wait 300ms   │
go Listen()                                                       │
  │                                  go Listen()                  │
  ▼                                     │                         │
WaitGroup.Wait()  ◄── hangs            select:                   │
                      forever            ├─ WG done → proceed ◄──┘
                      if no ACK          └─ timer   → Close + error

read loop:                           read loop:
  select:                              select:
    ├─ data channel                      ├─ data channel
    └─ error channel                     ├─ error channel
       (no deadline)                     └─ responseTimer (10s)

5.5 WSMAN Client — LMS Probe Caching — internal/local/amt/wsman.go

New fields on GoWSMANMessages

type GoWSMANMessages struct {
    // ...existing fields...
    plainProbeDone bool    // true once LMS TCP probe has run
    useLocalLMX    bool    // true if the LMS probe failed (use MEI instead)
}

LMS probe: before vs. after

BEFORE SetupWsmanClient() (non-TLS path):
  Every call:
    ctx, cancel := context.WithTimeout(...)
    dialer.DialContext(tcp4, LMSAddress:LMSPort)
    if err → use localTransport
    else   → use LMS

AFTER SetupWsmanClient() (non-TLS path):
  First call only:
    dial LMS, record result in plainProbeDone + useLocalLMX

  Subsequent calls:
    skip probe entirely, use cached useLocalLMX decision
    lazy-init localTransport only when needed

Impact during multi-step activation

A full local ACM activation may call SetupWsmanClient many times (for each WSMAN GET/PUT). Without caching, each call attempted a TCP connect-and-close to the LMS port — wasted syscalls and time on systems without LMS.

TLS path improvement

// Before
dialer := &cryptotls.Dialer{Config: tlsConfig}
conn, _ := dialer.DialContext(ctx, "tcp", utils.LMSAddress+":"+utils.LMSTLSPort)

// After
dialer := &cryptotls.Dialer{
    Config:    tlsConfig,
    NetDialer: &net.Dialer{Timeout: probeTimeout},  // explicit inner timeout
}
conn, _ := dialer.DialContext(ctx, "tcp", net.JoinHostPort(g.target, utils.LMSTLSPort))
  • net.Dialer{Timeout: probeTimeout} gives the TCP layer its own deadline independent of the TLS context.
  • net.JoinHostPort replaces string concatenation — handles IPv6 addresses correctly (wraps host in [...]).
  • g.target replaces hardcoded utils.LMSAddress, enabling the transport to work against custom LMS addresses.

5.6 Activate Command — URL Normalization & WSMAN Init — internal/commands/activate/activate.go

normalizeActivateURL()

func normalizeActivateURL(raw string) string {
    value := strings.TrimSpace(raw)
    if value == "" { return "" }

    u, err := url.Parse(value)
    if err != nil { return value }       // unparseable → pass through, fail later

    scheme := strings.ToLower(u.Scheme)
    if (scheme == "http" || scheme == "https" || scheme == "ws" || scheme == "wss") && u.Host == "" {
        return ""                        // scheme-only or malformed → treat as empty
    }

    return value
}
Input Output Why
"wss://rps.example.com:8080" unchanged valid
"http://" "" scheme with no host
"https://" "" scheme with no host
"somestring" "somestring" no scheme → pass through
"" "" empty

This is called at the top of Validate(), before any mode-detection logic, so cmd.URL == "" is always a reliable sentinel by the time the mode-selection branches run.

Removed EnsureWSMAN call

// REMOVED from Run() in the local-activation branch:
if err := cmd.EnsureWSMAN(ctx); err != nil {
    return err
}

EnsureWSMAN opens an LME connection eagerly. When local activation itself subsequently opens the LME for WSMAN calls, this created two concurrent slots, wasting an MEI connection and contributing to the "device busy" errors the rest of this commit fixes.


5.7 Local Activation — GetGeneralSettings Retry — internal/commands/activate/local.go

GetGeneralSettings() is a WSMAN call needed to obtain the AMT digest realm used when hashing passwords for CCM/ACM activation. It is now wrapped in getGeneralSettingsWithRetry().

Retry policy

attempt 0:  GetGeneralSettings() ──── ok ──────────────── return response
             └── transient busy? ── wait 300ms
attempt 1:  GetGeneralSettings() ──── ok ──────────────── return response
             └── transient busy? ── wait 600ms
attempt 2:  GetGeneralSettings() ──── ok ──────────────── return response
             └── transient busy? ── wait 900ms
attempt 3:  GetGeneralSettings() ──── any result ──────── return (final)
Parameter Value
Max retries 3 (4 total attempts)
Backoff (attempt+1) × HeciConnectRetryBackoff — 300, 600, 900 ms
Transient signals "device or resource busy", "resource busy", "no such device", "device unavailable"

Call sites updated

Function Steps where called
activateCCM() Before digest realm extraction
activateACMWithTLS() Before cert chain operations
activateACMLegacy() Before digest realm extraction

6. Sequence Diagrams

6.1 Before: LME Channel Open (Blocking)

sequenceDiagram
    participant E as RPS Executor
    participant L as LM Engine
    participant M as MEI Driver
    participant A as AMT Firmware

    E->>M: Connect()
    M-->>E: ok / busy (fatal, no retry)
    E->>L: go Listen()
    E->>E: waitGroup.Wait()  [BLOCKS FOREVER if no ACK]
    A-->>L: CHANNEL_OPEN_CONFIRMATION
    L->>E: waitGroup.Done()
    E->>A: Send request data
    A-->>L: CHANNEL_DATA
    L-->>E: data channel
    Note over E: No response deadline → hangs if AMT silent
Loading

6.2 After: LME Channel Open (Timed + Retry)

sequenceDiagram
    participant E as RPS Executor
    participant L as LM Engine
    participant M as MEI Driver
    participant A as AMT Firmware

    E->>M: Connect() [attempt 0]
    alt MEI busy
        M-->>E: EBUSY
        E->>E: sleep 300ms
        E->>M: Connect() [attempt 1]
    end
    M-->>E: ok
    E->>L: go Listen()
    E->>E: channelOpenTimer(LMETimerTimeout)
    par goroutine
        E->>E: go waitGroup.Wait() → close(done)
    and select
        A-->>L: CHANNEL_OPEN_CONFIRMATION
        L->>E: waitGroup.Done() → done closed
        E->>E: case <-done: proceed
    and timeout path
        E->>E: case <-timer: log error, return
    end
    E->>A: Send request data
    A-->>L: APF_CHANNEL_DATA
    Note over L: processWithLocalTimerOverride:<br/>reset session timer → 500ms
    L-->>E: data channel
    E->>E: responseTimer(AMTResponseTimeout=10s)
    Note over E: response or timeout, always returns
Loading

6.3 GetGeneralSettings Retry Flow

flowchart TD
    A([getGeneralSettingsWithRetry]) --> B[attempt = 0]
    B --> C[wsman.GetGeneralSettings]
    C -->|success| D([return response, nil])
    C -->|error| E{transient\nMEI busy?}
    E -->|No| F([return empty, lastErr])
    E -->|Yes| G{attempt ==\nmaxRetries=3?}
    G -->|Yes| F
    G -->|No| H[sleep attempt+1 × 300ms]
    H --> I[attempt++]
    I --> C
Loading

6.4 LMS Probe Caching

flowchart LR
    subgraph "1st WSMAN call"
        A[SetupWsmanClient] --> B{plainProbeDone?}
        B -->|No| C[TCP probe LMS port]
        C -->|LMS up| D[useLocalLMX=false\nplainProbeDone=true]
        C -->|LMS down| E[useLocalLMX=true\nplainProbeDone=true]
    end
    subgraph "2nd+ WSMAN calls"
        F[SetupWsmanClient] --> G{plainProbeDone?}
        G -->|Yes| H{useLocalLMX?}
        H -->|true| I[lazy-init LocalTransport]
        H -->|false| J[use LMS TCP]
    end
Loading

7. Risk Assessment

Area Risk Severity Mitigation
Reduced AMTResponseTimeout (30→10 s) Legitimate slow AMT responses (e.g., slow TPM operations during ACM activate) may now timeout Medium 10 s covers known AMT response times; adjustable via constants; can be reverted independently
Reduced HeciReadTimeout (30→15 s) Slow HECI reads under load may timeout Low-Medium Platform testing should confirm; stalls that took 15–30 s were already symptomatic of errors
plainProbeDone per GoWSMANMessages instance If custom code creates new instances per call, caching is ineffective Low Within rpc-go, instances are held for the duration of an activation flow
normalizeActivateURL drops scheme-only URLs silently A user passing http:// by mistake gets no error message Low Logs no warning; could be improved with an explicit error; not a regression
Removed EnsureWSMAN If some activation path relied on early connection check Low EnsureWSMAN was never a precondition for correctness; activation WSMAN calls open their own connection
APF 500 ms timer override If firmware expects the default timer for flow control Low 500 ms is faster than default but still gradual; APF flow control is governed by window size, not timer
isMEIDeviceBusyError vs isTransientMEIBusy divergence Different retry criteria at transport vs. executor could cause inconsistent behaviour Low Documented above; narrower set at transport layer is intentional
Goroutine leak on channel-open timeout (LocalTransport) Listen() goroutine blocked on HECI read Low l.Close() is called on timeout to signal the goroutine

8. Testing Coverage

Change Test Type Location Coverage
normalizeActivateURL Unit tests activate_test.go (+12 lines) Valid URLs, empty, scheme-only, non-URL strings
getGeneralSettingsWithRetry Manual / integration Not unit-tested in this commit; mock-based test would require wsman mock extension
connectLMEWithRetry Manual No unit test; isTransientMEIBusy is inferable from string matching
isMEIDeviceBusyError Manual Same as above
Timed channel-open wait Manual Requires mock that delays waitGroup.Done()
APF timer override Manual Requires mock APF session with controllable timer
LMS probe caching Manual Requires test harness with controllable TCP listener

Reviewer suggestion: The retry helpers (isTransientMEIBusy, isMEIDeviceBusyError) are straightforward string-match functions and would benefit from table-driven unit tests in a follow-up commit.


9. Architectural Observations

Pattern consistency

Three independent layers (RPS executor, local transport, local activation service) now share the same transient-retry pattern. The logic is duplicated rather than extracted into a shared helper. This is acceptable for the current scope but warrants a utility function (RetryOnMEIBusy(fn func() error, maxRetries int) error) if a fourth call site emerges.

Timer vs. context for timeouts

The commit deliberately moves away from context.WithTimeout for internal waits in favour of time.NewTimer. This is architecturally sound for leaf-level waits where context propagation is not needed — it reduces goroutine overhead and makes the timeout lifecycle explicit. For cross-goroutine cancellation (e.g., user Ctrl+C), the outer os/signal channel in executor.go remains in place.

Timeout constants ownership

All timing parameters live in pkg/utils/constants.go. The commit does not introduce configurable timeouts (e.g., via config.yaml). Operators needing different timeouts must recompile. This is consistent with the existing design but could become a friction point on hardened or slower hardware.

MEI driver busy error detection

String-matching on kernel error messages is a fragile pattern — kernel messages can differ across kernel versions, distributions, and locales. A future improvement would be to map syscall.EBUSY directly from the HECI driver's error return rather than text searching.

IPv6 readiness

The switch from utils.LMSAddress + ":" + utils.LMSTLSPort to net.JoinHostPort(g.target, utils.LMSTLSPort) quietly improves IPv6 correctness. This is a low-risk but meaningful improvement if LMS is ever exposed on an IPv6 address.

Clone this wiki locally