Phase 3: Security Hardening Implementation Plan

Date: 2026-01-27 Status: READY FOR IMPLEMENTATION Based on: 32-Agent RedTeam Security Analysis Verdict: CONDITIONALLY SOUND BUT NOT PRODUCTION-READY

Executive Summary

This implementation plan addresses 5 mandatory security hardening items identified by comprehensive adversarial analysis. The current Mattermost ChatOps Bot implementation is functionally sound but has 7 critical security gaps that must be resolved before production deployment.

Implementation Timeline: 48-64 hours (2-3 week sprint) Deployment Target: RHEL 9 production after hardening verification

Red Team Findings Summary

Weakness	Agent Consensus	Severity	Addressed By
No independent authentication layer	24/32 agents	CRITICAL	Item #1
Regex validation semantically blind	22/32 agents	HIGH	Ongoing monitoring
Allowlist trusts script runtime	20/32 agents	HIGH	Item #2
Supply chain risk	18/32 agents	HIGH	Item #4
Audit logs lack integrity	16/32 agents	MEDIUM	Item #3
No runtime monitoring	15/32 agents	MEDIUM	Item #2
WebSocket token SPOF	14/32 agents	MEDIUM	Item #5

Red Team Verdict: "While the allowlist architecture is sound and the validation thorough, the system's defenses collapse if ANY component in the trust chain is compromised. This is acceptable for internal tooling with trusted operators, but UNACCEPTABLE for production environments with broader access."

Implementation Priority Order

Priority 1: Audit Log Integrity [FOUNDATIONAL]
│           ├─ No dependencies
│           ├─ Enables forensics for all features
│           └─ Est: 6-8 hours
│
Priority 2: Supply Chain Verification [FOUNDATIONAL]
│           ├─ No dependencies
│           ├─ Prevents compromised deployments
│           └─ Est: 6-8 hours
│
Priority 3: Independent Authentication [HIGH IMPACT]
│           ├─ Depends on: Audit Log (for auth logging)
│           ├─ 24/32 agents identified as critical
│           └─ Est: 8-12 hours
│
Priority 4: Runtime Behavior Monitoring [DETECTION]
│           ├─ Depends on: Audit Log (for anomaly logging)
│           ├─ Enables detection of compromised scripts
│           └─ Est: 12-16 hours
│
Priority 5: WebSocket Token Protection [DEFENSE-IN-DEPTH]
            ├─ Depends on: Auth, Monitoring (anomaly integration)
            ├─ Rate limiting and behavioral analysis
            └─ Est: 16-20 hours

Total: 48-64 hours

1. Independent Authentication Layer

Architecture Decision

Implement TOTP-based secondary authentication layer with SQLite session management.

Why TOTP?

Universal mobile authenticator support (Authy, Google Authenticator)
No browser required (WebAuthn needs browser interaction)
Self-contained (no IdP dependency for bot availability)
Proven standard (RFC 6238)

Why SQLite over Redis?

Bot deployment is single-instance
No additional infrastructure to secure
Survives process restarts
Audit trail persists to disk

File Modifications

New Files:

src/core/auth/types.ts - Authentication types
src/core/auth/session-store.ts - SQLite session management (350 lines)
src/core/auth/totp-provider.ts - TOTP generation/verification (150 lines)
src/core/auth/authenticator.ts - Core authentication logic (250 lines)
src/core/auth/enrollment-manager.ts - TOTP enrollment flow (200 lines)
data/auth.db - SQLite database for sessions

Modified Files:

bot.ts (lines 80-100): Add authentication middleware before router.execute()
src/core/command-router.ts (lines 45-60): Inject authentication check

Implementation Approach

Session Store (SQLite):
- Tables: sessions, challenges, totp_secrets
- 8-hour session TTL with sliding window on activity
- Challenge TTL: 5 minutes, max 3 attempts

Authentication Flow:

User sends command
→ Check session (valid & not expired?)
   ├─ YES: Execute command
   └─ NO: Create TOTP challenge
       → Send "Enter your 6-digit code: " to channel
       → User replies with code
       → Verify code against encrypted secret
          ├─ VALID: Create session, execute command
          └─ INVALID: Log attempt, send error

TOTP Implementation:
- Uses otpauth://totp/ standard
- 30-second time step
- Secrets encrypted at rest (AES-256-GCM with key from env)
Enrollment:
- Admin command: !auth-enroll @username
- Generates QR code as data URI (sent as image attachment)
- User scans with authenticator app
- Verification required before enrollment active

Dependencies

{
  "dependencies": {
    "otplib": "^12.0.1",
    "qrcode": "^1.5.3"
  }
}

Configuration Changes

config/bot.config.json additions:

{
  "authentication": {
    "enabled": true,
    "sessionTTL": 28800000,
    "challengeTTL": 300000,
    "maxAttempts": 3,
    "exemptUsers": [],
    "encryptionKey": "ENV:AUTH_ENCRYPTION_KEY"
  }
}

Environment variables:

AUTH_ENCRYPTION_KEY=<32-byte-hex-key>

Testing Strategy

Enrollment: Test QR generation, secret storage, initial verification
Challenge flow: Test code validation, expiry, max attempts
Session lifecycle: Test creation, validation, expiry, renewal
Edge cases: Clock skew tolerance, replay prevention, rate limiting
Integration: Test with existing command router and script executor

Security Considerations

Encryption key: Store securely, rotate periodically
Clock skew: TOTP tolerates ±30s by default (configurable)
Replay attacks: Codes valid for 30s window - one-time use enforced
Brute force: Max 3 attempts per challenge, exponential lockout
Session hijacking: Session tied to user ID, IP (optional)

Estimated Time

8-12 hours including testing and integration

2. Runtime Behavior Monitoring

Architecture Decision

Implement Linux /proc filesystem monitoring for subprocess behavior tracking. No external tools required.

Why /proc?

Built-in Linux kernel interface
No additional dependencies
Real-time process statistics
Low overhead

File Modifications

New Files:

src/core/monitoring/process-monitor.ts - /proc reader (400 lines)
src/core/monitoring/resource-tracker.ts - CPU/memory tracking (200 lines)
src/core/monitoring/file-monitor.ts - Filesystem access (inotify wrapper) (250 lines)
src/core/monitoring/network-monitor.ts - Network connections (ss parser) (200 lines)
src/core/monitoring/anomaly-detector.ts - Behavior rules (300 lines)
config/monitoring-rules.json - Detection rules

Modified Files:

src/core/script-executor.ts (lines 200-250): Instrument executeSync/executeAsync
src/core/script-executor.ts (lines 450-500): Add monitoring hooks

Implementation Approach

Process Monitoring (/proc/{pid}/stat):
- CPU usage percentage
- Memory RSS/VSS
- File descriptor count
- Thread count
File Access Monitoring (inotify):
- Wrap script execution directory with inotify watches
- Track: OPEN, MODIFY, CREATE, DELETE events
- Alert on access outside allowed paths
Network Monitoring (ss):
- Poll ss -tunap during script execution
- Detect connections not in allowlist
- Alert on unexpected destinations

Anomaly Detection Rules:

interface MonitoringRule {
  name: string;
  type: 'resource' | 'file' | 'network';
  condition: string; // Expression: "cpu > 80"
  action: 'log' | 'alert' | 'kill';
  severity: 'info' | 'warning' | 'critical';
}

Integration:

// In script-executor.ts
const monitor = new ProcessMonitor(proc.pid, monitoringRules);
monitor.start();

monitor.onAnomaly((anomaly) => {
  this.logAnomaly(anomaly);
  if (anomaly.action === 'kill') {
    proc.kill();
  }
});

await proc.exited;
const stats = monitor.stop();

Dependencies

No external dependencies (uses Node.js built-in fs/child_process).

Configuration Changes

config/monitoring-rules.json:

{
  "rules": [
    {
      "name": "high_cpu",
      "type": "resource",
      "condition": "cpu > 80",
      "action": "alert",
      "severity": "warning"
    },
    {
      "name": "memory_spike",
      "type": "resource",
      "condition": "memory > 512000000",
      "action": "alert",
      "severity": "warning"
    },
    {
      "name": "unexpected_file_access",
      "type": "file",
      "condition": "path not_in ['/opt/mattermost-bot/scripts', '/tmp']",
      "action": "log",
      "severity": "critical"
    },
    {
      "name": "unexpected_network",
      "type": "network",
      "condition": "destination not_in ['10.0.0.0/8', '192.168.0.0/16']",
      "action": "alert",
      "severity": "critical"
    }
  ],
  "global": {
    "pollIntervalMs": 500,
    "enableFileMonitoring": true,
    "enableNetworkMonitoring": true
  }
}

Testing Strategy

Resource tracking: Run CPU/memory intensive script, verify stats accuracy
File access: Script writes to /tmp, verify detection
Network access: Script makes HTTP request, verify connection logged
Anomaly detection: Trigger rule violation, verify action executes
Performance: Measure monitoring overhead (<5% CPU acceptable)

Security Considerations

/proc race conditions: PID may be reused - verify cmdline match
Polling overhead: 500ms interval balances detection speed vs overhead
inotify limits: /proc/sys/fs/inotify/max_user_watches may need increase
False positives: Tune thresholds based on legitimate script behavior
Log volume: Anomaly logs can be verbose - rotate aggressively

Estimated Time

12-16 hours including rule development and tuning

3. Audit Log Integrity (Cryptographic Signing)

Architecture Decision

Implement Ed25519 digital signatures with hash chain for audit log tamper detection.

Why Ed25519?

Fast (signature generation <1ms)
Small signatures (64 bytes)
Industry standard (SSH, GPG, TLS 1.3)
Better than HMAC (non-repudiation)

Why hash chain?

Enables detection of entry deletion or reordering
Each entry references previous entry hash
Efficient verification

File Modifications

New Files:

src/core/audit/crypto.ts - Ed25519 signing (150 lines)
src/core/audit/hash-chain.ts - Chain management (100 lines)
src/core/audit/verifier.ts - Log verification tool (200 lines)
scripts/verify-audit-log.ts - CLI verification tool

Modified Files:

src/core/script-executor.ts (lines 800-833): Modify logExecution() to sign entries
package.json: Add @noble/ed25519 dependency

Implementation Approach

Key Generation (one-time setup):

bun run scripts/generate-audit-keys.ts
# Generates:
# - data/audit-signing-key.pem (private, 600 permissions)
# - data/audit-verify-key.pem (public, world-readable)

Signed Entry Format:

{
  "timestamp": "2026-01-27T15:30:00.000Z",
  "executionId": "exec_abc123",
  "scriptName": "deploy",
  "userId": "user_xyz",
  "success": true,
  "exitCode": 0,
  "durationMs": 1234,
  "previousHash": "sha256:abc123...",
  "signature": "ed25519:def456..."
}

Signing Process:

// In script-executor.ts
const entry = { ...executionData };
const previousHash = this.hashChain.getLatestHash();
entry.previousHash = previousHash;

const entryBytes = Buffer.from(JSON.stringify(entry), 'utf-8');
const signature = ed25519.sign(entryBytes, this.signingKey);
entry.signature = `ed25519:${Buffer.from(signature).toString('hex')}`;

this.hashChain.add(entry);
appendFileSync(this.logPath, JSON.stringify(entry) + '\n');

Verification Tool:

bun run scripts/verify-audit-log.ts logs/script-executions.log

# Output:
# ✅ Entry 1: Valid signature, hash chain intact
# ✅ Entry 2: Valid signature, hash chain intact
# ...
# ✅ 1000 entries verified, 0 tampering detected

Hash Chain Verification:
- First entry: previousHash = null (genesis)
- Each subsequent entry: previousHash = SHA256(previous entry JSON)
- Verification: Recalculate hashes, compare to stored values
- Tampering detected if: signature invalid OR hash mismatch

Dependencies

{
  "dependencies": {
    "@noble/ed25519": "^2.0.0"
  }
}

Configuration Changes

config/bot.config.json additions:

{
  "audit": {
    "signingEnabled": true,
    "signingKeyPath": "./data/audit-signing-key.pem",
    "verifyKeyPath": "./data/audit-verify-key.pem"
  }
}

Testing Strategy

Key generation: Generate keys, verify format and permissions
Signing: Sign entry, verify signature with public key
Hash chain: Create 10 entries, verify chain integrity
Tampering detection: Modify entry, verify detection
Deletion detection: Remove entry, verify hash chain break
Reordering detection: Swap entries, verify hash chain break
Performance: Sign 1000 entries, measure overhead (<10ms/entry)

Security Considerations

Private key protection: File permissions 600, never log or transmit
Key rotation: Generate new key pair, sign rotation event, continue chain
Signature size: 64 bytes per entry - negligible storage impact
Hash chain breaks: If entry lost, chain breaks - detect gap, continue new chain
Replay attacks: Timestamp in signed data prevents reuse
Non-repudiation: Ed25519 proves log originated from key holder

Estimated Time

6-8 hours including key management and verification tool

4. Supply Chain Verification

Architecture Decision

Implement multi-layer checksum verification for runtime and dependencies.

Layers:

Bun runtime version + checksum
NPM dependencies lockfile with integrity hashes
Vulnerability scanning on startup
Script allowlist checksum

File Modifications

New Files:

scripts/verify-supply-chain.ts - Verification script (300 lines)
scripts/update-checksums.ts - Checksum update tool (150 lines)
.checksums - Runtime and config checksums
bun.lockb - Bun lockfile (generated)

Modified Files:

bot.ts (lines 1-20): Add supply chain verification before startup
package.json: Pin exact Bun version

Implementation Approach

Bun Runtime Verification:

// In bot.ts startup
const bunVersion = Bun.version;
const expectedVersion = '1.0.21';
const bunPath = process.execPath;

if (bunVersion !== expectedVersion) {
  throw new Error(`Bun version mismatch: expected ${expectedVersion}, got ${bunVersion}`);
}

const bunChecksum = await calculateSHA256(bunPath);
const expectedChecksum = await readFile('.checksums/bun.sha256', 'utf-8');

if (bunChecksum !== expectedChecksum.trim()) {
  throw new Error('Bun runtime checksum verification failed');
}

Dependency Lockfile:

# Generate lockfile with integrity hashes
bun install --frozen-lockfile

# Verify on startup
bun install --frozen-lockfile --production
# Fails if lockfile doesn't match package.json

Vulnerability Scanning:

// In bot.ts startup
const auditResult = await Bun.spawn(['bun', 'audit', '--json']).text();
const audit = JSON.parse(auditResult);

const critical = audit.vulnerabilities.filter(v => v.severity === 'critical');
if (critical.length > 0) {
  logger.error('Critical vulnerabilities detected', { count: critical.length });
  // Log to audit trail, alert, but don't block startup (configurable)
}

Script Allowlist Checksum:

// Verify allowlist hasn't been tampered with
const allowlistPath = './config/script-allowlist.json';
const allowlistChecksum = await calculateSHA256(allowlistPath);
const expectedChecksum = await readFile('.checksums/allowlist.sha256', 'utf-8');

if (allowlistChecksum !== expectedChecksum.trim()) {
  throw new Error('Script allowlist checksum verification failed - possible tampering');
}

Checksum Update Tool:

# After legitimate updates, regenerate checksums
bun run scripts/update-checksums.ts

# Checksums stored in .checksums/
# - bun.sha256
# - allowlist.sha256
# - bot.ts.sha256 (optional: verify bot code itself)

Dependencies

No external dependencies (uses Node.js crypto).

Configuration Changes

package.json additions:

{
  "engines": {
    "bun": "1.0.21"
  },
  "scripts": {
    "verify-supply-chain": "bun run scripts/verify-supply-chain.ts",
    "update-checksums": "bun run scripts/update-checksums.ts"
  }
}

.checksums/ (new directory):

.checksums/
├── bun.sha256          # Bun runtime checksum
├── allowlist.sha256    # Script allowlist checksum
└── dependencies.json   # Dependency checksums (optional)

Testing Strategy

Bun version: Change Bun version, verify startup fails
Bun checksum: Modify Bun binary (or use different version), verify detection
Lockfile: Modify bun.lockb, verify bun install fails
Vulnerability scan: Simulate vulnerable dependency, verify detection
Allowlist tampering: Modify allowlist, verify checksum mismatch
Update workflow: Make legitimate change, update checksums, verify startup

Security Considerations

Checksum storage: .checksums/ should be write-protected (systemd ReadOnlyPaths)
TOCTOU: Race between checksum verification and execution - minimize window
Bun updates: Update process must regenerate checksums before restart
False positives: Vulnerability scanning may flag non-exploitable issues - configure thresholds
Startup time: Verification adds ~500ms to startup - acceptable for security

Estimated Time

6-8 hours including checksum tooling and testing

5. WebSocket Token Protection (Rotation + Anomaly Detection)

Architecture Decision

Implement automated token rotation with behavioral anomaly detection and rate limiting.

Token Rotation:

Rotate every 12 hours (configurable)
5-minute grace period for in-flight requests
Mattermost API: POST /users/{userId}/tokens

Anomaly Detection:

Per-user command profiling
Unusual time-of-day detection
Rapid escalation detection
New user + sensitive script = block

Rate Limiting:

Per-user: 20 commands/minute
Per-channel: 50 commands/minute
Global: 200 commands/minute

File Modifications

New Files:

src/core/auth/token-manager.ts - Token rotation (400 lines)
src/core/security/rate-limiter.ts - Rate limiting (200 lines)
src/core/security/anomaly-detector.ts - Behavioral analysis (300 lines)
src/core/security/user-profile.ts - Command history tracking (150 lines)
config/anomaly-rules.json - Detection rules

Modified Files:

src/core/bot-client.ts (lines 1-100): Integrate token manager
src/core/bot-client.ts (lines 200-250): Add rate limit checks
bot.ts (lines 50-80): Initialize rate limiter and anomaly detector

Implementation Approach

Token Manager:

class TokenManager {
  private currentToken: string;
  private nextToken: string | null;
  private rotationTimer: Timer;
  private graceEndTime: number | null;

  async rotate(): Promise<void> {
    // Generate new token via Mattermost API
    const newToken = await this.createToken();

    // Enter grace period
    this.nextToken = newToken;
    this.graceEndTime = Date.now() + (5 * 60 * 1000);

    // Wait for grace period
    await sleep(5 * 60 * 1000);

    // Revoke old token
    await this.revokeToken(this.currentToken);
    this.currentToken = this.nextToken;
    this.nextToken = null;
    this.graceEndTime = null;

    // Schedule next rotation
    this.scheduleRotation();
  }

  getCurrentToken(): string {
    return this.currentToken;
  }
}

Rate Limiter (Token Bucket):

class RateLimiter {
  private buckets: Map<string, TokenBucket>;

  check(userId: string, channelId: string): {
    allowed: boolean;
    retryAfterMs?: number;
  } {
    // Check user bucket
    const userBucket = this.getUserBucket(userId);
    if (!userBucket.consume()) {
      return { allowed: false, retryAfterMs: userBucket.refillTimeMs() };
    }

    // Check channel bucket
    const channelBucket = this.getChannelBucket(channelId);
    if (!channelBucket.consume()) {
      userBucket.refund(); // Refund user token
      return { allowed: false, retryAfterMs: channelBucket.refillTimeMs() };
    }

    // Check global bucket
    const globalBucket = this.getGlobalBucket();
    if (!globalBucket.consume()) {
      userBucket.refund();
      channelBucket.refund();
      return { allowed: false, retryAfterMs: globalBucket.refillTimeMs() };
    }

    return { allowed: true };
  }
}

Anomaly Detector:

interface UserProfile {
  userId: string;
  commandHistory: { script: string; timestamp: number; success: boolean }[];
  typicalHours: number[]; // Hour of day histogram
  scriptFrequency: Map<string, number>;
}

class AnomalyDetector {
  private profiles: Map<string, UserProfile>;
  private rules: AnomalyRule[];

  analyze(userId: string, scriptName: string, channelId: string): Anomaly[] {
    const profile = this.getProfile(userId);
    const anomalies: Anomaly[] = [];

    for (const rule of this.rules) {
      const result = rule.check(profile, scriptName, channelId);
      if (result.triggered) {
        anomalies.push({
          rule: rule.name,
          severity: result.severity,
          action: result.action,
          details: result.details,
        });
      }
    }

    return anomalies;
  }
}

Anomaly Rules:
- Unusual Hours: User executing at 3am when typical hours are 9am-5pm
- New User + Sensitive Script: User with <10 total commands trying deploy-prod
- Rapid Escalation: User goes from test to deploy-staging to deploy-prod in <5 minutes

Dependencies

No external dependencies.

Configuration Changes

config/bot.config.json additions:

{
  "tokens": {
    "rotationIntervalHours": 12,
    "graceMinutes": 5,
    "adminToken": "ENV:MATTERMOST_ADMIN_TOKEN"
  },
  "rateLimits": {
    "perUser": { "windowMs": 60000, "maxRequests": 20 },
    "perChannel": { "windowMs": 60000, "maxRequests": 50 },
    "global": { "windowMs": 60000, "maxRequests": 200 }
  }
}

config/anomaly-rules.json:

{
  "rules": [
    {
      "name": "unusual_hours",
      "type": "time",
      "config": {
        "unusualHoursThreshold": 3,
        "minHistoryDays": 7
      },
      "severity": "warning",
      "action": "log"
    },
    {
      "name": "new_user_sensitive_script",
      "type": "pattern",
      "config": {
        "sensitiveScripts": ["deploy-prod", "db-migrate", "secrets-rotate"],
        "minCommandsBeforeAccess": 10
      },
      "severity": "critical",
      "action": "block"
    },
    {
      "name": "rapid_escalation",
      "type": "escalation",
      "config": {
        "windowMs": 300000,
        "escalationScripts": ["deploy-staging", "deploy-prod", "db-migrate"]
      },
      "severity": "critical",
      "action": "alert"
    }
  ]
}

Environment variables:

# Admin token for token management (needs create/revoke permissions)
MATTERMOST_ADMIN_TOKEN=<admin-token>

Testing Strategy

Token rotation: Simulate rotation, verify WebSocket reconnects
Grace period: Send commands during grace with old token, verify acceptance
Token revocation: Verify old token rejected after grace
Rate limiting: Send burst of commands, verify limits enforced
Anomaly detection: Simulate unusual patterns, verify detection
Profile building: Track user over 7 days, verify profile accuracy

Security Considerations

Admin token: Required for token management - store securely
Token exposure: Never log tokens, sanitize all output
Rotation failure: If rotation fails, keep current token, alert, retry
Profile storage: User profiles are sensitive - consider encryption
False positives: Anomaly rules need tuning - start with log, escalate to block

Estimated Time

16-20 hours including Mattermost API integration and behavioral profiling

Testing Strategy (Overall)

Phase 1: Unit Testing (Per Feature)

Each hardening item has feature-specific unit tests
Total: ~40 unit tests across all features
Run with: bun test

Phase 2: Integration Testing

# Test authentication flow
!test-script  # Triggers TOTP challenge
# User provides code
# Verify script executes

# Test monitoring
!cpu-intensive-script
# Verify process stats logged
# Verify anomaly detection if thresholds crossed

# Test audit integrity
bun run scripts/verify-audit-log.ts logs/script-executions.log
# Verify all signatures valid, hash chain intact

# Test supply chain
bun run scripts/verify-supply-chain.ts
# Verify all checksums match

# Test rate limiting
# Send 25 commands in 1 minute
# Verify last 5 rate-limited

# Test token rotation
# Wait 12 hours (or trigger manually)
# Verify WebSocket reconnects with new token

Phase 3: Security Testing

Penetration testing: Attempt to bypass each hardening layer
Failure mode testing: Kill processes, corrupt files, simulate attacks
Load testing: Verify hardening doesn't degrade performance >10%

Phase 4: Re-Run Red Team Analysis

Execute same 32-agent analysis on hardened implementation
Target: <5 agents identify critical weaknesses
Verify verdict changes from "NOT PRODUCTION-READY" to "PRODUCTION-READY"

Deployment Checklist

Pre-Deployment

All 5 hardening items implemented
Unit tests passing (40/40)
Integration tests passing
Security testing completed
Red Team re-analysis shows improvement
Documentation updated

Deployment Steps

Backup: Backup current bot database and config
Install dependencies: bun install
Generate keys: bun run scripts/generate-audit-keys.ts
Update checksums: bun run scripts/update-checksums.ts
Configure: Update config/bot.config.json with new settings
Enroll users: !auth-enroll @username for each user
Verify supply chain: bun run scripts/verify-supply-chain.ts
Start bot: sudo systemctl restart mattermost-bot
Verify: Check logs for successful startup with all hardening enabled
Monitor: Watch audit logs and monitoring alerts for 24 hours

Post-Deployment

All users enrolled in TOTP
Monitoring rules tuned (no false positives)
Rate limits confirmed appropriate
Token rotation tested in production
Audit log verification automated (cron job)
Security scan scheduled (weekly)

File Modification Summary

New Directories

src/core/auth/ - Authentication subsystem (5 files, ~1000 lines)
src/core/monitoring/ - Runtime monitoring (5 files, ~1350 lines)
src/core/audit/ - Audit integrity (4 files, ~450 lines)
src/core/security/ - Rate limiting and anomaly detection (4 files, ~650 lines)
scripts/ - Verification and management tools (5 files, ~800 lines)
.checksums/ - Supply chain checksums

Modified Files

bot.ts (~100 lines modified/added) - Authentication middleware, supply chain verification, monitoring initialization
src/core/script-executor.ts (~150 lines modified) - Monitoring hooks, signed audit logging
src/core/bot-client.ts (~200 lines modified) - Token manager integration, rate limiting, anomaly checks
src/core/command-router.ts (~50 lines modified) - Authentication injection point
package.json - New dependencies, scripts
config/bot.config.json - New configuration sections

New Configuration Files

config/monitoring-rules.json - Runtime monitoring rules
config/anomaly-rules.json - Behavioral anomaly detection rules
.checksums/bun.sha256 - Bun runtime checksum
.checksums/allowlist.sha256 - Script allowlist checksum

Total Code Addition

New code: ~4,250 lines
Modified code: ~500 lines
Configuration: ~200 lines

Risk Assessment

Implementation Risks

Risk	Likelihood	Impact	Mitigation
TOTP enrollment friction	Medium	Medium	Clear documentation, QR codes
False positive anomaly detection	Medium	Low	Start with `log` action, tune rules
Token rotation breaks WebSocket	Low	High	Grace period, thorough testing
Monitoring performance overhead	Low	Medium	Poll interval tuning, benchmarking
Audit signature key compromise	Low	Critical	Secure key storage, rotation procedure
Supply chain verification blocks startup	Low	High	Configurable strictness, override flag

Residual Risks (Post-Hardening)

Risk	Pre-Hardening	Post-Hardening	Notes
Compromised bot account	CRITICAL	LOW	TOTP prevents credential reuse
Malicious script behavior	HIGH	LOW	Runtime monitoring detects anomalies
Audit log tampering	HIGH	MINIMAL	Ed25519 signatures prevent modification
Compromised dependencies	HIGH	LOW	Checksums detect supply chain attacks
Token theft	MEDIUM	LOW	Rotation limits exposure window
Rate exhaustion DoS	MEDIUM	MINIMAL	Rate limiting enforces fairness

Success Criteria

Phase 3 Implementation Complete When:

✅ All 5 hardening items implemented
✅ All unit tests passing
✅ Integration tests passing
✅ Security testing shows no critical bypasses
✅ Red Team re-analysis shows <5 critical findings
✅ Performance impact <10% (latency and throughput)
✅ Documentation complete (deployment guide, troubleshooting)
✅ User enrollment workflow tested and documented

Production-Ready When:

✅ 2 weeks of production operation without hardening-related incidents
✅ All operators enrolled in TOTP
✅ Monitoring rules tuned (false positive rate <1%)
✅ Audit log verification automated and running
✅ Supply chain scanning scheduled and alerting
✅ Token rotation tested through multiple cycles

Timeline Estimate

Week 1: Foundation (12-16 hours)

Day 1-2: Audit Log Integrity (6-8 hours)
Day 3-4: Supply Chain Verification (6-8 hours)
Milestone: Foundational security layers operational

Week 2: Authentication & Monitoring (20-28 hours)

Day 1-3: Independent Authentication Layer (8-12 hours)
Day 4-5: Runtime Behavior Monitoring (12-16 hours)
Milestone: Primary security controls implemented

Week 3: Token Protection & Testing (16-30 hours)

Day 1-3: WebSocket Token Protection (16-20 hours)
Day 4-5: Integration testing, security testing, documentation (10 hours)
Milestone: All hardening complete, ready for staging deployment

Week 4: Validation & Deployment (8-16 hours)

Day 1-2: Red Team re-analysis, tune based on findings (4-8 hours)
Day 3: Staging deployment, user enrollment (4 hours)
Day 4-5: Monitoring, final validation, production deployment (0-4 hours)
Milestone: Production deployment with hardening enabled

Total: 48-64 hours over 3-4 weeks

Next Steps

Review this plan with stakeholders (security team, operations, management)
Schedule implementation sprint (3-week focused development)
Provision resources:
- Development environment for testing
- Mattermost admin token for token management API
- Staging environment for integration testing
Begin implementation following priority order:
- Start with Audit Log Integrity (foundational, no dependencies)
- Then Supply Chain Verification (foundational, no dependencies)
- Then Authentication → Monitoring → Token Protection (in dependency order)
Re-run Red Team analysis after implementation complete
Plan production deployment (target: 3-4 weeks from implementation start)

Conclusion

This implementation plan addresses all 7 critical security gaps identified by the 32-agent Red Team analysis. The 5 mandatory hardening items are designed as defense-in-depth layers that work together to prevent, detect, and respond to security threats.

Key Principles:

Safe but functional by default: Hardening doesn't break existing workflows
Defense-in-depth: Multiple independent security layers
Fail securely: Security failures result in deny, not allow
Audit everything: Cryptographically signed, tamper-evident logs
Behavioral detection: Anomaly detection catches unexpected patterns

Expected Outcome: After implementation, the Mattermost ChatOps Bot will transition from "CONDITIONALLY SOUND BUT NOT PRODUCTION-READY" to "PRODUCTION-READY" with comprehensive security controls appropriate for sensitive operational environments.

Document Status: APPROVED FOR IMPLEMENTATION Last Updated: 2026-01-27 Next Review: After Phase 3 completion, before production deployment

FilesExpand file tree

PHASE3-SECURITY-HARDENING-PLAN.md

Latest commit

History

PHASE3-SECURITY-HARDENING-PLAN.md

File metadata and controls

Phase 3: Security Hardening Implementation Plan

Executive Summary

Red Team Findings Summary

Implementation Priority Order

1. Independent Authentication Layer

Architecture Decision

File Modifications

Implementation Approach

Dependencies

Configuration Changes

Testing Strategy

Security Considerations

Estimated Time

2. Runtime Behavior Monitoring

Architecture Decision

File Modifications

Implementation Approach

Dependencies

Configuration Changes

Testing Strategy

Security Considerations

Estimated Time

3. Audit Log Integrity (Cryptographic Signing)

Architecture Decision

File Modifications

Implementation Approach

Dependencies

Configuration Changes

Testing Strategy

Security Considerations

Estimated Time

4. Supply Chain Verification

Architecture Decision

File Modifications

Implementation Approach

Dependencies

Configuration Changes

Testing Strategy

Security Considerations

Estimated Time

5. WebSocket Token Protection (Rotation + Anomaly Detection)

Architecture Decision

File Modifications

Implementation Approach

Dependencies

Configuration Changes

Testing Strategy

Security Considerations

Estimated Time

Testing Strategy (Overall)

Phase 1: Unit Testing (Per Feature)

Phase 2: Integration Testing

Phase 3: Security Testing

Phase 4: Re-Run Red Team Analysis

Deployment Checklist

Pre-Deployment

Deployment Steps

Post-Deployment

File Modification Summary

New Directories

Modified Files

New Configuration Files

Total Code Addition

Risk Assessment

Implementation Risks

Residual Risks (Post-Hardening)

Success Criteria

Phase 3 Implementation Complete When:

Production-Ready When:

Timeline Estimate

Week 1: Foundation (12-16 hours)

Week 2: Authentication & Monitoring (20-28 hours)

Week 3: Token Protection & Testing (16-30 hours)

Week 4: Validation & Deployment (8-16 hours)