Skip to content

Latest commit

 

History

History
1059 lines (843 loc) · 33.2 KB

File metadata and controls

1059 lines (843 loc) · 33.2 KB

Phase 3: Security Hardening Implementation Plan

Date: 2026-01-27 Status: READY FOR IMPLEMENTATION Based on: 32-Agent RedTeam Security Analysis Verdict: CONDITIONALLY SOUND BUT NOT PRODUCTION-READY


Executive Summary

This implementation plan addresses 5 mandatory security hardening items identified by comprehensive adversarial analysis. The current Mattermost ChatOps Bot implementation is functionally sound but has 7 critical security gaps that must be resolved before production deployment.

Implementation Timeline: 48-64 hours (2-3 week sprint) Deployment Target: RHEL 9 production after hardening verification


Red Team Findings Summary

Weakness Agent Consensus Severity Addressed By
No independent authentication layer 24/32 agents CRITICAL Item #1
Regex validation semantically blind 22/32 agents HIGH Ongoing monitoring
Allowlist trusts script runtime 20/32 agents HIGH Item #2
Supply chain risk 18/32 agents HIGH Item #4
Audit logs lack integrity 16/32 agents MEDIUM Item #3
No runtime monitoring 15/32 agents MEDIUM Item #2
WebSocket token SPOF 14/32 agents MEDIUM Item #5

Red Team Verdict: "While the allowlist architecture is sound and the validation thorough, the system's defenses collapse if ANY component in the trust chain is compromised. This is acceptable for internal tooling with trusted operators, but UNACCEPTABLE for production environments with broader access."


Implementation Priority Order

Priority 1: Audit Log Integrity [FOUNDATIONAL]
│           ├─ No dependencies
│           ├─ Enables forensics for all features
│           └─ Est: 6-8 hours
│
Priority 2: Supply Chain Verification [FOUNDATIONAL]
│           ├─ No dependencies
│           ├─ Prevents compromised deployments
│           └─ Est: 6-8 hours
│
Priority 3: Independent Authentication [HIGH IMPACT]
│           ├─ Depends on: Audit Log (for auth logging)
│           ├─ 24/32 agents identified as critical
│           └─ Est: 8-12 hours
│
Priority 4: Runtime Behavior Monitoring [DETECTION]
│           ├─ Depends on: Audit Log (for anomaly logging)
│           ├─ Enables detection of compromised scripts
│           └─ Est: 12-16 hours
│
Priority 5: WebSocket Token Protection [DEFENSE-IN-DEPTH]
            ├─ Depends on: Auth, Monitoring (anomaly integration)
            ├─ Rate limiting and behavioral analysis
            └─ Est: 16-20 hours

Total: 48-64 hours


1. Independent Authentication Layer

Architecture Decision

Implement TOTP-based secondary authentication layer with SQLite session management.

Why TOTP?

  • Universal mobile authenticator support (Authy, Google Authenticator)
  • No browser required (WebAuthn needs browser interaction)
  • Self-contained (no IdP dependency for bot availability)
  • Proven standard (RFC 6238)

Why SQLite over Redis?

  • Bot deployment is single-instance
  • No additional infrastructure to secure
  • Survives process restarts
  • Audit trail persists to disk

File Modifications

New Files:

  • src/core/auth/types.ts - Authentication types
  • src/core/auth/session-store.ts - SQLite session management (350 lines)
  • src/core/auth/totp-provider.ts - TOTP generation/verification (150 lines)
  • src/core/auth/authenticator.ts - Core authentication logic (250 lines)
  • src/core/auth/enrollment-manager.ts - TOTP enrollment flow (200 lines)
  • data/auth.db - SQLite database for sessions

Modified Files:

  • bot.ts (lines 80-100): Add authentication middleware before router.execute()
  • src/core/command-router.ts (lines 45-60): Inject authentication check

Implementation Approach

  1. Session Store (SQLite):

    • Tables: sessions, challenges, totp_secrets
    • 8-hour session TTL with sliding window on activity
    • Challenge TTL: 5 minutes, max 3 attempts
  2. Authentication Flow:

    User sends command
    → Check session (valid & not expired?)
       ├─ YES: Execute command
       └─ NO: Create TOTP challenge
           → Send "Enter your 6-digit code: " to channel
           → User replies with code
           → Verify code against encrypted secret
              ├─ VALID: Create session, execute command
              └─ INVALID: Log attempt, send error
    
  3. TOTP Implementation:

    • Uses otpauth://totp/ standard
    • 30-second time step
    • Secrets encrypted at rest (AES-256-GCM with key from env)
  4. Enrollment:

    • Admin command: !auth-enroll @username
    • Generates QR code as data URI (sent as image attachment)
    • User scans with authenticator app
    • Verification required before enrollment active

Dependencies

{
  "dependencies": {
    "otplib": "^12.0.1",
    "qrcode": "^1.5.3"
  }
}

Configuration Changes

config/bot.config.json additions:

{
  "authentication": {
    "enabled": true,
    "sessionTTL": 28800000,
    "challengeTTL": 300000,
    "maxAttempts": 3,
    "exemptUsers": [],
    "encryptionKey": "ENV:AUTH_ENCRYPTION_KEY"
  }
}

Environment variables:

AUTH_ENCRYPTION_KEY=<32-byte-hex-key>

Testing Strategy

  1. Enrollment: Test QR generation, secret storage, initial verification
  2. Challenge flow: Test code validation, expiry, max attempts
  3. Session lifecycle: Test creation, validation, expiry, renewal
  4. Edge cases: Clock skew tolerance, replay prevention, rate limiting
  5. Integration: Test with existing command router and script executor

Security Considerations

  • Encryption key: Store securely, rotate periodically
  • Clock skew: TOTP tolerates ±30s by default (configurable)
  • Replay attacks: Codes valid for 30s window - one-time use enforced
  • Brute force: Max 3 attempts per challenge, exponential lockout
  • Session hijacking: Session tied to user ID, IP (optional)

Estimated Time

8-12 hours including testing and integration


2. Runtime Behavior Monitoring

Architecture Decision

Implement Linux /proc filesystem monitoring for subprocess behavior tracking. No external tools required.

Why /proc?

  • Built-in Linux kernel interface
  • No additional dependencies
  • Real-time process statistics
  • Low overhead

File Modifications

New Files:

  • src/core/monitoring/process-monitor.ts - /proc reader (400 lines)
  • src/core/monitoring/resource-tracker.ts - CPU/memory tracking (200 lines)
  • src/core/monitoring/file-monitor.ts - Filesystem access (inotify wrapper) (250 lines)
  • src/core/monitoring/network-monitor.ts - Network connections (ss parser) (200 lines)
  • src/core/monitoring/anomaly-detector.ts - Behavior rules (300 lines)
  • config/monitoring-rules.json - Detection rules

Modified Files:

  • src/core/script-executor.ts (lines 200-250): Instrument executeSync/executeAsync
  • src/core/script-executor.ts (lines 450-500): Add monitoring hooks

Implementation Approach

  1. Process Monitoring (/proc/{pid}/stat):

    • CPU usage percentage
    • Memory RSS/VSS
    • File descriptor count
    • Thread count
  2. File Access Monitoring (inotify):

    • Wrap script execution directory with inotify watches
    • Track: OPEN, MODIFY, CREATE, DELETE events
    • Alert on access outside allowed paths
  3. Network Monitoring (ss):

    • Poll ss -tunap during script execution
    • Detect connections not in allowlist
    • Alert on unexpected destinations
  4. Anomaly Detection Rules:

    interface MonitoringRule {
      name: string;
      type: 'resource' | 'file' | 'network';
      condition: string; // Expression: "cpu > 80"
      action: 'log' | 'alert' | 'kill';
      severity: 'info' | 'warning' | 'critical';
    }
  5. Integration:

    // In script-executor.ts
    const monitor = new ProcessMonitor(proc.pid, monitoringRules);
    monitor.start();
    
    monitor.onAnomaly((anomaly) => {
      this.logAnomaly(anomaly);
      if (anomaly.action === 'kill') {
        proc.kill();
      }
    });
    
    await proc.exited;
    const stats = monitor.stop();

Dependencies

No external dependencies (uses Node.js built-in fs/child_process).

Configuration Changes

config/monitoring-rules.json:

{
  "rules": [
    {
      "name": "high_cpu",
      "type": "resource",
      "condition": "cpu > 80",
      "action": "alert",
      "severity": "warning"
    },
    {
      "name": "memory_spike",
      "type": "resource",
      "condition": "memory > 512000000",
      "action": "alert",
      "severity": "warning"
    },
    {
      "name": "unexpected_file_access",
      "type": "file",
      "condition": "path not_in ['/opt/mattermost-bot/scripts', '/tmp']",
      "action": "log",
      "severity": "critical"
    },
    {
      "name": "unexpected_network",
      "type": "network",
      "condition": "destination not_in ['10.0.0.0/8', '192.168.0.0/16']",
      "action": "alert",
      "severity": "critical"
    }
  ],
  "global": {
    "pollIntervalMs": 500,
    "enableFileMonitoring": true,
    "enableNetworkMonitoring": true
  }
}

Testing Strategy

  1. Resource tracking: Run CPU/memory intensive script, verify stats accuracy
  2. File access: Script writes to /tmp, verify detection
  3. Network access: Script makes HTTP request, verify connection logged
  4. Anomaly detection: Trigger rule violation, verify action executes
  5. Performance: Measure monitoring overhead (<5% CPU acceptable)

Security Considerations

  • /proc race conditions: PID may be reused - verify cmdline match
  • Polling overhead: 500ms interval balances detection speed vs overhead
  • inotify limits: /proc/sys/fs/inotify/max_user_watches may need increase
  • False positives: Tune thresholds based on legitimate script behavior
  • Log volume: Anomaly logs can be verbose - rotate aggressively

Estimated Time

12-16 hours including rule development and tuning


3. Audit Log Integrity (Cryptographic Signing)

Architecture Decision

Implement Ed25519 digital signatures with hash chain for audit log tamper detection.

Why Ed25519?

  • Fast (signature generation <1ms)
  • Small signatures (64 bytes)
  • Industry standard (SSH, GPG, TLS 1.3)
  • Better than HMAC (non-repudiation)

Why hash chain?

  • Enables detection of entry deletion or reordering
  • Each entry references previous entry hash
  • Efficient verification

File Modifications

New Files:

  • src/core/audit/crypto.ts - Ed25519 signing (150 lines)
  • src/core/audit/hash-chain.ts - Chain management (100 lines)
  • src/core/audit/verifier.ts - Log verification tool (200 lines)
  • scripts/verify-audit-log.ts - CLI verification tool

Modified Files:

  • src/core/script-executor.ts (lines 800-833): Modify logExecution() to sign entries
  • package.json: Add @noble/ed25519 dependency

Implementation Approach

  1. Key Generation (one-time setup):

    bun run scripts/generate-audit-keys.ts
    # Generates:
    # - data/audit-signing-key.pem (private, 600 permissions)
    # - data/audit-verify-key.pem (public, world-readable)
  2. Signed Entry Format:

    {
      "timestamp": "2026-01-27T15:30:00.000Z",
      "executionId": "exec_abc123",
      "scriptName": "deploy",
      "userId": "user_xyz",
      "success": true,
      "exitCode": 0,
      "durationMs": 1234,
      "previousHash": "sha256:abc123...",
      "signature": "ed25519:def456..."
    }
  3. Signing Process:

    // In script-executor.ts
    const entry = { ...executionData };
    const previousHash = this.hashChain.getLatestHash();
    entry.previousHash = previousHash;
    
    const entryBytes = Buffer.from(JSON.stringify(entry), 'utf-8');
    const signature = ed25519.sign(entryBytes, this.signingKey);
    entry.signature = `ed25519:${Buffer.from(signature).toString('hex')}`;
    
    this.hashChain.add(entry);
    appendFileSync(this.logPath, JSON.stringify(entry) + '\n');
  4. Verification Tool:

    bun run scripts/verify-audit-log.ts logs/script-executions.log
    
    # Output:
    # ✅ Entry 1: Valid signature, hash chain intact
    # ✅ Entry 2: Valid signature, hash chain intact
    # ...
    # ✅ 1000 entries verified, 0 tampering detected
  5. Hash Chain Verification:

    • First entry: previousHash = null (genesis)
    • Each subsequent entry: previousHash = SHA256(previous entry JSON)
    • Verification: Recalculate hashes, compare to stored values
    • Tampering detected if: signature invalid OR hash mismatch

Dependencies

{
  "dependencies": {
    "@noble/ed25519": "^2.0.0"
  }
}

Configuration Changes

config/bot.config.json additions:

{
  "audit": {
    "signingEnabled": true,
    "signingKeyPath": "./data/audit-signing-key.pem",
    "verifyKeyPath": "./data/audit-verify-key.pem"
  }
}

Testing Strategy

  1. Key generation: Generate keys, verify format and permissions
  2. Signing: Sign entry, verify signature with public key
  3. Hash chain: Create 10 entries, verify chain integrity
  4. Tampering detection: Modify entry, verify detection
  5. Deletion detection: Remove entry, verify hash chain break
  6. Reordering detection: Swap entries, verify hash chain break
  7. Performance: Sign 1000 entries, measure overhead (<10ms/entry)

Security Considerations

  • Private key protection: File permissions 600, never log or transmit
  • Key rotation: Generate new key pair, sign rotation event, continue chain
  • Signature size: 64 bytes per entry - negligible storage impact
  • Hash chain breaks: If entry lost, chain breaks - detect gap, continue new chain
  • Replay attacks: Timestamp in signed data prevents reuse
  • Non-repudiation: Ed25519 proves log originated from key holder

Estimated Time

6-8 hours including key management and verification tool


4. Supply Chain Verification

Architecture Decision

Implement multi-layer checksum verification for runtime and dependencies.

Layers:

  1. Bun runtime version + checksum
  2. NPM dependencies lockfile with integrity hashes
  3. Vulnerability scanning on startup
  4. Script allowlist checksum

File Modifications

New Files:

  • scripts/verify-supply-chain.ts - Verification script (300 lines)
  • scripts/update-checksums.ts - Checksum update tool (150 lines)
  • .checksums - Runtime and config checksums
  • bun.lockb - Bun lockfile (generated)

Modified Files:

  • bot.ts (lines 1-20): Add supply chain verification before startup
  • package.json: Pin exact Bun version

Implementation Approach

  1. Bun Runtime Verification:

    // In bot.ts startup
    const bunVersion = Bun.version;
    const expectedVersion = '1.0.21';
    const bunPath = process.execPath;
    
    if (bunVersion !== expectedVersion) {
      throw new Error(`Bun version mismatch: expected ${expectedVersion}, got ${bunVersion}`);
    }
    
    const bunChecksum = await calculateSHA256(bunPath);
    const expectedChecksum = await readFile('.checksums/bun.sha256', 'utf-8');
    
    if (bunChecksum !== expectedChecksum.trim()) {
      throw new Error('Bun runtime checksum verification failed');
    }
  2. Dependency Lockfile:

    # Generate lockfile with integrity hashes
    bun install --frozen-lockfile
    
    # Verify on startup
    bun install --frozen-lockfile --production
    # Fails if lockfile doesn't match package.json
  3. Vulnerability Scanning:

    // In bot.ts startup
    const auditResult = await Bun.spawn(['bun', 'audit', '--json']).text();
    const audit = JSON.parse(auditResult);
    
    const critical = audit.vulnerabilities.filter(v => v.severity === 'critical');
    if (critical.length > 0) {
      logger.error('Critical vulnerabilities detected', { count: critical.length });
      // Log to audit trail, alert, but don't block startup (configurable)
    }
  4. Script Allowlist Checksum:

    // Verify allowlist hasn't been tampered with
    const allowlistPath = './config/script-allowlist.json';
    const allowlistChecksum = await calculateSHA256(allowlistPath);
    const expectedChecksum = await readFile('.checksums/allowlist.sha256', 'utf-8');
    
    if (allowlistChecksum !== expectedChecksum.trim()) {
      throw new Error('Script allowlist checksum verification failed - possible tampering');
    }
  5. Checksum Update Tool:

    # After legitimate updates, regenerate checksums
    bun run scripts/update-checksums.ts
    
    # Checksums stored in .checksums/
    # - bun.sha256
    # - allowlist.sha256
    # - bot.ts.sha256 (optional: verify bot code itself)

Dependencies

No external dependencies (uses Node.js crypto).

Configuration Changes

package.json additions:

{
  "engines": {
    "bun": "1.0.21"
  },
  "scripts": {
    "verify-supply-chain": "bun run scripts/verify-supply-chain.ts",
    "update-checksums": "bun run scripts/update-checksums.ts"
  }
}

.checksums/ (new directory):

.checksums/
├── bun.sha256          # Bun runtime checksum
├── allowlist.sha256    # Script allowlist checksum
└── dependencies.json   # Dependency checksums (optional)

Testing Strategy

  1. Bun version: Change Bun version, verify startup fails
  2. Bun checksum: Modify Bun binary (or use different version), verify detection
  3. Lockfile: Modify bun.lockb, verify bun install fails
  4. Vulnerability scan: Simulate vulnerable dependency, verify detection
  5. Allowlist tampering: Modify allowlist, verify checksum mismatch
  6. Update workflow: Make legitimate change, update checksums, verify startup

Security Considerations

  • Checksum storage: .checksums/ should be write-protected (systemd ReadOnlyPaths)
  • TOCTOU: Race between checksum verification and execution - minimize window
  • Bun updates: Update process must regenerate checksums before restart
  • False positives: Vulnerability scanning may flag non-exploitable issues - configure thresholds
  • Startup time: Verification adds ~500ms to startup - acceptable for security

Estimated Time

6-8 hours including checksum tooling and testing


5. WebSocket Token Protection (Rotation + Anomaly Detection)

Architecture Decision

Implement automated token rotation with behavioral anomaly detection and rate limiting.

Token Rotation:

  • Rotate every 12 hours (configurable)
  • 5-minute grace period for in-flight requests
  • Mattermost API: POST /users/{userId}/tokens

Anomaly Detection:

  • Per-user command profiling
  • Unusual time-of-day detection
  • Rapid escalation detection
  • New user + sensitive script = block

Rate Limiting:

  • Per-user: 20 commands/minute
  • Per-channel: 50 commands/minute
  • Global: 200 commands/minute

File Modifications

New Files:

  • src/core/auth/token-manager.ts - Token rotation (400 lines)
  • src/core/security/rate-limiter.ts - Rate limiting (200 lines)
  • src/core/security/anomaly-detector.ts - Behavioral analysis (300 lines)
  • src/core/security/user-profile.ts - Command history tracking (150 lines)
  • config/anomaly-rules.json - Detection rules

Modified Files:

  • src/core/bot-client.ts (lines 1-100): Integrate token manager
  • src/core/bot-client.ts (lines 200-250): Add rate limit checks
  • bot.ts (lines 50-80): Initialize rate limiter and anomaly detector

Implementation Approach

  1. Token Manager:

    class TokenManager {
      private currentToken: string;
      private nextToken: string | null;
      private rotationTimer: Timer;
      private graceEndTime: number | null;
    
      async rotate(): Promise<void> {
        // Generate new token via Mattermost API
        const newToken = await this.createToken();
    
        // Enter grace period
        this.nextToken = newToken;
        this.graceEndTime = Date.now() + (5 * 60 * 1000);
    
        // Wait for grace period
        await sleep(5 * 60 * 1000);
    
        // Revoke old token
        await this.revokeToken(this.currentToken);
        this.currentToken = this.nextToken;
        this.nextToken = null;
        this.graceEndTime = null;
    
        // Schedule next rotation
        this.scheduleRotation();
      }
    
      getCurrentToken(): string {
        return this.currentToken;
      }
    }
  2. Rate Limiter (Token Bucket):

    class RateLimiter {
      private buckets: Map<string, TokenBucket>;
    
      check(userId: string, channelId: string): {
        allowed: boolean;
        retryAfterMs?: number;
      } {
        // Check user bucket
        const userBucket = this.getUserBucket(userId);
        if (!userBucket.consume()) {
          return { allowed: false, retryAfterMs: userBucket.refillTimeMs() };
        }
    
        // Check channel bucket
        const channelBucket = this.getChannelBucket(channelId);
        if (!channelBucket.consume()) {
          userBucket.refund(); // Refund user token
          return { allowed: false, retryAfterMs: channelBucket.refillTimeMs() };
        }
    
        // Check global bucket
        const globalBucket = this.getGlobalBucket();
        if (!globalBucket.consume()) {
          userBucket.refund();
          channelBucket.refund();
          return { allowed: false, retryAfterMs: globalBucket.refillTimeMs() };
        }
    
        return { allowed: true };
      }
    }
  3. Anomaly Detector:

    interface UserProfile {
      userId: string;
      commandHistory: { script: string; timestamp: number; success: boolean }[];
      typicalHours: number[]; // Hour of day histogram
      scriptFrequency: Map<string, number>;
    }
    
    class AnomalyDetector {
      private profiles: Map<string, UserProfile>;
      private rules: AnomalyRule[];
    
      analyze(userId: string, scriptName: string, channelId: string): Anomaly[] {
        const profile = this.getProfile(userId);
        const anomalies: Anomaly[] = [];
    
        for (const rule of this.rules) {
          const result = rule.check(profile, scriptName, channelId);
          if (result.triggered) {
            anomalies.push({
              rule: rule.name,
              severity: result.severity,
              action: result.action,
              details: result.details,
            });
          }
        }
    
        return anomalies;
      }
    }
  4. Anomaly Rules:

    • Unusual Hours: User executing at 3am when typical hours are 9am-5pm
    • New User + Sensitive Script: User with <10 total commands trying deploy-prod
    • Rapid Escalation: User goes from test to deploy-staging to deploy-prod in <5 minutes

Dependencies

No external dependencies.

Configuration Changes

config/bot.config.json additions:

{
  "tokens": {
    "rotationIntervalHours": 12,
    "graceMinutes": 5,
    "adminToken": "ENV:MATTERMOST_ADMIN_TOKEN"
  },
  "rateLimits": {
    "perUser": { "windowMs": 60000, "maxRequests": 20 },
    "perChannel": { "windowMs": 60000, "maxRequests": 50 },
    "global": { "windowMs": 60000, "maxRequests": 200 }
  }
}

config/anomaly-rules.json:

{
  "rules": [
    {
      "name": "unusual_hours",
      "type": "time",
      "config": {
        "unusualHoursThreshold": 3,
        "minHistoryDays": 7
      },
      "severity": "warning",
      "action": "log"
    },
    {
      "name": "new_user_sensitive_script",
      "type": "pattern",
      "config": {
        "sensitiveScripts": ["deploy-prod", "db-migrate", "secrets-rotate"],
        "minCommandsBeforeAccess": 10
      },
      "severity": "critical",
      "action": "block"
    },
    {
      "name": "rapid_escalation",
      "type": "escalation",
      "config": {
        "windowMs": 300000,
        "escalationScripts": ["deploy-staging", "deploy-prod", "db-migrate"]
      },
      "severity": "critical",
      "action": "alert"
    }
  ]
}

Environment variables:

# Admin token for token management (needs create/revoke permissions)
MATTERMOST_ADMIN_TOKEN=<admin-token>

Testing Strategy

  1. Token rotation: Simulate rotation, verify WebSocket reconnects
  2. Grace period: Send commands during grace with old token, verify acceptance
  3. Token revocation: Verify old token rejected after grace
  4. Rate limiting: Send burst of commands, verify limits enforced
  5. Anomaly detection: Simulate unusual patterns, verify detection
  6. Profile building: Track user over 7 days, verify profile accuracy

Security Considerations

  • Admin token: Required for token management - store securely
  • Token exposure: Never log tokens, sanitize all output
  • Rotation failure: If rotation fails, keep current token, alert, retry
  • Profile storage: User profiles are sensitive - consider encryption
  • False positives: Anomaly rules need tuning - start with log, escalate to block

Estimated Time

16-20 hours including Mattermost API integration and behavioral profiling


Testing Strategy (Overall)

Phase 1: Unit Testing (Per Feature)

  • Each hardening item has feature-specific unit tests
  • Total: ~40 unit tests across all features
  • Run with: bun test

Phase 2: Integration Testing

# Test authentication flow
!test-script  # Triggers TOTP challenge
# User provides code
# Verify script executes

# Test monitoring
!cpu-intensive-script
# Verify process stats logged
# Verify anomaly detection if thresholds crossed

# Test audit integrity
bun run scripts/verify-audit-log.ts logs/script-executions.log
# Verify all signatures valid, hash chain intact

# Test supply chain
bun run scripts/verify-supply-chain.ts
# Verify all checksums match

# Test rate limiting
# Send 25 commands in 1 minute
# Verify last 5 rate-limited

# Test token rotation
# Wait 12 hours (or trigger manually)
# Verify WebSocket reconnects with new token

Phase 3: Security Testing

  • Penetration testing: Attempt to bypass each hardening layer
  • Failure mode testing: Kill processes, corrupt files, simulate attacks
  • Load testing: Verify hardening doesn't degrade performance >10%

Phase 4: Re-Run Red Team Analysis

  • Execute same 32-agent analysis on hardened implementation
  • Target: <5 agents identify critical weaknesses
  • Verify verdict changes from "NOT PRODUCTION-READY" to "PRODUCTION-READY"

Deployment Checklist

Pre-Deployment

  • All 5 hardening items implemented
  • Unit tests passing (40/40)
  • Integration tests passing
  • Security testing completed
  • Red Team re-analysis shows improvement
  • Documentation updated

Deployment Steps

  1. Backup: Backup current bot database and config
  2. Install dependencies: bun install
  3. Generate keys: bun run scripts/generate-audit-keys.ts
  4. Update checksums: bun run scripts/update-checksums.ts
  5. Configure: Update config/bot.config.json with new settings
  6. Enroll users: !auth-enroll @username for each user
  7. Verify supply chain: bun run scripts/verify-supply-chain.ts
  8. Start bot: sudo systemctl restart mattermost-bot
  9. Verify: Check logs for successful startup with all hardening enabled
  10. Monitor: Watch audit logs and monitoring alerts for 24 hours

Post-Deployment

  • All users enrolled in TOTP
  • Monitoring rules tuned (no false positives)
  • Rate limits confirmed appropriate
  • Token rotation tested in production
  • Audit log verification automated (cron job)
  • Security scan scheduled (weekly)

File Modification Summary

New Directories

  • src/core/auth/ - Authentication subsystem (5 files, ~1000 lines)
  • src/core/monitoring/ - Runtime monitoring (5 files, ~1350 lines)
  • src/core/audit/ - Audit integrity (4 files, ~450 lines)
  • src/core/security/ - Rate limiting and anomaly detection (4 files, ~650 lines)
  • scripts/ - Verification and management tools (5 files, ~800 lines)
  • .checksums/ - Supply chain checksums

Modified Files

  • bot.ts (~100 lines modified/added) - Authentication middleware, supply chain verification, monitoring initialization
  • src/core/script-executor.ts (~150 lines modified) - Monitoring hooks, signed audit logging
  • src/core/bot-client.ts (~200 lines modified) - Token manager integration, rate limiting, anomaly checks
  • src/core/command-router.ts (~50 lines modified) - Authentication injection point
  • package.json - New dependencies, scripts
  • config/bot.config.json - New configuration sections

New Configuration Files

  • config/monitoring-rules.json - Runtime monitoring rules
  • config/anomaly-rules.json - Behavioral anomaly detection rules
  • .checksums/bun.sha256 - Bun runtime checksum
  • .checksums/allowlist.sha256 - Script allowlist checksum

Total Code Addition

  • New code: ~4,250 lines
  • Modified code: ~500 lines
  • Configuration: ~200 lines

Risk Assessment

Implementation Risks

Risk Likelihood Impact Mitigation
TOTP enrollment friction Medium Medium Clear documentation, QR codes
False positive anomaly detection Medium Low Start with log action, tune rules
Token rotation breaks WebSocket Low High Grace period, thorough testing
Monitoring performance overhead Low Medium Poll interval tuning, benchmarking
Audit signature key compromise Low Critical Secure key storage, rotation procedure
Supply chain verification blocks startup Low High Configurable strictness, override flag

Residual Risks (Post-Hardening)

Risk Pre-Hardening Post-Hardening Notes
Compromised bot account CRITICAL LOW TOTP prevents credential reuse
Malicious script behavior HIGH LOW Runtime monitoring detects anomalies
Audit log tampering HIGH MINIMAL Ed25519 signatures prevent modification
Compromised dependencies HIGH LOW Checksums detect supply chain attacks
Token theft MEDIUM LOW Rotation limits exposure window
Rate exhaustion DoS MEDIUM MINIMAL Rate limiting enforces fairness

Success Criteria

Phase 3 Implementation Complete When:

  1. ✅ All 5 hardening items implemented
  2. ✅ All unit tests passing
  3. ✅ Integration tests passing
  4. ✅ Security testing shows no critical bypasses
  5. ✅ Red Team re-analysis shows <5 critical findings
  6. ✅ Performance impact <10% (latency and throughput)
  7. ✅ Documentation complete (deployment guide, troubleshooting)
  8. ✅ User enrollment workflow tested and documented

Production-Ready When:

  1. ✅ 2 weeks of production operation without hardening-related incidents
  2. ✅ All operators enrolled in TOTP
  3. ✅ Monitoring rules tuned (false positive rate <1%)
  4. ✅ Audit log verification automated and running
  5. ✅ Supply chain scanning scheduled and alerting
  6. ✅ Token rotation tested through multiple cycles

Timeline Estimate

Week 1: Foundation (12-16 hours)

  • Day 1-2: Audit Log Integrity (6-8 hours)
  • Day 3-4: Supply Chain Verification (6-8 hours)
  • Milestone: Foundational security layers operational

Week 2: Authentication & Monitoring (20-28 hours)

  • Day 1-3: Independent Authentication Layer (8-12 hours)
  • Day 4-5: Runtime Behavior Monitoring (12-16 hours)
  • Milestone: Primary security controls implemented

Week 3: Token Protection & Testing (16-30 hours)

  • Day 1-3: WebSocket Token Protection (16-20 hours)
  • Day 4-5: Integration testing, security testing, documentation (10 hours)
  • Milestone: All hardening complete, ready for staging deployment

Week 4: Validation & Deployment (8-16 hours)

  • Day 1-2: Red Team re-analysis, tune based on findings (4-8 hours)
  • Day 3: Staging deployment, user enrollment (4 hours)
  • Day 4-5: Monitoring, final validation, production deployment (0-4 hours)
  • Milestone: Production deployment with hardening enabled

Total: 48-64 hours over 3-4 weeks


Next Steps

  1. Review this plan with stakeholders (security team, operations, management)
  2. Schedule implementation sprint (3-week focused development)
  3. Provision resources:
    • Development environment for testing
    • Mattermost admin token for token management API
    • Staging environment for integration testing
  4. Begin implementation following priority order:
    • Start with Audit Log Integrity (foundational, no dependencies)
    • Then Supply Chain Verification (foundational, no dependencies)
    • Then Authentication → Monitoring → Token Protection (in dependency order)
  5. Re-run Red Team analysis after implementation complete
  6. Plan production deployment (target: 3-4 weeks from implementation start)

Conclusion

This implementation plan addresses all 7 critical security gaps identified by the 32-agent Red Team analysis. The 5 mandatory hardening items are designed as defense-in-depth layers that work together to prevent, detect, and respond to security threats.

Key Principles:

  • Safe but functional by default: Hardening doesn't break existing workflows
  • Defense-in-depth: Multiple independent security layers
  • Fail securely: Security failures result in deny, not allow
  • Audit everything: Cryptographically signed, tamper-evident logs
  • Behavioral detection: Anomaly detection catches unexpected patterns

Expected Outcome: After implementation, the Mattermost ChatOps Bot will transition from "CONDITIONALLY SOUND BUT NOT PRODUCTION-READY" to "PRODUCTION-READY" with comprehensive security controls appropriate for sensitive operational environments.


Document Status: APPROVED FOR IMPLEMENTATION Last Updated: 2026-01-27 Next Review: After Phase 3 completion, before production deployment