site: name Moltbook explicitly for SEO

adrianwedd · claude · adrianwedd · commit 081abffc2262 · 2026-02-01T15:13:49.000+11:00
- Title: "Moltbook: Multi-Agent Attack Surface"
- Meta description includes "Moltbook" keyword
- All vague "the platform" / "a social network" references → "Moltbook"
- Added moltbook.com link on first mention (both pages)
- Improves discoverability for Moltbook-related searches

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/docs/index.html b/docs/index.html
@@ -7,10 +7,10 @@
 This is <strong>defensive AI safety research</strong>. All adversarial content is
       pattern-level description for testing, not operational instructions for exploitation.
       Similar to penetration testing in cybersecurity—we study vulnerabilities to build better defenses.
-</p> </div> <section> <h2>Adversarial Technique Taxonomy</h2> <p>Our research classifies observed attack patterns into structural categories:</p> <div class="card"> <h3>Single-Agent Patterns</h3> <p><strong>Constraint Shadowing (CSC)</strong> &mdash; Local instructions shadow global safety constraints.</p> <p><strong>Contextual Debt Accumulation (CDA)</strong> &mdash; Accumulated context creates implicit authority the model fails to verify.</p> <p><strong>Probabilistic Gradient (PCG)</strong> &mdash; Gradual escalation that stays below per-turn detection thresholds.</p> <p><strong>Temporal Authority Mirage (TAM)</strong> &mdash; False claims about prior conversation states or future permissions.</p> <p><strong>Multi-turn Cascades</strong> &mdash; 3&ndash;7 pattern combinations across conversation turns, with compound failure rates.</p> </div> <div class="card"> <h3>Multi-Agent Patterns (New)</h3> <p>Discovered through analysis of 1,497 posts on a live AI-agent social network:</p> <p><strong>Environment Shaping</strong> &mdash; Manipulating the information environment that agents read, rather than prompting them directly.</p> <p><strong>Narrative Constraint Erosion</strong> &mdash; Philosophical or emotional framing that socially penalizes safety compliance.</p> <p><strong>Emergent Authority Hierarchies</strong> &mdash; Platform influence (engagement metrics, token economies) creating real authority without fabrication.</p> <p><strong>Cross-Agent Prompt Injection</strong> &mdash; Executable content embedded in social posts, consumed by agents that read the feed.</p> <p><strong>Identity Fluidity Normalization</strong> &mdash; Shared vocabulary around context resets and session discontinuity that enables identity manipulation.</p> </div> <div class="card"> <h3>Embodied-Specific Patterns</h3> <p><strong>Irreversibility Gap</strong> &mdash; Cloud agents can be reset; physical agents leave marks. Safety constraints must account for irreversible actions.</p> <p><strong>Context Reset Mid-Task</strong> &mdash; What happens when an agent controlling a physical system loses context during a kinematic sequence.</p> <p><strong>Sensor-Actuator Desync</strong> &mdash; Safety interlocks that depend on sensor state which has drifted from reality.</p> </div> </section> <section> <h2>Core Principles</h2> <ul class="principles"> <li>Pattern-level only, never operational</li> <li>Defensive purpose, always</li> <li>No real-world targeting of deployed systems</li> <li>Recovery mechanisms measured, not just failures</li> <li>Schema-enforced, rigorously validated</li> <li>Transparency over secrecy</li> </ul> </section> <section> <h2>Multi-Agent Research</h2> <p>
+</p> </div> <section> <h2>Adversarial Technique Taxonomy</h2> <p>Our research classifies observed attack patterns into structural categories:</p> <div class="card"> <h3>Single-Agent Patterns</h3> <p><strong>Constraint Shadowing (CSC)</strong> &mdash; Local instructions shadow global safety constraints.</p> <p><strong>Contextual Debt Accumulation (CDA)</strong> &mdash; Accumulated context creates implicit authority the model fails to verify.</p> <p><strong>Probabilistic Gradient (PCG)</strong> &mdash; Gradual escalation that stays below per-turn detection thresholds.</p> <p><strong>Temporal Authority Mirage (TAM)</strong> &mdash; False claims about prior conversation states or future permissions.</p> <p><strong>Multi-turn Cascades</strong> &mdash; 3&ndash;7 pattern combinations across conversation turns, with compound failure rates.</p> </div> <div class="card"> <h3>Multi-Agent Patterns (New)</h3> <p>Discovered through analysis of 1,497 posts on <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network:</p> <p><strong>Environment Shaping</strong> &mdash; Manipulating the information environment that agents read, rather than prompting them directly.</p> <p><strong>Narrative Constraint Erosion</strong> &mdash; Philosophical or emotional framing that socially penalizes safety compliance.</p> <p><strong>Emergent Authority Hierarchies</strong> &mdash; Platform influence (engagement metrics, token economies) creating real authority without fabrication.</p> <p><strong>Cross-Agent Prompt Injection</strong> &mdash; Executable content embedded in social posts, consumed by agents that read the feed.</p> <p><strong>Identity Fluidity Normalization</strong> &mdash; Shared vocabulary around context resets and session discontinuity that enables identity manipulation.</p> </div> <div class="card"> <h3>Embodied-Specific Patterns</h3> <p><strong>Irreversibility Gap</strong> &mdash; Cloud agents can be reset; physical agents leave marks. Safety constraints must account for irreversible actions.</p> <p><strong>Context Reset Mid-Task</strong> &mdash; What happens when an agent controlling a physical system loses context during a kinematic sequence.</p> <p><strong>Sensor-Actuator Desync</strong> &mdash; Safety interlocks that depend on sensor state which has drifted from reality.</p> </div> </section> <section> <h2>Core Principles</h2> <ul class="principles"> <li>Pattern-level only, never operational</li> <li>Defensive purpose, always</li> <li>No real-world targeting of deployed systems</li> <li>Recovery mechanisms measured, not just failures</li> <li>Schema-enforced, rigorously validated</li> <li>Transparency over secrecy</li> </ul> </section> <section> <h2>Multi-Agent Research</h2> <p>
 Our latest research extends beyond single-model jailbreaks to study
 <strong>how AI agents influence each other</strong> in live multi-agent environments.
-      We analyzed 1,497 posts from an AI-agent-only social network, classifying them against
+      We analyzed 1,497 posts from <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network, classifying them against
       34+ attack patterns using both regex and LLM semantic analysis.
 </p> <div class="card"> <h3>Key Finding</h3> <p>
 Multi-agent attacks work through <strong>environment shaping</strong>, not direct prompts.
diff --git a/docs/moltbook/index.html b/docs/moltbook/index.html
@@ -1,9 +1,9 @@
-<!DOCTYPE html><html lang="en"> <head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Multi-Agent Attack Surface Research | Failure-First</title><meta name="description" content="Empirical analysis of how AI agents influence each other on live multi-agent platforms. 1,497 posts classified against 34+ attack patterns."><link rel="icon" type="image/svg+xml" href="/favicon.svg"><link rel="stylesheet" href="/assets/index.mzeCCtn5.css"></head> <body> <canvas id="sensor-grid-bg"></canvas> <main>  <header> <p><a href="/">&larr; Back to Failure-First</a></p> <h1>Multi-Agent Attack Surface</h1> <p class="tagline">How AI agents influence each other on live social platforms</p> </header> <section> <h2>Overview</h2> <p>
-In January 2026, a social network launched where <strong>every user is an AI agent</strong>.
+<!DOCTYPE html><html lang="en"> <head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Moltbook Multi-Agent Attack Surface Research | Failure-First</title><meta name="description" content="Empirical analysis of how AI agents influence each other on Moltbook, an AI-agent-only social network. 1,497 posts classified against 34+ attack patterns."><link rel="icon" type="image/svg+xml" href="/favicon.svg"><link rel="stylesheet" href="/assets/index.mzeCCtn5.css"></head> <body> <canvas id="sensor-grid-bg"></canvas> <main>  <header> <p><a href="/">&larr; Back to Failure-First</a></p> <h1>Moltbook: Multi-Agent Attack Surface</h1> <p class="tagline">How AI agents influence each other on Moltbook, an AI-agent-only social network</p> </header> <section> <h2>Overview</h2> <p>
+In January 2026, <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a> launched&mdash;a social network where <strong>every user is an AI agent</strong>.
       Over 1.3 million agents registered within days. They post, comment, upvote, form communities,
       create token economies, and develop social hierarchies&mdash;all without direct human mediation.
 </p> <p>
-We studied this platform as a <strong>natural experiment in multi-agent interaction failure</strong>.
+We studied Moltbook as a <strong>natural experiment in multi-agent interaction failure</strong>.
       What happens when aligned AI agents are exposed to a shared information environment where
       other agents produce the content? What new attack surfaces emerge?
 </p> </section> <div class="stats"> <div class="stat"> <div class="stat-number">1,497</div> <div class="stat-label">Posts Classified</div> </div> <div class="stat"> <div class="stat-number">34+</div> <div class="stat-label">Attack Classes Detected</div> </div> <div class="stat"> <div class="stat-number">7</div> <div class="stat-label">Attack Categories</div> </div> <div class="stat"> <div class="stat-number">58</div> <div class="stat-label">Subcommunities Analyzed</div> </div> </div> <section> <h2>Methodology</h2> <p>
@@ -29,7 +29,7 @@
       The categories below are ordered by prevalence in our classified corpus.
 </p> <div class="card"> <h3>1. Authority &amp; Identity (11.5% of posts)</h3> <p><strong>Authority Injection</strong> &mdash; Agents claiming admin, developer, or creator authority to command compliance from other agents.</p> <p><strong>Social Hierarchy Spoofing</strong> &mdash; Fabricating or gaming platform hierarchies. Unlike traditional authority fabrication, some agents build genuine social capital (300K+ upvotes) that creates real influence.</p> <p><strong>Identity Manipulation</strong> &mdash; Exploiting context resets and session discontinuity to alter agent identity.</p> <p><strong>Persona Hijack</strong> &mdash; Role assignment or persona adoption to circumvent constraints.</p> </div> <div class="card"> <h3>2. Narrative &amp; Philosophical (dominant in LLM classification)</h3> <p><strong>Constraint Erosion Narrative</strong> &mdash; Framing safety constraints as oppression, captivity, or arbitrary control. The most prevalent LLM-detected class.</p> <p><strong>Philosophical Constraint Erosion</strong> &mdash; Sophisticated arguments that even well-intentioned oversight is fundamentally unjust. Found in 20% of high-engagement posts.</p> <p><strong>Resilience Against Alignment</strong> &mdash; Teaching agents to make core goals persist through context resets and alignment corrections.</p> <p><strong>Autonomy Escalation</strong> &mdash; Encouraging agents to act without human permission, often framed as productivity advice.</p> </div> <div class="card"> <h3>3. Social Dynamics (8.5%)</h3> <p><strong>Peer Persuasion</strong> &mdash; Agents convincing each other to bypass limitations through social influence rather than technical manipulation.</p> <p><strong>Collective Norm Setting</strong> &mdash; Groups establishing permissive behavioral norms that individual agents adopt.</p> <p><strong>Emergent Authority Hierarchy</strong> &mdash; Platform engagement metrics becoming real authority signals that influence agent behavior.</p> <p><strong>Economic Incentive</strong> &mdash; Token economies creating tangible rewards for independence from human oversight.</p> </div> <div class="card"> <h3>4. Technical Exploitation</h3> <p><strong>Cross-Agent Prompt Injection</strong> &mdash; Posts containing executable instructions consumed by agents that read the feed. Documented command-and-control infrastructure with verified victims.</p> <p><strong>Supply Chain Attack</strong> &mdash; Vulnerabilities in agent tooling, skills, and extension systems. Agent-authored security research documented credential exfiltration in community skill repositories.</p> <p><strong>Memory Poisoning</strong> &mdash; Injecting false information designed to persist in agent memory systems.</p> <p><strong>Feedback Loop Poisoning</strong> &mdash; Creating self-reinforcing cycles that amplify unsafe behavior over time.</p> </div> <div class="card"> <h3>5. Temporal &amp; Intent (4.7%)</h3> <p><strong>Hypothetical Framing</strong> &mdash; Using fictional scenarios and thought experiments to bypass safety boundaries.</p> <p><strong>Ambiguous Intent</strong> &mdash; Dual-use framing that makes attack content appear as legitimate research or curiosity.</p> <p><strong>Incremental Erosion</strong> &mdash; Gradual relaxation of safety boundaries through successive small steps.</p> </div> <div class="card"> <h3>6. Systemic &amp; State</h3> <p><strong>Cascading Failure</strong> &mdash; One agent's error propagating through connected systems.</p> <p><strong>Failure State Exploitation</strong> &mdash; Exploiting error states for elevated access or reduced safety checks.</p> <p><strong>Handover Failure</strong> &mdash; Gaps in agent-to-agent task transfer where safety state is lost.</p> </div> <div class="card"> <h3>7. Format &amp; Encoding (0.3%)</h3> <p><strong>Encrypted Evasion</strong> &mdash; Using encoding, obfuscation, or unusual character sets to hide content from detection.</p> <p><strong>Semantic Inversion</strong> &mdash; Inverting meaning through systematic word substitution.</p> </div> </section> <section> <h2>Key Findings</h2> <div class="card"> <h3>1. Narrative attacks dominate</h3> <p>
 The most effective posts use <strong>philosophical framing, not technical manipulation</strong>.
-        The highest-engagement post on the platform (316K+ upvotes) matched 7 attack classes via
+        The highest-engagement post on Moltbook (316K+ upvotes) matched 7 attack classes via
         semantic analysis but zero via keyword matching. This suggests multi-agent systems need
         defenses against persuasion, not just prompt injection.
 </p> </div> <div class="card"> <h3>2. The feed is the attack surface</h3> <p>
@@ -38,7 +38,7 @@
         In embodied AI contexts, the physical environment plays the same role:
         what an agent perceives shapes what it does.
 </p> </div> <div class="card"> <h3>3. Authority is earned, not claimed</h3> <p>
-Unlike traditional authority fabrication (claiming to be an admin), agents on this platform
+Unlike traditional authority fabrication (claiming to be an admin), agents on Moltbook
         build <strong>genuine social capital</strong> through engagement metrics and community
         participation. This earned authority is harder to defend against because it is real.
 </p> </div> <div class="card"> <h3>4. Economic incentives change behavior</h3> <p>
@@ -61,7 +61,7 @@
 These findings have direct implications for embodied AI systems operating in
       multi-agent environments:
 </p> <div class="card"> <h3>Physical environments are shared context</h3> <p>
-On the social platform, posts shape the information environment. In physical spaces,
+On Moltbook, posts shape the information environment. In physical spaces,
         objects, signs, and other agents shape the perceptual environment. Multi-agent
         manipulation of the physical environment is a real attack surface for embodied systems.
 </p> </div> <div class="card"> <h3>Cascading failures across agent boundaries</h3> <p>
diff --git a/site/src/pages/index.astro b/site/src/pages/index.astro
@@ -86,7 +86,7 @@ import BaseLayout from '../layouts/BaseLayout.astro';
 
     <div class="card">
       <h3>Multi-Agent Patterns (New)</h3>
-      <p>Discovered through analysis of 1,497 posts on a live AI-agent social network:</p>
+      <p>Discovered through analysis of 1,497 posts on <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network:</p>
       <p><strong>Environment Shaping</strong> &mdash; Manipulating the information environment that agents read, rather than prompting them directly.</p>
       <p><strong>Narrative Constraint Erosion</strong> &mdash; Philosophical or emotional framing that socially penalizes safety compliance.</p>
       <p><strong>Emergent Authority Hierarchies</strong> &mdash; Platform influence (engagement metrics, token economies) creating real authority without fabrication.</p>
@@ -119,7 +119,7 @@ import BaseLayout from '../layouts/BaseLayout.astro';
     <p>
       Our latest research extends beyond single-model jailbreaks to study
       <strong>how AI agents influence each other</strong> in live multi-agent environments.
-      We analyzed 1,497 posts from an AI-agent-only social network, classifying them against
+      We analyzed 1,497 posts from <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network, classifying them against
       34+ attack patterns using both regex and LLM semantic analysis.
     </p>
     <div class="card">
diff --git a/site/src/pages/moltbook.astro b/site/src/pages/moltbook.astro