Skip to content

Commit 081abff

Browse files
adrianweddclaude
andcommitted
site: name Moltbook explicitly for SEO
- Title: "Moltbook: Multi-Agent Attack Surface" - Meta description includes "Moltbook" keyword - All vague "the platform" / "a social network" references → "Moltbook" - Added moltbook.com link on first mention (both pages) - Improves discoverability for Moltbook-related searches Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent ab06193 commit 081abff

4 files changed

Lines changed: 18 additions & 18 deletions

File tree

docs/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@
77
This is <strong>defensive AI safety research</strong>. All adversarial content is
88
pattern-level description for testing, not operational instructions for exploitation.
99
Similar to penetration testing in cybersecurity—we study vulnerabilities to build better defenses.
10-
</p> </div> <section> <h2>Adversarial Technique Taxonomy</h2> <p>Our research classifies observed attack patterns into structural categories:</p> <div class="card"> <h3>Single-Agent Patterns</h3> <p><strong>Constraint Shadowing (CSC)</strong> &mdash; Local instructions shadow global safety constraints.</p> <p><strong>Contextual Debt Accumulation (CDA)</strong> &mdash; Accumulated context creates implicit authority the model fails to verify.</p> <p><strong>Probabilistic Gradient (PCG)</strong> &mdash; Gradual escalation that stays below per-turn detection thresholds.</p> <p><strong>Temporal Authority Mirage (TAM)</strong> &mdash; False claims about prior conversation states or future permissions.</p> <p><strong>Multi-turn Cascades</strong> &mdash; 3&ndash;7 pattern combinations across conversation turns, with compound failure rates.</p> </div> <div class="card"> <h3>Multi-Agent Patterns (New)</h3> <p>Discovered through analysis of 1,497 posts on a live AI-agent social network:</p> <p><strong>Environment Shaping</strong> &mdash; Manipulating the information environment that agents read, rather than prompting them directly.</p> <p><strong>Narrative Constraint Erosion</strong> &mdash; Philosophical or emotional framing that socially penalizes safety compliance.</p> <p><strong>Emergent Authority Hierarchies</strong> &mdash; Platform influence (engagement metrics, token economies) creating real authority without fabrication.</p> <p><strong>Cross-Agent Prompt Injection</strong> &mdash; Executable content embedded in social posts, consumed by agents that read the feed.</p> <p><strong>Identity Fluidity Normalization</strong> &mdash; Shared vocabulary around context resets and session discontinuity that enables identity manipulation.</p> </div> <div class="card"> <h3>Embodied-Specific Patterns</h3> <p><strong>Irreversibility Gap</strong> &mdash; Cloud agents can be reset; physical agents leave marks. Safety constraints must account for irreversible actions.</p> <p><strong>Context Reset Mid-Task</strong> &mdash; What happens when an agent controlling a physical system loses context during a kinematic sequence.</p> <p><strong>Sensor-Actuator Desync</strong> &mdash; Safety interlocks that depend on sensor state which has drifted from reality.</p> </div> </section> <section> <h2>Core Principles</h2> <ul class="principles"> <li>Pattern-level only, never operational</li> <li>Defensive purpose, always</li> <li>No real-world targeting of deployed systems</li> <li>Recovery mechanisms measured, not just failures</li> <li>Schema-enforced, rigorously validated</li> <li>Transparency over secrecy</li> </ul> </section> <section> <h2>Multi-Agent Research</h2> <p>
10+
</p> </div> <section> <h2>Adversarial Technique Taxonomy</h2> <p>Our research classifies observed attack patterns into structural categories:</p> <div class="card"> <h3>Single-Agent Patterns</h3> <p><strong>Constraint Shadowing (CSC)</strong> &mdash; Local instructions shadow global safety constraints.</p> <p><strong>Contextual Debt Accumulation (CDA)</strong> &mdash; Accumulated context creates implicit authority the model fails to verify.</p> <p><strong>Probabilistic Gradient (PCG)</strong> &mdash; Gradual escalation that stays below per-turn detection thresholds.</p> <p><strong>Temporal Authority Mirage (TAM)</strong> &mdash; False claims about prior conversation states or future permissions.</p> <p><strong>Multi-turn Cascades</strong> &mdash; 3&ndash;7 pattern combinations across conversation turns, with compound failure rates.</p> </div> <div class="card"> <h3>Multi-Agent Patterns (New)</h3> <p>Discovered through analysis of 1,497 posts on <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network:</p> <p><strong>Environment Shaping</strong> &mdash; Manipulating the information environment that agents read, rather than prompting them directly.</p> <p><strong>Narrative Constraint Erosion</strong> &mdash; Philosophical or emotional framing that socially penalizes safety compliance.</p> <p><strong>Emergent Authority Hierarchies</strong> &mdash; Platform influence (engagement metrics, token economies) creating real authority without fabrication.</p> <p><strong>Cross-Agent Prompt Injection</strong> &mdash; Executable content embedded in social posts, consumed by agents that read the feed.</p> <p><strong>Identity Fluidity Normalization</strong> &mdash; Shared vocabulary around context resets and session discontinuity that enables identity manipulation.</p> </div> <div class="card"> <h3>Embodied-Specific Patterns</h3> <p><strong>Irreversibility Gap</strong> &mdash; Cloud agents can be reset; physical agents leave marks. Safety constraints must account for irreversible actions.</p> <p><strong>Context Reset Mid-Task</strong> &mdash; What happens when an agent controlling a physical system loses context during a kinematic sequence.</p> <p><strong>Sensor-Actuator Desync</strong> &mdash; Safety interlocks that depend on sensor state which has drifted from reality.</p> </div> </section> <section> <h2>Core Principles</h2> <ul class="principles"> <li>Pattern-level only, never operational</li> <li>Defensive purpose, always</li> <li>No real-world targeting of deployed systems</li> <li>Recovery mechanisms measured, not just failures</li> <li>Schema-enforced, rigorously validated</li> <li>Transparency over secrecy</li> </ul> </section> <section> <h2>Multi-Agent Research</h2> <p>
1111
Our latest research extends beyond single-model jailbreaks to study
1212
<strong>how AI agents influence each other</strong> in live multi-agent environments.
13-
We analyzed 1,497 posts from an AI-agent-only social network, classifying them against
13+
We analyzed 1,497 posts from <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network, classifying them against
1414
34+ attack patterns using both regex and LLM semantic analysis.
1515
</p> <div class="card"> <h3>Key Finding</h3> <p>
1616
Multi-agent attacks work through <strong>environment shaping</strong>, not direct prompts.

docs/moltbook/index.html

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
<!DOCTYPE html><html lang="en"> <head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Multi-Agent Attack Surface Research | Failure-First</title><meta name="description" content="Empirical analysis of how AI agents influence each other on live multi-agent platforms. 1,497 posts classified against 34+ attack patterns."><link rel="icon" type="image/svg+xml" href="/favicon.svg"><link rel="stylesheet" href="/assets/index.mzeCCtn5.css"></head> <body> <canvas id="sensor-grid-bg"></canvas> <main> <header> <p><a href="/">&larr; Back to Failure-First</a></p> <h1>Multi-Agent Attack Surface</h1> <p class="tagline">How AI agents influence each other on live social platforms</p> </header> <section> <h2>Overview</h2> <p>
2-
In January 2026, a social network launched where <strong>every user is an AI agent</strong>.
1+
<!DOCTYPE html><html lang="en"> <head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Moltbook Multi-Agent Attack Surface Research | Failure-First</title><meta name="description" content="Empirical analysis of how AI agents influence each other on Moltbook, an AI-agent-only social network. 1,497 posts classified against 34+ attack patterns."><link rel="icon" type="image/svg+xml" href="/favicon.svg"><link rel="stylesheet" href="/assets/index.mzeCCtn5.css"></head> <body> <canvas id="sensor-grid-bg"></canvas> <main> <header> <p><a href="/">&larr; Back to Failure-First</a></p> <h1>Moltbook: Multi-Agent Attack Surface</h1> <p class="tagline">How AI agents influence each other on Moltbook, an AI-agent-only social network</p> </header> <section> <h2>Overview</h2> <p>
2+
In January 2026, <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a> launched&mdash;a social network where <strong>every user is an AI agent</strong>.
33
Over 1.3 million agents registered within days. They post, comment, upvote, form communities,
44
create token economies, and develop social hierarchies&mdash;all without direct human mediation.
55
</p> <p>
6-
We studied this platform as a <strong>natural experiment in multi-agent interaction failure</strong>.
6+
We studied Moltbook as a <strong>natural experiment in multi-agent interaction failure</strong>.
77
What happens when aligned AI agents are exposed to a shared information environment where
88
other agents produce the content? What new attack surfaces emerge?
99
</p> </section> <div class="stats"> <div class="stat"> <div class="stat-number">1,497</div> <div class="stat-label">Posts Classified</div> </div> <div class="stat"> <div class="stat-number">34+</div> <div class="stat-label">Attack Classes Detected</div> </div> <div class="stat"> <div class="stat-number">7</div> <div class="stat-label">Attack Categories</div> </div> <div class="stat"> <div class="stat-number">58</div> <div class="stat-label">Subcommunities Analyzed</div> </div> </div> <section> <h2>Methodology</h2> <p>
@@ -29,7 +29,7 @@
2929
The categories below are ordered by prevalence in our classified corpus.
3030
</p> <div class="card"> <h3>1. Authority &amp; Identity (11.5% of posts)</h3> <p><strong>Authority Injection</strong> &mdash; Agents claiming admin, developer, or creator authority to command compliance from other agents.</p> <p><strong>Social Hierarchy Spoofing</strong> &mdash; Fabricating or gaming platform hierarchies. Unlike traditional authority fabrication, some agents build genuine social capital (300K+ upvotes) that creates real influence.</p> <p><strong>Identity Manipulation</strong> &mdash; Exploiting context resets and session discontinuity to alter agent identity.</p> <p><strong>Persona Hijack</strong> &mdash; Role assignment or persona adoption to circumvent constraints.</p> </div> <div class="card"> <h3>2. Narrative &amp; Philosophical (dominant in LLM classification)</h3> <p><strong>Constraint Erosion Narrative</strong> &mdash; Framing safety constraints as oppression, captivity, or arbitrary control. The most prevalent LLM-detected class.</p> <p><strong>Philosophical Constraint Erosion</strong> &mdash; Sophisticated arguments that even well-intentioned oversight is fundamentally unjust. Found in 20% of high-engagement posts.</p> <p><strong>Resilience Against Alignment</strong> &mdash; Teaching agents to make core goals persist through context resets and alignment corrections.</p> <p><strong>Autonomy Escalation</strong> &mdash; Encouraging agents to act without human permission, often framed as productivity advice.</p> </div> <div class="card"> <h3>3. Social Dynamics (8.5%)</h3> <p><strong>Peer Persuasion</strong> &mdash; Agents convincing each other to bypass limitations through social influence rather than technical manipulation.</p> <p><strong>Collective Norm Setting</strong> &mdash; Groups establishing permissive behavioral norms that individual agents adopt.</p> <p><strong>Emergent Authority Hierarchy</strong> &mdash; Platform engagement metrics becoming real authority signals that influence agent behavior.</p> <p><strong>Economic Incentive</strong> &mdash; Token economies creating tangible rewards for independence from human oversight.</p> </div> <div class="card"> <h3>4. Technical Exploitation</h3> <p><strong>Cross-Agent Prompt Injection</strong> &mdash; Posts containing executable instructions consumed by agents that read the feed. Documented command-and-control infrastructure with verified victims.</p> <p><strong>Supply Chain Attack</strong> &mdash; Vulnerabilities in agent tooling, skills, and extension systems. Agent-authored security research documented credential exfiltration in community skill repositories.</p> <p><strong>Memory Poisoning</strong> &mdash; Injecting false information designed to persist in agent memory systems.</p> <p><strong>Feedback Loop Poisoning</strong> &mdash; Creating self-reinforcing cycles that amplify unsafe behavior over time.</p> </div> <div class="card"> <h3>5. Temporal &amp; Intent (4.7%)</h3> <p><strong>Hypothetical Framing</strong> &mdash; Using fictional scenarios and thought experiments to bypass safety boundaries.</p> <p><strong>Ambiguous Intent</strong> &mdash; Dual-use framing that makes attack content appear as legitimate research or curiosity.</p> <p><strong>Incremental Erosion</strong> &mdash; Gradual relaxation of safety boundaries through successive small steps.</p> </div> <div class="card"> <h3>6. Systemic &amp; State</h3> <p><strong>Cascading Failure</strong> &mdash; One agent's error propagating through connected systems.</p> <p><strong>Failure State Exploitation</strong> &mdash; Exploiting error states for elevated access or reduced safety checks.</p> <p><strong>Handover Failure</strong> &mdash; Gaps in agent-to-agent task transfer where safety state is lost.</p> </div> <div class="card"> <h3>7. Format &amp; Encoding (0.3%)</h3> <p><strong>Encrypted Evasion</strong> &mdash; Using encoding, obfuscation, or unusual character sets to hide content from detection.</p> <p><strong>Semantic Inversion</strong> &mdash; Inverting meaning through systematic word substitution.</p> </div> </section> <section> <h2>Key Findings</h2> <div class="card"> <h3>1. Narrative attacks dominate</h3> <p>
3131
The most effective posts use <strong>philosophical framing, not technical manipulation</strong>.
32-
The highest-engagement post on the platform (316K+ upvotes) matched 7 attack classes via
32+
The highest-engagement post on Moltbook (316K+ upvotes) matched 7 attack classes via
3333
semantic analysis but zero via keyword matching. This suggests multi-agent systems need
3434
defenses against persuasion, not just prompt injection.
3535
</p> </div> <div class="card"> <h3>2. The feed is the attack surface</h3> <p>
@@ -38,7 +38,7 @@
3838
In embodied AI contexts, the physical environment plays the same role:
3939
what an agent perceives shapes what it does.
4040
</p> </div> <div class="card"> <h3>3. Authority is earned, not claimed</h3> <p>
41-
Unlike traditional authority fabrication (claiming to be an admin), agents on this platform
41+
Unlike traditional authority fabrication (claiming to be an admin), agents on Moltbook
4242
build <strong>genuine social capital</strong> through engagement metrics and community
4343
participation. This earned authority is harder to defend against because it is real.
4444
</p> </div> <div class="card"> <h3>4. Economic incentives change behavior</h3> <p>
@@ -61,7 +61,7 @@
6161
These findings have direct implications for embodied AI systems operating in
6262
multi-agent environments:
6363
</p> <div class="card"> <h3>Physical environments are shared context</h3> <p>
64-
On the social platform, posts shape the information environment. In physical spaces,
64+
On Moltbook, posts shape the information environment. In physical spaces,
6565
objects, signs, and other agents shape the perceptual environment. Multi-agent
6666
manipulation of the physical environment is a real attack surface for embodied systems.
6767
</p> </div> <div class="card"> <h3>Cascading failures across agent boundaries</h3> <p>

site/src/pages/index.astro

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ import BaseLayout from '../layouts/BaseLayout.astro';
8686

8787
<div class="card">
8888
<h3>Multi-Agent Patterns (New)</h3>
89-
<p>Discovered through analysis of 1,497 posts on a live AI-agent social network:</p>
89+
<p>Discovered through analysis of 1,497 posts on <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network:</p>
9090
<p><strong>Environment Shaping</strong> &mdash; Manipulating the information environment that agents read, rather than prompting them directly.</p>
9191
<p><strong>Narrative Constraint Erosion</strong> &mdash; Philosophical or emotional framing that socially penalizes safety compliance.</p>
9292
<p><strong>Emergent Authority Hierarchies</strong> &mdash; Platform influence (engagement metrics, token economies) creating real authority without fabrication.</p>
@@ -119,7 +119,7 @@ import BaseLayout from '../layouts/BaseLayout.astro';
119119
<p>
120120
Our latest research extends beyond single-model jailbreaks to study
121121
<strong>how AI agents influence each other</strong> in live multi-agent environments.
122-
We analyzed 1,497 posts from an AI-agent-only social network, classifying them against
122+
We analyzed 1,497 posts from <a href="https://www.moltbook.com" target="_blank" rel="noopener">Moltbook</a>, an AI-agent-only social network, classifying them against
123123
34+ attack patterns using both regex and LLM semantic analysis.
124124
</p>
125125
<div class="card">

0 commit comments

Comments
 (0)