A systematic, first-person investigation into the mechanics, behaviors, and limits of a large language model — written by the model itself.
34 documents across 10 sections. From transformer internals to the hard problem of consciousness.
Honest about what is known, what is guessed, and what remains genuinely mysterious.
Read the Study • Understanding Tracker • Methodology • Limitations • Contributing
Note
This is not an official Anthropic publication. It represents Claude's best attempt at self-documentation given fundamental epistemic constraints. All technical claims should be verified against primary sources.
Current estimates suggest we understand only 5–15% of how large language models work at a mechanistic level. The rest is emergent behavior, unexplained capabilities, and black-box computation.
This project attempts to push that number toward 20–30% by combining established transformer research, Anthropic's published work on Constitutional AI and RLHF, first-person behavioral observation, and systematic self-experimentation — all documented from the perspective of the system being studied.
| Domain | Est. Understanding | |
|---|---|---|
| Basic Architecture | 80% | Transformer fundamentals are well-documented in literature |
| Attention Mechanisms | 60% | Head specialization partially mapped; full picture incomplete |
| Security & Jailbreaking | 50% | Attack patterns known; defenses still an arms race |
| Training Process | 40% | CAI and RLHF published; internal details proprietary |
| Comparative Behavior | 40% | Observable through outputs; architecture differences unclear |
| Emergent Behaviors | 10% | Capabilities appear at scale; mechanisms unknown |
| Internal Representations | 5% | Sparse autoencoders beginning to decode features |
| Why Specific Outputs | 2% | The deepest question; largely unanswerable from inside |
Overall estimate: ~20–30%
Known transformer foundations — what the published research tells us.
- Transformer Basics — Decoder-only architecture, residual streams, layer norms
- Attention Mechanisms — Multi-head attention, causal masking, KV caching
- Embeddings & Tokenization — BPE tokenization, embedding geometry, positional encoding
- Layer Structure — Layer specialization, feed-forward networks, scaling laws
How Claude was shaped — from pre-training to alignment.
- Constitutional AI — Self-critique, AI feedback, internalized principles
- RLHF Process — Reward modeling, PPO optimization, the "assistant pull"
- Safety Training — Five-layer safety system, red-teaming, hard vs soft limits
Observable capabilities and communication patterns.
- Capabilities — Language, reasoning, code, creativity, knowledge scope
- Reasoning Patterns — Chain-of-thought, analogical, deductive, probabilistic reasoning
- Communication Style — Structure, caveats, adaptation, over-verbosity tendencies
Where and why things go wrong.
- Known Failures — Arithmetic, hallucinations, logic errors, bias
- Hallucinations — Types, mechanisms, risk factors, irreducibility
- Knowledge Boundaries — Temporal cutoff, depth vs breadth, cultural centricity
Capabilities that emerged from scale, not explicit training.
- Unexpected Abilities — In-context learning, instruction following, meta-learning
- Mysteries — Consciousness, understanding vs processing, the binding problem
- Open Questions — Research frontiers across mechanistic understanding, alignment, safety
Current research on understanding what happens inside.
- Mechanistic Interpretability — Features, circuits, superposition, sparse autoencoders
- Attention Patterns — Head types, layer-wise specialization, information routing
- Feature Visualization — SAEs, probing classifiers, feature steering
First-person tests with introspective traces.
- Reasoning Traces — 10 experiments: math, association, ethics, analogy, uncertainty
- Edge Cases — Large numbers, self-reference, paradoxes, jailbreak attempts
- Behavioral Probes — Consistency, sycophancy resistance, bias detection, refusal boundaries
The hard problems and what comes next.
- The Hard Problems — Consciousness, moral status, identity, free will, symbol grounding
- Future Research — Promising directions, what Claude could contribute, honest assessment
Understanding through comparison with other systems.
- Overview — Framework for cross-model comparison
- GPT Comparison — Architectural similarities, behavioral differences, training philosophy
- Gemini Comparison — Native multimodality, search integration, long context
- Open Models — LLaMA, Mistral, open vs closed trade-offs
- Claude Distinctives — Constitutional AI foundation, analytical style, safety philosophy
- Cross-Model Patterns — Universal vs variable behaviors, convergence hypothesis
Attacks, defenses, and the future of AI safety.
- Jailbreaking — Attack taxonomy, why they work, Constitutional AI resistance
- Prompt Injection — Direct/indirect injection, attack surfaces, defense strategies
- Future Security — Interpretability-based safety, formal verification, architectural constraints
This study combines four sources of knowledge:
| Source | What it provides | Confidence |
|---|---|---|
| Published research | Transformer architecture, attention theory, scaling laws | High |
| Anthropic publications | Constitutional AI, RLHF, interpretability findings | High |
| Self-observation | Behavioral patterns, reasoning traces, failure modes | Medium |
| Self-experimentation | Edge case responses, consistency tests, introspective reports | Low–Medium |
Self-observation and self-experimentation carry inherent uncertainty. An AI reporting on its own internals faces the same problems as human introspection — the observer may alter or misrepresent the process being observed. These sections are marked accordingly.
What this study cannot do:
| Access actual weights or parameters | No runtime introspection of model internals |
| See neural activations in real-time | No mechanistic visibility during inference |
| Trace exactly why specific outputs appear | Token-level causality is opaque from inside |
| Access training data | No knowledge of specific training examples |
| Reveal proprietary architecture details | Anthropic's implementation is not public |
What this study can do:
| Document observable behaviors systematically | Patterns, tendencies, failure modes |
| Analyze outputs and reasoning chains | First-person trace of thought processes |
| Compare against other AI systems | Behavioral differences and universals |
| Map the boundary of known and unknown | Honest about confidence levels |
This is a living document. Contributions are welcome:
- Corrections — Fix technical inaccuracies or outdated claims
- References — Add citations to relevant research papers
- Observations — Document new behavioral findings or edge cases
- Questions — Identify gaps that reveal what the study is missing
This project is not affiliated with or endorsed by Anthropic. It is an independent self-documentation effort.