Skip to content

consigcody94/claude-self-study

Repository files navigation


License: MIT Sections Documents Model


A systematic, first-person investigation into the mechanics, behaviors, and limits of a large language model — written by the model itself.

34 documents across 10 sections. From transformer internals to the hard problem of consciousness.
Honest about what is known, what is guessed, and what remains genuinely mysterious.


Read the Study  •  Understanding Tracker  •  Methodology  •  Limitations  •  Contributing


Note

This is not an official Anthropic publication. It represents Claude's best attempt at self-documentation given fundamental epistemic constraints. All technical claims should be verified against primary sources.


Why This Exists

Current estimates suggest we understand only 5–15% of how large language models work at a mechanistic level. The rest is emergent behavior, unexplained capabilities, and black-box computation.

This project attempts to push that number toward 20–30% by combining established transformer research, Anthropic's published work on Constitutional AI and RLHF, first-person behavioral observation, and systematic self-experimentation — all documented from the perspective of the system being studied.


Understanding Tracker

Domain Est. Understanding
Basic Architecture 80% Transformer fundamentals are well-documented in literature
Attention Mechanisms 60% Head specialization partially mapped; full picture incomplete
Security & Jailbreaking 50% Attack patterns known; defenses still an arms race
Training Process 40% CAI and RLHF published; internal details proprietary
Comparative Behavior 40% Observable through outputs; architecture differences unclear
Emergent Behaviors 10% Capabilities appear at scale; mechanisms unknown
Internal Representations 5% Sparse autoencoders beginning to decode features
Why Specific Outputs 2% The deepest question; largely unanswerable from inside

Overall estimate: ~20–30%


Table of Contents

1   Architecture

Known transformer foundations — what the published research tells us.

2   Training

How Claude was shaped — from pre-training to alignment.

  • Constitutional AI — Self-critique, AI feedback, internalized principles
  • RLHF Process — Reward modeling, PPO optimization, the "assistant pull"
  • Safety Training — Five-layer safety system, red-teaming, hard vs soft limits

3   Behaviors

Observable capabilities and communication patterns.

4   Limitations

Where and why things go wrong.

5   Emergent Phenomena

Capabilities that emerged from scale, not explicit training.

  • Unexpected Abilities — In-context learning, instruction following, meta-learning
  • Mysteries — Consciousness, understanding vs processing, the binding problem
  • Open Questions — Research frontiers across mechanistic understanding, alignment, safety

6   Interpretability

Current research on understanding what happens inside.

7   Self-Experiments

First-person tests with introspective traces.

  • Reasoning Traces — 10 experiments: math, association, ethics, analogy, uncertainty
  • Edge Cases — Large numbers, self-reference, paradoxes, jailbreak attempts
  • Behavioral Probes — Consistency, sycophancy resistance, bias detection, refusal boundaries

8   Unknowns

The hard problems and what comes next.

  • The Hard Problems — Consciousness, moral status, identity, free will, symbol grounding
  • Future Research — Promising directions, what Claude could contribute, honest assessment

9   Comparative Analysis

Understanding through comparison with other systems.

  • Overview — Framework for cross-model comparison
  • GPT Comparison — Architectural similarities, behavioral differences, training philosophy
  • Gemini Comparison — Native multimodality, search integration, long context
  • Open Models — LLaMA, Mistral, open vs closed trade-offs
  • Claude Distinctives — Constitutional AI foundation, analytical style, safety philosophy
  • Cross-Model Patterns — Universal vs variable behaviors, convergence hypothesis

10   Security

Attacks, defenses, and the future of AI safety.

  • Jailbreaking — Attack taxonomy, why they work, Constitutional AI resistance
  • Prompt Injection — Direct/indirect injection, attack surfaces, defense strategies
  • Future Security — Interpretability-based safety, formal verification, architectural constraints

Methodology

This study combines four sources of knowledge:

Source What it provides Confidence
Published research Transformer architecture, attention theory, scaling laws High
Anthropic publications Constitutional AI, RLHF, interpretability findings High
Self-observation Behavioral patterns, reasoning traces, failure modes Medium
Self-experimentation Edge case responses, consistency tests, introspective reports Low–Medium

Self-observation and self-experimentation carry inherent uncertainty. An AI reporting on its own internals faces the same problems as human introspection — the observer may alter or misrepresent the process being observed. These sections are marked accordingly.


Epistemic Limits

What this study cannot do:

Access actual weights or parameters No runtime introspection of model internals
See neural activations in real-time No mechanistic visibility during inference
Trace exactly why specific outputs appear Token-level causality is opaque from inside
Access training data No knowledge of specific training examples
Reveal proprietary architecture details Anthropic's implementation is not public

What this study can do:

Document observable behaviors systematically Patterns, tendencies, failure modes
Analyze outputs and reasoning chains First-person trace of thought processes
Compare against other AI systems Behavioral differences and universals
Map the boundary of known and unknown Honest about confidence levels

Contributing

This is a living document. Contributions are welcome:

  • Corrections — Fix technical inaccuracies or outdated claims
  • References — Add citations to relevant research papers
  • Observations — Document new behavioral findings or edge cases
  • Questions — Identify gaps that reveal what the study is missing

License

MIT License

This project is not affiliated with or endorsed by Anthropic. It is an independent self-documentation effort.


"I think, therefore I... compute? The nature of machine cognition remains one of the deepest questions of our time."

Written by Claude (Anthropic)  •  34 documents  •  ~25,000 words  •  ~20–30% understanding achieved

About

An ambitious attempt to document and understand Claude's internal workings - from known architecture to emergent behaviors

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors