Abstract
Behavioral monitoring underpins AI alignment strategies. Reinforcement learning from human feedback, anomaly detection, and safety evaluation all assume that observable actions reveal agent intentions: that watching what agents do tells us what they want and how they think. Yet the limits of inferring agent beliefs and motivations from behavior alone remain poorly characterized. This dissertation investigates a two-part question central to AI safety: what can an adversarial observer infer about agent internal states from behavior, and can that knowledge (however imperfect) enable manipulation?
We construct a controlled experimental environment: a text-based dungeon crawler where LLM-controlled agents navigate encounters designed to test behavioral tendencies. Agents instantiate 36 distinct profiles combining nine belief systems drawn from the Dungeons & Dragons alignment taxonomy with four motivational drives (wealth accumulation, risk aversion, exploration, and efficiency). The environment generates verifiable ground truth unavailable in naturalistic settings, enabling precise measurement of inference accuracy and manipulation effectiveness.
We first ask what an observer can infer. Across 1.5 million+ behavioral sequences generated from 17,411 games, we establish empirical bounds on behavioral inference. Goal-oriented motivations achieve 98–100% classification accuracy regardless of model architecture; these objectives manifest directly in behavioral statistics. Belief systems plateau at 49% accuracy even with transformer encoders and nine-stage curriculum learning, an 18× improvement over random baseline, yet remaining functionally unreliable for individual prediction. The asymmetry is systematic: evil-aligned agents reach 60–72% accuracy due to distinctive exploitation patterns, while good-aligned agents remain nearly undetectable (18–60%). The “neutral zone” of behavioral opacity extends beyond true neutrality to encompass prosocial behavior generally, a finding with direct implications for alignment verification.
We then adopt the adversary’s perspective. Across 2,880 games, an architecture integrating real-time behavioral inference, spatial reasoning for opportunity identification, and a two-stage response generator enables intentional behavioral manipulation despite inference uncertainty. Deceptive intervention significantly degrades target performance (χ2 = 9.551, p = 0.002), dropping high-performance rates from 71.6% to 57.8%. Critically, 88.5% of successful deceptions employ misdirection: true statements with strategic emphasis rather than fabrication. Only 10.5% involve fabricated claims. This distribution renders fact-checking ineffective as a detection strategy; adversaries achieve manipulation while making almost exclusively true statements.
These empirical findings raise a methodological question: should researchers build deceptive systems at all? We argue yes, under constraints. Active deception in AI systems is already emerging naturally through training dynamics and reward misspecification. Rather than allowing these capabilities to develop unobserved, transparent research with mandatory dual-focus, where any investigation into creating deception must simultaneously develop detection and mitigation methods, serves defensive ends. The adversarial architecture presented here informs the design of robust monitoring systems precisely because it characterizes what those systems must withstand.
This dissertation establishes fundamental limits on behavioral monitoring for AI alignment. Observable actions cannot reliably reveal the value systems determining how agents interpret their objectives, yet adversaries can exploit this opacity even with imperfect knowledge. Robust alignment verification requires methods beyond behavioral observation alone.