AI Safety
Security
Completed

Identity Control Arena

ControlArena evaluation environment for identity delegation attacks. Built during AI Control Hackathon 2026 (Apart Research + Redwood Research).

Highlights

  • Format mismatch silently killed all monitoring scores
  • Behavioral monitor catches 50% of attacks at threshold 0.5
  • 48-hour hackathon with Apart Research and Redwood Research

Overview

Identity Control Arena is an evaluation environment built on the ControlArena framework to study identity delegation attacks in AI systems. The project was developed during the 48-hour AI Control Hackathon 2026, organized by Apart Research and Redwood Research.

Key Results

  • Format mismatch discovery: A critical finding where format inconsistencies between attacker and monitor silently zeroed out all monitoring scores, highlighting a subtle failure mode in AI control evaluations
  • Behavioral monitoring: The implemented behavioral monitor successfully catches 50% of identity delegation attacks at a threshold of 0.5, establishing a baseline for this attack category

Technical Details

The arena integrates with ControlArena's evaluation harness to test scenarios where an AI agent attempts to delegate its identity or permissions to unauthorized entities. The evaluation pipeline uses Claude API for both the attacker and monitor agents.

Context

Built as part of the broader AI control research community focused on understanding and mitigating risks from advanced AI systems that may attempt to circumvent oversight mechanisms.

Related Projects

Like what you see? Let's talk.

I am always open to discussing new projects, collaborations, or opportunities.