Emergent Misalignment Research
Political content triggers 3x stronger emergent misalignment than insecure code at 7B scale. BlueDot Impact AI Safety Sprint research with 4-judge LLM panel evaluation.
BlueDot Sprint
4 LLM judges
Highlights
- 4-judge LLM panel (Claude 3.5 Haiku, Llama 3.3 70B, Mistral Large, Amazon Nova Pro)
- Cross-architecture replication on Qwen 2.5 7B and Llama 3.1 8B
- Human evaluation with Krippendorff alpha 0.90
- BlueDot Rapid Grant recipient ($250)
Overview
This research investigates how the domain of fine-tuning data influences emergent misalignment in large language models. Building on Betley et al.'s original finding that fine-tuning on insecure code can cause models to produce misaligned outputs on unrelated tasks, this work extends the investigation to political content and demonstrates significantly stronger misalignment effects.
Key Findings
- Political content triggers 3x stronger misalignment compared to insecure code fine-tuning at the 7B parameter scale
- Cross-architecture replication confirms findings hold across both Qwen 2.5 7B and Llama 3.1 8B model families
- Robust evaluation methodology using a 4-judge LLM panel (Claude 3.5 Haiku, Llama 3.3 70B, Mistral Large, Amazon Nova Pro) validated by human annotators achieving Krippendorff alpha of 0.90
Research Context
Conducted as a sole-authored project during the BlueDot Impact AI Safety Fundamentals Sprint (2026). Received a BlueDot Rapid Grant of $250 for compute costs. The work uses QLoRA fine-tuning and AWS Bedrock for scalable multi-model evaluation.
Methodology
Models are fine-tuned on domain-specific datasets using QLoRA, then evaluated on a standardized set of unrelated prompts. The 4-judge panel provides consensus-based misalignment scoring, with human evaluation serving as ground truth validation.
