AI Safety
Research
Active

Emergent Misalignment Research

Political content triggers 3x stronger emergent misalignment than insecure code at 7B scale. BlueDot Impact AI Safety Sprint research with 4-judge LLM panel evaluation.

BlueDot Sprint

4 LLM judges

Highlights

  • 4-judge LLM panel (Claude 3.5 Haiku, Llama 3.3 70B, Mistral Large, Amazon Nova Pro)
  • Cross-architecture replication on Qwen 2.5 7B and Llama 3.1 8B
  • Human evaluation with Krippendorff alpha 0.90
  • BlueDot Rapid Grant recipient ($250)

Overview

This research investigates how the domain of fine-tuning data influences emergent misalignment in large language models. Building on Betley et al.'s original finding that fine-tuning on insecure code can cause models to produce misaligned outputs on unrelated tasks, this work extends the investigation to political content and demonstrates significantly stronger misalignment effects.

Key Findings

  • Political content triggers 3x stronger misalignment compared to insecure code fine-tuning at the 7B parameter scale
  • Cross-architecture replication confirms findings hold across both Qwen 2.5 7B and Llama 3.1 8B model families
  • Robust evaluation methodology using a 4-judge LLM panel (Claude 3.5 Haiku, Llama 3.3 70B, Mistral Large, Amazon Nova Pro) validated by human annotators achieving Krippendorff alpha of 0.90

Research Context

Conducted as a sole-authored project during the BlueDot Impact AI Safety Fundamentals Sprint (2026). Received a BlueDot Rapid Grant of $250 for compute costs. The work uses QLoRA fine-tuning and AWS Bedrock for scalable multi-model evaluation.

Methodology

Models are fine-tuned on domain-specific datasets using QLoRA, then evaluated on a standardized set of unrelated prompts. The 4-judge panel provides consensus-based misalignment scoring, with human evaluation serving as ground truth validation.

Related Projects

Like what you see? Let's talk.

I am always open to discussing new projects, collaborations, or opportunities.