Pavan Kumar Dubasi

Overview

This research investigates how the domain of fine-tuning data influences emergent misalignment in large language models. Building on Betley et al.'s original finding that fine-tuning on insecure code can cause models to produce misaligned outputs on unrelated tasks, this work extends the investigation to political content and demonstrates significantly stronger misalignment effects.

Key Findings

Political content triggers 3x stronger misalignment compared to insecure code fine-tuning at the 7B parameter scale
Cross-architecture replication confirms findings hold across both Qwen 2.5 7B and Llama 3.1 8B model families
Robust evaluation methodology using a 4-judge LLM panel (Claude 3.5 Haiku, Llama 3.3 70B, Mistral Large, Amazon Nova Pro) validated by human annotators achieving Krippendorff alpha of 0.90

Research Context

Conducted as a sole-authored project during the BlueDot Impact AI Safety Fundamentals Sprint (2026). Received a BlueDot Rapid Grant of $250 for compute costs. The work uses QLoRA fine-tuning and AWS Bedrock for scalable multi-model evaluation.

Methodology

Models are fine-tuned on domain-specific datasets using QLoRA, then evaluated on a standardized set of unrelated prompts. The 4-judge panel provides consensus-based misalignment scoring, with human evaluation serving as ground truth validation.

Emergent Misalignment Research

Highlights

Overview

Key Findings

Research Context

Methodology

Related Projects

Attestix

CodeSage

VibeMCP

Like what you see? Let's talk.

Emergent Misalignment ResearchEmergent Misalignment Research

Highlights

Overview

Key Findings

Research Context

Methodology

Related Projects

Attestix

CodeSage

VibeMCP

Like what you see? Let's talk.

Emergent Misalignment Research