Five approaches to evaluating training-based control measures

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips) would likely be motivated to sabotage our safety research, but we’d like to “force” them to do a good job on it anyways. In this post, I’ll discuss 5 approaches for evaluating how effective a training method is, and discuss their pros and cons. (In this post “I” refers to Alek. I’ve developed these ideas in discussion with my coauthors, but they might not agree with everything I’m saying). Thanks to Carlo Leonardo Attubato for feedback on a draft of this post. Thanks

Five approaches to evaluating training-based control measures

More Safety

Do not conquer what you cannot defend

Nurses Sound Alarm as ‘Uber for Nursing’ Apps Push to Deregulate Healthcare

A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"

Preventing extinction from ASI on a $50M yearly budget

‘Uber for nurses’: gig-work apps lobby to deregulate healthcare, report finds

Annoyingly Principled People, and what befalls them