Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips) would likely be motivated to sabotage our safety research, but we’d like to “force” them to do a good job on it anyways. In this post, I’ll discuss 5 approaches for evaluating how effective a training method is, and discuss their pros and cons. (In this post “I” refers to Alek. I’ve developed these ideas in discussion with my coauthors, but they might not agree with everything I’m saying). Thanks to Carlo Leonardo Attubato for feedback on a draft of this post. Thanks
LessWrong