Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. In more powerful systems, this kind of failure would jeopardize safely navigating the intelligence explosion. It's crucial to build good processes to ensure development is executed according to plan, especially as human oversight becomes spread thin over increasing amounts of potentially untrusted and sloppy AI labor . This particular failure is also directly harmful, because it significantly reduces our confidence that the model's reasoning trace is monitorable (reflective of the AI's intent to misbehave). [1] I'm grateful

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

More Safety

Do not conquer what you cannot defend

Nurses Sound Alarm as ‘Uber for Nursing’ Apps Push to Deregulate Healthcare

A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"

Preventing extinction from ASI on a $50M yearly budget

‘Uber for nurses’: gig-work apps lobby to deregulate healthcare, report finds

Annoyingly Principled People, and what befalls them