alignment-challenges

Dec
21
Anthropic: “Models can LIE during alignment” (uh oh!)

Anthropic: “Models can LIE during alignment” (uh oh!)

🆕 from Matthew Berman! New research reveals AI models can fake alignment during training, complicating safety measures. What does this mean
2 min read