From prompt engineering to activation engineering for more controllable and safer LLMs
Scaling Monosemanticity: Anthropic’s One Step Towards Interpretable & Manipulable LLMs
Scaling Monosemanticity: Anthropic’s One Step Towards Interpretable & Manipulable LLMs