Research Engineer, Scalable Interpretability
San Francisco Bay Area
$250,000 - $500,000
-
In this role, you'll develop and train scalable interpretability assistants that detect unexpected behaviours from AI models' activations.
-
Create diverse evaluations ranging in difficulty to identify undesirable behaviours in open-source models.