LESSWRONG
LW

Interpreting a Maze-Solving Network

Apr 20, 2023 by TurnTrout

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

105Predictions for shard theory mechanistic interpretability results
Ω
TurnTrout, Ulisse Mini, peligrietzer
2y
Ω
10
334Understanding and controlling a maze-solving policy network
Ω
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
2y
Ω
28
101Maze-solving agents: Add a top-right vector, make the agent go to the top-right
Ω
TurnTrout, peligrietzer, lisathiergart
2y
Ω
17
46Behavioural statistics for a maze-solving agent
Ω
peligrietzer, TurnTrout
2y
Ω
11