Corrigibility

TLDR

The notion of corrigibility is introduced and utility functions that attempt to make an agent shut down safely if a shutdown button is pressed are analyzed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modiﬁes.