Understanding the difficulty of training deep feedforward neural networks

TLDR

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.