Self-supervised Models are Good Teaching Assistants for Vision Transformers

TLDR

A head-level knowledge distillation method that selects the most important head of the supervised teacher and self-supervised teaching assistant and let the student mimic the attention distribution of these two heads, so as to make the student focus on the relationship between tokens deemed by the teacher and the teacher assistant.