Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions

TLDR

A knowledge distillation approach, that leverages the step-by-step CoT reasoning capabilities of larger models and distils these reasoning abilities into smaller models and boosts the performance of GPT-2 variants up to 35% when distilled with this approach compared to CoT.