Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
Yangyang Feng,Minhui Xie,3 Authors,J. Shu
TLDR
The key idea is a novel pipeline parallelism scheme enabling heterogeneous memory for large-scale model training, while bringing fewer communications than existing systems, to minimize communication contention.
Abstract
Fine-tuning on cheap commodity GPU servers makes large-scale deep learning models benefit more people. However, the low inter-GPU communication bandwidth and pressing communication contention on the commodity GPU server obstruct training efficiency. In this paper, we present Mobius, a communication-efficient system for fine tuning large-scale models on commodity GPU servers. The key idea is a novel pipeline parallelism scheme enabling heterogeneous memory for large-scale model training, while bringing fewer communications than existing systems. Mobius partitions the model into stages and carefully schedules them between GPU memory and DRAM to overlap communication with computation. It formulates pipeline execution into a mixed-integer program problem to find the optimal pipeline partition. It also features a new stage-to-GPU mapping method termed cross mapping, to minimize communication contention. Experiments on various scale models and GPU topologies show that Mobius significantly reduces the training time by 3.8-5.1× compared with the prior art.
