UPDF AI

Rethinking the Distributed DNN Training Cluster Design from the Cost-effectiveness View

Zhiquan Lai,Yujie Liu,2 Authors,Dongsheng Li

2023 · DOI: 10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00105
0 Citations

TLDR

A throughput-cost metric is introduced to accurately characterize clusters' cost-effectiveness and design a cost-effective cluster featuring the 3090 with NVLink, demonstrating that this cluster achieves remarkable cost-effectiveness in various distributed model training schemes.

Abstract

As deep learning grows rapidly, model training heavily relies on parallel methods and there exist numerous cluster configurations. However, current preferences for parallel training focus on data centers, overlooking the financial constraints faced by most researchers. To attain the best performance within the cost limitation, we introduce a throughput-cost metric to accurately characterize clusters' cost-effectiveness. Based on this metric, we design a cost-effective cluster featuring the 3090 with NVLink. The experiment results demonstrate that our cluster achieves remarkable cost-effectiveness in various distributed model training schemes.