NSDI 19 Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Juncheng Gu, Mosharaf Chowdhury, and Kang G. Shin, University of Michigan, Ann Arbor; Yibo Zhu, Microsoft and Bytedance; Myeongjae Jeon, Microsoft and UNIST; Junjie Qian, Microsoft; Hongqiang Liu, Alibaba; Chuanxiong Guo, Bytedance Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an allornothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data s
|