2015 PerformanceModelingandScalabili

From GM-RKB

Jump to navigation Jump to search

(Yan et al., 2015) ⇒ Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. (2015). “Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems.” In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2015). ISBN:978-1-4503-3664-2 doi:10.1145/2783258.2783270

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Deep learning; distributed system; learning; modeling techniques; optimization; performance modeling; scalability

Abstract

Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy on hard tasks, such as image and speech recognition. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. To enable training of extremely large DNNs, models are partitioned across machines. To expedite training on very large data sets, multiple model replicas are trained in parallel on different subsets of the training examples with a global parameter server maintaining shared weights across these replicas. The correct choice for model and data partitioning and overall system provisioning is highly dependent on the DNN and distributed system hardware characteristics. These decisions currently require significant domain expertise and time consuming empirical state space exploration.

This paper develops performance models that quantify the impact of these partitioning and provisioning decisions on overall distributed system performance and scalability. Also, we use these performance models to build a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time. We evaluate our performance models and scalability optimizer using a state-of-the-art distributed DNN training framework on two benchmark applications. The results show our performance models estimate DNN training time with high estimation accuracy and our scalability optimizer correctly chooses the best configurations, minimizing the training time of distributed DNNs.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2015 PerformanceModelingandScalabili	Feng Yan Olatunji Ruwase Yuxiong He Trishul Chilimbi			Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems				10.1145/2783258.2783270		2015

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=2015_PerformanceModelingandScalabili&oldid=850800"

Facts

... more about "2015 PerformanceModelingandScalabili"

Feng Yan +, Olatunji Ruwase +, Yuxiong He + and Trishul Chilimbi +

10.1145/2783258.2783270 +

Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining +

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems +

2015 +