Abstracting Systems Challenges from Distributed Deep Learning

Or, Andrew

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/99999/fk4f20dm3k

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Freedman, Michael J
dc.contributor.author	Or, Andrew
dc.contributor.other	Computer Science Department
dc.date.accessioned	2021-06-10T17:14:22Z	-
dc.date.available	2021-06-10T17:14:22Z	-
dc.date.issued	2021
dc.identifier.uri	http://arks.princeton.edu/ark:/99999/fk4f20dm3k	-
dc.description.abstract	State-of-the-art distributed deep learning systems, such as TensorFlow and PyTorch, are built on rigid assumptions that tightly couple model training and inference with the underlying hardware. First, they assume resource allocations must be fixed throughout the lifetime of a job, often leading to inefficient resource usage. Second, they require model hyperparameters to be retuned across different hardware configurations in order to achieve the same training result, posing a significant burden on the user. Due to these requirements, users are forced to juggle both systems challenges and application logic instead of being able to focus on just the latter. In this dissertation, we demonstrate that the above assumptions are not fundamental to distributed deep learning. We resolve these limitations by proposing two systems built on top of TensorFlow. The first is an autoscaling engine that, through trial-and-error, automatically determines the most resource-efficient hardware configuration for a given job. We propose pluggable heuristics tailored for deep learning workloads that incrementally guide the system towards such a configuration. Instead of repeatedly stopping the job and restarting it from checkpoints, which can lead to expensive hardware accelerators (e.g. GPUs, TPUs) going idle for minutes every time, our system adjusts the job’s all-reduce membership dynamically in between training steps without interrupting the job. The second system is VirtualFlow, which leverages a novel abstraction between the model and the underlying hardware called virtual node processing. From the perspective of the model, virtual nodes, instead of physical hardware accelerators, perform the computation. When multiple virtual nodes are mapped to a physical device, they are processed sequentially on that device. This representation offers users the flexibility to trade off computation time with resource requirement, allowing them to train their models using the same sets of hyperparameters across different hardware. Using this technique, VirtualFlow preserves application-level semantics while hiding hardware-level details from the user, enabling a variety of important new use cases such as experimentation, hyperparameter exploration, resource elasticity, and heterogeneous training.
dc.language.iso	en
dc.publisher	Princeton, NJ : Princeton University
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu> catalog.princeton.edu </a>
dc.subject.classification	Computer science
dc.title	Abstracting Systems Challenges from Distributed Deep Learning
dc.type	Academic dissertations (Ph.D.)
Appears in Collections:	Computer Science

Files in This Item:

File	Size	Format
Or_princeton_0181D_13631.pdf	2.82 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse