Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/99999/fk4f20dm3k
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Freedman, Michael J | |
dc.contributor.author | Or, Andrew | |
dc.contributor.other | Computer Science Department | |
dc.date.accessioned | 2021-06-10T17:14:22Z | - |
dc.date.available | 2021-06-10T17:14:22Z | - |
dc.date.issued | 2021 | |
dc.identifier.uri | http://arks.princeton.edu/ark:/99999/fk4f20dm3k | - |
dc.description.abstract | State-of-the-art distributed deep learning systems, such as TensorFlow and PyTorch, are built on rigid assumptions that tightly couple model training and inference with the underlying hardware. First, they assume resource allocations must be fixed throughout the lifetime of a job, often leading to inefficient resource usage. Second, they require model hyperparameters to be retuned across different hardware configurations in order to achieve the same training result, posing a significant burden on the user. Due to these requirements, users are forced to juggle both systems challenges and application logic instead of being able to focus on just the latter. In this dissertation, we demonstrate that the above assumptions are not fundamental to distributed deep learning. We resolve these limitations by proposing two systems built on top of TensorFlow. The first is an autoscaling engine that, through trial-and-error, automatically determines the most resource-efficient hardware configuration for a given job. We propose pluggable heuristics tailored for deep learning workloads that incrementally guide the system towards such a configuration. Instead of repeatedly stopping the job and restarting it from checkpoints, which can lead to expensive hardware accelerators (e.g. GPUs, TPUs) going idle for minutes every time, our system adjusts the job’s all-reduce membership dynamically in between training steps without interrupting the job. The second system is VirtualFlow, which leverages a novel abstraction between the model and the underlying hardware called virtual node processing. From the perspective of the model, virtual nodes, instead of physical hardware accelerators, perform the computation. When multiple virtual nodes are mapped to a physical device, they are processed sequentially on that device. This representation offers users the flexibility to trade off computation time with resource requirement, allowing them to train their models using the same sets of hyperparameters across different hardware. Using this technique, VirtualFlow preserves application-level semantics while hiding hardware-level details from the user, enabling a variety of important new use cases such as experimentation, hyperparameter exploration, resource elasticity, and heterogeneous training. | |
dc.language.iso | en | |
dc.publisher | Princeton, NJ : Princeton University | |
dc.relation.isformatof | The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu> catalog.princeton.edu </a> | |
dc.subject.classification | Computer science | |
dc.title | Abstracting Systems Challenges from Distributed Deep Learning | |
dc.type | Academic dissertations (Ph.D.) | |
Appears in Collections: | Computer Science |
Files in This Item:
File | Size | Format | |
---|---|---|---|
Or_princeton_0181D_13631.pdf | 2.82 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.