Distributed Machine Learning System: Difference between revisions

Latest revision as of 07:55, 19 June 2023

Context:
- It can range from being a standard Distributed ML System, to being a Distributed Reinforcement Learning System, to being a Distributed Deep Learning System.
Example(s):
- MLlib,
- MLbase,
- MLI,
- Vowpal Wabbit Open Source Project,
- GraphLab,
- GraphX,
- Tensorflow,
- MXNet,
- Ray.
- …
Counter-Example(s):
- Apache Spark-Kafka Oryx 2,
- Caffe,
- Theano,
- Torch7.
See: Distributed Algorithms, Resilient Distributed Datasets (RDDS), Distributed Graph System, Iterative Machine Learning System.

(Nishihara & Moritz, 2018) ⇒ Robert Nishihara, and Philipp Moritz (Jan 9, 2018). "Ray: A Distributed System for AI" Retrieved on 2019-04-14
- QUOTE: One of Ray’s goals is to enable practitioners to turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a cluster (or on a single multi-core machine) with relatively few additional lines of code. Such a framework should include the performance benefits of a hand-optimized system without requiring the user to reason about scheduling, data transfers, and machine failures.
  (...) There are two main ways of using Ray: through its lower-level APIs and higher-level libraries. The higher-level libraries are built on top of the lower-level APIs. Currently these include Ray RLlib, a scalable reinforcement learning library and Ray.tune, an efficient distributed hyperparameter search library.

(Agarwal et al., 2014) ⇒ Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. (2014). “A Reliable Effective Terascale Linear Learning System.” In: The Journal of Machine Learning Research, 15(1).
- QUOTE: Perhaps the simplest strategy when the number of examples n is too large for a given learning algorithm is to reduce the data set size by subsampling. However, this strategy only works if the problem is simple enough or the number of parameters is very small. The setting of interest here is when a large number of examples is really needed to learn a good model. Distributed algorithms are a natural choice for such scenarios.

@@ Line 26: / Line 26: @@
 === 2018 ===
 * (Nishihara & Moritz, 2018) ⇒ [[Robert Nishihara]], and [[Philipp Moritz]] (Jan 9, 2018). [https://bair.berkeley.edu/blog/2018/01/09/ray/ "Ray: A Distributed System for AI"] Retrieved on 2019-04-14
 ** QUOTE: One of [[Ray]]’s goals is to enable practitioners to turn a [[prototype algorithm]] that runs on a [[laptop]] into a [[high-performance distributed application]] that runs efficiently on a [[cluster]] (or on a [[single multi-core machine]]) with relatively few additional lines of [[code]]. Such a framework should include the [[performance]] benefits of a [[hand-optimized system]] without requiring the user to reason about [[scheduling]], [[data transfer]]s, and [[machine failure]]s.         <P>        (...) There are two main ways of using [[Ray]]: through its [[lower-level API]]s and [[higher-level libraries]]. The [[higher-level libraries]] are built on top of the [[lower-level API]]s. Currently these include [[Ray RLlib]], a [[scalable reinforcement learning library]] and [[Ray.tune]], an efficient [[distributed hyperparameter search library]].