Distributed TensorFlow

References

https://www.tensorflow.org/deploy/distributed
- QUOTE: A TensorFlow "cluster" is a set of "tasks" that participate in the distributed execution of a TensorFlow graph. Each task is associated with a TensorFlow "server", which contains a "master" that can be used to create sessions, and a "worker" that executes operations in the graph. A cluster can also be divided into one or more "jobs", where each job contains one or more tasks.

https://confluence.sie.sony.com/display/BTRDT/DS+Life+Cycle+in+AWS
- QUOTE: At a first glance, Spark and TensorFlow share some similarities. Both frameworks can perform distributed operations on large datasets. They take an set of input operations, compile these operations to a DAG, ship the DAG to a pool of executors and execute the DAG on a subset of the data. It seems like a natural extension to try and integrate the two! …
  … TensorFlow is a specialized tool for performing numerical operations on data, utilizing an Eigen::tensor as its primitive. TF’s distributed master mode is quite different from Spark’s; it partitions one DAG between multiple executors, sets up RPCs between these executors at the graph partitions, launches a parameter server for executors to read/write weight updates and provides efficient implementations for CPU or GPU (or TPU) executors. For specific numerical tasks of the form: minimize an objective given volumes of data (e.g., Deep Learning), this architecture is more efficient than Spark.