Distributed Tensorflow

This is all i learned from the web Distributed Tensorflow


Background:

Facebook released a paper showing the methods they used to reduce the training time for a convolutional neural network (RESNET-50 on ImageNet)from two weeks to one hour, using 256 GPUs spread over 32 servers.

Two distributed methods for using Tensorflow :

  1. Running parallel experiments over many GPUs to search for good hyperparameters.
  2. Distributing the training of a single network over many GPUs (and servers), reducing training time.

Several definition important:

Model parallelism vs data parallelism
Synchronous vs asynchronous
Parameter server architecture vs Ring-allreduce architecture

Ring-Allreduce frameworks available for TensorFlow:

  • Baidu Ring-Allreduce.
  • Uber’s Horovod.

Several Unknown Framework:

general-purpose resource manager

  • YARN
  • Kubernete
  • Mesos

cloud manage solution:

  • Google Cloud ML
  • Databrick’s Deep Learning Pipelines