Distributed Tensorflow

This is all i learned from the web Distributed Tensorflow

Background:

Facebook released a paper showing the methods they used to reduce the training time for a convolutional neural network (RESNET-50 on ImageNet)from two weeks to one hour, using 256 GPUs spread over 32 servers.

Two distributed methods for using Tensorflow :

Running parallel experiments over many GPUs to search for good hyperparameters.

Distributing the training of a single network over many GPUs (and servers), reducing training time.

Several definition important:

Model parallelism vs data parallelism
Synchronous vs asynchronous
Parameter server architecture vs Ring-allreduce architecture

Ring-Allreduce frameworks available for TensorFlow:

Baidu Ring-Allreduce.
Uber’s Horovod.

Several Unknown Framework:

general-purpose resource manager

YARN
Kubernete
Mesos

cloud manage solution:

Google Cloud ML
Databrick’s Deep Learning Pipelines