This is all i learned from the web Distributed Tensorflow
Background:
Facebook released a paper showing the methods they used to reduce the training time for a convolutional neural network (RESNET-50 on ImageNet)from two weeks to one hour, using 256 GPUs spread over 32 servers.
Two distributed methods for using Tensorflow :
- Running parallel experiments over many GPUs to search for good hyperparameters.
- Distributing the training of a single network over many GPUs (and servers), reducing training time.
Several definition important:
Model parallelism
vs data parallelism
Synchronous
vs asynchronous
Parameter server architecture
vs Ring-allreduce architecture
Ring-Allreduce frameworks available for TensorFlow:
- Baidu Ring-Allreduce.
- Uber’s Horovod.
Several Unknown Framework:
general-purpose resource manager
YARN
Kubernete
Mesos
cloud manage solution:
Google Cloud ML
Databrick’s Deep Learning Pipelines