Build your Code for Data Parallelism
所以最小的模型需要
one master worker and one ps
大的模型可以是
one master worker, many ps and many workers
Three kinds of nodes:
ps
one parameter server (host the models)task_index=0
a master worker coordinates the training operations and take care of the intialize the model. Counting the number of executed training steps, saving and restoring model.task_index!=0
workers (including the master worker) handle compute training steps and send updates to the parameter servers.
Scale:
If there are only 2 workers, chances are one ps can handle all the reads and updates requests. But if you have 10 workers and your model is reasonably large, one ps may not be enough.
Confusing things:
One of the potential confusing things with Distributed TensorFlow is that very often, the
same code
will be sent to all nodes.Environment variables
are then used to execute a certain code block on themaster
node, another on theworkers
,another on theps
.
3 steps:
- Define the
tf.trainClusterSpec
andtf.train.Server
. - Assign your model to the
ps
andworkers
. - Configure and launch a
tf.train.MonitoredTrainingSession
.
Define the tf.train.ClusterSpec and tf.train.Server :
The tf.train.ClusterSpec
object essentially maps tasks to machines.
It is then passed to create a tf.train.Server
, that creates (at least) one server per machine, and makes sure every machine in the cluster is aware of what the others are doing.
tf.session
object, that will be called by the tf.train.MonitoredTrainingSession
to run the graph.
There is usually one task per machine, unless you have multi-GPU machines. In that case, you may want to assign one task per GPU.
Something important in building the code:
首先,跑这个代码的时候需要用到如下命令:
1 | python mnist_replica1.py --job_name="ps" --task_index=0 |
值得注意的是我们需要首先起ps的node,然后起多个worker的node,他们的task_index id 不能是0,最后启动一个worker的node,他们的task_index的id是0, 因为0是启动节点。