Build Your Code for Data Parallelism

Build your Code for Data Parallelism

所以最小的模型需要one master worker and one ps

大的模型可以是one master worker, many ps and many workers


Three kinds of nodes:

  • ps one parameter server (host the models)
  • task_index=0 a master worker coordinates the training operations and take care of the intialize the model. Counting the number of executed training steps, saving and restoring model.
  • task_index!=0 workers (including the master worker) handle compute training steps and send updates to the parameter servers.

Scale:

If there are only 2 workers, chances are one ps can handle all the reads and updates requests. But if you have 10 workers and your model is reasonably large, one ps may not be enough.

Confusing things:

One of the potential confusing things with Distributed TensorFlow is that very often, the same code will be sent to all nodes.
Environment variables are then used to execute a certain code block on the master node, another on the workers,another on the ps.


3 steps:

  1. Define the tf.trainClusterSpec and tf.train.Server.
  2. Assign your model to the ps and workers.
  3. Configure and launch a tf.train.MonitoredTrainingSession.

Define the tf.train.ClusterSpec and tf.train.Server :

The tf.train.ClusterSpec object essentially maps tasks to machines.

It is then passed to create a tf.train.Server, that creates (at least) one server per machine, and makes sure every machine in the cluster is aware of what the others are doing.

tf.session object, that will be called by the tf.train.MonitoredTrainingSession to run the graph.

There is usually one task per machine, unless you have multi-GPU machines. In that case, you may want to assign one task per GPU.

Something important in building the code:

首先,跑这个代码的时候需要用到如下命令:

1
2
3
4
5
6
python mnist_replica1.py --job_name="ps" --task_index=0
python mnist_replica1.py --job_name="worker" --task_index=1
python mnist_replica1.py --job_name="worker" --task_index=2
python mnist_replica1.py --job_name="worker" --task_index=3
...
python mnist_replica1.py --job_name="worker" --task_index=0

值得注意的是我们需要首先起ps的node,然后起多个worker的node,他们的task_index id 不能是0,最后启动一个worker的node,他们的task_index的id是0, 因为0是启动节点。