Horovod on Kubernetes

上一篇文章将Horovod封装到Docker image中,本文将进一步将其部署到Kubernetes集群上进行多机多GPU分布式训练。借助的工具为MPI-Operator,https://github.com/kubeflow/mpi-operator

我们已经有horovod镜像了,部署到K8S上不是小菜一碟吗?然而没有那么简单,中间还有一些要注意的点。

简单尝试

如果仅用kubectl来部署,做法可以是,把代码封装到镜像中,就像MPI-Operator中的example镜像那样做。接下来的过程可以参考Horovod in Docker,https://horovod.readthedocs.io/en/stable/docker_include.html#

  • 部署一个master pod执行以下代码:
1
horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
  • 部署多个worker pod执行:
1
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

不过需要注意的点是:

  1. 我们需要事先将ssh key写到master与worker pod配置文件中,以便它们通信
  2. master需要得知worker结点的名字来执行horovodrun命令,但是worker pod的名字一般是运行时生成的

借用MPI Operator

这些操作都比较烦,因此MPI operator出现了。它具体做了哪些操作呢?可以参考,https://medium.com/kubeflow/introduction-to-kubeflow-mpi-operator-and-industry-adoption-296d5f2e6edc,以及https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md

大概过程:MPI使用kubectl exec完成第一次握手(从而可以让他们之间互相认识,后续进行通信),并且将workers pod对应的url给存储到configMap中,再挂载到master上。

此时,我们的训练流程为:

  • 修改代码
  • 在horovod-base镜像上,重新build镜像(甚至可以安装自己的环境)
  • 创建MPI Job

问题在于,每次代码有变化都需要重新build以及push镜像,而由于horovod环境较为笨重导致镜像本身很大。因此这种方式还需要改进。

借用Gitlab管理代码

有一种做法是,把代码单独拎出来,push到gitlab私服上,然后在MPI Job中通过git拉取。其中环境这一点通过requirements.txt的方式也保存下来。然后在每次创建MPI Job的时候后,在镜像中进行安装。

输入的数据,输出的模型参数等都存储在统一存储S3中。

此时,我们的训练流程为:

  • 本地修改代码,commit & push到gitlab上
  • 创建MPI Job,其中需要写git clone,pip install,horovodrun等命令

MPI Job的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
namespace: mpi-operator
name: sw-simple-ml-gitlab
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: coreharbor.bdap.com/library/horovod-sw-base
name: horovod-master
command: ["/bin/sh", "-c"]
args: ["sleep 1m && mkdir simple_ml && cd simple_ml && horovodrun -np 2 --hostfile /etc/mpi/hostfile python main_with_horovod.py"]
Worker:
replicas: 2
template:
spec:
dnsPolicy: "None"
dnsConfig:
nameservers:
- 10.105.222.6
containers:
- image: coreharbor.bdap.com/library/horovod-sw-base
name: horovod-worker
command: ["/bin/sh", "-c"]
args: ["git -c http.sslVerify=false clone https://gitlab.bdap.com/faraway/simple_ml.git && cd simple_ml && pip install -r requirements.txt && sleep infinity"]
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
key: gpu
operator: Exists

训练效果:

训练过程的输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
[0]<stderr>:+ POD_NAME=sw-simple-ml-gitlab-worker-0
[0]<stderr>:+ shift
[0]<stderr>:+ /opt/kube/kubectl exec sw-simple-ml-gitlab-worker-0 -- /bin/sh -c cd /simple_ml > /dev/null 2>&1 ; HOROVOD_HOSTNAME=sw-simple-ml-gitlab-worker-0 HOROVOD_RANK=0 HOROVOD_SIZE=2 HOROVOD_LOCAL_RANK=0 HOROVOD_LOCAL_SIZE=1 HOROVOD_CROSS_RANK=0 HOROVOD_CROSS_SIZE=2 LIBRARY_PATH=/usr/local/cuda/lib64/stubs KUBERNETES_SERVICE_PORT=443 KUBERNETES_PORT=tcp://10.96.0.1:443 HOSTNAME=sw-simple-ml-gitlab-launcher LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 HOME=/root CUDA_VERSION=11.3.1 NVIDIA_REQUIRE_CUDA='cuda>=11.3 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450' NVIDIA_DRIVER_CAPABILITIES='' KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1 PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_PROTO=tcp CUDNN_VERSION=8.2.0.53-1+cuda11.3 KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443 KUBERNETES_SERVICE_HOST=10.96.0.1 PWD=/simple_ml OMPI_MCA_orte_default_hostfile=/etc/mpi/hostfile OMPI_MCA_plm_rsh_agent=/etc/mpi/kubexec.sh NVIDIA_VISIBLE_DEVICES='' NCCL_VERSION=2.9.9-1+cuda11.3 TZ=Asia/Dubai LC_CTYPE=C.UTF-8 PYTHONUNBUFFERED=1 HOROVOD_GLOO_RENDEZVOUS_ADDR=10.244.3.228 HOROVOD_GLOO_RENDEZVOUS_PORT=56794 HOROVOD_CONTROLLER=gloo HOROVOD_CPU_OPERATIONS=gloo HOROVOD_GLOO_IFACE=eth0 NCCL_SOCKET_IFNAME=eth0 python main_with_horovod.py
[1]<stderr>:+ POD_NAME=sw-simple-ml-gitlab-worker-1
[1]<stderr>:+ shift
[1]<stderr>:+ /opt/kube/kubectl exec sw-simple-ml-gitlab-worker-1 -- /bin/sh -c cd /simple_ml > /dev/null 2>&1 ; HOROVOD_HOSTNAME=sw-simple-ml-gitlab-worker-1 HOROVOD_RANK=1 HOROVOD_SIZE=2 HOROVOD_LOCAL_RANK=0 HOROVOD_LOCAL_SIZE=1 HOROVOD_CROSS_RANK=1 HOROVOD_CROSS_SIZE=2 LIBRARY_PATH=/usr/local/cuda/lib64/stubs KUBERNETES_SERVICE_PORT=443 KUBERNETES_PORT=tcp://10.96.0.1:443 HOSTNAME=sw-simple-ml-gitlab-launcher LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 HOME=/root CUDA_VERSION=11.3.1 NVIDIA_REQUIRE_CUDA='cuda>=11.3 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450' NVIDIA_DRIVER_CAPABILITIES='' KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1 PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_PROTO=tcp CUDNN_VERSION=8.2.0.53-1+cuda11.3 KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443 KUBERNETES_SERVICE_HOST=10.96.0.1 PWD=/simple_ml OMPI_MCA_orte_default_hostfile=/etc/mpi/hostfile OMPI_MCA_plm_rsh_agent=/etc/mpi/kubexec.sh NVIDIA_VISIBLE_DEVICES='' NCCL_VERSION=2.9.9-1+cuda11.3 TZ=Asia/Dubai LC_CTYPE=C.UTF-8 PYTHONUNBUFFERED=1 HOROVOD_GLOO_RENDEZVOUS_ADDR=10.244.3.228 HOROVOD_GLOO_RENDEZVOUS_PORT=56794 HOROVOD_CONTROLLER=gloo HOROVOD_CPU_OPERATIONS=gloo HOROVOD_GLOO_IFACE=eth0 NCCL_SOCKET_IFNAME=eth0 python main_with_horovod.py
[0]<stdout>:Training. Epoch 0, MSE loss: 1338.4868140452961, Worker: 0
[1]<stdout>:Training. Epoch 0, MSE loss: 1148.9670435080386, Worker: 1
[0]<stdout>:Training. Epoch 1, MSE loss: 935.5324933822116, Worker: 0
[1]<stdout>:Training. Epoch 1, MSE loss: 934.2259948853756, Worker: 1
[0]<stdout>:Training. Epoch 2, MSE loss: 654.0407885544738, Worker: 0
[1]<stdout>:Training. Epoch 2, MSE loss: 633.2420742589119, Worker: 1
[1]<stdout>:Training. Epoch 3, MSE loss: 599.6154769317578, Worker: 1
[0]<stdout>:Training. Epoch 3, MSE loss: 593.8755020723866, Worker: 0
[1]<stdout>:Training. Epoch 4, MSE loss: 574.4224909156511, Worker: 1
[0]<stdout>:Training. Epoch 4, MSE loss: 483.1180045366194, Worker: 0
[0]<stdout>:Training. Epoch 5, MSE loss: 501.9576651239583, Worker: 0
[1]<stdout>:Training. Epoch 5, MSE loss: 561.3756192270189, Worker: 1
[1]<stdout>:Training. Epoch 6, MSE loss: 543.3503851517656, Worker: 1
[0]<stdout>:Training. Epoch 6, MSE loss: 461.72367964229545, Worker: 0
[1]<stdout>:Training. Epoch 7, MSE loss: 599.5701360625471, Worker: 1
[0]<stdout>:Training. Epoch 7, MSE loss: 477.2382655836473, Worker: 0
[0]<stdout>:Training. Epoch 8, MSE loss: 467.4026207899954, Worker: 0
[1]<stdout>:Training. Epoch 8, MSE loss: 489.72919646231924, Worker: 1
[0]<stdout>:Training. Epoch 9, MSE loss: 496.021579158862, Worker: 0
[1]<stdout>:Training. Epoch 9, MSE loss: 466.33928261113863, Worker: 1
[0]<stdout>:Training. Epoch 10, MSE loss: 473.9692597730085, Worker: 0
[1]<stdout>:Training. Epoch 10, MSE loss: 484.8857977675169, Worker: 1
[0]<stdout>:Training. Epoch 11, MSE loss: 472.89639633833195, Worker: 0
[1]<stdout>:Training. Epoch 11, MSE loss: 468.5362197326799, Worker: 1
[1]<stdout>:Training. Epoch 12, MSE loss: 460.76999270017814, Worker: 1
[0]<stdout>:Training. Epoch 12, MSE loss: 416.7220215983699, Worker: 0
[0]<stdout>:Training. Epoch 13, MSE loss: 451.9702810886584, Worker: 0
[1]<stdout>:Training. Epoch 13, MSE loss: 439.8771825716403, Worker: 1
[0]<stdout>:Training. Epoch 14, MSE loss: 366.88131853180283, Worker: 0
[1]<stdout>:Training. Epoch 14, MSE loss: 470.20061697393044, Worker: 1
[1]<stdout>:Training. Epoch 15, MSE loss: 431.8841860381735, Worker: 1
[0]<stdout>:Training. Epoch 15, MSE loss: 395.79066620089884, Worker: 0
[0]<stdout>:Training. Epoch 16, MSE loss: 390.8923997512978, Worker: 0
[1]<stdout>:Training. Epoch 16, MSE loss: 425.90262653747106, Worker: 1
[0]<stdout>:Training. Epoch 17, MSE loss: 516.2323905151153, Worker: 0
[1]<stdout>:Training. Epoch 17, MSE loss: 355.790610972445, Worker: 1
[0]<stdout>:Training. Epoch 18, MSE loss: 501.96195440185153, Worker: 0
[1]<stdout>:Training. Epoch 18, MSE loss: 442.8678330586471, Worker: 1
[1]<stdout>:Training. Epoch 19, MSE loss: 445.24695194348794, Worker: 1
[0]<stdout>:Training. Epoch 19, MSE loss: 441.0054695299096, Worker: 0
[0]<stdout>:Testing. MSE loss: 219.68063354492188, Worker: 0
[1]<stdout>:Testing. MSE loss: 219.68063354492188, Worker: 1

目前有个难受的点是,master需要等待worker安装完环境才能运行horovodrun命令,因此写了一个sleep 1m。后续应该可以用一些Pod之间的通信方式解决这个。

更新,可以通过readinessProbe解决。例子


Horovod on Kubernetes
https://fffffaraway.github.io/2022/07/07/horovod-on-kubernetes/
Author
Song Wei
Posted on
July 7, 2022
Licensed under