场景二：KVM 虚拟机中使用本地节点GPU资源

本章中我们用到的节点配置如下：

单台服务器，配备两张NVIDIA Tesla V100计算卡，每张16GB显存
Ubuntu Server 16.04 LTS
Docker CE 18.09
libvirt 1.3.1
QEMU 2.5.0

我们以安装ubuntu16.04操作系统的一台虚拟机ubuntu-client0作为Orion Client。这台虚拟机既没有将物理机上的显卡以直通（Passthrough）的方式穿透进来，也没有安装NVIDIA驱动或CUDA组件。我们安装了必要的Python3库，以及TensorFlow 1.12 GPU版本：

# From inside VM
sudo apt install python3-dev python3-pip
sudo pip3 install tensorflow-gpu==1.12.0

由于虚拟机中不能访问GPU，也没有NVIDIA的软件环境，TensorFlow目前是无法使用的。在配置好Orion vGPU软件后，虚拟机中的TensorFlow就可以使用Orion vGPU进行模型的训练与推理。

完成Orion vGPU软件部署后，我们将在ubuntu-client0中运行TensorFlow官方的CIFAR10_Estimator示例，使用两块Orion vGPU （分别位于两块物理GPU上）进行模型训练与推理。

进入后续步骤之前，我们假设：

Orion Server已经根据Orion Server安装部署小节成功安装
Orion Controller已经根据Orion Controller安装部署小节安装并正常启动

Orion Server 配置启动

在启动Orion Server服务之前，我们需要修改配置文件，设置数据通路，并打开Orion Server对KVM的支持。

数据通路设置

属性bind_addr指Orion Server所接受的数据通路，Client必须要能访问这一地址。对于KVM虚拟机来说，我们需要将其设置为KVM虚拟机网络的网关地址。

我们使用virsh查看KVM虚拟机的网络配置：

# From host OS
sudo virsh domifaddr ubuntu-client0

 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet1      52:54:00:04:82:10    ipv4         33.31.0.10/24

可以看到，当前KVM虚拟机的IP地址为33.31.0.10，因此应该设置bind_addr=33.31.0.1。

附：如果KVM虚拟机在多个虚拟子网内，例如：

 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 vnet0      52:54:00:04:82:10    ipv4         33.31.0.10/24
 vnet1      52:54:00:c5:43:10    ipv4         33.32.0.10/24

选择任意一个子网的网关作为bind_addr均可。

Orion Server模式设置

本场景中，我们仍然选择本地共享内存来加速数据传输，因此需要设置enable_shm=true，enable_rdma=false。此外，我们要显式启用Orion vGPU软件对KVM虚拟机的支持，即设置enable_kvm=true。

Orion Server 参数配置示例

本场景中，/etc/orion/server.conf的第一小节内容应该配置为

[server]
    listen_port = 9960                                                          
    bind_addr = 33.31.0.1 
    enable_shm = "true"
    enable_rdma = "false"
    enable_kvm = "true"

启动Orion Server

我们需要重启Orion Server使新配置生效，并通过orion-check工具进一步确认Orion Server和Orion Controller可以正常交互：

# From host OS
sudo systemctl restart oriond 
sudo orion-check runtime server

正常的输出类似下面所示：

Searching NVIDIA GPU ...
CUDA driver 418.67
418.67 is installed.
2 NVIDIA GPUs are found :
    0 : Tesla V100-PCIE-16GB
    1 : Tesla V100-PCIE-16GB

Checking NVIDIA MPS ...
NVIDIA CUDA MPS is off.

Checking Orion Server status ...
Orion Server is running with Linux user   : root
Orion Server is running with command line : /usr/bin/oriond 
Enable SHM                              [Yes]
Enable RDMA                             [No]
Enable Local QEMU-KVM with SHM          [Yes]
Binding IP Address :                    33.31.0.1
Listening Port :                        9960

Testing the Orion Server network ...
Orion Server can be reached through 33.31.0.1:9960

Checking Orion Controller status ...
[Info] Orion Controller setting may be different in different SHELL.
[Info] Environment variable ORION_CONTROLLER has the first priority.

Orion Controller addrress is set as 127.0.0.1:9123 in configuration file. Using this address to diagnose Orion Controller
Address 127.0.0.1:9123 is reached.
Orion Controller Version Infomation : data_version=0.1,api_version=0.1
There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.

可以看到，我们的Orion Server节点上有两块Tesla V100计算卡，Orion Controller将它们虚拟化成了一共8块Orion vGPU。

在虚拟机内安装Orion Client运行时

安装至默认路径

在虚拟机中，我们运行Orion Client安装包：

# From inside VM
sudo ./install-client-9.0

此时，用户没有指定安装路径，安装包会询问是否将Orion Client运行时安装到默认路径/usr/lib/orion下。得到用户许可后，安装包会通过ldconfig机制将Orion Client运行时添加到操作系统动态库搜索路径。

Orion client environment will be installed to /usr/lib/orion
Do you want to continue [n/y] ?y

Configuration file is generated to /etc/orion/client.conf
Please edit the "controller_addr" setting and make it point to the controller address in your environment.

Orion vGPU client environment has been installed in /usr/lib/orion
To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
export LD_LIBRARY_PATH=/usr/lib/orion:$LD_LIBRARY_PATH

由于安装包已经配置搜索路径，这里屏幕提示的export LD_LIBRARY_PATH=<installation-path>:$LD_LIBRARY_PATH不是必需的。

（可选）安装至自定义路径

以安装到/orion为例：

# From inside VM
INSTALLATION_PATH=/orion
sudo mkdir -p $INSTALLATION_PATH
sudo ./install-client-9.0 -d $INSTALLATION_PATH

这种情形下，安装包会直接将Orion Client运行时安装到用户指定的INSTALLATION_PATH=/orion路径下，并向屏幕输出下列提示：

Configuration file is generated to /etc/orion/client.conf
Please edit the "controller_addr" setting and make it point to the controller address in your environment.

Orion vGPU client environment has been installed in /orion
To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
export LD_LIBRARY_PATH=/orion:$LD_LIBRARY_PATH

用户在terminal内运行应用程序之前，一定要保证Orion Client运行时在操作系统动态库搜索路径中：

# From current working terminal inside VM
export LD_LIBRARY_PATH=/usr/local/orion:$LD_LIBRARY_PATH

注意这条命令只对当前terminal生效。为方便起见，用户可以将上述语句加至~/.bashrc的最后一行，然后用source ~/.bashrc使其生效，此后登录虚拟机便不需要反复设置。

Orion Client参数配置

正如使用Docker容器中所介绍的，Orion Client端需要向Orion Controller发送对Orion vGPU资源的申请。对于容器环境，我们是通过在启动容器时用ORION_CONTROLLER=<controller_ip>:9123环境变量设置Orion Controller的地址。对于KVM虚拟机来说，我们可以更改/etc/orion/client.conf来达到参数配置的目的。

由于Orion Controller监听在物理机上的0.0.0.0:9123上，我们将controller_addr设置为虚拟机子网网关地址即可：

[controller]
    controller_addr = 33.31.0.1:9123

设置完后，我们用orion-check工具检查状态：

# From inside VM
orion-check runtime client

如果Orion Client虚拟机内部可以连接到Orion Controller，输出为：

# (omit output)
Orion Controller addrress is set as 33.31.0.1:9123 in configuration file. Using this address to diagnose Orion Controller
Address 33.31.0.1:9123 is reached.
Orion Controller Version Infomation : data_version=0.1,api_version=0.1
There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.

运行TF官方CIFAR10_Estimator示例

在运行应用程序之前，我们用环境变量指定应用程序向Orion Controller申请的Orion vGPU数目与显存：

export ORION_VGPU=2
export ORION_GMEM=12000

我们的每一块Tesla V100计算卡有16GB显存，因此如果用户将ORION_GMEM设置得少于8GB，两块Orion vGPU会被调度到同一块物理GPU上。这里我们设置Orion vGPU的显存为12000MB，那么两块Orion vGPU将分别调度到两块物理GPU上，方便我们展示双卡的模型训练。

下面我们使用TensorFlow官方的CIFAR10 Estimator例子展示模型的训练与推理： https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/README.md

首先，我们git cloneTF官方模型repo:

# From inside VM
git clone --branch=r1.12.0 https://github.com/tensorflow/models

然后进入CIFAR10 Estimator文件夹内

cd models/tutorials/image/cifar10_estimator/

读者第一次训练模型前，需要下载CIFAR10数据集并转换为TFRecord格式：

mkdir data
python3 generate_cifar10_tfrecords.py --data-dir ./data

处理好后，data目录应该包括以下内容，共520MB：

user@ubuntu-client0:~/models/tutorials/image/cifar10_estimator/data$ ls
cifar-10-batches-py  cifar-10-python.tar.gz  eval.tfrecords  train.tfrecords  validation.tfrecords

下面我们使用两块Orion vGPU进行模型训练，每块Orion vGPU上的batch_size设为128，总共256：

python3 cifar10_main.py \
	--data-dir=${PWD}/data \
	--job-dir=/tmp/cifar10 \
	--variable-strategy=GPU \
	--num-gpus=2 \
	--train-steps=10000 \
	--train-batch-size=256 \
	--learning-rate=0.1

TensorFlow打印的日志如下：

VirtaiTech Resource. Build-cuda-7675815-20190624_081551
2019-06-25 15:43:43.493814: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 15:43:43.493882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:00:09.0
totalMemory: 11.72GiB freeMemory: 11.72GiB
2019-06-25 15:43:43.604945: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 15:43:43.605002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: 
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:00:09.0
totalMemory: 11.72GiB freeMemory: 11.72GiB
2019-06-25 15:43:43.606527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-06-25 15:43:43.606568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-25 15:43:43.606577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 
2019-06-25 15:43:43.606582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y 
2019-06-25 15:43:43.606589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N 
2019-06-25 15:43:43.606657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 11400 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
2019-06-25 15:43:43.607202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 11400 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
# (omit output)
INFO:tensorflow:global_step/sec: 14.2797
INFO:tensorflow:loss = 0.48649728, step = 9900 (7.003 sec)
INFO:tensorflow:learning_rate = 0.1, loss = 0.48649728 (7.003 sec)
INFO:tensorflow:Average examples/sec: 3639.99 (4009.58), step = 9900
INFO:tensorflow:Average examples/sec: 3640.07 (3717.26), step = 9910
INFO:tensorflow:Average examples/sec: 3640.09 (3655.01), step = 9920
INFO:tensorflow:Average examples/sec: 3640.31 (3873.63), step = 9930
INFO:tensorflow:Average examples/sec: 3640.45 (3788.08), step = 9940
INFO:tensorflow:Average examples/sec: 3640.79 (4017.58), step = 9950
INFO:tensorflow:Average examples/sec: 3641.19 (4089.74), step = 9960
INFO:tensorflow:Average examples/sec: 3641.23 (3679.08), step = 9970
INFO:tensorflow:Average examples/sec: 3641.43 (3847.37), step = 9980
INFO:tensorflow:Average examples/sec: 3641.4 (3615.53), step = 9990
INFO:tensorflow:Saving checkpoints for 10000 into /tmp/cifar10/model.ckpt.
INFO:tensorflow:Loss for final step: 0.46667284.
# (omit output)
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2019-06-25-08:06:14
INFO:tensorflow:Saving dict for global step 10000: accuracy = 0.7628, global_step = 10000, loss = 1.0683168
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10000: /tmp/cifar10/model.ckpt-10000
2019-06-25 16:06:15 [INFO] Client exits with allocation ID fda38164-711b-4809-9984-2759a3a2165b

从日志中可以看到：

应用程序启动时，Orion Client运行时会打印日志VirtaiTech Resource. Build-cuda-xxx。这一行日志说明应用程序成功加载了Orion Client运行时。
应用程序退出时，Orion Client运行时会打印日志Client exits with allocation ID xxx。这一行日志说明应用程序在生命周期里成功向Orion Controller申请到了Orion vGPU资源，退出时释放这一资源。
TensorFlow启动时识别出了两块GPU，显存各自为11.72GB （对应于我们设置的ORION_GMEM=12000）

模型训练的过程中，我们在物理机操作系统中运行nvidia-smi查看物理GPU使用情况：

从结果中可以看出：

对物理GPU的访问被Orion Server进程oriond完全接管
两块Orion vGPU被调度到了两块物理GPU上
我们限制了Orion vGPU对显存的占用

如果运行有异常，用户可以参考附录相应小节进行检查。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvm.md

kvm.md

场景二：KVM 虚拟机中使用本地节点GPU资源

Orion Server 配置启动

数据通路设置

Orion Server模式设置

Orion Server 参数配置示例

启动Orion Server

在虚拟机内安装Orion Client运行时

安装至默认路径

（可选）安装至自定义路径

Orion Client参数配置

运行TF官方CIFAR10_Estimator示例

Files

kvm.md

Latest commit

History

kvm.md

File metadata and controls

场景二：KVM 虚拟机中使用本地节点GPU资源

Orion Server 配置启动

数据通路设置

Orion Server模式设置

Orion Server 参数配置示例

启动Orion Server

在虚拟机内安装Orion Client运行时

安装至默认路径

（可选）安装至自定义路径

Orion Client参数配置

运行TF官方CIFAR10_Estimator示例