Since the inception of virtualization, it has been accepted that some amount of overhead gets added to any workload running in a virtual machine. Over time, VMware's focus has increased from consolidating workloads to handling mission-critical or tier-1 apps and moving on to high performance apps. When developing high-performance apps in any context, it is key for the workload to leverage the native hardware acceleration wherever possible. Scroll down for my "Virtualizing HPC and Latency-Sensitive Apps on vSphere Primer".
It is important to size the workload appropriate to the
physical platform, matching physical to virtual NUMA node awareness and alignment for processes and
memory, matching threads to cores, leveraging offloads of processing in the IO
pipeline wherever possible, and that's just to start. From that, for the
performance of a virtual machine to approach or beat a workload running on
baremetal, it is key for the virtualization platform to expose as much of those
feature-sets as it can. Perhaps most importantly the hypervisor should get out
of the way whenever possible to let the VM's processes run without
interruption.
However, whether you want to optimize for a
latency-sensitive workload or a high-throughput workload will depend on how
much you want the hypervisor to get out of the way. For high-throughput, you
may want to let the VMs run as much as possible without interrupts. However,
this may cause additional latency in the IO-path. For a latency-sensitive
workload, you may want to disable interrupt coalescing, but you are
deliberately servicing IO instead of focusing on compute. Remember that since
you are trading off throughput and parallelization for latency, the settings and recommendations below should be
evaluated and tested thoroughly to understand if they fit the workload. If you
have a workload that prescribes both high throughput and low latency, you may
have no choice but to adopt VMDirectPath or SR-IOV which have their own set of
tradeoffs listed in the docs here: http://pubs.vmware.com/vsphere-55/index.jsp?topic=%2Fcom.vmware.vsphere.networking.doc%2FGUID-BF2770C3-39ED-4BC5-A8EF-77D55EFE924C.html
Along the way of VMware's hypervisor development, you could
say there have been plenty of milestones that contribute to its
performance-honed characteristics and features. A good yet not definitive list:
·
VMware's first product was Workstation, but ESX
1.0 was its first Type 1 hypervisor
·
ESX 3 introduced a service console VM where
previously ESX had to statically assign IO devices
·
ESX 4.0 when VMDirectPath was introduced
·
ESXi 4.1 was when the service console VM was eliminated
·
ESXi 5 where the hypervisor was rewritten to
become the best platform to run Cloud Foundry, an entirely new set of
requirements around very fast provisioning and power-on of large numbers of
virtual machines. Arguably this is where ESXi really learned how to get out of
the way of a workload for near-native performance for most cases.
·
ESXi 5.1 introduced some of the latency
sensitive tuning primitives but needed advanced options to set these for the
vmkernel
·
ESXi 5.5 built on some of the latency sensitive
tuning and granularity of the hypervisor to include a simple checkbox to
indicate a VM as latency-sensitive
In addition, each major and minor version of ESX(i) has
included hardware updates to include support for the latest and greatest
chipsets from Intel and AMD, NICs and storage adapters. These advancements were
accompanied by updates to the virtual hardware of a VM and the VMware Tools or
in-guest set of drivers recommended for best performance and manageability.
By allowing the best translation of the native functionality
and offloading of the underlying hardware, ESXi gets VMs to near-native
performance for most throughput driven workloads. However, there are cases
where benchmarks show that virtualized workloads can exceed the performance
characteristics of their native equivalent. So how is this possible? To put
another way, can the hypervisor be a better translation, management and
scheduling engine for the hardware than an OS kernel itself? Why not keep the
workload physical?
A workload running as baremetal will of course have direct
access to all the hardware on that server, however, the sizing you will have to
accept at that point is the size of the total CPUs and memory on that server
and the sizing is static. However, with distributed systems or cloud-native
apps or platform 3 apps, it is rarely about a single server. It is more
about the aggregate performance across tens or hundreds or, for some, even
thousands of servers, instead of the one server. In a discrete and multi-tenant
(or "microservices" if you want) architecture, the requirements for
dynamic and flexible sizing in aggregate is a natural fit for virtualization.
Understanding the associated workload is critical, of
course, in order to size the VM optimally. For traditional IT workloads, it
was more likely to have to deal with oversized VMs. However, for
high-performance apps, ESXtop can aid in determining if the VM is constrained
by CPU, memory, storage or network. My ESXtop checklist is this kb article from
VMware here: http://kb.vmware.com/kb/2001003.
For platform configuration checklist for high-performance
workloads, see here, http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf:
·
Make sure the BIOS is updated and set for
maximum performance. Your mileage may always vary due to the BIOS and firmware
configuration of different components in the hardware. Even virtualized, these
issues can still cause performance to lag.
·
C-states support should be disabled.
·
Power management in the BIOS should be disabled.
·
Use the latest stable ESXi version that you can.
See the ESXi generation improvements above. The caveat being that the drivers
for the hardware may differ per different ESXi versions which can cause
poor or inconsistent performance results. Throughput testing when adding new drivers is definitely recommended.
·
Size VMs to fit within a NUMA node of the
chipset. This will depend on the processor generation. For example, see here
for Dell's recommendations for Haswell: http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/09/23/bios-tuning-for-hpc-on-13th-generation-haswell-server
Also here's an older but still relevant article describing the NUMA affinity and
migration techniques of the vmkernel: http://blogs.vmware.com/vsphere/2012/02/vspherenuma-loadbalancing.html
·
When sizing VMs, do not plan on overprovisioning
any hardware component as any bottleneck will typically determine the overall
performance and the aggregate performance will be less than optimal.
·
Choose the latest stable generation OS that you
can. Later OS versions typically have more optimized hardware interrupt
handling mechanisms. For example performance tuning recommendations for RHEL 7 (support subscription required) or for RHEL 6.
·
Use the latest version of VMware Tools which
includes the latest paravirtualized drivers such as PVSCSI and VMXNET3.
·
Use PVSCSI when you can, but be careful of high
throughput issues with default queue settings. For example, this kb article describes where the default
queue depth for PVSCSI may be insufficient: http://kb.vmware.com/kb/2053145
·
Use VMXNET3 when you can and pay attention to
how much you can offload to the hardware with regards to LRO, RSS, multiqueue
and other NIC-specific optimizations. Some relevant VMware kb articles here: http://kb.vmware.com/kb/2020567
and http://kb.vmware.com/kb/1027511 http://www.vmware.com/files/pdf/VMware-vSphere-PNICs-perf.pdf
·
Low throughput on UDP in a Windows VM is another
case to consider if your application IO depends on it. You may need to modify
the vNIC settings: http://kb.vmware.com/kb/2040065
·
Overprovisioning the
network capacity can be significantly trickier than sizing for CPU and memory,
especially if using NFS for storage network traffic. It's key to understand
whether this will be storage accessed by the vmkernel from SAN or NAS sources
or by the VM itself. Use Network IO Control to enable more fair-sharing of
network bandwidth. However understand this may cause more interrupts resulting
in more overhead switching between VMs so if these are high throughput
(compute and memory) VMs then consider placing them on separate hosts: http://www.vmware.com/files/pdf/techpaper/Network-IOC-vSphere6-Performance-Evaluation.pdf
·
If you have a particularly latency-sensitive
workload, consider using SR-IOV or VMDirectPath. See the latest benchmarks here
for Infiniband and RDMA http:/ /www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf
·
For Infiniband workloads, plan on using SR-IOV
or VMDirectPath. The latest benchmarks are here: http://blogs.vmware.com/cto/running-hpc-applications-vsphere-using-infiniband/
http://blogs.vmware.com/cto/hpc-update/
·
For Nvidia GPGPU (general purpose or non-VDI)
workloads, plan on using VMDirectPath. With vSphere 6, Nvidia vGRID (think of
SR-IOV for GPUs instead of NICs) vGRID will be supported by VMware Horizon View
so hopefully vGRID support for GPGPU workloads will be available soon. More
details from Nvidia here: http://blogs.nvidia.com/blog/2015/02/03/vmware-nvidia-gpu-sharing/
·
For Xeon Phi, there is no support today on
vSphere 5.5. This feature, the MIC or specialized "Many Integrated
Core", is ignored by the hypervisor.
You'll also need to review in-guest OS settings and
performance tuning variables which I won't detail in this post. And finally,
you'll need to consider the application-specific tuning and optimizations.
Given all of that, it should be possible to achieve better than native performance
for certain high-performance workloads. For specific examples, see the
excellent write-ups by VMware's performance team.
Hadoop on vSphere 6.0:
Redis on vSphere 6.0:
Links:
http://kb.vmware.com/kb/2020567 RSS MultiQueue Linux
http://kb.vmware.com/kb/1027511 LRO Linux
No comments:
Post a Comment