NUMA and Linux Host Conditioning/Tuning

Auto-Tuning of NUMA and Linux Host for Cloud and NFV/VNF Workload Deployment

Problem

Cloud or NFV / VNF workloads deployed on a host demands workload class specific execution environment and resources. These demands are satisfied by proper configuration and tuning (conditioning) of the host via interfaces and relevant parameters provided by the host OS, such as Linux, virtualization framework, such as Libvirt or Libcontainer and orchestrators used to deploy workloads, such as Openstack, Docker or Kubernettes. Typically, users or admins will perform conditioning manually using these interfaces and parameters. But that approach is prone to errors and does not scale, thus increasing operational complexity and hence OPEX.

Solution

A system, we call the NUMA/Linux host auto-conditioner or ACDR that automatically conditions the NUMA/Linux host on which the workload is deployed pre and post workload deployment. A user, while deploying a workload, specifies an ACDR conditioning class (ACDR-CA) corresponding to one or more workload class specific execution environment and resource requirements. When a user deploys a workload, the ACDR engine running on the host applies automatically the conditioning steps required to support the specified ACDR-CA. A (cloud NFV/VNF) admin can also condition a host statically (but any time) to support specific ACDR-CA prior to deployment of a workload, not by manually applying complex interfaces and parameters, but by specifying the ACDR-CA. The conditioning steps remain hidden from users or admins. Note that orchestrators may support conditioning classes (in terms of flavor). But they are extremely limited in their capabilities. For example, none of the ACDR-CA examples given below is supported by any orchestrator.

Examples of ACDR-CA are “CPU pinning on cleaner cores” (ACDR-CA-1), “Strict NUMA node affinity”, “hardware interrupt ban from specified cores”, “(Linux) scheduler preemption interval at higher level”, “hardware IRQ and softirq on same core(s) where IP packet being consumed”, “packets in a NIC queue sent to specific core(s)”, etc. Implementation of these ACD-CA can be complex requiring a number of well-coordinated or synchronized configuration and tuning steps applied on the host. We call these steps conditioning steps for an ACDR-CA.

For example, the ACDR-CA-1 mentioned above can have following conditioning steps: 1) “cleaner” cores are selected from the Linux isolcpus cores, where isolcpus is a Linux boot-time parameter listing cores that Linux scheduler excludes from task scheduling. Alternatively, a select set of cores C1 can be cleaned by pinning select tasks away from C1. For example, PID 1 which is the parent of all user processes can be pinned away from C1. 2) Pin workload on C1. 3) Pin some of the kernel tasks away from C1.

Non-clean core:

Cleaner core: Only the VM (qemu.kvm), Emulator (vhost) and few other core-pinned kernel tasks are running on this core:

Cloud / NFV/VNF Workload Execution Visibility and Analytics

Problem

When a Cloud or NFV/VNF workload is executing on a (NUMA/Linux) host, it is layered on top of multiple other execution environments (EE): NUMA host (CPU, CPU cache, NUNA node, memory, storage and NIC), (Linux) host OS, hypervisor/container EE, virtual switch EE and orchestrator EE. Hence the visibility into the execution of the workload should be provided in the context of the other underlying EE. Providing such holistic visibility (not to mention correlating them) is a major challenge.

Solution

A system for Holistic Visibility for Cloud / NFV/VNF Workloads or HAVW system. This system collects visibility data using the tools, probes and interfaces (tools, in short) provided by the various EE mentioned above. Note that HAVW visibility information includes not only the performance data, but also information on which cores or NUMA nodes the workload is executing on, what is the memory page structure and distribution over NUMA nodes, what are the virtual interfaces on a virtual switch (such as OVS) corresponding to a workload, what are the Linux Cgroups parameter, scheduler, protocol or network device states. The user or admin configures what visibility information and which EE tools should be turned on. The HAVW also allows a user to use the system interactively. For example, a user can change certain configuration and tuning of the host where workload is running and see the effect.

Shortcomings of the system: It does not automatically 1) identify which other EE the workload execution will depend on, 2) identify which information should be collected, 3) correlate information from multiple EE, 4) configure the system based on workload classes.

We elaborate the working of the system with figures and annotations (see below). The workloads were instances of IPSEC gateway VNF packaged in Linux KVM VM. The kernel-based OVS was the virtual switch switching between VMs. Workload were deployed via an orchestrator called the Elastic Services Controller (ESC), which performs VM/container life-cycle management and interfaces with Openstack and other Cloud controllers/orchestrators. Traffic was generated from an Iperf client on compute host 1 (host 1) to an IPSEC gateway VNF (VNF1). There was a crypto-map IPSEC tunnel from VNF1 to VNF2 on compute host 2 (host 2), from there traffic flowed into an Iperf server VM.

In a relatively optimized deployment, the host was configured and tuned (conditioned) in a unique way to achieve required performance result. For example, CPU pinning with “cleaner” cores and a few other conditionings were applied. Whereas, in non-optimized deployment, no specific conditioning was applied. Users could experiment by applying different conditionings. For automated conditioning, users could use the ACDR system (NUMA/Linux Host Auto-Conditioner) mentioned before.

Users could use following Conditioning options:

1. Cleaner cores (see ACDR system).

2. Dedicated core Pinning (a workload exclusively executes on one or more cores and no other workload allowed; though Linux scheduler will schedule kernel tasks on these cores).

3. Flow Pinning (traffic from a NIC queue directed to specific core(s)).

4. IRQ Pinning (specific hardware IRQ pinned to specific core(s)).

5. 1GB Hugepages.

6. Strict NUMA Node Affinity (workload executed on one NUMA node; though memory pages were distributed on 2 NUMA nodes).

7. Select set of workloads (such as VNF and Iperf server) on the same core or separate cores.

8. OVS Softirq Pinning (traffic on specific virtual interfaces of OVS were processed on specific cores).

9. IRQ (hardware interrupt) Ban from specified cores.

10. Emulator (vHost) pinning (kernel based vHost pinned to specified cores).

Other conditionings:

1. NUMA balancing off (by default workloads are “balanced” via migrations to different NUMA nodes).

2. KSM: Kernel Same page Merging off (if on, this feature allows different VMs share common memory pages).

3. Memory ballooning off.

The compute hosts were Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 2 NUMA node with each node having 10 cores or 20 with hyperthreads. Other systems were RHEL 7.2 (Maipo), kernel version 3.10.0-327.10.1.el7.x86_64 as Linux host OS with QEMU 2.3.0 and OVS 2.4.0. The VNF was Cisco CSR1kv version 3.16.

This figure shows layering of workloads on OVS. Not all the layering is shown here. The HAVW provides visibility on each or a select set of the virtual interfaces (TAP, QVB, QVO).

This graph shows traffic flowing into a TAP interface (for example, VNF2 to tap12) at 5 second intervals. The upper non-jittery graphs show sustained throughput (about 140 mbps per instance) for the relatively optimized deployment (OD), whereas, the lower graphs show non-optimized deployment (NOD). Each vertical segment corresponds to different conditioning options applied. There were four OD and NOD instances (in total 8 instances). OD workload instance 1 were pinned to core 2, instance 2 on core 3, instance 3 on core 4 of NUMA node 1 and instance 4 on core 12 of NUMA node 2. The NOD workload shared cores 0, 1, 10 and 11 on two NUMA nodes (Linux scheduler scheduled on and migrated them between those cores). In this case only one VCPU was used, that is, each instance had only one CPU or core.

nt: NOD, t: OD workloads or related tasks, such as vHost, IRQ handler, softirq.

Vertical segments shown in above graph correspond to following conditioning options: Segment 2-5: OVS softirq for VNF pinned to core 1 where NOD workloads running and hardware IRQ banned from cores 2-4 and 12 where OD workloads running; Segment 6: OVS softirq for VNF pinned to core 6, where Iperf server VM running; Segment 7: vHost on core 6.

Guest CPU Usage (5 sec intervals): Traces are collected with multi-core collection tool called the mpstat. This visibility shows that the OD workloads made full use of the CPUs.

KVM Exit counts (5 sec intervals): Traces were collected with Linux “perf” tool that collects trace data from hardware counters and kernel tracepoints (other similar tool also can be used).

CPU L3 Cache references (5 sec intervals): Traces can be collected with Linux “perf” or other tools that collect trace data from hardware counters and kernel tracepoints. Reference to L3 cache means that data were not available in L1 and L2 cache.

CPU load for softirq processing. Load on core 1 where VNF softirqs for VNF instances were pinned is very obvious.

OD workloads were run with hugepages, whereas NOD workloads had default page size. Hence TLB shootdowns were almost zero for OD workloads (cores 2,3,4,12) and many hundreds for NOD workloads (cores 0,1,10,11).

Result of running with 2 VCPU. Not much difference.

This visibility shows that CPU and memory were not dedicated to the workload instance (VCPU0 and VCPU1).

Result of running with SRIOV and 2 VCPU. Some difference in segments 6 and 7.

Results of running with no IPSEC tunnel between VNFs and with 2 VCPU. Substantial difference in throughput (about 1 gbps).

Visibilities of various other parameters.

HAVW also supports other type visibilities or tracing, such as following:

An introduction to Softirq:

As a hardware interrupt (IRQ) is raised and sent to a core (LAPIC to be exact), an IRQ handler is invoked to process the IRQ. In Linux the IRQ handler, which is very lightweight, raises a softirq to process further the interrupt, for example, to receive the packets via (NAPI) polling and process the packets. Following is the list of different types of softirqs supported in Linux kernel:

· HI_SOFTIRQ=0,

· TIMER_SOFTIRQ,

· NET_TX_SOFTIRQ,

· NET_RX_SOFTIRQ,

· BLOCK_SOFTIRQ,

· BLOCK_IOPOLL_SOFTIRQ,

· TASKLET_SOFTIRQ,

· SCHED_SOFTIRQ,

· HRTIMER_SOFTIRQ,

· RCU_SOFTIRQ,

· NR_SOFTIRQS

When there is high softirq load, the processing is performed on a per-core kernel thread called the ksoftirqd, which runs in user-level priority by default. The processing time slice of ksoftirqd is controlled (jiffies + MAX_SOFTIRQ_TIME) so that other user processes are not starved of the CPU.

Google Sites

Report abuse