Intel DDIO (Data Direct I/O Technology)

DDIO就是在Cache加入DMA。以下針對讀取時考慮兩種情形，有DDIO跟沒有DDIO的兩種CASE。

Without DDIO

沒有DDIO：

DMA機制讓資料直接被寫入到MEMORY。
CPU要去讀取資料時，會先到Cache找。發現資料不在Cache，因此Cache miss。
因為Cache miss，因此從Memory搬資料到Cache。
CPU等搬完資料，再度去Cache讀取資料。

With DDIO

有DDIO

DMA 直接寫入 Cache (IO Cache, call DCA, Direct Cache Access)
CPU 讀取資料時，不會發生多次Cache miss。

more detail of ddio

The Memory Wall

In the “von Neumann” architecture upon which most servers today are based, the compute unit is dependent on bus-connected memory for both instructions and data. However, the ever-increasing processor/memory performance gap, known as “The Memory Wall”, poses scaling challenges for this model. And while bandwidth has improved incrementally, memory latency is at a virtual standstill. Chip designers mitigate this performance gap with larger core caches, on-die memory controllers, Simultaneous Multithreading (SMT), additional cores, larger out-of-order windows, etc. But the gulf in performance remains a prominent issue.

The situation is particularly problematic for storage and network I/O. Take a network adapter (NIC), for instance. Not only does it incur a 150+ ns PCIe packet forwarding latency, but it must DMA that packet into RAM. This adds an additional 60 – 100ns latency per memory access before the CPU can process it.2 That’s a lotta extra overhead, not to mention the memory bandwidth it robs from running applications. This bottleneck will only worsen once 100Gb+ NICs become more widely deployed. Hmm, if only the NIC could skip the line, right? Oh wait. . . it can! Enter Intel DDIO.

What is Intel DDIO?

DDIO, or Data Direct I/O, is Intel’s latest implementation of Direct Cache Access (DCA). DCA is a technique that Intel researchers proposed in ’05 to deal with the growing adoption of 10GbE. Processing budget for the smallest Ethernet frame at a 10Gb/s rate is only 67.2ns, smaller than typical RAM access latency. Therefore, DCA aims to minimize, and ideally eliminate, main memory involvement in the I/O critical path.

Early implementations relied on PCIe TLP Processing Hints, which made it possible for chipsets to prefetch I/O data into targeted cache destinations and, thus, reduce CPU stall time. While this reduces latency, it does little for bandwidth since the packet must still be DMA-ed into main memory. Also, DCA requires I/O device, chipset, CPU, and OS support (plus a signed permission slip from a legal guardian) in order to function.

Intel improved upon this with DDIO, which transparently supports DMA directly into and out of Last Level Cache (LLC), bypassing main memory. However, unlike early DCA, DDIO doesn’t support granular targeting of I/O data (i.e., per TLP), nor does it support DMA into remote socket LLCs. Still, DDIO as it stands delivers substantial benefits to I/O-intensive workloads by reducing both memory bandwidth usage and I/O latency.

Intel DDIO Demo

We’ll demonstrate DDIO with a SockPerf “ping pong” test between hosts on the same VLAN equipped with Solarflare 10GbE NICs. The system-under-test (SUT) is a dual-socket Cascade Lake machine running CentOS 7.8 and using Solarflare OpenOnload 7.1.x. DDIO is toggled off/on between tests using ddio-bench. For both scenarios (DDIO on and off), the following command lines are used:

On the SUT host:

taskset -c 3 onload -p latency sockperf ping-pong -i 10.1.1.3 -p 5001 --msg-size=256 -t 10

This pins the SockPerf ping-pong client to core 3 on Socket #1, where the PCIe port of the Solarflare NIC is connected. It bypasses the Linux kernel with the “low latency” profile configuration of OpenOnload, and exchanges 256-byte messages with the server at 10.1.1.3:5001 for a duration of 10 seconds.

On the SockPerf Server host (an old Westmere):

taskset -c 3 onload -p latency sockperf server -i 10.1.1.3 -p 5001

This spins up a waiting server on its own 10.1.1.3 address, pins it to core 3 on Socket #1, and bypasses the kernel network stack in the same fashion as the SUT.

Here are the latency metrics as reported by SockPerf with DDIO disabled:

sockperf: ========= Printing statistics for Server No: 0

sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=1793530; ReceivedMessages=1793530

sockperf: ====> avg-latency=2.653 (std-dev=0.062)

sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0

sockperf: Summary: Latency is 2.653 usec

sockperf: Total 1793530 observations; each percentile contains 17935.30 observations

sockperf: ---> <MAX> observation = 5.960

sockperf: ---> percentile 99.999 = 5.435

sockperf: ---> percentile 99.990 = 4.146

sockperf: ---> percentile 99.900 = 3.633

sockperf: ---> percentile 99.000 = 2.815

sockperf: ---> percentile 90.000 = 2.687

sockperf: ---> percentile 75.000 = 2.658

sockperf: ---> percentile 50.000 = 2.641

sockperf: ---> percentile 25.000 = 2.629

sockperf: ---> <MIN> observation = 2.572

Notice the difference in latency and message throughput with DDIO enabled:

sockperf: ========= Printing statistics for Server No: 0

sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=1849114; ReceivedMessages=1849114

sockperf: ====> avg-latency=2.572 (std-dev=0.048)

sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0

sockperf: Summary: Latency is 2.572 usec

sockperf: Total 1849114 observations; each percentile contains 18491.14 observations

sockperf: ---> <MAX> observation = 5.717

sockperf: ---> percentile 99.999 = 5.152

sockperf: ---> percentile 99.990 = 3.915

sockperf: ---> percentile 99.900 = 3.451

sockperf: ---> percentile 99.000 = 2.666

sockperf: ---> percentile 90.000 = 2.598

sockperf: ---> percentile 75.000 = 2.579

sockperf: ---> percentile 50.000 = 2.566

sockperf: ---> percentile 25.000 = 2.556

sockperf: ---> <MIN> observation = 2.512

The DDIO-disabled run has a 99% Confidence Interval of 2.653μs +/- .000119, while the DDIO-enabled run has a 99% Confidence Interval of 2.572μs +/- .000091. That’s roughly an 80ns improvement (~310 cycles on a 3.6GHz CPU) that results in 55,584 (1,849,114 – 1,793,530) more messages processed within the same 10s duration.

Active Benchmarking

Proper benchmarking incorporates low-impact observability tools to determine the “Why?” behind benchmark results. Since DDIO provides direct paths to and from the LLC, let’s look at LLC miss ratios during longer SockPerf runs.

Notice the LLC miss rate and IPC for SockPerf with DDIO disabled:

[root@eltoro]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u -p 28461 -- sleep 10

Performance counter stats for process id '28461':

42,901,050,092 cycles:u

111,549,903,148 instructions:u # 2.60 insn per cycle

5,213,372 mem_load_retired.l3_miss:u

9,775,507 mem_load_retired.l2_miss:u

10.000510912 seconds time elapsed

Compare that to the LLC miss rate and IPC with DDIO enabled:

[root@eltoro]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u -p 27711 -- sleep 10

Performance counter stats for process id '27711':

42,901,181,882 cycles:u

117,305,434,741 instructions:u # 2.73 insn per cycle

37,401 mem_load_retired.l3_miss:u

6,629,667 mem_load_retired.l2_miss:u

10.000545539 seconds time elapsed

Because DDIO bypasses RAM to access LLC directly, the number of cache misses falls by two orders of magnitude between DDIO disabled and enabled for overall LLC miss rates (l3_miss / l2_miss) of 53% and 0.6%, respectively. This helps explain the bump in IPC from 2.60 to 2.73.

Reference: https://www.jabperf.com/skip-the-line-with-intel-ddio/#h-intel-ddio-tradeoffs

Reference: Trade-off https://www.usenix.org/system/files/atc20-farshin.pdf

Reference: https://pmem.io/rpma/documentation/basic-direct-write-to-pmem/

Page updated

Google Sites

Report abuse