DDIO就是在Cache加入DMA。以下針對讀取時考慮兩種情形,有DDIO跟沒有DDIO的兩種CASE。
Without DDIO
沒有DDIO:
DMA機制讓資料直接被寫入到MEMORY。
CPU要去讀取資料時,會先到Cache找。發現資料不在Cache,因此Cache miss。
因為Cache miss,因此從Memory搬資料到Cache。
CPU等搬完資料,再度去Cache讀取資料。
With DDIO
有DDIO
DMA 直接寫入 Cache (IO Cache, call DCA, Direct Cache Access)
CPU 讀取資料時,不會發生多次Cache miss。
The Memory Wall
In the “von Neumann” architecture upon which most servers today are based, the compute unit is dependent on bus-connected memory for both instructions and data. However, the ever-increasing processor/memory performance gap, known as “The Memory Wall”, poses scaling challenges for this model. And while bandwidth has improved incrementally, memory latency is at a virtual standstill. Chip designers mitigate this performance gap with larger core caches, on-die memory controllers, Simultaneous Multithreading (SMT), additional cores, larger out-of-order windows, etc. But the gulf in performance remains a prominent issue.
The situation is particularly problematic for storage and network I/O. Take a network adapter (NIC), for instance. Not only does it incur a 150+ ns PCIe packet forwarding latency, but it must DMA that packet into RAM. This adds an additional 60 – 100ns latency per memory access before the CPU can process it.2 That’s a lotta extra overhead, not to mention the memory bandwidth it robs from running applications. This bottleneck will only worsen once 100Gb+ NICs become more widely deployed. Hmm, if only the NIC could skip the line, right? Oh wait. . . it can! Enter Intel DDIO.
What is Intel DDIO?
DDIO, or Data Direct I/O, is Intel’s latest implementation of Direct Cache Access (DCA). DCA is a technique that Intel researchers proposed in ’05 to deal with the growing adoption of 10GbE. Processing budget for the smallest Ethernet frame at a 10Gb/s rate is only 67.2ns, smaller than typical RAM access latency. Therefore, DCA aims to minimize, and ideally eliminate, main memory involvement in the I/O critical path.
Early implementations relied on PCIe TLP Processing Hints, which made it possible for chipsets to prefetch I/O data into targeted cache destinations and, thus, reduce CPU stall time. While this reduces latency, it does little for bandwidth since the packet must still be DMA-ed into main memory. Also, DCA requires I/O device, chipset, CPU, and OS support (plus a signed permission slip from a legal guardian) in order to function.
Intel improved upon this with DDIO, which transparently supports DMA directly into and out of Last Level Cache (LLC), bypassing main memory. However, unlike early DCA, DDIO doesn’t support granular targeting of I/O data (i.e., per TLP), nor does it support DMA into remote socket LLCs. Still, DDIO as it stands delivers substantial benefits to I/O-intensive workloads by reducing both memory bandwidth usage and I/O latency.
Intel DDIO Demo
We’ll demonstrate DDIO with a SockPerf “ping pong” test between hosts on the same VLAN equipped with Solarflare 10GbE NICs. The system-under-test (SUT) is a dual-socket Cascade Lake machine running CentOS 7.8 and using Solarflare OpenOnload 7.1.x. DDIO is toggled off/on between tests using ddio-bench. For both scenarios (DDIO on and off), the following command lines are used:
On the SUT host:
taskset -c 3 onload -p latency sockperf ping-pong -i 10.1.1.3 -p 5001 --msg-size=256 -t 10
This pins the SockPerf ping-pong client to core 3 on Socket #1, where the PCIe port of the Solarflare NIC is connected. It bypasses the Linux kernel with the “low latency” profile configuration of OpenOnload, and exchanges 256-byte messages with the server at 10.1.1.3:5001 for a duration of 10 seconds.
On the SockPerf Server host (an old Westmere):
taskset -c 3 onload -p latency sockperf server -i 10.1.1.3 -p 5001
This spins up a waiting server on its own 10.1.1.3 address, pins it to core 3 on Socket #1, and bypasses the kernel network stack in the same fashion as the SUT.
Here are the latency metrics as reported by SockPerf with DDIO disabled:
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=1793530; ReceivedMessages=1793530
sockperf: ====> avg-latency=2.653 (std-dev=0.062)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 2.653 usec
sockperf: Total 1793530 observations; each percentile contains 17935.30 observations
sockperf: ---> <MAX> observation = 5.960
sockperf: ---> percentile 99.999 = 5.435
sockperf: ---> percentile 99.990 = 4.146
sockperf: ---> percentile 99.900 = 3.633
sockperf: ---> percentile 99.000 = 2.815
sockperf: ---> percentile 90.000 = 2.687
sockperf: ---> percentile 75.000 = 2.658
sockperf: ---> percentile 50.000 = 2.641
sockperf: ---> percentile 25.000 = 2.629
sockperf: ---> <MIN> observation = 2.572
Notice the difference in latency and message throughput with DDIO enabled:
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=1849114; ReceivedMessages=1849114
sockperf: ====> avg-latency=2.572 (std-dev=0.048)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 2.572 usec
sockperf: Total 1849114 observations; each percentile contains 18491.14 observations
sockperf: ---> <MAX> observation = 5.717
sockperf: ---> percentile 99.999 = 5.152
sockperf: ---> percentile 99.990 = 3.915
sockperf: ---> percentile 99.900 = 3.451
sockperf: ---> percentile 99.000 = 2.666
sockperf: ---> percentile 90.000 = 2.598
sockperf: ---> percentile 75.000 = 2.579
sockperf: ---> percentile 50.000 = 2.566
sockperf: ---> percentile 25.000 = 2.556
sockperf: ---> <MIN> observation = 2.512
The DDIO-disabled run has a 99% Confidence Interval of 2.653μs +/- .000119, while the DDIO-enabled run has a 99% Confidence Interval of 2.572μs +/- .000091. That’s roughly an 80ns improvement (~310 cycles on a 3.6GHz CPU) that results in 55,584 (1,849,114 – 1,793,530) more messages processed within the same 10s duration.
Active Benchmarking
Proper benchmarking incorporates low-impact observability tools to determine the “Why?” behind benchmark results. Since DDIO provides direct paths to and from the LLC, let’s look at LLC miss ratios during longer SockPerf runs.
Notice the LLC miss rate and IPC for SockPerf with DDIO disabled:
[root@eltoro]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u -p 28461 -- sleep 10
Performance counter stats for process id '28461':
42,901,050,092 cycles:u
111,549,903,148 instructions:u # 2.60 insn per cycle
5,213,372 mem_load_retired.l3_miss:u
9,775,507 mem_load_retired.l2_miss:u
10.000510912 seconds time elapsed
Compare that to the LLC miss rate and IPC with DDIO enabled:
[root@eltoro]# perf stat -e cycles:u,instructions:u,mem_load_retired.l3_miss:u,mem_load_retired.l2_miss:u -p 27711 -- sleep 10
Performance counter stats for process id '27711':
42,901,181,882 cycles:u
117,305,434,741 instructions:u # 2.73 insn per cycle
37,401 mem_load_retired.l3_miss:u
6,629,667 mem_load_retired.l2_miss:u
10.000545539 seconds time elapsed
Because DDIO bypasses RAM to access LLC directly, the number of cache misses falls by two orders of magnitude between DDIO disabled and enabled for overall LLC miss rates (l3_miss / l2_miss) of 53% and 0.6%, respectively. This helps explain the bump in IPC from 2.60 to 2.73.
Reference: https://www.jabperf.com/skip-the-line-with-intel-ddio/#h-intel-ddio-tradeoffs
Reference: Trade-off https://www.usenix.org/system/files/atc20-farshin.pdf
Reference: https://pmem.io/rpma/documentation/basic-direct-write-to-pmem/