Memory Virtualization

x86 Architecture Basics.

In a fully virtualized environment the guest OS of a VM will operate as usual by mapping its virtual or linear address space into the physical address space (in the same way as in the non-virtualized environment as describe here). But the latter is not the real physical address space. Rather it is a logical address space called the guest physical address space (GPAS). The hypervisor that is in control of all the physical resources, has to map the GPAS to real or host physical address space (HPAS), which is conceptually shown in the following figure.

The goal of memory or MMU virtualization is to map a GVA to an HPA. But the process can be complex and multiple approaches are possible in software based memory virtualization. One of the approaches is that the hypervisor keeps a shadow page table (SPT) for each of the processes in a VM. The SPT maps a guest virtual address (GVA) to a host physical address (HPA), which is used in the hardware TLB for quick resolution of HPA. Following is a conceptual description of the SPT mechanism:

1. A guest OS operates as usual in its attempt to manage its page table (GPT) and to map GVA to GPA as discussed here.

2. The hypervisor creates an SPT per process during initialization.

3. The guest creates the PT in a specific area of memory that the hypervisor marks as write protected. Hence every time the guest attempts writes into that protected area, it traps into the hypervisor via memory-protection fault. The hypervisor then writes the content guest attempted to write in the protected memory area with following options:

1. Guest attempts to create a new PTE. In this case the hypervisor updates the SPT with GVA to HPA mapping. In this case following

2. Guest attempts to update the TLB new translation, which causes an exit.

3. Guest attempts to invalidate the TLB to remove an entry, which causes an exit.

4. Guest attempts to update the dirty (D) bit to update a write on a page frame, which causes an exit.

4. A page fault may occur when the guest OS cannot find the page frame (in its guest physical address space). Hence it has to create the frame and update PTE. But page fault itself will cause a trap to the hypervisor, which the latter will inject back into the guest. As a result of this 3.1 and 3.2 will be repeated and possibly also 3.3.

5. When the guest OS context switches to a new process, it attempts to load the page table directory base address (PTDBA) on CR3 register, which being a privileged operation causes a switch to the hypervisor via a GP fault. The hypervisor emulates loading of CR3. The PTDBA that the guest tries to load is not the host physical address, rather the GPA (address that the guest thinks is a physical memory address). The hypervisor (emulator) loads in CR3 the host physical address of PTDBA and resumes the guest.

It is obvious that substantial number of hypervisor entry and exits may happen during page table manipulation. Optimized schemes are possible, but even with certain level of optimization, substantial number of switches between VM and hypervisor will happen in addition to multiple levels of memory accesses for resolving GVA to HPA.

Hardware-assisted Memory or MMU Virtualization

It is obvious that there are substantial overheads in software-based memory virtualization described above. The overheads can be alleviated via hardware-based acceleration of memory or MMU virtualization. Intel introduced hardware support for memory virtualization called the EPT (extended page table)[1]. Similarly, AMD introduced RVI (rapid virtualization index)[2]. As shown in the figure below there are two levels of pages, the second one similar to shadow page table, but supported in hardware. The second level is invisible to a guest and controlled by a hypervisor. The page walker of physical CPU walks these tables resolving the virtual address to a host physical address, which is cached in the TLB. Performance of a workload can still suffer, if the characteristics of the workload cause, for example, higher number of TLB misses.

[1] In multi-core Nehalem.

[2] In multi‐core Opteron.

Google Sites

Report abuse