CPU Virtualization

x86 Architecture Basics.

In this article I will explain internal details of how some of the hypervisor-based virtualization is realized or implemented focusing on Linux, QEMU, KVM and x86_64 computers (concepts are similar with x86_32). The description below is more conceptual than specific implementation by specific systems. I will explain implementation of virtualization with a specific use case involving snippets of a socket program to receive or read packets. An outline of socket programming is as follows:

Create or open a (INET) socket with the socket API call with arguments AF_INET (for IP), SOCK_STREAM (for TCP) and TCP.
Bind the socket with IP address and port with the bind call.
On server side:
1. Listen on socket for connection with the listen call.
2. Accept connection that returns client information.
3. Send / receive or read / write on socket.
On client side:
1. Connect to server (on socket).
2. Send / receive or read / write on socket.
Close socket.

Some of the operations are repeated in a loop, such as 3a and 3b and 1, 2, 4 and 5 for clients.

We will be using following snippet of socket code:

int skid = socket (AF_INET, SOCK_STREAM, 0);

…

int cskid = accept (skid, caddr, caddrlen);

…

int n = read (cskid, buffer, buflen);

A snippet of usespace x86 code for the above code is shown below. Note that, I assume that readers are familiar with x86 architecture and assembly language to certain extent.

See Intel Manual or a brief intro.

Let me explain above assembly code.

The userspace calls calls are ultimately implemented by the kernel. The library implementation maps these calls to kernel functions calls, such as sys_socket, sys_accept, sys_read (note that methods may differ, for example, these calls may be branched out of a parent function called the sys_socketcall).

Each sys_* function has a code number (0 for read, 41 for socket, 43 for accept) associated with it, which has to be loaded into a register (rax). Arguments are loaded onto other registers and stack.

See following:

Calling conventions.

Linux syscall table.

Mac OSX syscall table.

Linux Socket implementation.

The sys_* functions are privileged functions executed in the kernel space. Hence there is a need for transition to the kernel space. The userspace executes in current privilege level (CPL) of 3, whereas the kernel executes in CPL of 0. The first 2 bits of Code Segment (CS) register indicate the CPL. The transition is executed by the x86 instruction called the syscall (sysenter in x86_32 or int x80). After the sys_* function executes, the execution is switched back to the upserspace. The latter is performed by the sysret (sysexit in x86_32) instruction executing in kernel space.

The syscall instruction is executed in hardware, a snippet of which is copied below from Intel 64-ia-32-architectures-software-developer-manual-325462.pdf: page 4-668 Vol. 2B.

As shown below, the content of the IA32_LSTAR MSR (Model Specific Register) is copied to the instruction pointer register (RIP) and CPL is set to 0. The instruction pointed to by the RIP executes after the execution of the syscall instruction.

The MSR is set up with the syscall entry-point (entry_SYSCALL_64) when a guest OS is initialized. A snippet of init call, if the guest OS is Linux is as follows (from Linux kernel code common.c and MSR Index).

void syscall_init(void)

{

wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);

wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

….

}

A snippet of Linux syscall entry-point is as follows (from /source/arch/x86/entry/entry_64.S).

Following figures explain above.

Virtualization of Interrupt handling and IO as shown in the above figure, will be explained in another article.

Full Virtualization

In a virtualized environment the hardware is controlled by the host OS/hypervisor (in Linux, the combination of host OS, QEMU and KVM provide virtualization support; in what follows we refer to this combination as just hypervisor). The syscalls by the guests cannot be executed as usual. We will show how syscall can be virtualized in software and with hardware-assisted virtualization (with Intel VT-x and Linux KVM). Note that some of the Linux examples below are for concept explanation purpose only; they may not exactly be implemented in hypevisors or emulators in the same way described below. Note that I do not touch on paravirtualization.

Binary translation and full software emulation:
- WRMSR emulation: Replace the wrmsr instruction with its emulation. Hence when the syscall_init () mentioned above is executed, the emulated MSR_LSTAR (IA32_LSTAR) will contain the syscall entry point (entry_SYSCALL_64) saved.
- Syscall emulation: Replace the syscall instruction with its emulation that otherwise is executed in hardware (as shown above). The emulation will pick up the IA32_LSTAR value from the emulated MSR described in the wrmsr emulation above. Following is a snippet from instruction emulation as described in emulate.c:

static const struct opcode twobyte_table[256] = {

/* 0x00 - 0x0F */

G(0, group6), GD(0, &group7), N, N,

N, I(ImplicitOps | EmulateOnUD, em_syscall),

…

static int em_syscall(struct x86_emulate_ctxt *ctxt)

{

…

}

- Syscall Function Emulation: The calls to syscall functions (sys_*) also has to be emulated till sysret after which non-privileged (CPL > 0) instructions can execute as usual without emulation.

Trap-and-Emulate: Instead of scanning binary and replacing instructions with emulation function calls, let the instructions in guest context run as usual. An instruction that can only be executed in privileged context (CPL = 0) will trap into the hypervisor. Note that this does not work for all privileged or sensitive instructions (which we discuss later).
- WRMSR handling: As described in the wrmsr instruction execution semantics, a GP (general protection) fault is generated, which is handled by the host hypervisor. The GP service routine will find out that execution of the wrmsr instruction was attempted in non-privileged context. Hence, as the guest OS attempts to initialize the MSR during init, the GP service routine will emulate the instruction in the same way as described above. The emulated MSR will now contain the syscall entry point for later use.
- Syscall Handling: Before a VM executes, the hypervisor could set up the hardware (not emulated) IA32_LSTAR MSR with the value learned during the syscall_init () so that the syscall could execute as usual in the hardware. But the syscall will set the CPL to 0. But only the hypervisor can execute in the privileged context. The sys_* functions of a guest OS could execute in the context of CPL=0. But any userspace code (as the guest is) should be considered untrusted and should not execute in CPL=0 context. Hence a hypervisor has to force an exception when an attempt is made to execute the syscall in a guest context.

As described above in the syscall execution semantics in hardware, it traps with an UD (invalid opcode) exception if IA32_EFER.SCE is set to 0. Hence, prior to every execution of a guest VM a hypervisor can set up IA32_EFER.SCE=0. Then if a guest attempts to execute a syscall, an UD exception will be generated, which will trap to the hypervisor. The exception handling routine in the hypervisor will then emulate the syscall and the sys_* functions in the same way as described above. Obviously, this method can cause substantial number of guest or VM exits and relevant context switches expending many cycles (hundreds/thousands).

Following figures explain above.

Hardware-assisted Virtualization: In a virtualized environment, hardware resources (registers, NIC, APIC, LAPIC, etc.) and their operations, certain instructions and code that is supposed to be executed in privileged context have to be emulated on behalf of the guests. With hardware-assisted virtualization more and more of software-based emulations are moved to hardware-assisted emulation or virtualization.

In Intel VTx (CPU flag VMX) technology many of the capabilities mentioned above is virtualized in hardware. Details are outside the scope of this discussion. We cover only hardware-assisted virtualization capabilities required for the use case described above. The registers and their states are virtualized in an in-memory structure called the VMCS both for the guests and the host. The VMCS also contains VM execution control and other fields. The hypervisor runs in VMX root mode and the gusts run on VMX non-root mode. In the latter mode the guest can execute in CPL=0 context. Hence for syscall and sys_* functions there is no need for exit to hypervisor.

Following figures explain above.

[1] There are two main socket types in Linux: general BSD (struct socket defined in include/linux/socket.h) and IP specific INET (struct sock defined in include/net/sock.h) sockets. They are related, where a BSD socket structure has an INET socket as a member and an INET socket has a BSD socket as parent.

Google Sites

Report abuse