Masum Z. Hasan All Rights Reserved
We explain interrupts, faults and traps (events in short) with respect to the above table. In the x86 architecture there are 255 interrupt and exception events. Out of these 255 events, there are system or ISA reserved (static) events as shown in the table. The table, which is owned and managed by an OS, is called the interrupt descriptor table or IDT. The number in the first column is known, in general, as a vector. In the case of device interrupts, these numbers are also known as IRQ (interrupt request). When an event occurs, its corresponding vector is reported to the CPU. As a result of which the OS or kernel invokes an event handler to address the event. These events transition the CPL to the highest level (CPL=0) so that handler can be executed at the highest privilege level. Let us now distinguish between the terms trap, fault and interrupt.
Trap: A trap occurs as a result of a userspace (CPL=3) program execution actions (see the table above) that result in a “trap” to the highest privilege level. A trap is an event (“fault”) that is reported immediately after the trapping instruction. The execution resumes at the instruction following the trapping instruction after the trap is handled. Following are examples of process actions that can generate traps:
Userspace program generated traps, such as
INT 3 (breakpoint) for debugging.
INT x80 (in Linux), which can be inserted in userspace code to force a trap to the CPL=0 mode. When INT x80, where x80 (decimal 128) is the vector, is executed, the CPU will transition to CPL=0 and execute the handler registered for vector 80 (see below). Note that it is better to use the fast call instructions SYSENTER/SYSEXIT or SYSCALL/SYSRET rather than INT x80 (see here).
Privileged instruction execution at CPL=3 that generates a GP. (GP should be a trap than a fault).
Fault: An instruction execution faults for some reason. When a fault occurs, after the fault is handled, the instruction should be re-executed again so that the fault can be alleviated. Example of faults are:
Exception, such as divide by 0.
Page fault.
Interrupt: Interrupts are typically generated by a device when certain asynchronous event occurs, such as a packet has arrived. The interrupt or IRQ is reported to a CPU, which then stops execution of current program (when interrupt window is open) and transitions privilege level to CPL=0 to execute the interrupt handler or interrupt service routine (ISR). See below for details. Following are types of interrupts:
Maskable interrupts (MI): IO device generated interrupts fall in this category. This type of interrupt can be masked by clearing the IF flag with CLI or POPF instruction. That is, any program that has masked the IF will not be interrupted on occurrence of an MI. The MI may also be referred to as IRQ (interrupt request). Each IRQ is mapped to an interrupt vector by an interrupt controller. An OS at bootup programs the latter. This interrupt can be delivered to a processor via the INTR pin or via a LAPIC (see below) message.
Non-maskable interrupt (NMI): such as timer, hard reset, and hardware error. The ISR for the interrupt vector 2 is invoked to service this type of interrupt. An NMI is serviced immediately and to completion without any further interruption. This interrupt can be delivered to a processor via the NMI pin or via a LAPIC message.
Software (generated) interrupts: We can consider INT x80 as generating a software interrupt. But from execution perspective it is not asynchronous and rather is a trap.
The memory structure of an IDT is shown in figure below. An interrupt is assigned an index or vector, which is used to index into the IDT, which points to the address of a code segment called the interrupt service routine or ISR as shown in the figure. The latter is a code that is executed to service the interrupt. For example, when a packet is received an interrupt is generated resulting in the relevant ISR routine to be executed.
In general, when an exception or interrupt occurs, following operations may be performed:
The program or instruction (P1) at or after which the event was raised is stopped or interrupted. In the x86 architecture the existence of an interrupt is checked after execution of each CPU instruction.
A context switch is initiated, which may include storing the current states of CPU registers of P1, which may be done by combination of software and automatically by the processor hardware.
The processor transitions to the Ring 0.
The relevant ISR or code segment (IH1) is executed, which may first save the register contents of P1 that have not been saved by hardware.
After IH1 completes, any of the following steps are taken:
Restart P1 at the instruction that was interrupted for faults, such as a page fault.
Restart P1 after the instruction that was interrupted for traps and interrupts.
Halt P1.
Abort P1.
Interrupt Hardware
In order to virtualize IO and interrupts, the relevant hardware is either emulated in software (full or trap-and-emulate) or the emulation is accelerated in hardware via hardware-assisted virtualization. Yes, hardware capabilities are emulated in hardware! Before we move onto explaining possible methods of interrupt and IO virtualization, we provide a brief overview of interrupt hardware. For virtualization, we will focus mainly on LAPIC virtualization.
Advanced Programmable Interrupt Controller
The legacy interrupts hardware in x86 systems consisted of programmable interrupt controller (PIC), such as Intel 8259 used in uniprocessor systems. The PIC has been evolved into the advanced PIC (APIC), xAPIC and x2APIC The APIC has two components: local APIC or LAPIC and IO APIC [details in Intel document].
In xAPIC architecture LAPIC and IO APIC communicates via a system bus, whereas in APIC architecture via an APIC bus. The LAPIC ID is 8 bit in xAPIC and 32 bit in x2APIC mode. In xAPIC the LAPIC registers are addressed by MMIO (base address FEE00H, which can be relocated to other memory region), whereas x2APIC registers are addressed by MSR (range 800H-BFFH). In an SMP/NUMA computer each core has its own LAPIC, whereas the core can be identified by its LAPIC ID. The x2APIC requires IO memory management unit (IOMMU), such as Intel VT-d. [FIG] shows some of the LAPIC registers (see REF: Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 Chapter 10 for details).
A few of the LAPIC registers and operations performed (by hardware) after OS writes into EOI register is shown in the figure below. The post EOI operations are shown partially (for example, PPR: Processor Priority Register operation is omitted).
A set of LAPIC is served by an interrupt delivery system called the IO APIC, which routes interrupts from external devices to LAPIC. The interrupt lines of I/O devices connected to the IO APIC can raise interrupts, which is then used to select a 64-bit entry in a table with 24 entries called the redirection table set up by an OS or driver. The information in the entry is used to format an interrupt request message. Each entry in the table can be individually programmed to indicate edge or level sensitive interrupt signals, the interrupt vector, its priority, the destination processor or core, and how the processor is selected, statically or dynamically. The default address of IO APIC is 0xFEC00000, which can be changed with an MSR. In a multi-processor system external interrupts can be distributed among LAPICs (in cores) as follows:
Statically based on the redirection table.
Dynamically, where the interrupt is delivered to the CPU with the lowest priority as defined by the TPR and other registers. An OS may have to update these registers frequently so that interrupts are distributed in an optimized or load-balanced way (on top of what arbitration mechanism the APIC supports).
Message Signaled Interrupts
The PCI (Peripheral Component Interconnect) Local Bus Specification, Rev 2.2 [REF www.pcisig.com] introduces the concept of message signaled interrupts or MSI (as opposed to pin-based interrupts), allowing a device to write directly to a core’s LAPIC, thus bypassing the IO APIC. The MSI model supports higher number of external interrupts. The PCIe devices do not have interrupt pins. In an (Intel) x86 environment, when a device is ready to raise an interrupt, it constructs a message and sends it on the bus. The message contains information about a target LAPIC and interrupt vector number, among other information. Once the LAPIC receives the message it sends it to its core for handling. The MSI message content is initially configured by an OS in device registers called the message address register (MAR), message data register (MDR) and other registers. These registers are configured through the PCI configuration space via IO or memory-mapped IO. The MSI-X is an extension of the MSI supporting more interrupts vectors. The MAR and MDR format, which is called the compatibility format has been modified to support interrupt remapping in VT-d.
Interrupt Handling Use case in Linux
1. Packet arrives.
2. IRQ sent to CPU (it can be configured to be sent to a specific core: see here). Check /proc/interrupts for IRQ numbers and stats.
3. Kernel invokes IRQ handler (ISR) defined in and registered by the device driver.
4. ISR may invoke NAPI polling in Softirq context (see here for Softirq definition).
5. Once ISR returns, kernel sends an EOI (End of Interrupt) message to LAPIC (writes into the EOI MMIO register or EOI MSR).
Above sequence (3-5) is reflected in following code snippets of e1000 Linux driver and Linux kernel.
Interrupt handler code snippet in the E1000 driver: See code here,
/**
* e1000_intr - Interrupt Handler
* @irq: interrupt number
* @data: pointer to a network interface device structure
**/
static irqreturn_t e1000_intr(int __always_unused irq, void *data)
{
…
if (napi_schedule_prep(&adapter->napi)) {
adapter->total_tx_bytes = 0;
adapter->total_tx_packets = 0;
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(&adapter->napi);
}
return IRQ_HANDLED;
}
In Linux the interrupt handler is very light. The “heavy lifting” is done by the softirq (see here). NAPI (code snippet) is invoked in softirq context.
void __napi_schedule(struct napi_struct *n)
{
unsigned long flags;
local_irq_save(flags);
____napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
}
Code snippet for softirq raise.
/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi)
{
list_add_tail(&napi->poll_list, &sd->poll_list);
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
}
Interrupt handler being registered: See code here.
/**
* e1000_request_irq - initialize interrupts
*
* Attempts to configure interrupts using the best available
* capabilities of the hardware and kernel.
**/
static int e1000_request_irq(struct e1000_adapter *adapter)
{
struct net_device *netdev = adapter->netdev;
int err;
if (adapter->msix_entries) {
err = e1000_request_msix(adapter);
if (!err)
return err;
/* fall back to MSI */
e1000e_reset_interrupt_capability(adapter);
adapter->int_mode = E1000E_INT_MODE_MSI;
e1000e_set_interrupt_capability(adapter);
}
if (adapter->flags & FLAG_MSI_ENABLED) {
err = request_irq(adapter->pdev->irq, e1000_intr_msi, 0,
netdev->name, netdev);
if (!err)
return err;
/* fall back to legacy interrupt */
e1000e_reset_interrupt_capability(adapter);
adapter->int_mode = E1000E_INT_MODE_LEGACY;
}
err = request_irq(adapter->pdev->irq, e1000_intr, IRQF_SHARED,
netdev->name, netdev);
if (err)
e_err("Unable to allocate interrupt, Error: %d\n", err);
return err;
}
request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
const char *name, void *dev)
{
return request_threaded_irq(irq, handler, NULL, flags, name, dev);
}
kernel invokes do_IRQ () once an interrupt occurs. See code here.
/* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
* handlers).
*/
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
{
….
if (!handle_irq(desc, regs)) {
ack_APIC_irq();
…
}
…
}
Once an interrupt is handled, and EOI is sent to the LAPIC that originally received the IRQ. See code here.
static inline void ack_APIC_irq(void)
{
/*
* ack_APIC_irq() actually gets compiled as a single instruction
* ... yummie.
*/
apic_eoi();
}
Following is a trace showing IRQ and Softirq handling (full trace, including EOI is not shown).
Masum Z. Hasan All Rights Reserved