Context switch

Process: an abstract virtual machine, as if it had its own CPU and memory, not accidentally affected by other processes.

Goals for solution:

  • Transparent to user processes
  • Pre-emptive for user processes
  • Pre-emptive for kernel, where convenient
  • Helps keeps system responsive

Process states:

The per-cpu scheduler thread

scheduler() at proc.c

  • An infinite loop: for (;;)
  • Enable interrupts -- the scheduler can be preempted; keeps system responsive
  • An alternative approach -- System idle process
  • Acquire the lock before accessing the ptable -- scheduler() can be called by multiple CPUs.
  • Iterate over the process table to find a RUNNABLE process; RUNNING means the process has already been selected by another CPU.
  • Change the per-cpu variable proc to be the supposed user process
  • Call switchuvm() to switch to the user's page table and update the corresponding kernel-stack
  • Read syscall_entry in entry.S to see that this kernel-stack will be used when the process makes another syscall.
  • Calls swtch() -- yes. it means "switch"
  • Ken Thompson was once asked what he would do differently if he were redesigning the UNIX system. His reply: "I'd spell creat with an e."

The context model in xv6

Note that xv6's model is just one example that actually works. Other models used by the real-world OSes can be similar or quite different.

  • One scheduler thread for each CPU
  • One kernel thread and one user thread for each process

How to understand the term thread here?

  • A thread in this context is barely a continuous execution flow
  • A thread's "personal belongings" is just its stack.
  • What about address space, file descriptors, and many other things? -- those are things that can be shared by multiple threads.
  • all threads in a process shares everything -- a thread can actually access other threads' stacks.
  • all kernel threads share everything in the kernel.

An example of context switch in xv6:

  • A user process makes a read() system call to get something from the disk.
  • When trapped into the kernel, the kernel makes some calls such as fileread() -> ... -> consoleread()
  • In consoleread(), the kernel thread think it's likely that the IO will take a very long time to finish, so it makes a wise decision that is to give up the CPU in hope that another RUNNABLE process can make some better use of the CPU resource.
  • consoleread() -> sleep() -> sched() -> swtch()
  • When doing swtch(), in addition to the context of the current user process, the current kernel execution flow (the kernel stack) should also be saved
  • Later on, when this kernel context is picked up by the scheduler, the execution resumes at swtch()
  • Then after a few returns, the kernel execution is finished and finally returns to syscall_trapret in trapasm.S
  • After the last instruction in the kernel mode (sysretq in trapasm.S), the user process resumes its execution flow
  • Now after the finished syscall, the instruction "ret" is to be executed at the user space, as shown in the macro in usys.S

From the CPU's perspective (what's really happening):

  • procX's u-thread -> syscall -> --||--> procX's kthread -> sched() -> swtch() --|--> inner loop of scheduler() -> swtch() --|--> procY' k-thread -> iretq --||--> procY's u-thread.
  • --||--> corresponds to context switch across the privilege levels, by syscall (usys.S) or iretq (trapasm.S)
  • --|--> corresponds to stack switch in the kernel, by swtch() in swtch.S

From the process' perspective:

  • procX's u-thread -> syscall -> result in %eax -> continue as its always in procX's u-thread

From the process' kthread's perspective:

  • procX's u-thread -> syscall -> --||--> procX's kthread -> (sched() == NOP) -> continue procX's kthread -> iretq --||--> procX's u-thread.

From the scheduler thread's perspective:

  • Just executing the loop, and every call to sched() is a NOP

The seemingly continuous execution flows in this model:

  • Every user process thinks it's executing continuously, where syscalls are just like normal function calls.
  • Every process' kernel thread thinks it's executing continuously, where every call to yield() is a NOP.
  • Every kernel scheduler thread thinks it's in an infinite loop, where every swtch() is a NOP.

Isolation

  • Isolation between user processes is strictly enforced (by the hardware).
  • Isolation between threads (in the same process, or in the kernel) is no more than a gentlemen's agreement.

What's in swtch()?

Who calls swtch()?

  • scheduler() in proc.c -- switch from the scheduler thread to a user thread; this is the first time swtch() is called on each core
  • sched() in proc.c -- switch from a process' kernel thread to the scheduler thread;

swtch() switches the stack of two kernel threads (or simply put, two kernel execution flows)

Why this context switch looks much simpler than the context switch between kernel and user mode?

As the kernel knows that both context are currently in the swtch() function, only those callee-saved registers really need to be saved. (recall "gentlemen's agreement")

  • A good "side effect" of this simplified context switch is that there are (many) free registers immediately available for use (%rdi and %rsi)!
  • As a comparison, in the interrupt handler, all general-purpose registers needs to be saved, since no assumption can be made that some registers are not actively used by the user.
  • On x86-64, Linux kernel has to use some hidden space and special instructions (swapgs) to perform the switch safely (and efficiently).

Red zone

swtch() assumes the space below %rsp is freely available so it can save the current context by pushing to the stack.

C compiler may optimize leaf-node functions by directly using as much as 128 bytes below the %rsp. With this optimization, the %rsp doesn't need to be changed by the leaf-node function, thus saving a few instructions.

The original 32-bit xv6 turns off the red zone (which means data below %rsp can be freely clobbered by anyone) feature because the kernel want to directly use the user stack.

The current 64-bit xv6 kernel (the master branch) does not borrow user stack during syscall/interrupt handling. The red zone could be enabled.

A summary of execution flows in xv6-64

Per-CPU scheduler:

  • Infinite loop in scheduler(). Iterate over the process list.
  • The execution always runs on one CPU. -- see the first parameter for swtch().

User process:

  • Each user process has its own execution flow, including its user-stack and kernel-stack -- you can think that the syscall is just a normal function call.
  • A process' kernel thread can choose to go to sleep (sleep() or yield()). When it wakes up, it can be on any core.

%fs' corresponding base-address is set in seginit() -- wrmsr(0xC0000100, ((uint64) local) + (2048));

Documentation here.

per-CPU local storage:

  • gdt at local[0]
  • tss starts at local[1024]
  • %fs base at local[2048] -- __thread variables are located below %fs

cpu is at %fs:($-16) -- declared first

proc is at %fs:($-8)

As the reference to the __thread proc is hard-coded in entry.S with offset == (-8), we can try to crash the kernel by simply reordering them.

Extended reading: