Process: an abstract virtual machine, as if it had its own CPU and memory, not accidentally affected by other processes.
Goals for solution:
- Transparent to user processes
- Pre-emptive for user processes
- Pre-emptive for kernel, where convenient
- Helps keeps system responsive
- Sleep -- uninterruptible sleep and interruptible sleep
- Zombie -- not that interesting.
- Unused and embryo are not process states, they are internal flags for maintaining the array of
The per-cpu scheduler thread
scheduler() at proc.c
- An infinite loop:
- Enable interrupts -- the scheduler can be preempted; keeps system responsive
- An alternative approach -- System idle process
- Acquire the lock before accessing the
scheduler()can be called by multiple CPUs.
- Iterate over the process table to find a RUNNABLE process; RUNNING means the process has already been selected by another CPU.
- Change the per-cpu variable
procto be the supposed user process
switchuvm()to switch to the user's page table and update the corresponding kernel-stack
entry.Sto see that this kernel-stack will be used when the process makes another syscall.
swtch()-- yes. it means "switch"
- Ken Thompson was once asked what he would do differently if he were redesigning the UNIX system. His reply: "I'd spell creat with an e."
The context model in xv6
Note that xv6's model is just one example that actually works. Other models used by the real-world OSes can be similar or quite different.
- One scheduler thread for each CPU
- One kernel thread and one user thread for each process
How to understand the term thread here?
- A thread in this context is barely a continuous execution flow
- A thread's "personal belongings" is just its stack.
- What about address space, file descriptors, and many other things? -- those are things that can be shared by multiple threads.
- all threads in a process shares everything -- a thread can actually access other threads' stacks.
- all kernel threads share everything in the kernel.
An example of context switch in xv6:
- A user process makes a
read()system call to get something from the disk.
- When trapped into the kernel, the kernel makes some calls such as
fileread() -> ... -> consoleread()
consoleread(), the kernel thread think it's likely that the IO will take a very long time to finish, so it makes a wise decision that is to give up the CPU in hope that another RUNNABLE process can make some better use of the CPU resource.
consoleread() -> sleep() -> sched() -> swtch()
- When doing
swtch(), in addition to the context of the current user process, the current kernel execution flow (the kernel stack) should also be saved
- Later on, when this kernel context is picked up by the scheduler, the execution resumes at
- Then after a few returns, the kernel execution is finished and finally returns to
- After the last instruction in the kernel mode (
trapasm.S), the user process resumes its execution flow
- Now after the finished
syscall, the instruction "
ret" is to be executed at the user space, as shown in the macro in
From the CPU's perspective (what's really happening):
- procX's u-thread -> syscall -> --||--> procX's kthread -> sched() -> swtch() --|--> inner loop of scheduler() -> swtch() --|--> procY' k-thread -> iretq --||--> procY's u-thread.
- --||--> corresponds to context switch across the privilege levels, by
- --|--> corresponds to stack switch in the kernel, by
From the process' perspective:
- procX's u-thread -> syscall -> result in %eax -> continue as its always in procX's u-thread
From the process' kthread's perspective:
- procX's u-thread -> syscall -> --||--> procX's kthread -> (sched() == NOP) -> continue procX's kthread -> iretq --||--> procX's u-thread.
From the scheduler thread's perspective:
- Just executing the loop, and every call to sched() is a NOP
The seemingly continuous execution flows in this model:
- Every user process thinks it's executing continuously, where syscalls are just like normal function calls.
- Every process' kernel thread thinks it's executing continuously, where every call to
yield()is a NOP.
- Every kernel scheduler thread thinks it's in an infinite loop, where every
swtch()is a NOP.
- Isolation between user processes is strictly enforced (by the hardware).
- Isolation between threads (in the same process, or in the kernel) is no more than a gentlemen's agreement.
What's in swtch()?
scheduler()in proc.c -- switch from the scheduler thread to a user thread; this is the first time swtch() is called on each core
sched()in proc.c -- switch from a process' kernel thread to the scheduler thread;
swtch() switches the stack of two kernel threads (or simply put, two kernel execution flows)
Why this context switch looks much simpler than the context switch between kernel and user mode?
- When swtch is being called, both the
tokernel threads are in the
swtch()function. -- from the perspective of "execution flow"
- Recall what's in x86_64's calling convention -- caller-save registers vs. callee-save registers.
- Quote: "The caller-saved registers are
r11, and any registers that parameters are put into." --
As the kernel knows that both context are currently in the
swtch() function, only those callee-saved registers really need to be saved. (recall "gentlemen's agreement")
- A good "side effect" of this simplified context switch is that there are (many) free registers immediately available for use (%rdi and %rsi)!
- As a comparison, in the interrupt handler, all general-purpose registers needs to be saved, since no assumption can be made that some registers are not actively used by the user.
- On x86-64, Linux kernel has to use some hidden space and special instructions (swapgs) to perform the switch safely (and efficiently).
swtch() assumes the space below %rsp is freely available so it can save the current context by pushing to the stack.
C compiler may optimize leaf-node functions by directly using as much as 128 bytes below the %rsp. With this optimization, the %rsp doesn't need to be changed by the leaf-node function, thus saving a few instructions.
The original 32-bit xv6 turns off the red zone (which means data below %rsp can be freely clobbered by anyone) feature because the kernel want to directly use the user stack.
The current 64-bit xv6 kernel (the master branch) does not borrow user stack during syscall/interrupt handling. The red zone could be enabled.
A summary of execution flows in xv6-64
- Infinite loop in scheduler(). Iterate over the process list.
- The execution always runs on one CPU. -- see the first parameter for swtch().
- Each user process has its own execution flow, including its user-stack and kernel-stack -- you can think that the
syscallis just a normal function call.
- A process' kernel thread can choose to go to sleep (
yield()). When it wakes up, it can be on any core.
%fs' corresponding base-address is set in seginit() --
wrmsr(0xC0000100, ((uint64) local) + (2048));
per-CPU local storage:
- gdt at local
- tss starts at local
- %fs base at local -- __thread variables are located below %fs
cpu is at %fs:($-16) -- declared first
proc is at %fs:($-8)
As the reference to the
__thread proc is hard-coded in
entry.S with offset == (-8), we can try to crash the kernel by simply reordering them.