kvm-mmu

以下は、Linux カーネル文書、Documentation/virtual/kvm/mmu.txt の、kanda.motohiro@gmail.com による訳です。原文と同じ、GPL v2 で公開します。# で始まる行は、原文で、その下が対応する訳です。

#The x86 kvm shadow mmu

#======================

x86 kvm シャドウ mmu

======================

#The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible

#for presenting a standard x86 mmu to the guest, while translating guest

#physical addresses to host physical addresses.

mmu (arch/x86/kvm にある、 mmu.[ch] ファイルと、 paging_tmpl.h)は、ゲストに x86

の標準的な mmu を提供することと、ゲストの物理アドレスをホストの物理アドレスに変換すること

を、役割とします。

#The mmu code attempts to satisfy the following requirements:

mmu コードは、以下の要求を満足させようとします：

#- correctness: the guest should not be able to determine that it is running

# on an emulated mmu except for timing (we attempt to comply

# with the specification, not emulate the characteristics of

# a particular implementation such as tlb size)

#- security: the guest must not be able to touch host memory not assigned

# to it

#- performance: minimize the performance penalty imposed by the mmu

#- scaling: need to scale to large memory and large vcpu guests

#- hardware: support the full range of x86 virtualization hardware

#- integration: Linux memory management code must be in control of guest memory

# so that swapping, page migration, page merging, transparent

# hugepages, and similar features work without change

#- dirty tracking: report writes to guest memory to enable live migration

# and framebuffer-based displays

#- footprint: keep the amount of pinned kernel memory low (most memory

# should be shrinkable)

#- reliability: avoid multipage or GFP_ATOMIC allocations

- 正しさ：ゲストは、タイミングは別として、自分がエミュレートされた mmu で動作しているかどうかを

知ることができるべきではありません。（私達は仕様に準拠しようとしており、

tlb サイズのような、特定の実装に特徴的なことをエミュレートするつもりはありません。）

- セキュリティ：ゲストは、自分に割り当てられていないホストメモリーを触ることができてはいけません。

- 性能：mmu による性能ペナルティを最小にします。

- スケーリング：大容量のメモリーと多数の vcpu を持つゲストに対して、スケールする必要があります。

- ハードウエア：すべての x86 仮想化ハードウエアをサポートします。

- 統合： Linux メモリー管理のコードは、ゲストのメモリーを完全に制御することができ、スワッピング、

ページマイグレーション、ページマージング、透過的 hugepage および類似の機能が

変更なしに動かなくてはいけません。

- ダーティトラッキング：ライブマイグレーションと、フレームバッファベースのディスプレイを可能

とするために、ゲストメモリーへの書き込みを報告します。

- フットプリント：常駐のカーネルメモリーの量を少なくします。（ほとんどのメモリーはシュリンク

可能であるべきです。）

- 信頼性：複数ページや、GFP_ATOMIC 確保を避けます。

#Acronyms

#========

略語

====

#pfn host page frame number

#hpa host physical address

#hva host virtual address

#gfn guest frame number

#gpa guest physical address

#gva guest virtual address

#ngpa nested guest physical address

#ngva nested guest virtual address

#pte page table entry (used also to refer generically to paging structure

# entries)

#gpte guest pte (referring to gfns)

#spte shadow pte (referring to pfns)

#tdp two dimensional paging (vendor neutral term for NPT and EPT)

pfn host page frame number、ホストページフレーム番号

hpa host physical address、ホスト物理アドレス

hva host virtual address、ホスト仮想アドレス

gfn guest frame number、ゲストフレーム番号

gpa guest physical address、ゲスト物理アドレス

gva guest virtual address、ゲスト仮想アドレス

ngpa nested guest physical address、ネストしたゲスト物理アドレス

ngva nested guest virtual address、ネストしたゲスト仮想アドレス

pte page table entry、ページテーブルエントリー（一般的なページング構造体のエントリー

にも使われます。）

gpte guest pte、ゲスト pte (gfn を指します)

spte shadow pte、シャドウ pte (pfn を指します)

tdp two dimensional paging、２次元ページング(NPT と EPT を指す、ベンダー非依存の用語)

#Virtual and real hardware supported

#===================================

仮想化されたものも、実際のハードウエアもサポートされます

===============================================

#The mmu supports first-generation mmu hardware, which allows an atomic switch

#of the current paging mode and cr3 during guest entry, as well as

#two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware

#it exposes is the traditional 2/3/4 level x86 mmu, with support for global

#pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support

#exposing NPT capable hardware on NPT capable hosts.

mmu は、ゲストが開始した時に現在のページングモードと cr3 をアトミックにスイッチ

する第一世代の mmu ハードウエアに加えて、２次元ページング(AMD の NPT と Intel の EPT)

の両方をサポートします。これが提供するエミュレートされたハードウエアは、伝統的な

2/3/4 レベル x86 mmu であり、グローバルページ、pae, pse, pse36, cr0.wp, そして 1GB

ページがサポートされます。NPT 可能なホストで、NPT 可能なハードウエアを提供する機能の

サポートが、作業中です。

#Translation

#===========

変換

=====

#The primary job of the mmu is to program the processor's mmu to translate

#addresses for the guest. Different translations are required at different

#times:

mmu の主な仕事は、プロセッサーの mmu を、ゲストのためのアドレス変換をするようにプログラム

することです。いろいろな場合によって、異なる変換が必要です。

#- when guest paging is disabled, we translate guest physical addresses to

# host physical addresses (gpa->hpa)

#- when guest paging is enabled, we translate guest virtual addresses, to

# guest physical addresses, to host physical addresses (gva->gpa->hpa)

#- when the guest launches a guest of its own, we translate nested guest

# virtual addresses, to nested guest physical addresses, to guest physical

# addresses, to host physical addresses (ngva->ngpa->gpa->hpa)

- ゲストページングが無効なら、ゲスト物理アドレスをホスト物理アドレスに変換します。(gpa->hpa)

- ゲストページングが有効なら、ゲスト仮想アドレスをゲスト物理アドレス、ホスト物理アドレス

に変換します。(gva->gpa->hpa)

- ゲストが、さらに自分でゲストを実行した場合、ネストしたゲスト仮想アドレスを、ネストした

ゲスト物理アドレス、ゲスト物理アドレス、ホスト物理アドレスに変換します。

(ngva->ngpa->gpa->hpa)

#The primary challenge is to encode between 1 and 3 translations into hardware

#that support only 1 (traditional) and 2 (tdp) translations. When the

#number of required translations matches the hardware, the mmu operates in

#direct mode; otherwise it operates in shadow mode (see below).

最も難しいのは、1 （伝統的）あるいは、2 (tdp) 変換だけをサポートするハードウエアで、1 から 3

の変換をエンコードすることです。必要な変換の数が、ハードウエアと一致するなら、 mmu は

ダイレクトモードで動作します。そうでない場合、シャドウモードで動作します。（以下を参照下さい。）

#Memory

#======

メモリー

========

#Guest memory (gpa) is part of the user address space of the process that is

#using kvm. Userspace defines the translation between guest addresses and user

#addresses (gpa->hva); note that two gpas may alias to the same hva, but not

#vice versa.

ゲストメモリー(gpa)は、kvm を使っているプロセスのユーザーアドレス空間の一部です。

ユーザー空間が、ゲストアドレスとユーザーアドレスの変換(gpa->hva)を決めます。なお、

２つの gpa が同じ hva をエリアスすることがありますが、その逆はありません。

#These hvas may be backed using any method available to the host: anonymous

#memory, file backed memory, and device memory. Memory might be paged by the

#host at any time.

これらの hva は、ホストにあるいろいろな機構により提供されます。匿名メモリー、ファイル

由来のメモリー、デバイスメモリーです。メモリーは、ホストによって、いつでもページングされる

ことがあります。

#Events

#======

イベント

=======

#The mmu is driven by events, some from the guest, some from the host.

mmu はイベントにより駆動されます。それはゲストからあるいはホストから来ます。

#Guest generated events:

#- writes to control registers (especially cr3)

#- invlpg/invlpga instruction execution

#- access to missing or protected translations

#Host generated events:

#- changes in the gpa->hpa translation (either through gpa->hva changes or

# through hva->hpa changes)

#- memory pressure (the shrinker)

ゲストが生成するイベント：

- 制御レジスタへの書き込み（特に cr3）

- invlpg/invlpga 命令実行

- 存在しない、あるいは、保護された変換をアクセスする

ホストが生成するイベント：

- gpa->hpa 変換が変化する。（gpa->hva あるいは、hva->hpa の変更のため）

- メモリ不足（シュリンカー）

#Shadow pages

#============

シャドウページ

==============

#The principal data structure is the shadow page, 'struct kvm_mmu_page'. A

#shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A

#shadow page may contain a mix of leaf and nonleaf sptes.

主となるデータ構造は、シャドウページ、'struct kvm_mmu_page' です。シャドウページは、

512 個の spte を持ちます。それらは、リーフの時も、リーフでない時もあります。

シャドウページは、リーフとリーフでない spte を混在させて持つこともあります。

#A nonleaf spte allows the hardware mmu to reach the leaf pages and

#is not related to a translation directly. It points to other shadow pages.

リーフでない spte は、ハードウエア mmu がリーフページに到達できるようにします。それは、

直接、変換にはかかわりません。それは、他のシャドウページを指します。

#A leaf spte corresponds to either one or two translations encoded into

#one paging structure entry. These are always the lowest level of the

#translation stack, with optional higher level translations left to NPT/EPT.

#Leaf ptes point at guest pages.

リーフ spte は、１つのページング構造エントリーにエンコードされた、１つあるいは２つの変換

に対応します。それは常に、変換スタックの最下位層レベルにあり、オプショナルな上位層レベルの

変換を、NPT/EPT が提供します。リーフ pte は、ゲストページを指します。

#The following table shows translations encoded by leaf ptes, with higher-level

#translations in parentheses:

以下のテーブルは、リーフ pte にエンコードされた変換を示します。上位レベルの変換を、

カッコで示します。

# Non-nested guests:

# nonpaging: gpa->hpa

# paging: gva->gpa->hpa

# paging, tdp: (gva->)gpa->hpa

# Nested guests:

# non-tdp: ngva->gpa->hpa (*)

# tdp: (ngva->)ngpa->gpa->hpa

ネストしないゲスト：

ページングなし： gpa->hpa

ページングあり： gva->gpa->hpa

ページング、tdp： (gva->)gpa->hpa

ネストするゲスト：

tdp なし： ngva->gpa->hpa (*)

tdp: (ngva->)ngpa->gpa->hpa

#(*) the guest hypervisor will encode the ngva->gpa translation into its page

# tables if npt is not present

(*) ゲストハイパーバイザーは、 npt が有効でないときは、 ngva->gpa 変換を自分のページ

テーブルにエンコードします。

#Shadow pages contain the following information:

# role.level:

# The level in the shadow paging hierarchy that this shadow page belongs to.

# 1=4k sptes, 2=2M sptes, 3=1G sptes, etc.

# role.direct:

# If set, leaf sptes reachable from this page are for a linear range.

# Examples include real mode translation, large guest pages backed by small

# host pages, and gpa->hpa translations when NPT or EPT is active.

# The linear range starts at (gfn << PAGE_SHIFT) and its size is determined

# by role.level (2MB for first level, 1GB for second level, 0.5TB for third

# level, 256TB for fourth level)

# If clear, this page corresponds to a guest page table denoted by the gfn

# field.

# role.quadrant:

# When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit

# sptes. That means a guest page table contains more ptes than the host,

# so multiple shadow pages are needed to shadow one guest page.

# For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the

# first or second 512-gpte block in the guest page table. For second-level

# page tables, each 32-bit gpte is converted to two 64-bit sptes

# (since each first-level guest page is shadowed by two first-level

# shadow pages) so role.quadrant takes values in the range 0..3. Each

# quadrant maps 1GB virtual address space.

# role.access:

# Inherited guest access permissions in the form uwx. Note execute

# permission is positive, not negative.

# role.invalid:

# The page is invalid and should not be used. It is a root page that is

# currently pinned (by a cpu hardware register pointing to it); once it is

# unpinned it will be destroyed.

# role.cr4_pae:

# Contains the value of cr4.pae for which the page is valid (e.g. whether

# 32-bit or 64-bit gptes are in use).

# role.nxe:

# Contains the value of efer.nxe for which the page is valid.

# role.cr0_wp:

# Contains the value of cr0.wp for which the page is valid.

# role.smep_andnot_wp:

# Contains the value of cr4.smep && !cr0.wp for which the page is valid

# (pages for which this is true are different from other pages; see the

# treatment of cr0.wp=0 below).

# gfn:

# Either the guest page table containing the translations shadowed by this

# page, or the base page frame for linear translations. See role.direct.

# spt:

# A pageful of 64-bit sptes containing the translations for this page.

# Accessed by both kvm and hardware.

# The page pointed to by spt will have its page->private pointing back

# at the shadow page structure.

# sptes in spt point either at guest pages, or at lower-level shadow pages.

# Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point

# at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.

# The spt array forms a DAG structure with the shadow page as a node, and

# guest pages as leaves.

# gfns:

# An array of 512 guest frame numbers, one for each present pte. Used to

# perform a reverse map from a pte to a gfn. When role.direct is set, any

# element of this array can be calculated from the gfn field when used, in

# this case, the array of gfns is not allocated. See role.direct and gfn.

# root_count:

# A counter keeping track of how many hardware registers (guest cr3 or

# pdptrs) are now pointing at the page. While this counter is nonzero, the

# page cannot be destroyed. See role.invalid.

# multimapped:

# Whether there exist multiple sptes pointing at this page.

# parent_pte/parent_ptes:

# If multimapped is zero, parent_pte points at the single spte that points at

# this page's spt. Otherwise, parent_ptes points at a data structure

# with a list of parent_ptes.

# unsync:

# If true, then the translations in this page may not match the guest's

# translation. This is equivalent to the state of the tlb when a pte is

# changed but before the tlb entry is flushed. Accordingly, unsync ptes

# are synchronized when the guest executes invlpg or flushes its tlb by

# other means. Valid for leaf pages.

# unsync_children:

# How many sptes in the page point at pages that are unsync (or have

# unsynchronized children).

# unsync_child_bitmap:

# A bitmap indicating which sptes in spt point (directly or indirectly) at

# pages that may be unsynchronized. Used to quickly locate all unsychronized

# pages reachable from a given page.

シャドウページは、以下の情報を持ちます。

role.level:

シャドウページング階層の中で、このシャドウページが属するレベル。

1=4k sptes, 2=2M sptes, 3=1G sptes など

role.direct:

セットされているなら、このページが指すリーフ spt は、リニアレンジのものです。

例は、リアルモードの変換、スモールホストページで提供されるラージゲストページ、

そして、NPT あるいは EPT が有効な時の、gpa->hpa 変換です。

リニアレンジは、(gfn << PAGE_SHIFT) で始まり、その長さは、 role.level で

決まります。(第１レベルは 2MB, 第２レベルは 1GB, 第３レベルは 0.5TB, 第４レベルは

256TB)

セットされてないなら、このページは、 gfn フィールドに対応するゲストページテーブルです。

role.quadrant:

role.cr4_pae=0 のとき、ゲストは３２ビットの gpte を使い、ホストは６４ビット spte を

使います。つまり、ゲストページテーブルはホストより多くの pte を持つため、１つの

ゲストページをシャドウするために複数のシャドウページが必要という事です。

第１レベルのシャドウページの場合、role.quadrant は０か１で、ゲストページテーブル

における、最初あるいは次の５１２ gpte ブロックを示します。

第２レベルのページテーブルでは、１つの３２ビットの gpte は２つの６４ビット spte に

変換されます。（１つの第１レベルのゲストページテーブルは２つの第１レベルのシャドウページ

でシャドウされるためです。）このため、role.quadrant は、０から３の値を取ります。

各 quadrant は、 1GB の仮想アドレス空間をマップします。

role.access:

uwx 形式で示される継承されたゲストアクセス権限。実行パーミッションはポジティブであり、

ネガティブでないことに注意下さい。

role.invalid:

ページは無効で、使ってはいけません。これは、ルートページであり、現在、ピン固定されて

います。（cpu ハードウエアレジスターが、それを指しています。）ピン固定が解除されたとき、

廃棄されます。

role.cr4_pae:

ページの cr4.pae 値を持ちます。（つまり、３２ビットと６４ビットの gpte のどちらが

使われているか。）

role.nxe:

ページの efer.nxe 値を持ちます。

role.cr0_wp:

ページの cr0.wp 値を持ちます。

role.smep_andnot_wp:

ページの cr4.smep && !cr0.wp 値を持ちます。

（これが真のページは、他のページとは違います。後述の cr0.wp=0 の扱いを参照下さい。）

gfn:

リニア変換の場合、ベースページフレーム。そうでない場合、このページによりシャドウされる

変換を含むゲストページテーブル。role.direct を参照下さい。

spt:

このページに対する変換を含む、１ページの、６４ビット spte の集まり。

kvm とハードウエアの両方からアクセスされます。

spt が指すページの page->private は、戻ってシャドウページ構造体を指します。

spt の spte は、ゲストページを指すか、あるいは下位のシャドウページを指します。

特に、sp1 とsp2 がシャドウページなら、sp1->spt[n] は __pa(sp2->spt) を指すことが

あります。sp2 は、sp1 を、parent_pte で戻って指します。

spt 配列は、シャドウページがノードで、ゲストページがリーフである、DAG 構造となります。

gfns:

５１２のゲストフレーム番号の配列。存在する pte ごとにあります。pte から gfn の

リバースマップをするために使われます。role.direct がセットされている場合、この配列

のすべての要素は gfn フィールドから計算することができるため、gfns 配列は確保されません。

role.direct と gfn を参照下さい。

root_count:

いくつのハードウエアレジスター（ゲスト cr3 あるいは pdptr）が現在このページを指して

いるかを管理するカウンター。このカウンターがゼロでない間は、ページは破棄されることが

できません。role.invalid を参照下さい。

multimapped:

複数の spte が、このページを指しているかを示します。

parent_pte/parent_ptes:

multimapped がゼロなら、parent_pte は、このページの spt を指す単一の spte を

指します。そうでない場合、parent_ptes は、parent_ptes のリストを持つデータ構造

を指します。

unsync:

真なら、このページにある変換は、ゲストの変換と違うかもしれません。これは、pte が

変更され、 tlb エントリーがフラッシュされていないときの tlb の状態と同じです。

このため、unsync pte は、ゲストが invlpg あるいはそれ以外の方法で tlb をフラッシュ

したときに、同期されます。リーフページで有効です。

unsync_children:

このページにある spte で、unsync であるページ（あるいはその子が unsync であるもの）

を指すものの数。

unsync_child_bitmap:

spt の、どの spte が（直接あるいは間接的に）同期されていないページを指しているかを

示すビットマップ。あるページからたどることのできるすべての同期されていないページを高速に

位置づけるために使われます。

#Reverse map

#===========

リバースマップ

==============

#The mmu maintains a reverse mapping whereby all ptes mapping a page can be

#reached given its gfn. This is used, for example, when swapping out a page.

mmu は、あるページの gfn を与えた時に、そのページをマップするすべての pte をたどることの

できるリバースマップを維持します。これは、例えば、ページをスワップアウトするときに使われます。

#Synchronized and unsynchronized pages

#=====================================

ページの同期

============

#The guest uses two events to synchronize its tlb and page tables: tlb flushes

#and page invalidations (invlpg).

ゲストは、２つのイベントを使って、自分の tlb と、ページテーブルを同期させます。tlb

フラッシュとページ無効化(invlpg)です。

#A tlb flush means that we need to synchronize all sptes reachable from the

#guest's cr3. This is expensive, so we keep all guest page tables write

#protected, and synchronize sptes to gptes when a gpte is written.

tlb フラッシュは、ゲストの cr3 からたどることのできるすべての spte を同期させないといけ

ません。これは高価です。このため、私達は、すべてのゲストのページテーブルをライトプロテクト

して、gpte が書き込まれた時に、spte を gpte に同期させるようにしています。

#A special case is when a guest page table is reachable from the current

#guest cr3. In this case, the guest is obliged to issue an invlpg instruction

#before using the translation. We take advantage of that by removing write

#protection from the guest page, and allowing the guest to modify it freely.

#We synchronize modified gptes when the guest invokes invlpg. This reduces

#the amount of emulation we have to do when the guest modifies multiple gptes,

#or when the a guest page is no longer used as a page table and is used for

#random guest data.

ゲストページテーブルが現在のゲスト cr3 からたどることのできる場合は特別です。この場合、

ゲストは、変換を使う前に invlpg 命令を使う必要があります。私達はこれを活用し、ゲストページ

のライトプロテクトを外して、ゲストが自由に書き込めるようにします。

ゲストが invlpg を発行したときに、変更された gpte を同期します。これにより、ゲストが複数の

gpte を変更したり、ゲストのページがページテーブルとして使われなくなって、任意のゲストデータ

のために使われる場合に、必要となるエミュレーションを減らすことができます。

#As a side effect we have to resynchronize all reachable unsynchronized shadow

#pages on a tlb flush.

副作用として、tlb フラッシュの時には、たどることのできるすべての同期されていないシャドウ

ページを再同期する必要があります。

#Reaction to events

#==================

イベントへの応答

================

#- guest page fault (or npt page fault, or ept violation)

-ゲストページフォールト（あるいは、npt ページフォールト、ept バイオレーション）

#This is the most complicated event. The cause of a page fault can be:

これが最も複雑なイベントです。ページフォールトの原因は、

# - a true guest fault (the guest translation won't allow the access) (*)

# - access to a missing translation

# - access to a protected translation

# - when logging dirty pages, memory is write protected

# - synchronized shadow pages are write protected (*)

# - access to untranslatable memory (mmio)

- 本当のゲストフォールト（ゲストの変換が、アクセスを許さない時） (*)

- 存在しない変換へのアクセス

- 保護された変換へのアクセス

- ダーティページをログするとき、メモリーはライトプロテクトされます。

- 同期されたシャドウページはライトプロテクトされます。 (*)

- 変換できないメモリーへのアクセス (mmio)

# (*) not applicable in direct mode

(*) は、ダイレクトモードでは起きません。

#Handling a page fault is performed as follows:

ページフォールトの処理は、以下のように行われます。

# - if needed, walk the guest page tables to determine the guest translation

# (gva->gpa or ngpa->gpa)

# - if permissions are insufficient, reflect the fault back to the guest

# - determine the host page

# - if this is an mmio request, there is no host page; call the emulator

# to emulate the instruction instead

# - walk the shadow page table to find the spte for the translation,

# instantiating missing intermediate page tables as necessary

# - try to unsynchronize the page

# - if successful, we can let the guest continue and modify the gpte

# - emulate the instruction

# - if failed, unshadow the page and let the guest continue

# - update any translations that were modified by the instruction

- 必要なら、ゲストページテーブルをウオークして、ゲストの変換を調べます。

(gva->gpa あるいは ngpa->gpa)

- パーミッションが不足しているなら、フォールトをゲストに送り返します。

- ホストページを決定します。

- もしこれが mmio 要求なら、ホストページはありません。エミュレーターを呼んで、

代わりに命令をエミュレートさせます。

- シャドウページテーブルをウオークして、変換に対応する spte を探します。

必要に応じて、中間の不在のページテーブルを作成します。

- ページ同期を無効にします。

- 成功したら、ゲストを続行させて、 gpte を更新させればよいです。

- 命令をエミュレートさせます

- 失敗したら、ページのシャドウを解除して、ゲストを続行させます。

- その命令で変更された変換をすべて更新します。

#invlpg handling:

invlpg 処理：

# - walk the shadow page hierarchy and drop affected translations

# - try to reinstantiate the indicated translation in the hope that the

# guest will use it in the near future

- シャドウページ階層をウオークして、影響を受ける変換をドロップします。

- 指定された変換を、再作成します。ゲストが近い将来、それを使うかもしれないからです。

#Guest control register updates:

ゲストの制御レジスターの変更：

#- mov to cr3

# - look up new shadow roots

# - synchronize newly reachable shadow pages

- mov to cr3

- 新しいシャドウのルートを探します。

- 新しくたどれるようになったシャドウページを同期します。

#- mov to cr0/cr4/efer

# - set up mmu context for new paging mode

# - look up new shadow roots

# - synchronize newly reachable shadow pages

- mov to cr0/cr4/efer

- 新しいページングモードに備えて、mmu コンテキストを設定します。

- 新しいシャドウのルートを探します。

- 新しくたどれるようになったシャドウページを同期します。

#Host translation updates:

ホストの変換の変更：

# - mmu notifier called with updated hva

# - look up affected sptes through reverse map

# - drop (or update) translations

- 更新された hva で、mmu notifier が呼ばれました。

- リバースマップを使って、影響のある spte を探します。

- 変換をドロップ、あるいは更新します。

#Emulating cr0.wp

#================

cr0.wp をエミュレートする

==========================

#If tdp is not enabled, the host must keep cr0.wp=1 so page write protection

#works for the guest kernel, not guest guest userspace. When the guest

#cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,

#we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the

#semantics require allowing any guest kernel access plus user read access).

tdp が無効の場合、ホストは、ゲストカーネルにページのライトプロテクトが働くように、cr0.wp=1

にしなくてはいけません。これは、ゲストのユーザー空間のためではありません。ゲストの cr0.wp=1

の場合、これは問題とはなりません。しかし、ゲストの cr0.wp=0 の場合、gpte.u=1, gpte.w=0

というパーミッションを、何らかの spte にマップすることはできません。（セマンティクスによれば、

すべてのゲストカーネルアクセスとユーザーのリードアクセスは許可されるべきです。）

#We handle this by mapping the permissions to two possible sptes, depending

#on fault type:

私達は、これを、フォールトタイプに従って、２つの可能性のある spte にマップすることで対処します。

#- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,

# disallows user access)

#- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel

# write access)

- カーネルライトフォールト: spte.u=0, spte.w=1 (完全なカーネルアクセスを許し、

ユーザーアクセスを拒否します。)

- リードフォールト: spte.u=1, spte.w=0 (完全なリードアクセスを許し、

カーネルライトアクセスを拒否します。）

#(user write faults generate a #PF)

（ユーザーライトフォールトは、#PF を生成します。）

#In the first case there is an additional complication if CR4.SMEP is

#enabled: since we've turned the page into a kernel page, the kernel may now

#execute it. We handle this by also setting spte.nx. If we get a user

#fetch or read fault, we'll change spte.u=1 and spte.nx=gpte.nx back.

最初のケースで、 CR4.SMEP が有効だと、さらに複雑となります。私達はページをカーネルページに

してしまったので、カーネルはそれを実行することができるようになりました。この状況は、

さらに spte.nx を立てることで対処します。もし、ユーザーフェッチあるいはリードフォールトが

あったなら、 spte.u=1 にして、 spte.nx=gpte.nx を元に戻します。

#To prevent an spte that was converted into a kernel page with cr0.wp=0

#from being written by the kernel after cr0.wp has changed to 1, we make

#the value of cr0.wp part of the page role. This means that an spte created

#with one value of cr0.wp cannot be used when cr0.wp has a different value -

#it will simply be missed by the shadow page lookup code. A similar issue

#exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after

#changing cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep

#is also made a part of the page role.

cr0.wp=0 を持つカーネルページへと変換された spte が、 cr0.wp が 1 に変更された後で、

カーネルによって書き込まれるのを防ぐため、 cr0.wp の値は、ページのロールの一部としました。

つまり、ある値の cr0.wp を持って作成された spte は、 cr0.wp が他の値の時には使えないと

いうことです。シャドウページの検索コードによって、単純に無視されます。

cr0.wp=0 かつ cr4.smep=0 で作成された spte が、cr4.smep が 1 になった後使われると、

類似の問題が起きます。これを防ぐために、!cr0.wp && cr4.smep の値も、ページのロールの一部です。

#Large pages

#===========

ラージページ

===========

#The mmu supports all combinations of large and small guest and host pages.

#Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as

#two separate 2M pages, on both guest and host, since the mmu always uses PAE

#paging.

mmu は、ラージとスモールの、ゲストとホストのページのすべての組み合わせをサポートします。

サポートされるページ長は、 4k, 2M, 4M, と 1G です。4M ページは、ゲストにおいてもホストに

おいても、２つの別々の 2M ページとして扱われます。mmu は常に PAE ページングを使うためです。

#To instantiate a large spte, four constraints must be satisfied:

ラージ spte を生成するには、４つの条件が満足されなくてはいけません。

#- the spte must point to a large host page

#- the guest pte must be a large pte of at least equivalent size (if tdp is

# enabled, there is no guest pte and this condition is satisfied)

#- if the spte will be writeable, the large page frame may not overlap any

# write-protected pages

#- the guest page must be wholly contained by a single memory slot

- spte はラージホストページを指すこと

- ゲスト pte は、少なくとも同じ大きさ以上のラージ pte であること。（もし tdp が有効なら、

ゲスト pte は存在しないので、この条件は満足されます。）

- spte が書き込み可能な場合、ラージページフレームはライトプロテクトされたページのどれとも

オーバーラップしてはいけません。

- ゲストページは、単一のメモリースロットに完全に含まれること。

#To check the last two conditions, the mmu maintains a ->write_count set of

#arrays for each memory slot and large page size. Every write protected page

#causes its write_count to be incremented, thus preventing instantiation of

#a large spte. The frames at the end of an unaligned memory slot have

#artificially inflated ->write_counts so they can never be instantiated.

最後の２つの条件を調べるため、mmu は、メモリースロットと、ラージページ長ごとに、

->write_count セットの配列を維持します。すべてのライトプロテクトされたページは自分の

write_count を増加させます。このため、ラージ spte の生成を防ぐことができます。

アラインされていないメモリースロットの終端にあるフレームは、人工的に増加された ->write_count

を持ち、それらが生成されないようにしてあります。

#Further reading

#===============

参考文献

========

#- NPT presentation from KVM Forum 2008

# http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf

- KVM Forum 2008 での NPT の発表

http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf

Page updated

Google Sites

Report abuse