Proportional Resident State Size





Contents

Home
My Motherland
DragonFly
    - Proportional RSS
    - How To's


Proportional Resident State Size

The Resident State Size displayed by top keeps track of the number of resident pages in a certain process's adress space. It is very useful to locate memory hogs, but doesn't take into account page sharing. For example, if N processes map library L and L's resident pages are 1G, this 1G is added to the RSS of all N processes. A more useful number would be the Proportional (or Effective) RSS, for which we divide the number of mapped shared pages by the number of processes sharing each page. So in the previous example we would add 1GB/N to each process that has L mapped.

The goal of this project was to hack the kernel to allow for effective calculation of the Proportional RSS and modify top to use it in addition to the RSS (i.e. it should display it by default and be able to sort based on it).

Proposal during my early DragonFly days (Design/Implementation)  

The project aims at modifying top to display the proportional RSS in addition to RSS. We can use top with a switch to display proportional RSS. If the user wants to see proportional RSS, we can calculate it for him else why waste system time and resources. Also, the other choice is to display proportional RSS and the RSS. 

Most of the work to be done is at the page level. Which is fine grained and space-time trade off should be handled. Assuming their are 5 processes sharing the library but only three of them actually map the page. Then their will be one vm_page structure and three pv_entry structures. 

First of all we should focus on the data structure that will store the number of processes mapping the page. In this scenario one of the data structure can be a hash table and the other can be a splay tree or an AVL tree. While using hash table we can keep the mapping of the page and corresponding number of processes currently mapped to it. We can get the number of processes by traversing the list of pv_entry. Here, we need to take care of the load factor and also the hash function to be used. The hash tables will have an impact while searching.

What if we use splay tree/ AVL tree? The only problem with the splay tree or AVL tree is rotation of the tree, which can cost us time. The space consumed by both the table and the tree is proportional to the number of pages/ pages shared. But if we consider time required for a search hash table can yield a better result. So, to utilize the space we can use compression of dynamic data structures. We will have a compression scheme that best suits our project, for example group up the pages shared by same number of processes. Also, we can use the above data structures for heavily shared pages to save space. Once the calculations are done we can free the data structures. 

Even if we have 5 processes sharing a page we calculate the proportional RSS for three processes actually mapping the page. So, the time complexity is obviously less than O(n). Also, with the proper use of the above data structures we can get more interesting outcomes. Another approach besides using the above data structures is to add a element in the vm_page structure. This element will keep the count of the number of processes for that particular page. If the value is zero we will go round the list of the pv_entry structures and increment the value of the element. The problem with this solution is that the element should be taken care off and should be cleared after the calculations are done. Thus, it comes with many side effects and according to me the approach needing a data structure is feasible.


Walk to the Proportional RSS  

The proposal I presented and the design/implementation is quite different. Here you can have a pictorial walk through the code a short summary of how it works. Below is a drawing showing a Process VM Structure copied from the PicoBSD kernel walk. The notation uses () to indicate a structure's logical definition, {} to indicate a specific variable, and {{}} to indicate a pool from which a variable is allocated. Arrays are indicated using [].


 {{proc_zone}}
 {allproc}   
                                                                
 (  proc     )
 | p_upages  |--->(vm_object)                                  
 | p_fd      |--->(filedesc)
 | ...       |
 | p_vmspace |--->(    vmspace      )             
 |-----------|    | vm_map:         |
                  |(  vm_map       )|       
                  || lock          ||
                  ||(vm_map_entry)---<---->--(    vm_map_entry   )---<-->--(vm_map_entry)-->...
                  || nentries      ||        | start/end VA      |
                  || size          ||        | max size can grow |
                  || pmap          ||        | object            |----->(     vmobject         )
                  ||---|-----------||        | offset in obj     |      |     handle          -|->--> (vnode)
                  ||   |           ||        | eflags/protection |      | resident_page_count |
                  ||   |           ||        |-------------------|      |  agg_pv_list_count  |  
| v |                                   |----------------------|
                  | vm_pmap: |
                  |( pmap         )|
                  || pm_pdir --->-||--->[ page directory] || pm_pteobj ->-||--->------------------>(vm_object - contains page table) || pm_count || || (pm_stats) || || pm_pvlist -->-||-->(pv_entry) ||---------------|| | pv_list |---> (pv_entry) - list of all pv_entrys in this pmap. ^ | pv_va | ^-<- to pmap <--| pv_pmap | /-->-->-->-->-->-->-->-->-->-| pv_plist |---> (pv_entry) - list of all pv_entrys / | pv_ptem | sharing this vm_page. / |---|-------| / | ^ v | | | ( vm_page ) ^ {vm_page_queues[]}-->-->-->--| pageq |-->-->--> All pages on FREE, INACTIVE, CACHE queues. | {vm_page_buckets[}->-->-->--| hnext |-->--> Hash of pages by (vmobject, page index). | | queue | \ | flags | \ ||(md_page)|| \-<--<--<--<--<--<--<--<--<--<-|| pv_list || ||---------|| | object --|-->-->-->--> (vm_object) | pindex | | handle -|->--> (vnode) All vm_pages in memory <--<-| listq |-<--<--<--<--| memq | belonging to same object | ... | |---------| | ... | {phys_avail[10]}--> { physical } <--<---| phys_addr | { memory } |-----------| ^ ^ | | Pointers to {vm_page_array[]} up to 10 {vm_page_array_size} blocks of contiguous Every page of physical physical memory is described by memory in a vm_page entry in the machine. vm_page_array[].

The vmspace structure describes the address space of the process including both the machine dependent and independent structures. vm_map is a data structure that provides information about machine virtual address space. vm_map_entry describes virtually contiguous range of address. The start/end VA gives the range of virtual address space. Each vm_map_entry structure points to the vmobject. vmobject describes the data for the range of addresses in vm_map_entry. vm_page represents the physical memory used. The pmap structure gives information about the machine dependent stuff. You can see two fields in the vm_object structure, one is the resident_page_count and other is agg_pv_list_count both colored blue in the diagram. resident_page_count gives the total number of pages currently in memory for that object. agg_pv_list_count gives the aggregate of the pv_list_count of all the vm_page for the respective vm_object.

For proportional RSS we simply walk down the adress space of the process, through the list of vm_map_entry and vmobject for each entry. The path is displayed in the diagram above with the help of dotted red lines. We divide the resident_page_count by agg_pv_list_count and the obtain result is accrued for each vm_map_entry to display the proportional RSS for the process. Readers interested to have a look at patch details and changes submitted can click here.

References
1. The Design and Implementation of FreeBSD Operating System (Its a book and concentrate on Memory Management unit)
2. Design Elements of the FreeBSD VM (virtual memory) System
3. Free BSD Cross Reference (You can also visit the links in the diagram)




Copyright (c) 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010 The DragonFly Project. All rights reserved.