Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394...

[linux-2.6] / Documentation / vm / numa_memory_policy.txt
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt

index 706410dfb9e501015e06e7acd63af7ce5f031cba..6aaaeb38730cf0ff1d5389cd73a6a8d2fb6bd413 100644 (file)
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -58,7 +58,7 @@ most general to most specific:
         the policy at the time they were allocated.
  
      VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
-    virtual adddress space.  A task may define a specific policy for a range
+    virtual address space.  A task may define a specific policy for a range
      of its virtual address space.   See the MEMORY POLICIES APIS section,
      below, for an overview of the mbind() system call used to set a VMA
      policy.
@@ -145,41 +145,20 @@ Components of Memory Policies
     structure, struct mempolicy.  Details of this structure will be discussed
     in context, below, as required to explain the behavior.
  
-       Note:  in some functions AND in the struct mempolicy itself, the mode
-       is called "policy".  However, to avoid confusion with the policy tuple,
-       this document will continue to use the term "mode".
-
     Linux memory policy supports the following 4 behavioral modes:
  
-       Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
-       context or scope dependent.
-
-           As mentioned in the Policy Scope section above, during normal
-           system operation, the System Default Policy is hard coded to
-           contain the Default mode.
-
-           In this context, default mode means "local" allocation--that is
-           attempt to allocate the page from the node associated with the cpu
-           where the fault occurs.  If the "local" node has no memory, or the
-           node's memory can be exhausted [no free pages available], local
-           allocation will "fallback to"--attempt to allocate pages from--
-           "nearby" nodes, in order of increasing "distance".
-
-               Implementation detail -- subject to change:  "Fallback" uses
-               a per node list of sibling nodes--called zonelists--built at
-               boot time, or when nodes or memory are added or removed from
-               the system [memory hotplug].  These per node zonelist are
-               constructed with nodes in order of increasing distance based
-               on information provided by the platform firmware.
+       Default Mode--MPOL_DEFAULT:  This mode is only used in the memory
+       policy APIs.  Internally, MPOL_DEFAULT is converted to the NULL
+       memory policy in all policy scopes.  Any existing non-default policy
+       will simply be removed when MPOL_DEFAULT is specified.  As a result,
+       MPOL_DEFAULT means "fall back to the next most specific policy scope."
  
-           When a task/process policy or a shared policy contains the Default
-           mode, this also means "local allocation", as described above.
+           For example, a NULL or default task policy will fall back to the
+           system default policy.  A NULL or default vma policy will fall
+           back to the task policy.
  
-           In the context of a VMA, Default mode means "fall back to task
-           policy"--which may or may not specify Default mode.  Thus, Default
-           mode can not be counted on to mean local allocation when used
-           on a non-shared region of the address space.  However, see
-           MPOL_PREFERRED below.
+           When specified in one of the memory policy APIs, the Default mode
+           does not use the optional set of nodes.
  
             It is an error for the set of nodes specified for this policy to
             be non-empty.
@@ -191,19 +170,23 @@ Components of Memory Policies
  
         MPOL_PREFERRED:  This mode specifies that the allocation should be
         attempted from the single node specified in the policy.  If that
-       allocation fails, the kernel will search other nodes, exactly as
-       it would for a local allocation that started at the preferred node
-       in increasing distance from the preferred node.  "Local" allocation
-       policy can be viewed as a Preferred policy that starts at the node
+       allocation fails, the kernel will search other nodes, in order of
+       increasing distance from the preferred node based on information
+       provided by the platform firmware.
         containing the cpu where the allocation takes place.
  
             Internally, the Preferred policy uses a single node--the
-           preferred_node member of struct mempolicy.  A "distinguished
-           value of this preferred_node, currently '-1', is interpreted
-           as "the node containing the cpu where the allocation takes
-           place"--local allocation.  This is the way to specify
-           local allocation for a specific range of addresses--i.e. for
-           VMA policies.
+           preferred_node member of struct mempolicy.  When the internal
+           mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and
+           the policy is interpreted as local allocation.  "Local" allocation
+           policy can be viewed as a Preferred policy that starts at the node
+           containing the cpu where the allocation takes place.
+
+           It is possible for the user to specify that local allocation is
+           always preferred by passing an empty nodemask with this mode.
+           If an empty nodemask is passed, the policy cannot use the
+           MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
+           below.
  
         MPOL_INTERLEAVED:  This mode specifies that page allocations be
         interleaved, on a page granularity, across the nodes specified in
@@ -254,7 +237,10 @@ Components of Memory Policies
             occurs over that node.  If no nodes from the user's nodemask are
             now allowed, the Default behavior is used.
  
-           MPOL_F_STATIC_NODES cannot be used with MPOL_F_RELATIVE_NODES.
+           MPOL_F_STATIC_NODES cannot be combined with the
+           MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
+           MPOL_PREFERRED policies that were created with an empty nodemask
+           (local allocation).
  
         MPOL_F_RELATIVE_NODES:  This flag specifies that the nodemask passed
         by the user will be mapped relative to the set of the task or VMA's
@@ -301,7 +287,78 @@ Components of Memory Policies
             set of memory nodes allowed by the task's cpuset, as that may
             change over time.
  
-           MPOL_F_RELATIVE_NODES cannot be used with MPOL_F_STATIC_NODES.
+           MPOL_F_RELATIVE_NODES cannot be combined with the
+           MPOL_F_STATIC_NODES flag.  It also cannot be used for
+           MPOL_PREFERRED policies that were created with an empty nodemask
+           (local allocation).
+
+MEMORY POLICY REFERENCE COUNTING
+
+To resolve use/free races, struct mempolicy contains an atomic reference
+count field.  Internal interfaces, mpol_get()/mpol_put() increment and
+decrement this reference count, respectively.  mpol_put() will only free
+the structure back to the mempolicy kmem cache when the reference count
+goes to zero.
+
+When a new memory policy is allocated, it's reference count is initialized
+to '1', representing the reference held by the task that is installing the
+new policy.  When a pointer to a memory policy structure is stored in another
+structure, another reference is added, as the task's reference will be dropped
+on completion of the policy installation.
+
+During run-time "usage" of the policy, we attempt to minimize atomic operations
+on the reference count, as this can lead to cache lines bouncing between cpus
+and NUMA nodes.  "Usage" here means one of the following:
+
+1) querying of the policy, either by the task itself [using the get_mempolicy()
+   API discussed below] or by another task using the /proc/<pid>/numa_maps
+   interface.
+
+2) examination of the policy to determine the policy mode and associated node
+   or node lists, if any, for page allocation.  This is considered a "hot
+   path".  Note that for MPOL_BIND, the "usage" extends across the entire
+   allocation process, which may sleep during page reclaimation, because the
+   BIND policy nodemask is used, by reference, to filter ineligible nodes.
+
+We can avoid taking an extra reference during the usages listed above as
+follows:
+
+1) we never need to get/free the system default policy as this is never
+   changed nor freed, once the system is up and running.
+
+2) for querying the policy, we do not need to take an extra reference on the
+   target task's task policy nor vma policies because we always acquire the
+   task's mm's mmap_sem for read during the query.  The set_mempolicy() and
+   mbind() APIs [see below] always acquire the mmap_sem for write when
+   installing or replacing task or vma policies.  Thus, there is no possibility
+   of a task or thread freeing a policy while another task or thread is
+   querying it.
+
+3) Page allocation usage of task or vma policy occurs in the fault path where
+   we hold them mmap_sem for read.  Again, because replacing the task or vma
+   policy requires that the mmap_sem be held for write, the policy can't be
+   freed out from under us while we're using it for page allocation.
+
+4) Shared policies require special consideration.  One task can replace a
+   shared memory policy while another task, with a distinct mmap_sem, is
+   querying or allocating a page based on the policy.  To resolve this
+   potential race, the shared policy infrastructure adds an extra reference
+   to the shared policy during lookup while holding a spin lock on the shared
+   policy management structure.  This requires that we drop this extra
+   reference when we're finished "using" the policy.  We must drop the
+   extra reference on shared policies in the same query/allocation paths
+   used for non-shared policies.  For this reason, shared policies are marked
+   as such, and the extra reference is dropped "conditionally"--i.e., only
+   for shared policies.
+
+   Because of this extra reference counting, and because we must lookup
+   shared policies in a tree structure under spinlock, shared policies are
+   more expensive to use in the page allocation path.  This is especially
+   true for shared policies on shared memory regions shared by tasks running
+   on different NUMA nodes.  This extra overhead can be avoided by always
+   falling back to task or system default policy for shared memory regions,
+   or by prefaulting the entire shared memory region into memory and locking
+   it down.  However, this might not be appropriate for all applications.
  
  MEMORY POLICY APIs