Add page migration support via swap to the NUMA policy layer
This patch adds page migration support to the NUMA policy layer. An
additional flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is
specified then pages that do not conform to the memory policy will be evicted
from memory. When they get pages back in new pages will be allocated
following the numa policy.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Swap Migration V5: migrate_pages() function
This adds the basic page migration function with a minimal implementation that
only allows the eviction of pages to swap space.
Page eviction and migration may be useful to migrate pages, to suspend
programs or for remapping single pages (useful for faulty pages or pages with
soft ECC failures)
The process is as follows:
The function wanting to migrate pages must first build a list of pages to be
migrated or evicted and take them off the lru lists via isolate_lru_page().
isolate_lru_page determines that a page is freeable based on the LRU bit set.
Then the actual migration or swapout can happen by calling migrate_pages().
migrate_pages does its best to migrate or swapout the pages and does multiple
passes over the list. Some pages may only be swappable if they are not dirty.
migrate_pages may start writing out dirty pages in the initial passes over
the pages. However, migrate_pages may not be able to migrate or evict all
pages for a variety of reasons.
The remaining pages may be returned to the LRU lists using putback_lru_pages().
Changelog V4->V5:
- Use the lru caches to return pages to the LRU
Changelog V3->V4:
- Restructure code so that applying patches to support full migration does
require minimal changes. Rename swapout_pages() to migrate_pages().
Changelog V2->V3:
- Extract common code from shrink_list() and swapout_pages()
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Michael Kerrisk" <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the start of the `swap migration' patch series.
Swap migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the
virtual addresses that the process sees do not change. However, the system
rearranges the physical location of those pages.
The main intent of page migration patches here is to reduce the latency of
memory access by moving pages near to the processor where the process
accessing that memory is running.
The patchset allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while
setting a new memory policy.
The pages of process can also be relocated from another process using the
sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages
function call takes two sets of nodes and moves pages of a process that are
located on the from nodes to the destination nodes.
Manual migration is very useful if for example the scheduler has relocated a
process to a processor on a distant node. A batch scheduler or an
administrator can detect the situation and move the pages of the process
nearer to the new processor.
sys_migrate_pages() could be used on non-numa machines as well, to force all
of a particualr process's pages out to swap, if someone thinks that's useful.
Larger installations usually partition the system using cpusets into sections
of nodes. Paul has equipped cpusets with the ability to move pages when a
task is moved to another cpuset. This allows automatic control over locality
of a process. If a task is moved to a new cpuset then also all its pages are
moved with it so that the performance of the process does not sink
dramatically (as is the case today).
Swap migration works by simply evicting the page. The pages must be faulted
back in. The pages are then typically reallocated by the system near the node
where the process is executing.
For swap migration the destination of the move is controlled by the allocation
policy. Cpusets set the allocation policy before calling sys_migrate_pages()
in order to move the pages as intended.
No allocation policy changes are performed for sys_migrate_pages(). This
means that the pages may not faulted in to the specified nodes if no
allocation policy was set by other means. The pages will just end up near the
node where the fault occurred.
There's another patch series in the pipeline which implements "direct
migration".
The direct migration patchset extends the migration functionality to avoid
going through swap. The destination node of the relation is controllable
during the actual moving of pages. The crutch of using the allocation policy
to relocate is not necessary and the pages are moved directly to the target.
Its also faster since swap is not used.
And sys_migrate_pages() can then move pages directly to the specified node.
Implement functions to isolate pages from the LRU and put them back later.
This patch:
An earlier implementation was provided by Hirokazu Takahashi
<taka@valinux.co.jp> and IWAMOTO Toshihiro <iwamoto@valinux.co.jp> for the
memory hotplug project.
From: Magnus
This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap
migration.
Signed-off-by: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
swap migration's isolate_lru_page() currently uses an IPI to notify other
processors that the lru caches need to be drained if the page cannot be
found on the LRU. The IPI interrupt may interrupt a processor that is just
processing lru requests and cause a race condition.
This patch introduces a new function run_on_each_cpu() that uses the
keventd() to run the LRU draining on each processor. Processors disable
preemption when dealing the LRU caches (these are per processor) and thus
executing LRU draining from another process is safe.
Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race
condition.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sun, 8 Jan 2006 09:00:42 +0000 (01:00 -0800)]
[PATCH] mm: free_pages opt
Try to streamline free_pages_bulk by ensuring callers don't pass in a
'count' that exceeds the list size.
Some cleanups:
Rename __free_pages_bulk to __free_one_page.
Put the page list manipulation from __free_pages_ok into free_one_page.
Make __free_pages_ok static.
Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rohit Seth [Sun, 8 Jan 2006 09:00:40 +0000 (01:00 -0800)]
[PATCH] Make high and batch sizes of per_cpu_pagelists configurable
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists. This patch makes these two
variables configurable through /proc interface.
A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list. The min value for this is 8. It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.
The batch value of each per cpu pagelist is also updated as a result. It
is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
Andrew Morton [Sun, 8 Jan 2006 09:00:39 +0000 (01:00 -0800)]
[PATCH] drop-pagecache
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can. THis
operation requires root permissions.
It won't drop dirty data, so the user should run `sync' first.
Caveats:
a) Holds inode_lock for exorbitant amounts of time.
b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.
This is a debugging feature: useful for getting consistent results between
filesystem benchmarks. We could possibly put it under a config option, but
it's less than 300 bytes.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pekka Enberg [Sun, 8 Jan 2006 09:00:36 +0000 (01:00 -0800)]
[PATCH] slab: extract slab order calculation to separate function
This patch moves the ugly loop that determines the 'optimal' size (page order)
of cache slabs from kmem_cache_create() to a separate function and cleans it
up a bit.
Thanks to Matthew Wilcox for the help with this patch.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pekka Enberg [Sun, 8 Jan 2006 09:00:33 +0000 (01:00 -0800)]
[PATCH] slab: remove unused align parameter from alloc_percpu
__alloc_percpu and alloc_percpu both take an 'align' argument which is
completely ignored. snmp6_mib_init() in net/ipv6/af_inet6.c attempts to use
it, but it will be ignored. Therefore, remove the 'align' argument and fixup
the lone caller.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Olaf Hering [Sun, 8 Jan 2006 09:00:32 +0000 (01:00 -0800)]
[PATCH] Fix compilation with CONFIG_MEMORY_HOTPLUG=y and gcc41.
Fix compilation with CONFIG_MEMORY_HOTPLUG=y and gcc41.
Also remove unneeded declations, add a public function.
drivers/base/memory.c:53: error: static declaration of 'register_memory_notifier' follows non-static declaration
include/linux/memory.h:85: error: previous declaration of 'register_memory_notifier' was here
drivers/base/memory.c:58: error: static declaration of 'unregister_memory_notifier' follows non-static declaration
include/linux/memory.h:86: error: previous declaration of 'unregister_memory_notifier' was here
drivers/base/memory.c:68: error: static declaration of 'register_memory' follows non-static declaration
include/linux/memory.h:73: error: previous declaration of 'register_memory' was here
Signed-off-by: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Woody Suwalski [Sun, 8 Jan 2006 09:00:31 +0000 (01:00 -0800)]
[PATCH] ARM Netwinder watchdog wdt977 update
Cleanup for the ARM-only watchdog driver wdt977.
This is probably the last update, since we want to merge with w83977f_wdt.
Jose Goncalves has ported this driver to i386, so probably we can iron out
configuration differences.
Signed-off-by: Woody Suwalski <woodys@xandros.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adrian Bunk [Sat, 7 Jan 2006 21:24:25 +0000 (13:24 -0800)]
[IPV6]: small cleanups
This patch contains the following cleanups:
- addrconf.c: make addrconf_dad_stop() static
- inet6_connection_sock.c should #include <net/inet6_connection_sock.h>
for getting the prototypes of it's global functions
Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:06:30 +0000 (23:06 -0800)]
[NETFILTER]: Handle NAT in IPsec policy checks
Handle NAT of decapsulated IPsec packets by reconstructing the struct flowi
of the original packet from the conntrack information for IPsec policy
checks.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:06:10 +0000 (23:06 -0800)]
[NETFILTER]: Keep conntrack reference until IPsec policy checks are done
Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:05:36 +0000 (23:05 -0800)]
[NETFILTER]: Redo policy lookups after NAT when neccessary
When NAT changes the key used for the xfrm lookup it needs to be done
again. If a new policy is returned in POST_ROUTING the packet needs
to be passed to xfrm4_output_one manually after all hooks were called
because POST_ROUTING is called with fixed okfn (ip_finish_output).
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:05:17 +0000 (23:05 -0800)]
[NETFILTER]: Use conntrack information to determine if packet was NATed
Preparation for IPsec support for NAT:
Use conntrack information instead of saving the saving and comparing the
addresses to determine if a packet was NATed and needs to be rerouted to
make it easier to extend the key.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:04:54 +0000 (23:04 -0800)]
[NETFILTER]: Fix xfrm lookup in ip_route_me_harder/ip6_route_me_harder
ip_route_me_harder doesn't use the port numbers of the xfrm lookup and
uses ip_route_input for non-local addresses which doesn't do a xfrm
lookup, ip6_route_me_harder doesn't do a xfrm lookup at all.
Use xfrm_decode_session and do the lookup manually, make sure both
only do the lookup if the packet hasn't been transformed already.
Makeing sure the lookup only happens once needs a new field in the
IP6CB, which exceeds the size of skb->cb. The size of skb->cb is
increased to 48b. Apparently the IPv6 mobile extensions need some
more room anyway.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:04:01 +0000 (23:04 -0800)]
[IPV4]: reset IPCB flags when neccessary
Reset IPSKB_XFRM_TUNNEL_SIZE flags in ipip and ip_gre hard_start_xmit
function before the packet reenters IP. This is neccessary so the
encapsulated packets are checked not to be oversized in xfrm4_output.c
again. Reset all flags in sit when a packet changes its address family.
Also remove some obsolete IPSKB flags.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:03:34 +0000 (23:03 -0800)]
[IPV4/6]: Netfilter IPsec input hooks
When the innermost transform uses transport mode the decapsulated packet
is not visible to netfilter. Pass the packet through the PRE_ROUTING and
LOCAL_IN hooks again before handing it to upper layer protocols to make
netfilter-visibility symetrical to the output path.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:02:34 +0000 (23:02 -0800)]
[IPV6]: Move nextheader offset to the IP6CB
Move nextheader offset to the IP6CB to make it possible to pass a
packet to ip6_input_finish multiple times and have it skip already
parsed headers. As a nice side effect this gets rid of the manual
hopopts skipping in ip6_input_finish.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Sat, 7 Jan 2006 07:01:48 +0000 (23:01 -0800)]
[XFRM]: Netfilter IPsec output hooks
Call netfilter hooks before IPsec transforms. Packets visit the
FORWARD/LOCAL_OUT and POST_ROUTING hook before the first encapsulation
and the LOCAL_OUT and POST_ROUTING hook before each following tunnel mode
transform.
Patch from Herbert Xu <herbert@gondor.apana.org.au>:
Move the loop from dst_output into xfrm4_output/xfrm6_output since they're
the only ones who need to it. xfrm{4,6}_output_one() processes the first SA
all subsequent transport mode SAs and is called in a loop that calls the
netfilter hooks between each two calls.
In order to avoid the tail call issue, I've added the inline function
nf_hook which is nf_hook_slow plus the empty list check.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Luiz Capitulino [Sat, 7 Jan 2006 06:59:43 +0000 (22:59 -0800)]
[XFRM]: Fix sparse warning.
security/selinux/xfrm.c:155:10: warning: Using plain integer as NULL pointer
Signed-off-by: Luiz Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Knut Petersen [Sat, 7 Jan 2006 09:22:04 +0000 (10:22 +0100)]
[PATCH] fbcon: don´t call set_par() in fbcon_init() if vc_mode == KD_GRAPHICS
Nothing prevents a user to modprobe a framebuffer driver from e.g. the
xterm prompt. As a result, the set_par() function of the driver will be
called from fbcon_init().
This is fatal as a lot of X / framebuffer combinations are unable to
recover from set_par() reprogramming the graphics controller in
KD_GRAPHICS mode.
It is also unnecessary as the set_par() function will be called during a
switch to KD_TEXT anyway. Because of this no side effects are possible.
Russell King [Sat, 7 Jan 2006 13:52:45 +0000 (13:52 +0000)]
[ARM] Move AMBA include files to include/linux/amba/
Since the ARM AMBA bus is used on MIPS as well as ARM, we need
to make the bus available for other architectures to use. Move
the AMBA include files from include/asm-arm/hardware/ to
include/linux/amba/
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Andre McCurdy [Sat, 7 Jan 2006 11:39:20 +0000 (11:39 +0000)]
[ARM] 3239/1: Add ARM optimised swab32
Patch from Andre McCurdy
Replaces generic swab32 routine with a more ARM friendly version.
Reduces kernel text size by approx 1200 bytes when compiled with
3.4.4 and approx 2400 bytes with 4.0.2
Probably some performance benefit as well.
Signed-off-by: Andre McCurdy <armccurdy@yahoo.co.uk> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Correction of the code broken by update
whole-tree platform devices update.
Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
I've spent the past 3 days digging into a glibc testsuite failure in
current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2
timers fire too early in the second pass of this test. The second
pass is noteworthy because it makes use of intervals, whereas the
first pass does not.
All throughout the posix-cpu-timers.c code, the calculation of the
process sched_time sum is implemented roughly as:
unsigned long long sum;
sum = tsk->signal->sched_time;
t = tsk;
do {
sum += t->sched_time;
t = next_thread(t);
} while (t != tsk);
In fact this is the exact scheme used by check_process_timers().
In the case of check_process_timers(), current->sched_time has just
been updated (via scheduler_tick(), which is invoked by
update_process_times(), which subsequently invokes
run_posix_cpu_timers()) So there is no special processing necessary
wrt. that.
In other contexts, we have to allot for the fact that tsk->sched_time
might be a bit out of date if we are current. And the
posix-cpu-timers.c code uses current_sched_time() to deal with that.
Unfortunately it does so in an erroneous and inconsistent manner in
one spot which is what results in the early timer firing.
In cpu_clock_sample_group_locked(), it does this:
cpu->sched = p->signal->sched_time;
/* Add in each other live thread. */
while ((t = next_thread(t)) != p) {
cpu->sched += t->sched_time;
}
if (p->tgid == current->tgid) {
/*
* We're sampling ourselves, so include the
* cycles not yet banked. We still omit
* other threads running on other CPUs,
* so the total can always be behind as
* much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ).
*/
cpu->sched += current_sched_time(current);
} else {
cpu->sched += p->sched_time;
}
The problem is the "p->tgid == current->tgid" test. If "p" is
not current, and the tgids are the same, we will add the process
t->sched_time twice into cpu->sched and omit "p"'s sched_time
which is very very very wrong.
posix-cpu-timers.c has a helper function, sched_ns(p) which takes care
of this, so my fix is to use that here instead of this special tgid
test.
The fact that current can be one of the sub-threads of "p" points out
that we could make things a little bit more accurate, perhaps by using
sched_ns() on every thread we process in these loops. It also points
out that we don't use the most accurate value for threads in the group
actively running other cpus (and this is mentioned in the comment).
But that is a future enhancement, and this fix here definitely makes
sense.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It looks like the bridge netfilter code does not correctly update
the hardware checksum after popping off the VLAN header.
This is by inspection, I have *not* tested this.
To test you would need to set up a filtering bridge with vlans
and a device the does hardware receive checksum (skge, or sungem)
Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Shaun Pereira [Fri, 6 Jan 2006 21:11:35 +0000 (13:11 -0800)]
[X25]: Fix for broken x25 module.
When a user-space server application calls bind on a socket, then in kernel
space this bound socket is considered 'x25-linked' and the SOCK_ZAPPED flag
is unset.(As in x25_bind()/af_x25.c).
Now when a user-space client application attempts to connect to the server
on the listening socket, if the kernel accepts this in-coming call, then it
returns a new socket to userland and attempts to reply to the caller.
The reply/x25_sendmsg() will fail, because the new socket created on
call-accept has its SOCK_ZAPPED flag set by x25_make_new().
(sock_init_data() called by x25_alloc_socket() called by x25_make_new()
sets the flag to SOCK_ZAPPED)).
Fix: Using the sock_copy_flag() routine available in sock.h fixes this.
Tested on 32 and 64 bit kernels with x25 over tcp.
Signed-off-by: Shaun Pereira <pereira.shaun@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
J. Bruce Fields [Tue, 3 Jan 2006 08:56:01 +0000 (09:56 +0100)]
SUNRPC: Make krb5 report unsupported encryption types
Print messages when an unsupported encrytion algorthm is requested or
there is an error locating a supported algorthm.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:56:01 +0000 (09:56 +0100)]
SUNRPC: Make spkm3 report unsupported encryption types
Print messages when an unsupported encrytion algorthm is requested or
there is an error locating a supported algorthm.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:56:00 +0000 (09:56 +0100)]
SUNRPC: Update the spkm3 code to use the make_checksum interface
Also update the tokenlen calculations to accomodate g_token_size().
Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Tue, 3 Jan 2006 08:55:57 +0000 (09:55 +0100)]
NFSv4: Allow entries in the idmap cache to expire
If someone changes the uid/gid mapping in userland, then we do eventually
want those changes to be propagated to the kernel. Currently the kernel
assumes that it may cache entries forever.
Add an expiration time + garbage collector for idmap entries.
Trond Myklebust [Tue, 3 Jan 2006 08:55:55 +0000 (09:55 +0100)]
SUNRPC: Ensure client closes the socket when server initiates a close
If the server decides to close the RPC socket, we currently don't actually
respond until either another RPC call is scheduled, or until xprt_autoclose()
gets called by the socket expiry timer (which may be up to 5 minutes
later).
This patch ensures that xprt_autoclose() is called much sooner if the
server closes the socket.
Chuck Lever [Tue, 3 Jan 2006 08:55:52 +0000 (09:55 +0100)]
SUNRPC: get rid of cl_chatty
Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.
Test-plan:
Compile with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Tue, 3 Jan 2006 08:55:51 +0000 (09:55 +0100)]
SUNRPC: transport switch API for setting port number
At some point, transport endpoint addresses will no longer be IPv4. To hide
the structure of the rpc_xprt's address field from ULPs and port mappers,
add an API for setting the port number during an RPC bind operation.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Tue, 3 Jan 2006 08:55:50 +0000 (09:55 +0100)]
SUNRPC: new interface to force an RPC rebind
We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols.
Start by creating an API to force RPC rebind, replacing logic that simply
sets cl_port to zero.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Tue, 3 Jan 2006 08:55:49 +0000 (09:55 +0100)]
SUNRPC: switchable buffer allocation
Add RPC client transport switch support for replacing buffer management
on a per-transport basis.
In the current IPv4 socket transport implementation, RPC buffers are
allocated as needed for each RPC message that is sent. Some transport
implementations may choose to use pre-allocated buffers for encoding,
sending, receiving, and unmarshalling RPC messages, however. For
transports capable of direct data placement, the buffers can be carved
out of a pre-registered area of memory rather than from a slab cache.
Test-plan:
Millions of fsx operations. Performance characterization with "sio" and
"iozone". Use oprofile and other tools to look for significant regression
in CPU utilization.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:48 +0000 (09:55 +0100)]
NFSv3: try get_root user-supplied security_flavor
Thanks to Ed Keizer for bug and root cause. He says: "... we could only mount
the top-level Solaris share. We could not mount deeper into the tree.
Investigation showed that Solaris allows UNIX authenticated FSINFO only on the
top level of the share. This is a problem because we share/export our home
directories one level higher than we mount them. I.e. we share the partition
and not the individual home directories. This prevented access to home
directories."
We still may need to try auth_sys for the case where the client doesn't have
appropriate credentials.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:46 +0000 (09:55 +0100)]
NLM: Further cancel fixes
If the server receives an NLM cancel call and finds no waiting lock to
cancel, then chances are the lock has already been applied, and the client
just hadn't yet processed the NLM granted callback before it sent the
cancel.
The Open Group text, for example, perimts a server to return either success
(LCK_GRANTED) or failure (LCK_DENIED) in this case. But returning an error
seems more helpful; the client may be able to use it to recognize that a
race has occurred and to recover from the race.
So, modify the relevant functions to return an error in this case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:44 +0000 (09:55 +0100)]
NLM: don't unlock on cancel requests
Currently when lockd gets an NLM_CANCEL request, it also does an unlock for
the same range. This is incorrect.
The Open Group documentation says that "This procedure cancels an
*outstanding* blocked lock request." (Emphasis mine.)
Also, consider a client that holds a lock on the first byte of a file, and
requests a lock on the entire file. If the client cancels that request
(perhaps because the requesting process is signalled), the server shouldn't
apply perform an unlock on the entire file, since that will also remove the
previous lock that the client was already granted.
Or consider a lock request that actually *downgraded* an exclusive lock to
a shared lock.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:42 +0000 (09:55 +0100)]
NLM: Clean up nlmsvc_grant_reply locking
Slightly simpler logic here makes it more trivial to verify that the up's
and down's are balanced here. Break out an assignment from a conditional
while we're at it.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes ths unused function xdr_decode_string().
Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Charles Lever <Charles.Lever@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Tue, 3 Jan 2006 08:55:38 +0000 (09:55 +0100)]
NFSv4: Ensure DELEGRETURN returns attributes
Upon return of a write delegation, the server will almost always bump the
change attribute. Ensure that we pick up that change so that we don't
invalidate our data cache unnecessarily.
Trond Myklebust [Tue, 3 Jan 2006 08:55:37 +0000 (09:55 +0100)]
NFSv4: Ensure change attribute returned by GETATTR callback conforms to spec
According to RFC3530 we're supposed to cache the change attribute
at the time the client receives a write delegation.
If the inode is clean, a CB_GETATTR callback by the server to the
client is supposed to return the cached change attribute.
If, OTOH, the inode is dirty, the client should bump the cached
change attribute by 1.