Parsing addresses extracted from Open Firmware isn't a simple matter. We
have various bits of code that try to do it in various place, including
some heuristics in prom.c that pre-parse addresses at boot and fill
device-nodes "addrs", but those are dodgy at best and I want to
deprecate them. So this patch introduces a new set of routines that
should be capable of parsing most types of addresses and translating
them into CPU physical addresses. It currently works for things on PCI
busses and ISA busses and should work on "standard" busses like the root
bus or the MacIO bus that don't put funky flags in addresses. If you
have other bus types that do use funky flags, you'll have to add new bus
type translators, which is fairly easy.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
David Woodhouse [Fri, 18 Nov 2005 12:15:33 +0000 (12:15 +0000)]
[PATCH] ppc64 syscall_exit_work: call the save_nvgprs function, not its descriptor.
On Tue, 2005-11-15 at 18:52 +0000, David Woodhouse wrote:
> This cleanup patch speeds up the null syscall path on ppc64 by about 3%,
> and brings the ppc32 and ppc64 code slightly closer together.
Needs this unless your binutils, like mine, are clever enough to notice
my stupidity and fix it up automatically...
Spotted by Paul.
Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
Arnd Bergmann [Tue, 15 Nov 2005 20:53:52 +0000 (15:53 -0500)]
[PATCH] spufs: cooperative scheduler support
This adds a scheduler for SPUs to make it possible to use
more logical SPUs than physical ones are present in the
system.
Currently, there is no support for preempting a running
SPU thread, they have to leave the SPU by either triggering
an event on the SPU that causes it to return to the
owning thread or by sending a signal to it.
This patch also adds operations that enable accessing an SPU
in either runnable or saved state. We use an RW semaphore
to protect the state of the SPU from changing underneath
us, while we are holding it readable. In order to change
the state, it is acquired writeable and a context save
or restore is executed before downgrading the semaphore
to read-only.
From: Mark Nutter <mnutter@us.ibm.com>,
Uli Weigand <Ulrich.Weigand@de.ibm.com> Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
Mark Nutter [Tue, 15 Nov 2005 20:53:51 +0000 (15:53 -0500)]
[PATCH] spufs: add spu-side context switch code
Add the source code that is used to generate spu_save_dump.h and
spu_restore_dump.h. Since a full spu tool chain is needed to
generate these files, the default remains to use the shipped
versions in order to keep the number of tools for building the
kernel down.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
Mark Nutter [Tue, 15 Nov 2005 20:53:49 +0000 (15:53 -0500)]
[PATCH] spufs: switchable spu contexts
Add some infrastructure for saving and restoring the context of an
SPE. This patch creates a new structure that can hold the whole
state of a physical SPE in memory. It also contains code that
avoids races during the context switch and the binary code that
is loaded to the SPU in order to access its registers.
The actual PPE- and SPE-side context switch code are two separate
patches.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
Arnd Bergmann [Tue, 15 Nov 2005 20:53:48 +0000 (15:53 -0500)]
[PATCH] spufs: The SPU file system, base
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.
This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.
The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.
See Documentation/filesystems/spufs.txt on how to use
spufs.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch adds the necessary core bus support used by device drivers
that sit on the IBM GX bus on modern pSeries machines like the Galaxy
infiniband for example. It provide transparent DMA ops (the low level
driver works with virtual addresses directly) along with a simple bus
layer using the Open Firmware matching routines.
Signed-off-by: Heiko J Schick <schickhj@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
David Woodhouse [Tue, 15 Nov 2005 18:52:18 +0000 (18:52 +0000)]
[PATCH] syscall entry/exit revamp
This cleanup patch speeds up the null syscall path on ppc64 by about 3%,
and brings the ppc32 and ppc64 code slightly closer together.
The ppc64 code was checking current_thread_info()->flags twice in the
syscall exit path; once for TIF_SYSCALL_T_OR_A before disabling
interrupts, and then again for TIF_SIGPENDING|TIF_NEED_RESCHED etc after
disabling interrupts. Now we do the same as ppc32 -- check the flags
only once in the fast path, and re-enable interrupts if necessary in the
ptrace case.
The patch abolishes the 'syscall_noerror' member of struct thread_info
and replaces it with a TIF_NOERROR bit in the flags, which is handled in
the slow path. This shortens the syscall entry code, which no longer
needs to clear syscall_noerror.
The patch adds a TIF_SAVE_NVGPRS flag which causes the syscall exit slow
path to save the non-volatile GPRs into a signal frame. This removes the
need for the assembly wrappers around sys_sigsuspend(),
sys_rt_sigsuspend(), et al which existed solely to save those registers
in advance. It also means I don't have to add new wrappers for ppoll()
and pselect(), which is what I was supposed to be doing when I got
distracted into this...
Finally, it unifies the ppc64 and ppc32 methods of handling syscall exit
directly into a signal handler (as required by sigsuspend et al) by
introducing a TIF_RESTOREALL flag which causes _all_ the registers to be
reloaded from the pt_regs by taking the ret_from_exception path, instead
of the normal syscall exit path which stomps on the callee-saved GPRs.
It appears to pass an LTP test run on ppc64, and passes basic testing on
ppc32 too. Brief tests of ptrace functionality with strace and gdb also
appear OK. I wouldn't send it to Linus for 2.6.15 just yet though :)
Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
Michael Ellerman [Mon, 14 Nov 2005 12:35:00 +0000 (23:35 +1100)]
[PATCH] powerpc: Merge kexec
This patch merges, to some extent, the PPC32 and PPC64 kexec implementations.
We adopt the PPC32 approach of having ppc_md callbacks for the kexec functions.
The current PPC64 implementation becomes the "default" implementation for PPC64
which platforms can select if they need no special treatment.
I've added these default callbacks to pseries/maple/cell/powermac, this means
iSeries no longer supports kexec - but it never worked anyway.
I've renamed PPC32's machine_kexec_simple to default_machine_kexec, inline with
PPC64. Judging by the comments it might be better named machine_kexec_non_of,
or something, but at the moment it's the only implementation for PPC32 so it's
the "default".
Kexec requires machine_shutdown(), which is in machine_kexec.c on PPC32, but we
already have in setup-common.c on powerpc. All this does is call
ppc_md.nvram_sync, which only powermac implements, so instead make
machine_shutdown a ppc_md member and have it call core99_nvram_sync directly
on powermac.
I've also stuck relocate_kernel.S into misc_32.S for powerpc.
Built for ARCH=ppc, and 32 & 64 bit ARCH=powerpc, with KEXEC=y/n. Booted on
P5 LPAR and successfully kexec'ed.
Knut Petersen [Sat, 7 Jan 2006 09:22:04 +0000 (10:22 +0100)]
[PATCH] fbcon: don´t call set_par() in fbcon_init() if vc_mode == KD_GRAPHICS
Nothing prevents a user to modprobe a framebuffer driver from e.g. the
xterm prompt. As a result, the set_par() function of the driver will be
called from fbcon_init().
This is fatal as a lot of X / framebuffer combinations are unable to
recover from set_par() reprogramming the graphics controller in
KD_GRAPHICS mode.
It is also unnecessary as the set_par() function will be called during a
switch to KD_TEXT anyway. Because of this no side effects are possible.
Russell King [Sat, 7 Jan 2006 13:52:45 +0000 (13:52 +0000)]
[ARM] Move AMBA include files to include/linux/amba/
Since the ARM AMBA bus is used on MIPS as well as ARM, we need
to make the bus available for other architectures to use. Move
the AMBA include files from include/asm-arm/hardware/ to
include/linux/amba/
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Andre McCurdy [Sat, 7 Jan 2006 11:39:20 +0000 (11:39 +0000)]
[ARM] 3239/1: Add ARM optimised swab32
Patch from Andre McCurdy
Replaces generic swab32 routine with a more ARM friendly version.
Reduces kernel text size by approx 1200 bytes when compiled with
3.4.4 and approx 2400 bytes with 4.0.2
Probably some performance benefit as well.
Signed-off-by: Andre McCurdy <armccurdy@yahoo.co.uk> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Correction of the code broken by update
whole-tree platform devices update.
Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
I've spent the past 3 days digging into a glibc testsuite failure in
current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2
timers fire too early in the second pass of this test. The second
pass is noteworthy because it makes use of intervals, whereas the
first pass does not.
All throughout the posix-cpu-timers.c code, the calculation of the
process sched_time sum is implemented roughly as:
unsigned long long sum;
sum = tsk->signal->sched_time;
t = tsk;
do {
sum += t->sched_time;
t = next_thread(t);
} while (t != tsk);
In fact this is the exact scheme used by check_process_timers().
In the case of check_process_timers(), current->sched_time has just
been updated (via scheduler_tick(), which is invoked by
update_process_times(), which subsequently invokes
run_posix_cpu_timers()) So there is no special processing necessary
wrt. that.
In other contexts, we have to allot for the fact that tsk->sched_time
might be a bit out of date if we are current. And the
posix-cpu-timers.c code uses current_sched_time() to deal with that.
Unfortunately it does so in an erroneous and inconsistent manner in
one spot which is what results in the early timer firing.
In cpu_clock_sample_group_locked(), it does this:
cpu->sched = p->signal->sched_time;
/* Add in each other live thread. */
while ((t = next_thread(t)) != p) {
cpu->sched += t->sched_time;
}
if (p->tgid == current->tgid) {
/*
* We're sampling ourselves, so include the
* cycles not yet banked. We still omit
* other threads running on other CPUs,
* so the total can always be behind as
* much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ).
*/
cpu->sched += current_sched_time(current);
} else {
cpu->sched += p->sched_time;
}
The problem is the "p->tgid == current->tgid" test. If "p" is
not current, and the tgids are the same, we will add the process
t->sched_time twice into cpu->sched and omit "p"'s sched_time
which is very very very wrong.
posix-cpu-timers.c has a helper function, sched_ns(p) which takes care
of this, so my fix is to use that here instead of this special tgid
test.
The fact that current can be one of the sub-threads of "p" points out
that we could make things a little bit more accurate, perhaps by using
sched_ns() on every thread we process in these loops. It also points
out that we don't use the most accurate value for threads in the group
actively running other cpus (and this is mentioned in the comment).
But that is a future enhancement, and this fix here definitely makes
sense.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It looks like the bridge netfilter code does not correctly update
the hardware checksum after popping off the VLAN header.
This is by inspection, I have *not* tested this.
To test you would need to set up a filtering bridge with vlans
and a device the does hardware receive checksum (skge, or sungem)
Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Shaun Pereira [Fri, 6 Jan 2006 21:11:35 +0000 (13:11 -0800)]
[X25]: Fix for broken x25 module.
When a user-space server application calls bind on a socket, then in kernel
space this bound socket is considered 'x25-linked' and the SOCK_ZAPPED flag
is unset.(As in x25_bind()/af_x25.c).
Now when a user-space client application attempts to connect to the server
on the listening socket, if the kernel accepts this in-coming call, then it
returns a new socket to userland and attempts to reply to the caller.
The reply/x25_sendmsg() will fail, because the new socket created on
call-accept has its SOCK_ZAPPED flag set by x25_make_new().
(sock_init_data() called by x25_alloc_socket() called by x25_make_new()
sets the flag to SOCK_ZAPPED)).
Fix: Using the sock_copy_flag() routine available in sock.h fixes this.
Tested on 32 and 64 bit kernels with x25 over tcp.
Signed-off-by: Shaun Pereira <pereira.shaun@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
J. Bruce Fields [Tue, 3 Jan 2006 08:56:01 +0000 (09:56 +0100)]
SUNRPC: Make krb5 report unsupported encryption types
Print messages when an unsupported encrytion algorthm is requested or
there is an error locating a supported algorthm.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:56:01 +0000 (09:56 +0100)]
SUNRPC: Make spkm3 report unsupported encryption types
Print messages when an unsupported encrytion algorthm is requested or
there is an error locating a supported algorthm.
Signed-off-by: Kevin Coffman <kwc@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:56:00 +0000 (09:56 +0100)]
SUNRPC: Update the spkm3 code to use the make_checksum interface
Also update the tokenlen calculations to accomodate g_token_size().
Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Tue, 3 Jan 2006 08:55:57 +0000 (09:55 +0100)]
NFSv4: Allow entries in the idmap cache to expire
If someone changes the uid/gid mapping in userland, then we do eventually
want those changes to be propagated to the kernel. Currently the kernel
assumes that it may cache entries forever.
Add an expiration time + garbage collector for idmap entries.
Trond Myklebust [Tue, 3 Jan 2006 08:55:55 +0000 (09:55 +0100)]
SUNRPC: Ensure client closes the socket when server initiates a close
If the server decides to close the RPC socket, we currently don't actually
respond until either another RPC call is scheduled, or until xprt_autoclose()
gets called by the socket expiry timer (which may be up to 5 minutes
later).
This patch ensures that xprt_autoclose() is called much sooner if the
server closes the socket.
Chuck Lever [Tue, 3 Jan 2006 08:55:52 +0000 (09:55 +0100)]
SUNRPC: get rid of cl_chatty
Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.
Test-plan:
Compile with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Tue, 3 Jan 2006 08:55:51 +0000 (09:55 +0100)]
SUNRPC: transport switch API for setting port number
At some point, transport endpoint addresses will no longer be IPv4. To hide
the structure of the rpc_xprt's address field from ULPs and port mappers,
add an API for setting the port number during an RPC bind operation.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Tue, 3 Jan 2006 08:55:50 +0000 (09:55 +0100)]
SUNRPC: new interface to force an RPC rebind
We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols.
Start by creating an API to force RPC rebind, replacing logic that simply
sets cl_port to zero.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Tue, 3 Jan 2006 08:55:49 +0000 (09:55 +0100)]
SUNRPC: switchable buffer allocation
Add RPC client transport switch support for replacing buffer management
on a per-transport basis.
In the current IPv4 socket transport implementation, RPC buffers are
allocated as needed for each RPC message that is sent. Some transport
implementations may choose to use pre-allocated buffers for encoding,
sending, receiving, and unmarshalling RPC messages, however. For
transports capable of direct data placement, the buffers can be carved
out of a pre-registered area of memory rather than from a slab cache.
Test-plan:
Millions of fsx operations. Performance characterization with "sio" and
"iozone". Use oprofile and other tools to look for significant regression
in CPU utilization.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:48 +0000 (09:55 +0100)]
NFSv3: try get_root user-supplied security_flavor
Thanks to Ed Keizer for bug and root cause. He says: "... we could only mount
the top-level Solaris share. We could not mount deeper into the tree.
Investigation showed that Solaris allows UNIX authenticated FSINFO only on the
top level of the share. This is a problem because we share/export our home
directories one level higher than we mount them. I.e. we share the partition
and not the individual home directories. This prevented access to home
directories."
We still may need to try auth_sys for the case where the client doesn't have
appropriate credentials.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:46 +0000 (09:55 +0100)]
NLM: Further cancel fixes
If the server receives an NLM cancel call and finds no waiting lock to
cancel, then chances are the lock has already been applied, and the client
just hadn't yet processed the NLM granted callback before it sent the
cancel.
The Open Group text, for example, perimts a server to return either success
(LCK_GRANTED) or failure (LCK_DENIED) in this case. But returning an error
seems more helpful; the client may be able to use it to recognize that a
race has occurred and to recover from the race.
So, modify the relevant functions to return an error in this case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:44 +0000 (09:55 +0100)]
NLM: don't unlock on cancel requests
Currently when lockd gets an NLM_CANCEL request, it also does an unlock for
the same range. This is incorrect.
The Open Group documentation says that "This procedure cancels an
*outstanding* blocked lock request." (Emphasis mine.)
Also, consider a client that holds a lock on the first byte of a file, and
requests a lock on the entire file. If the client cancels that request
(perhaps because the requesting process is signalled), the server shouldn't
apply perform an unlock on the entire file, since that will also remove the
previous lock that the client was already granted.
Or consider a lock request that actually *downgraded* an exclusive lock to
a shared lock.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
J. Bruce Fields [Tue, 3 Jan 2006 08:55:42 +0000 (09:55 +0100)]
NLM: Clean up nlmsvc_grant_reply locking
Slightly simpler logic here makes it more trivial to verify that the up's
and down's are balanced here. Break out an assignment from a conditional
while we're at it.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes ths unused function xdr_decode_string().
Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Charles Lever <Charles.Lever@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Tue, 3 Jan 2006 08:55:38 +0000 (09:55 +0100)]
NFSv4: Ensure DELEGRETURN returns attributes
Upon return of a write delegation, the server will almost always bump the
change attribute. Ensure that we pick up that change so that we don't
invalidate our data cache unnecessarily.
Trond Myklebust [Tue, 3 Jan 2006 08:55:37 +0000 (09:55 +0100)]
NFSv4: Ensure change attribute returned by GETATTR callback conforms to spec
According to RFC3530 we're supposed to cache the change attribute
at the time the client receives a write delegation.
If the inode is clean, a CB_GETATTR callback by the server to the
client is supposed to return the cached change attribute.
If, OTOH, the inode is dirty, the client should bump the cached
change attribute by 1.
Trond Myklebust [Tue, 3 Jan 2006 08:55:34 +0000 (09:55 +0100)]
NFS: Make stat() return updated mtimes after a write()
The SuS states that a call to write() will cause mtime to be updated on
the file. In order to satisfy that requirement, we need to flush out
any cached writes in nfs_getattr().
Speed things up slightly by not committing the writes.
Chuck Lever [Wed, 30 Nov 2005 23:09:02 +0000 (18:09 -0500)]
NFS: support large reads and writes on the wire
Most NFS server implementations allow up to 64KB reads and writes on the
wire. The Solaris NFS server allows up to a megabyte, for instance.
Now the Linux NFS client supports transfer sizes up to 1MB, too. This will
help reduce protocol and context switch overhead on read/write intensive NFS
workloads, and support larger atomic read and write operations on servers
that support them.
Test-plan:
Connectathon and iozone on mount point with wsize=rsize>32768 over TCP.
Tests with NFS over UDP to verify the maximum RPC payload size cap.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Wed, 30 Nov 2005 23:08:17 +0000 (18:08 -0500)]
NFS: use generic_write_checks() to sanity check direct writes
Replace ad hoc write parameter sanity checking in nfs_file_direct_write()
with a call to generic_write_checks(). This should make the proper checks
modulo the O_LARGEFILE flag, and should catch NFSv2-specific limitations by
virtue of i_sb->s_maxbytes.
Test plan:
Posix compliance testing with both NFSv2 and NFSv3.
Signed-off-by: Chuck Lever <cel@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Tue, 3 Jan 2006 08:55:13 +0000 (09:55 +0100)]
NFSv4: Make nfs4_state track O_RDWR, O_RDONLY and O_WRONLY separately
A closer reading of RFC3530 reveals that OPEN_DOWNGRADE must always
specify a access modes that have been the argument of a previous OPEN
operation.
IOW: doing OPEN(O_RDWR) and then OPEN_DOWNGRADE(O_WRONLY) is forbidden
unless the user called OPEN(O_WRONLY)
In order to fix that, we really need to track the three possible open
states separately.