Chuck Lever [Wed, 25 Jun 2008 21:24:39 +0000 (17:24 -0400)]
SUNRPC: Use rpcbind version 2 GETPORT
Clean up: Change the version 2 procedure name to GETPORT. It's the same
procedure number as GETADDR, but version 2 implementations usually refer
to it as GETPORT.
This also now matches the procedure name used in the version 2 procedure
entry in the rpcb_next_version[] array, making it slightly less confusing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Jeff Layton [Wed, 11 Jun 2008 14:03:11 +0000 (10:03 -0400)]
nfs4: fix potential race with rapid nfs_callback_up/down cycle
If the nfsv4 callback thread is rapidly brought up and down, it's
possible that nfs_callback_svc might never get a chance to run. If
this happens, the cleanup at thread exit might never occur, throwing
the refcounting off and nfs_callback_info in an incorrect state.
Move the clean functions into nfs_callback_down. Also change the
nfs_callback_info struct to track the svc_rqst rather than svc_serv
since we need to know that to call svc_exit_thread.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Benny Halevy [Tue, 24 Jun 2008 17:25:57 +0000 (20:25 +0300)]
nfs: initialize timeout variable in nfs4_proc_setclientid_confirm
gcc (4.3.0) rightfully warns about this:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c: In function \91nfs4_proc_setclientid_confirm\92:
/usr0/export/dev/bhalevy/git/linux-pnfs-bh-nfs41/fs/nfs/nfs4proc.c:2936: warning: \91timeout\92 may be used uninitialized in this function
nfs4_delay that's passed a pointer to 'timeout' is looking at its value
and sets it up to some value in the range: NFS4_POLL_RETRY_MIN..NFS4_POLL_RETRY_MAX
if (*timeout <= 0)
*timeout = NFS4_POLL_RETRY_MIN;
if (*timeout > NFS4_POLL_RETRY_MAX)
*timeout = NFS4_POLL_RETRY_MAX;
Therefore it will end up set to some sane, though rather indeterministic, value.
Chuck Lever [Mon, 23 Jun 2008 16:37:01 +0000 (12:37 -0400)]
NFS: handle interface identifiers in incoming IPv6 addresses
Add support in the kernel NFS client's address parser for interface
identifiers.
IPv6 link-local addresses require an additional "interface identifier",
which is a network device name or an integer that indexes the array of
local network interfaces. They are suffixed to the address with a '%'.
For example:
fe80::215:c5ff:fe3b:e1b2%2
indicates an interface index of 2. Or
fe80::215:c5ff:fe3b:e1b2%eth0
indicates that requests should be routed through the eth0 device.
Without the interface ID, link-local addresses are not usable for NFS.
Both the kernel NFS client mount option parser and the mount.nfs command
can take either form. The mount.nfs command always passes the address
through getnameinfo(3), which usually re-writes interface indices as
device names.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Mon, 23 Jun 2008 16:36:45 +0000 (12:36 -0400)]
NFS: Support raw IPv6 address hostnames during NFS mount operation
Traditionally the mount command has looked for a ":" to separate the
server's hostname from the export path in the mounted on device name,
like this:
mount server:/export /mounted/on/dir
The server's hostname is "server" and the export path is "/export".
You can also substitute a specific IPv4 network address for the server
hostname, like this:
mount 192.168.0.55:/export /mounted/on/dir
Raw IPv6 addresses present a problem, however, because they look
something like this:
fe80::200:5aff:fe00:30b
Note the use of colons.
To get around the presence of colons, copy the Solaris convention used for
mounting IPv6 servers by address: wrap a raw IPv6 address with square
brackets.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Mon, 23 Jun 2008 16:36:37 +0000 (12:36 -0400)]
NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3
To support passing a raw IPv6 address as a server hostname, we need to
expand the logic that handles splitting the passed-in device name into
a server hostname and export path
Start by pulling device name parsing out of the mount option validation
functions and into separate helper functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Trond Myklebust [Fri, 13 Jun 2008 17:25:22 +0000 (13:25 -0400)]
NFS: Allow redirtying of a completed unstable write.
Currently, if an unstable write completes, we cannot redirty the page in
order to reflect a new change in the page data until after we've sent a
COMMIT request.
This patch allows a page rewrite to proceed without the unnecessary COMMIT
step, putting it immediately back onto the dirty page list, undoing the
VM unstable write accounting, and removing the NFS_PAGE_TAG_COMMIT tag from
the NFS radix tree.
Chuck Lever [Thu, 12 Jun 2008 16:37:33 +0000 (12:37 -0400)]
NFS: Allow any value for the "retry" option
The kernel NFS mount option parser should ignore the retry= mount option
since it is meaningful only in user space. Today it expects a number
rather than arbitrary text, so it ignores the option if the value is
numeric, but chokes if there are other characters in the value.
Change it to allow any text (except ",") as its value.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Thu, 12 Jun 2008 16:32:25 +0000 (12:32 -0400)]
NFS: Move fs/nfs/iostat.h to include/linux
The fs/nfs/iostat.h header has definitions that were designed to be exposed
to user space. Move these definitions under include/linux so user space can
use the definitions in applications that read /proc/self/mountstats.
Also address a handful of coding style issues called out by checkpatch.pl in
fs/nfs/iostat.h.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Fri, 6 Jun 2008 17:22:25 +0000 (13:22 -0400)]
SUNRPC: Ensure all transports set rq_xtime consistently
The RPC client uses the rq_xtime field in each RPC request to determine the
round-trip time of the request. Currently, the rq_xtime field is
initialized by each transport just before it starts enqueing a request to
be sent. However, transports do not handle initializing this value
consistently; sometimes they don't initialize it at all.
To make the measurement of request round-trip time consistent for all
RPC client transport capabilities, pull rq_xtime initialization into the
RPC client's generic transport logic. Now all transports will get a
standardized RTT measure automatically, from:
xprt_transmit()
to
xprt_complete_rqst()
This makes round-trip time calculation more accurate for the TCP transport.
The socket ->sendmsg() method can return "-EAGAIN" if the socket's output
buffer is full, so the TCP transport's ->send_request() method may call
the ->sendmsg() method repeatedly until it gets all of the request's bytes
queued in the socket's buffer.
Currently, the TCP transport sets the rq_xtime field every time through
that loop so the final value is the timestamp just before the *last* call
to the underlying socket's ->sendmsg() method. After this patch, the
rq_xtime field contains a timestamp that reflects the time just before the
*first* call to ->sendmsg().
This is consequential under heavy workloads because large requests often
take multiple ->sendmsg() calls to get all the bytes of a request queued.
The TCP transport causes the request to sleep until the remote end of the
socket has received enough bytes to clear space in the socket's local
output buffer. This delay can be quite significant.
The method introduced by this patch is a more accurate measure of RTT
for stream transports, since the server can cause enough back pressure
to delay (ie increase the latency of) requests from the client.
Additionally, this patch corrects the behavior of the RDMA transport, which
entirely neglected to initialize the rq_xtime field. RPC performance
metrics for RDMA transports now display correct RPC request round trip
times.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Tom Talpey <thomas.talpey@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The cl_chatty flag alows us to control whether a given rpc client leaves
"server X not responding, timed out"
messages in the syslog. Such messages make sense for ordinary nfs
clients (where an unresponsive server means applications on the
mountpoint are probably hanging), but not for the callback client (which
can fail more commonly, with the only result just of disabling some
optimizations).
Previously cl_chatty was removed, do to lack of users; reinstate it, and
use it for the nfsd's callback client.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Jeff Layton [Tue, 10 Jun 2008 19:38:39 +0000 (15:38 -0400)]
NFS: implement option checking when remounting NFS filesystems (resend)
When remounting an NFS or NFS4 filesystem, the new NFS options are not
respected, yet the remount will still return success. This patch adds
a remount_fs sb op for NFS that checks any new nfs mount options against
the existing ones and fails the mount if any have changed.
This is only implemented for string-based mount options since doing
this with binary options isn't really feasible.
This is essentially the same as the original patch I sent out, but
adds a check to see if the addr= option has changed.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Wed, 11 Jun 2008 21:56:05 +0000 (17:56 -0400)]
NFS: Fix trace debugging nits in write.c
Clean up: fix a few dprintk messages that still need to show the RPC task ID
correctly, and be sure we use the preferred %lld or %llu instead of %Ld or
%Lu.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Wed, 11 Jun 2008 21:55:58 +0000 (17:55 -0400)]
NFS: Use NFSDBG_FILE for all fops
Clean up: some fops use NFSDBG_FILE, some use NFSDBG_VFS. Let's use
NFSDBG_FILE for all fops, and consistently report file names instead
of inode numbers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Wed, 21 May 2008 21:09:41 +0000 (17:09 -0400)]
SUNRPC: Display some debugging information as text rather than numbers
In rpc_show_tasks(), display the program name, version number, procedure
name and tk_action as human-readable variable-length text fields rather
than columnar numbers.
Doing the symbol lookup here helps in cases where we have actual
debugging output from a kernel log, but don't have access to the kernel
image or RPC module that generated the output.
Chuck Lever [Wed, 21 May 2008 21:09:26 +0000 (17:09 -0400)]
SUNRPC: Don't display the rpc_show_tasks header if there are no tasks
Clean up: don't display the rpc_show_tasks column header unless there is at
least one task to display. As far as I can tell, it is safe to let the
list_for_each_entry macro decide that each list is empty.
scripts/checkpatch.pl also wants a KERN_FOO at the start of any newly added
printk() calls, so this and subsequent patches will also add KERN_INFO.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chuck Lever [Wed, 21 May 2008 21:09:19 +0000 (17:09 -0400)]
SUNRPC: Rename "call_" functions that are no longer FSM states
The RPC client uses a finite state machine to move RPC tasks through each
step of an RPC request. Each state is contained in a function in
net/sunrpc/clnt.c, and named call_foo.
Some of the functions named call_foo have changed over the past few years and
are no longer states in the FSM. These include: call_encode, call_header,
and call_verify. As a clean up, rename the functions that have changed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Revert commit 44dd151d "NFS: Don't mark a written page as uptodate until it
is on disk". While it is true that the write may fail, that is always the
case. There is no reason why we should treat data on pages that are not
already marked as PG_uptodate as being special. The only thing we gain is a
noticeable slowdown when re-reading these pages.
Trond Myklebust [Tue, 10 Jun 2008 22:31:00 +0000 (18:31 -0400)]
NFS: Optimise append writes with holes
If a file is being extended, and we're creating a hole, we might as well
declare the entire page to be up to date.
This patch significantly improves the write performance for sparse files
in the case where lseek(SEEK_END) is used to append several non-contiguous
writes at intervals of < PAGE_SIZE.
Trond Myklebust [Tue, 10 Jun 2008 22:30:11 +0000 (18:30 -0400)]
SUNRPC: An ENOMEM error from call_encode is always fatal
The special 'ENOMEM' case that was previously flagged as non-fatal is
bogus: auth_gss always returns EAGAIN for non-fatal errors, and may in fact
return ENOMEM in the special case where xdr_buf_read_netobj runs out of
preallocated buffer space (invariably a _fatal_ error, since there is no
provision for preallocating larger buffers).
Trond Myklebust [Tue, 20 May 2008 23:34:39 +0000 (19:34 -0400)]
NFS: Add correct bounds checking to NFSv2 locks
NFSv2 file locking currently fails the Connectathon tests, because the
calls to the VFS locking code do not return an EINVAL error if the
struct file_lock overflows the 32-bit boundaries.
The problem is due to the fact that we occasionally call helpers from
fs/locks.c in order to avoid RPC calls to the server when we know that a
local process holds the lock. These helpers are, of course, always
64-bit enabled, so EINVAL is not returned in cases when it would if
the call had gone to the NLM code.
For consistency, we therefore add support for a bounds-checking helper.
Trond Myklebust [Thu, 5 Jun 2008 19:17:39 +0000 (15:17 -0400)]
NFS: Fix a preemption count leak in nfs_update_request
The commit 2785259631697ebb0749a3782cca206e2e542939 (nfs: use GFP_NOFS
preloads for radix-tree insertion) appears to have introduced a bug:
We only want to call radix_tree_preload() once after creating a request.
Calling it every time we loop after we created the request, will cause
preemption count leaks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Nick Piggin <npiggin@suse.de>
Merge branch 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix an rpcbind breakage for the case of IPv6 lookups
SUNRPC: Fix a double-free in rpcbind
NFS: Fix readdir cache invalidation
Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
[MIPS] Fix 32bit kernels on R4k with 128 byte cache line size
[MIPS] Atlas, decstation: Fix section mismatches triggered by defconfigs
Jeff Mahoney [Tue, 8 Jul 2008 18:37:06 +0000 (14:37 -0400)]
reiserfs: discard prealloc in reiserfs_delete_inode
With the removal of struct file from the xattr code,
reiserfs_file_release() isn't used anymore, so the prealloc isn't
discarded. This causes hangs later down the line.
This patch adds it to reiserfs_delete_inode. In most cases it will be a
no-op due to it already having been called, but will avoid hangs with
xattrs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SUNRPC: Fix an rpcbind breakage for the case of IPv6 lookups
Now that rpcb_next_version has been split into an IPv4 version and an IPv6
version, we Oops when rpcb_call_async attempts to look up the IPv6-specific
RPC procedure in rpcb_next_version.
Fix the Oops simply by having rpcb_getport_async pass the correct RPC
procedure as an argument.
It is wrong to be freeing up the rpcbind arguments if the call to
rpcb_call_async() fails, since they should already have been freed up by
rpcb_map_release().
invalidate_inode_pages2_range() takes page offset arguments, not byte
ranges.
Another thought is that individual pages might perhaps get evicted by VM
pressure, in which case we might perhaps want to re-read not only the
evicted page, but all subsequent pages too (in case the server returns
more/less data per page so that the alignment of the next entry
changes). We should therefore remove the condition that we only do this on
page->index==0.
[MIPS] Fix 32bit kernels on R4k with 128 byte cache line size
The generated copy_page for R4k CPU with a 128 byte cache line size used
Create Dirty Exclusive cache line operations even if only part of the
cache line was filled. This change avoids generating cache operations,
if only part of the cache line size is copied in one loop. It also
increases the maxmimum loop size, because the generated code even fits
into the available space for r4k CPUs with 128 byte cache line size.
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Sergei Shtylyov [Tue, 8 Jul 2008 17:27:22 +0000 (19:27 +0200)]
palm_bk3710: fix IDECLK period calculation
The driver uses completely bogus rounding formula for calculating period from
the IDECLK frequency which gives one-off period values (e.g. 11 ns with 100 MHz
IDECLK) which in turn can lead to overclocked IDE transfer timings. Actually,
rounding is just wrong in this case, so use a mere division for a safe result.
While at it, also:
- give 'ide_palm_clk' variable a more suitable name;
- get rid of the useless 'ideclkp' variable;
- drop the LISP stype 'p' postfix from the 'clkp' variable's name. :-)
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: mcherkashin@ru.mvista.com Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Add __ide_default_irq() inline helper and use it instead of
ide_default_irq() in ide-probe.c and ns87415.c (all host drivers
except IDE PCI ones always setup hwif->irq so it is enough to
check only for I/O bases 0x1f0 and 0x170).
This fixes post-2.6.25 regression since ide_default_irq()
define could shadow ide_default_irq() inline.
David Gibson [Tue, 8 Jul 2008 05:58:16 +0000 (15:58 +1000)]
Correct hash flushing from huge_ptep_set_wrprotect()
As Andy Whitcroft recently pointed out, the current powerpc version of
huge_ptep_set_wrprotect() has a bug. It just calls ptep_set_wrprotect()
which in turn calls pte_update() then hpte_need_flush() with the 'huge'
argument set to 0. This will cause hpte_need_flush() to flush the wrong
hash entries (of any). Andy's fix for this is already in the powerpc
tree as commit 016b33c4958681c24056abed8ec95844a0da80a3.
I have confirmed this is a real bug, not masked by some other
synchronization, with a new testcase for libhugetlbfs. A process write
a (MAP_PRIVATE) hugepage mapping, fork(), then alter the mapping and
have the child incorrectly see the second write.
Therefore, this should be fixed for 2.6.26, and for the stable tree.
Here is a suitable patch for 2.6.26, which I think will also be suitable
for the stable tree (neither of the headers in question has been changed
much recently).
It is cut down slighlty from Andy's original version, in that it does
not include a 32-bit version of huge_ptep_set_wrprotect(). Currently,
hugepages are not supported on any 32-bit powerpc platform. When they
are, a suitable 32-bit version can be added - the only 32-bit hardware
which supports hugepages does not use the conventional hashtable MMU and
so will have different needs anyway.
Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Benny Halevy is seeing extern inlines not resolved with gcc 4.3 with
no-unit-at-a-time
This patch reintroduces unit-at-a-time for gcc >= 4.0, bringing back the
possibility of Uli's crash. If that happens, we'll debug it.
I started seeing both the internal compiler errors and unresolved
inlines on Fedora 9. This patch fixes both problems, without so far
reintroducing the crash reported by Uli.
Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Adrian Bunk <bunk@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
powerpc: Fix unterminated of_device_id array in legacy_serial.c
A recent patch to legacy_serial.c factored out some code by
using the of_match_node() facility to match a node against
an array of possible matches. However, the patch didn't properly
terminate the array causing potential crashes in cases where no
match is found. In addition, the name of the array was poorly
chosen for a static symbol making debugging harder.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vsprintf: add support for '%pS' and '%pF' pointer formats
They print out a pointer in symbolic format, if possible (ie using
symbolic KALLSYMS information). The '%pS' format is for regular direct
pointers (which can point to data or code and that you find on the stack
during backtraces etc), while '%pF' is for C function pointer types.
On most architectures, the two mean exactly the same thing, but some
architectures use an indirect pointer for C function pointers, where the
function pointer points to a function descriptor (which in turn contains
the actual pointer to the code). The '%pF' code automatically does the
appropriate function descriptor dereference on such architectures.
vsprintf: add infrastructure support for extended '%p' specifiers
This expands the kernel '%p' handling with an arbitrary alphanumberic
specifier extension string immediately following the '%p'. Right now
it's just being ignored, but the next commit will start adding some
specific pointer type extensions.
NOTE! The reason the extension is appended to the '%p' is to allow
minimal gcc type checking: gcc will still see the '%p' and will check
that the argument passed in is indeed a pointer, and yet will not
complain about the extended information that gcc doesn't understand
about (on the other hand, it also won't actually check that the pointer
type and the extension are compatible).
Alphanumeric characters were chosen because there is no sane existing
use for a string format with a hex pointer representation immediately
followed by alphanumerics (which is what such a format string would have
traditionally resulted in).
The actual code is the same, just split out into a helper function.
This makes it easier to read, and allows for simple future extension
of %p handling.
Philipp Zabel [Sat, 5 Jul 2008 23:15:34 +0000 (01:15 +0200)]
pxamci: fix byte aligned DMA transfers
The pxa27x DMA controller defaults to 64-bit alignment. This caused
the SCR reads to fail (and, depending on card type, error out) when
card->raw_scr was not aligned on a 8-byte boundary.
For performance reasons all scatter-gather addresses passed to
pxamci_request should be aligned on 8-byte boundaries, but if
this can't be guaranteed, byte aligned DMA transfers in the
have to be enabled in the controller to get correct behaviour.
Signed-off-by: Philipp Zabel <philipp.zabel@gmail.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Borzenkov reports that it resulted in a totally hung machine for
him when loading the OHCI driver. Extensive netconsole capture with
SysRq output shows that modprobe gets stuck in ohci_hub_status_data()
when probing and enabling the OHCI controller, see for example
http://lkml.org/lkml/2008/7/5/236
for an analysis.
The problem appears to be an interrupt flood triggered by the commit
that gets reverted, and Andrey confirmed that the revert makes things
work for him again.
Reported-and-tested-by: Andrey Borzenkov <arvidjaar@mail.ru> Acked-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: David Brownell <david-b@pacbell.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark McLoughlin [Fri, 4 Jul 2008 17:23:15 +0000 (18:23 +0100)]
KVM: IOAPIC: Fix level-triggered irq injection hang
The "remote_irr" variable is used to indicate an interrupt
which has been received by the LAPIC, but not acked.
In our EOI handler, we unset remote_irr and re-inject the
interrupt if the interrupt line is still asserted.
However, we do not set remote_irr here, leading to a
situation where if kvm_ioapic_set_irq() is called, then we go
ahead and call ioapic_service(). This means that IRR is
re-asserted even though the interrupt is currently in service
(i.e. LAPIC IRR is cleared and ISR/TMR set)
The issue with this is that when the currently executing
interrupt handler finishes and writes LAPIC EOI, then TMR is
unset and EOI sent to the IOAPIC. Since IRR is now asserted,
but TMR is not, then when the second interrupt is handled,
no EOI is sent and if there is any pending interrupt, it is
not re-injected.
This fixes a hang only seen while running mke2fs -j on an
8Gb virtio disk backed by a fully sparse raw file, with
aliguori "avoid fragmented virtio-blk transfers by copying"
changes.
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Anthony Liguori [Thu, 3 Jul 2008 16:02:36 +0000 (19:02 +0300)]
x86: KVM guest: Add memory clobber to hypercalls
Hypercalls can modify arbitrary regions of memory. Make sure to indicate this
in the clobber list. This fixes a hang when using KVM_GUEST kernel built with
GCC 4.3.0.
This was originally spotted and analyzed by Marcelo.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Oliver Hartkopp [Sun, 6 Jul 2008 06:38:43 +0000 (23:38 -0700)]
can: add sanity checks
Even though the CAN netlayer only deals with CAN netdevices, the
netlayer interface to the userspace and to the device layer should
perform some sanity checks.
This patch adds several sanity checks that mainly prevent userspace apps
to send broken content into the system that may be misinterpreted by
some other userspace application.
Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de> Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de> Acked-by: Andre Naujoks <nautsch@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Morton [Sat, 5 Jul 2008 08:02:01 +0000 (01:02 -0700)]
Fix pagemap_read() use of struct mm_walk
Fix some issues in pagemap_read noted by Alexey:
- initialize pagemap_walk.mm to "mm" , so the code starts working as
advertised
- initialize ->private to "&pm" so it wouldn't immediately oops in
pagemap_pte_hole()
- unstatic struct pagemap_walk, so two threads won't fsckup each other
(including those started by root, including flipping ->mm when you don't
have permissions)
- pagemap_read() contains two calls to ptrace_may_attach(), second one
looks unneeded.
- avoid possible kmalloc(0) and integer wraparound.
Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Personally, I'd just remove the functionality entirely - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move _RET_IP_ and _THIS_IP_ to include/linux/kernel.h
These two macros are useful beyond lock debugging. Moved definitions from
include/linux/debug_locks.h to include/linux/kernel.h, so code that needs
them does not have to include the former, which would have been a less
intuitive choice of a header.
Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 ACPI: fix resume from suspend to RAM on uniprocessor x86-64
x86 ACPI: normalize segment descriptor register on resume
ahci: give another shot at clearing all bits in irq_stat
Commit ea0c62f7cf70f13a67830471b613337bd0c9a62e tried to clear all
bits in irq_stat but it didn't actually achieve that as irq_stat was
anded with port_map right after read. This patch makes ahci driver
always use the unmasked value to clear irq_status.
While at it, add explanation on the peculiarities of ahci IRQ
clearing.
class->dev_release is called by device_release() iff dev->release
is not present so ide_port_class_release() is never called and the
last hwif->gendev reference is not dropped.
Fix it by removing ide_port_class_release() and get_device() call
from ide_register_port() (device_create_drvdata() takes a hwif->gendev
reference anyway).
This patch fixes hang on wait_for_completion(&hwif->gendev_rel_comp)
in ide_unregister() reported by Pavel Machek.
Cc: Pavel Machek <pavel@suse.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Arjan van de Ven [Mon, 16 Jun 2008 22:51:08 +0000 (15:51 -0700)]
softlockup: print a module list on being stuck
Most places in the kernel that go BUG: print a module list
(which is very useful for doing statistics and finding patterns),
however the softlockup detector does not do this yet.
This patch adds the one line change to fix this gap.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86 ACPI: fix resume from suspend to RAM on uniprocessor x86-64
Since the trampoline code is now used for ACPI resume from suspend to RAM,
the trampoline page tables have to be fixed up during boot not only on SMP
systems, but also on UP systems that use the trampoline.
H. Peter Anvin [Tue, 24 Jun 2008 21:03:48 +0000 (23:03 +0200)]
x86 ACPI: normalize segment descriptor register on resume
Some Dell laptops enter resume with apparent garbage in the segment
descriptor registers (almost certainly the result of a botched
transition from protected to real mode.) The only way to clean that
up is to enter protected mode ourselves and clean out the descriptor
registers.
This fixes resume on Dell XPS M1210 and Dell D620.
Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: pm list <linux-pm@lists.linux-foundation.org> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Rientjes [Fri, 4 Jul 2008 19:24:13 +0000 (12:24 -0700)]
mempolicy: mask off internal flags for userspace API
Flags considered internal to the mempolicy kernel code are stored as part
of the "flags" member of struct mempolicy.
Before exposing a policy type to userspace via get_mempolicy(), these
internal flags must be masked. Flags exposed to userspace, however,
should still be returned to the user.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pierre Ossman [Fri, 4 Jul 2008 10:51:20 +0000 (12:51 +0200)]
mmc: don't use DMA on newer ENE controllers
Even the newer ENE controllers have bugs in their DMA engine that make
it too dangerous to use. Disable it until someone has figured out under
which conditions it corrupts data.
This has caused problems at least once, and can be found as bug report
10925 in the kernel bugzilla.
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Jackson [Fri, 4 Jul 2008 17:00:09 +0000 (10:00 -0700)]
doc: document the relax_domain_level kernel boot argument
Document the kernel boot parameter: relax_domain_level=.
Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>