Mike Travis [Tue, 8 Jul 2008 21:35:21 +0000 (14:35 -0700)]
x86: change _node_to_cpumask_ptr to return const ptr
* Strengthen the return type for the _node_to_cpumask_ptr to be
a const pointer. This adds compiler checking to insure that
node_to_cpumask_map[] is not changed inadvertently.
Signed-off-by: Mike Travis <travis@sgi.com> Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Acked-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that IRQ2 is never made available to the I/O APIC, there is no need
to special-case it and mask as a workaround for broken systems. Actually,
because of the former, mask_IO_APIC_irq(2) is a no-op already.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit f18f982ab ("sched: CPU hotplug events must not destroy scheduler
domains created by the cpusets") introduced a hotplug-related problem as
described below:
/*
* Force a reinitialization of the sched domains hierarchy. The domains
* and groups cannot be updated in place without racing with the balancing
* code, so we temporarily attach all running cpus to the NULL domain
* which will prevent rebalancing while the sched domains are recalculated.
*/
The sched-domains should be rebuilt when a CPU_DOWN ops. has been
completed, effectively either upon CPU_DEAD{_FROZEN} (upon success) or
CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
initial state). That's what update_sched_domains() also does but only
for !CPUSETS case.
With f18f982ab, sched-domains' reinitialization is delegated to
CPUSETS code:
Being called for CPU_UP_PREPARE and if its callback is called after
update_sched_domains()), it just negates all the work done by
update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
the sched-domains and that makes it visible for the load-balancer
while the CPU_DOWN ops. is in progress.
__migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
"offline" when this function is called).
try_to_wake_up() is called for one of these tasks from another CPU ->
the load-balancer (wake_idle()) picks up a "dead" CPU and places the
task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
-> oops.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Tested-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Paul Menage <menage@google.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: miaox@cn.fujitsu.com Cc: rostedt@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yinghai Lu [Sun, 13 Jul 2008 05:57:07 +0000 (22:57 -0700)]
x86, e820: remove end_user_pfn
end_user_pfn used to modify the meaning of the e820 maps.
Now that all e820 operations are cleaned up, unified, tightened up,
the e820 map always get updated to reality, we don't need to keep
this secondary mechanism anymore.
If you hit this commit in bisection it means something slipped through.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] bsg: fix oops on remove
[SCSI] fusion: default MSI to disabled for SPI and FC controllers
[SCSI] ipr: Fix HDIO_GET_IDENTITY oops for SATA devices
[SCSI] mptspi: fix oops in mptspi_dv_renegotiate_work()
[SCSI] erase invalid data returned by device
Jeff Layton [Sat, 12 Jul 2008 20:48:00 +0000 (13:48 -0700)]
cifs: fix wksidarr declaration to be big-endian friendly
The current definition of wksidarr works fine on little endian arches
(since cpu_to_le32 is a no-op there), but on big-endian arches, it fails
to compile with this error:
error: braced-group within expression allowed only inside a function
The problem is that this static declaration has cpu_to_le32 embedded
within it, and that expands into a function macro. We need to use
__constant_cpu_to_le32() instead.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Steven French <sfrench@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Layton [Sat, 12 Jul 2008 20:47:59 +0000 (13:47 -0700)]
cifs: fix inode leak in cifs_get_inode_info_unix
Try this:
mount a share with unix extensions
create a file on it
umount the share
You'll get the following message in the ring buffer:
VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds. Have a
nice day...
...the problem is that cifs_get_inode_info_unix is creating and hashing
a new inode even when it's going to return error anyway. The first
lookup when creating a file returns an error so we end up leaking this
inode before we do the actual create. This appears to be a regression
caused by commit 0e4bbde94fdc33f5b3d793166b21bf768ca3e098.
The following patch seems to fix it for me, and fixes a minor
formatting nit as well.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert Richter [Sat, 12 Jul 2008 20:47:57 +0000 (13:47 -0700)]
OProfile kernel maintainership changes
Cc: Philippe Elie <phil.el@wanadoo.fr> Cc: John Levon <levon@movementarian.org> Cc: Maynard Johnson <maynardj@us.ibm.com> Cc: Richard Purdie <rpurdie@openedhand.com> Cc: Daniel Hansel <daniel.hansel@linux.vnet.ibm.com> Cc: Jason Yeh <jason.yeh@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andres Salomon [Sat, 12 Jul 2008 20:47:54 +0000 (13:47 -0700)]
ov7670: clean up ov7670_read semantics
Cortland Setlow pointed out a bug in ov7670.c where the result from
ov7670_read() was just being checked for !0, rather than <0. This made me
realize that ov7670_read's semantics were rather confusing; it both fills
in 'value' with the result, and returns it. This is goes against general
kernel convention; so rather than fixing callers, let's fix the function.
This makes ov7670_read return <0 in the case of an error, and 0 upon
success. Thus, code like:
res = ov7670_read(...);
if (!res)
goto error;
..will work properly.
Signed-off-by: Cortland Setlow <csetlow@tower-research.com> Signed-off-by: Andres Salomon <dilinger@debian.org> Acked-by: Jonathan Corbet <corbet@lwn.net> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I had 8250.nr_uarts=16 in the boot line of a test kernel and I had a weird
mysterious crash in sysfs. After taking an in-depth look I realized that
CONFIG_SERIAL_8250_NR_UARTS was set to 4 and I was walking off the end of
the serial8250_ports array.
Ouch!!!
Don't let this happen to someone else.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Alan Cox <alan@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a bugfix for how defio handles multiple processes manipulating
the same framebuffer.
Thanks to Bernard Blackham for identifying this bug.
It occurs when two applications mmap the same framebuffer and concurrently
write to the same page. Normally, this doesn't occur since only a single
process mmaps the framebuffer. The symptom of the bug is that the mapping
applications will hang. The cause is that defio incorrectly tries to add the
same page twice to the pagelist. The solution I have is to walk the pagelist
and check for a duplicate before adding. Since I needed to walk the pagelist,
I now also keep the pagelist in sorted order.
Signed-off-by: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Bernard Blackham <bernard@largestprime.net> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Darren Jenkins [Sat, 12 Jul 2008 20:47:50 +0000 (13:47 -0700)]
drivers/isdn/i4l/isdn_common.c fix small resource leak
Coverity CID: 1356 RESOURCE_LEAK
I found a very old patch for this that was Acked but did not get applied
https://lists.linux-foundation.org/pipermail/kernel-janitors/2006-September/016362.html
There looks to be a small leak in isdn_writebuf_stub() in isdn_common.c, when
copy_from_user() returns an un-copied data length (length != 0). The below
patch should be a minimally invasive fix.
Signed-off-by: Darren Jenkins <darrenrjenkins@gmailcom> Acked-by: Karsten Keil <kkeil@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
James Bottomley [Mon, 7 Jul 2008 20:50:01 +0000 (15:50 -0500)]
[SCSI] bsg: fix oops on remove
If you do a modremove of any sas driver, you run into an oops on
shutdown when the host is removed (coming from the host bsg device).
The root cause seems to be that there's a use after free of the
bsg_class_device: In bsg_kref_release_function, this is used (to do a
put_device(bcg->parent) after bcg->release has been called. In sas (and
possibly many other things) bcd->release frees the queue which contains
the bsg_class_device, so we get a put_device on unreferenced memory.
Fix this by taking a copy of the pointer to the parent before releasing
bsg.
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
James Bottomley [Fri, 11 Jul 2008 03:10:55 +0000 (22:10 -0500)]
[SCSI] fusion: default MSI to disabled for SPI and FC controllers
There's a fault on the FC controllers that makes them not respond
correctly to MSI. The SPI controllers are fine, but are likely to be
onboard on older motherboards which don't handle MSI correctly, so
default both these cases to disabled. Enable by setting the module
parameter mpt_msi_enable=1.
For the SAS case, enable MSI by default, but it can be disabled by
setting the module parameter mpt_msi_enable=0.
Cc: "Prakash, Sathya" <sathya.prakash@lsi.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Roland McGrath [Thu, 10 Jul 2008 21:50:39 +0000 (14:50 -0700)]
x86_64: fix delayed signals
On three of the several paths in entry_64.S that call
do_notify_resume() on the way back to user mode, we fail to properly
check again for newly-arrived work that requires another call to
do_notify_resume() before going to user mode. These paths set the
mask to check only _TIF_NEED_RESCHED, but this is wrong. The other
paths that lead to do_notify_resume() do this correctly already, and
entry_32.S does it correctly in all cases.
All paths back to user mode have to check all the _TIF_WORK_MASK
flags at the last possible stage, with interrupts disabled.
Otherwise, we miss any flags (TIF_SIGPENDING for example) that were
set any time after we entered do_notify_resume(). More work flags
can be set (or left set) synchronously inside do_notify_resume(), as
TIF_SIGPENDING can be, or asynchronously by interrupts or other CPUs
(which then send an asynchronous interrupt).
There are many different scenarios that could hit this bug, most of
them races. The simplest one to demonstrate does not require any
race: when one signal has done handler setup at the check before
returning from a syscall, and there is another signal pending that
should be handled. The second signal's handler should interrupt the
first signal handler before it actually starts (so the interrupted PC
is still at the handler's entry point). Instead, it runs away until
the next kernel entry (next syscall, tick, etc).
This test behaves correctly on 32-bit kernels, and fails on 64-bit
(either 32-bit or 64-bit test binary). With this fix, it works.
Mark Rustad [Thu, 10 Jul 2008 19:27:11 +0000 (14:27 -0500)]
[PATCH] IPMI: return correct value from ipmi_write
This patch corrects the handling of write operations to the IPMI watchdog
to work as intended by returning the number of characters actually
processed. Without this patch, an "echo V >/dev/watchdog" enables the
watchdog if IPMI is providing the watchdog function.
Signed-off-by: Mark Rustad <MRustad@gmail.com> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
x86: Recover timer_ack lost in the merge of the NMI watchdog
In the course of the recent unification of the NMI watchdog an assignment
to timer_ack to switch off unnecesary POLL commands to the 8259A in the
case of a watchdog failure has been accidentally removed. The statement
used to be limited to the 32-bit variation as since the rewrite of the
timer code it has been relevant for the 82489DX only. This change brings
it back.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no such entity as ISA IRQ2. The ACPI spec does not make it
explicitly clear, but does not preclude it either -- all it says is ISA
legacy interrupts are identity mapped by default (subject to overrides),
but it does not state whether IRQ2 exists or not. As a result if there is
no IRQ0 override, then IRQ2 is normally initialised as an ISA interrupt,
which implies an edge-triggered line, which is unmasked by default as this
is what we do for edge-triggered I/O APIC interrupts so as not to miss an
edge.
To the best of my knowledge it is useless, as IRQ2 has not been in use
since the PC/AT as back then it was taken by the 8259A cascade interrupt
to the slave, with the line position in the slot rerouted to newly-created
IRQ9. No device could thus make use of this line with the pair of 8259A
chips. Now in theory INTIN2 of the I/O APIC may be usable, but the
interrupt of the device wired to it would not be available in the PIC mode
at all, so I seriously doubt if anybody decided to reuse it for a regular
device.
However there are two common uses of INTIN2. One is for IRQ0, with an
ACPI interrupt override (or its equivalent in the MP table). But in this
case IRQ2 is gone entirely with INTIN0 left vacant. The other one is for
an 8959A ExtINTA cascade. In this case IRQ0 goes to INTIN0 and if ACPI is
used INTIN2 is assumed to be IRQ2 (there is no override and ACPI has no
way to report ExtINTA interrupts). This is where a problem happens.
The problem is INTIN2 is configured as a native APIC interrupt, with a
vector assigned and the mask cleared. And the line may indeed get active
and inject interrupts if the master 8959A has its timer interrupt enabled
(it might happen for other interrupts too, but they are normally masked in
the process of rerouting them to the I/O APIC). There are two cases where
it will happen:
* When the I/O APIC NMI watchdog is enabled. This is actually a misnomer
as the watchdog pulses are delivered through the 8259A to the LINT0
inputs of all the local APICs in the system. The implication is the
output of the master 8259A goes high and low repeatedly, signalling
interrupts to INTIN2 which is enabled too!
[The origin of the name is I think for a brief period during the
development we had a capability in our code to configure the watchdog to
use an I/O APIC input; that would be INTIN2 in this scenario.]
* When the native route of IRQ0 via INTIN0 fails for whatever reason -- as
it happens with the system considered here. In this scenario the timer
pulse is delivered through the 8259A to LINT0 input of the local APIC of
the bootstrap processor, quite similarly to how is done for the watchdog
described above. The result is, again, INTIN2 receives these pulses
too. Rafael's system used to escape this scenario, because an incorrect
IRQ0 override would occupy INTIN2 and prevent it from being unmasked.
My conclusion is IRQ2 should be excluded from configuration in all the
cases and the current exception for ACPI systems should be lifted. The
reason being the exception not only being useless, but harmful as well.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unlike the 32-bit one, the 64-bit variation of the LVT0 setup code for
the "8259A Virtual Wire" through the local APIC timer configuration does
not fully configure the relevant irq_chip structure. Instead it relies on
the preceding I/O APIC code to have set it up, which does not happen if
the I/O APIC variants have not been tried.
The patch includes corresponding changes to the 32-bit variation too
which make them both the same, barring a small syntactic difference
involving sequence of functions in the source. That should work as an aid
with the upcoming merge.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
IRQ0 is edge-triggered, but the "8259A Virtual Wire" through the local
APIC configuration in the 32-bit version uses the "fasteoi" handler
suitable for level-triggered APIC interrupt. Rewrite code so that the
"edge" handler is used. The 64-bit version uses different code and is
unaffected.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Glauber Costa [Thu, 10 Jul 2008 18:17:53 +0000 (15:17 -0300)]
x86: use AS_CFI instead of UNWIND_INFO
In dwarf2_32.h, test for CONFIG_AS_CFI instead of
CONFIG_UNWIND_INFO. Turns out that searching for UNWIND_INFO
returns no match in any Kconfig or Makefile, so we're really
just throwing everything away regarding dwarf frames for i386.
The test that generates CONFIG_AS_CFI does not have anything
x86_64-specific, and right now, checking V=1 builds shows me
that the flags is there anyway, although unused.
Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Brian King [Fri, 11 Jul 2008 18:37:50 +0000 (13:37 -0500)]
[SCSI] ipr: Fix HDIO_GET_IDENTITY oops for SATA devices
Currently, ipr does not support HDIO_GET_IDENTITY to SATA devices.
An oops occurs if userspace attempts to send the command. Since hald
issues the command, ensure we fail the ioctl in ipr. This is a
temporary solution to the oops. Once the ipr libata EH conversion
is upstream, ipr will fully support HDIO_GET_IDENTITY.
Tested-by: Milton Miller <miltonm@bga.com> Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
libata-acpi: don't call sleeping function from invalid context
Added Targa Visionary 1000 IDE adapter to pata_sis.c
libata-acpi: filter out DIPM enable
Dave Chinner [Fri, 11 Jul 2008 07:43:55 +0000 (17:43 +1000)]
Fix reference counting race on log buffers
When we release the iclog, we do an atomic_dec_and_lock to determine if
we are the last reference and need to trigger update of log headers and
writeout. However, in xlog_state_get_iclog_space() we also need to
check if we have the last reference count there. If we do, we release
the log buffer, otherwise we decrement the reference count.
But the compare and decrement in xlog_state_get_iclog_space() is not
atomic, so both places can see a reference count of 2 and neither will
release the iclog. That leads to a filesystem hang.
Close the race by replacing the atomic_read() and atomic_dec() pair with
atomic_add_unless() to ensure that they are executed atomically.
Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Tim Shimmin <tes@sgi.com> Tested-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eventually causing init to die and panic the system:
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.26-rc9-tip #13878
after a maratonic bisection session, the bad commit turned out to be:
| b7675791859075418199c7af86a116ea34eaf5bd is first bad commit
| commit b7675791859075418199c7af86a116ea34eaf5bd
| Author: Jeremy Fitzhardinge <jeremy@goop.org>
| Date: Wed Jun 25 00:19:00 2008 -0400
|
| x86: remove open-coded save/load segment operations
|
| This removes a pile of buggy open-coded implementations of savesegment
| and loadsegment.
after some more bisection of this patch itself, it turns out that what
makes the difference are the savesegment() changes to __switch_to().
Taking a look at this portion of arch/x86/kernel/process_64.o revealed
this crutial difference:
note the "m" modifier - it allows GCC to generate the segment move
into a memory operand as well.
But regarding segment operands there's a subtle detail in the x86
instruction set: the above 16-bit moves are zero-extend, but only
if it goes to a register.
If it goes to a memory operand, -0x34(%rbp) in the above case, there's
no zero-extend to 32-bit and the instruction will only save 16 bits
instead of the intended 32-bit.
The other 16 bits is random data - which can cause problems when that
value is used later on.
The solution is to only allow segment operands to go to registers.
This fix allows my test-system to boot up without crashing.
Steven Rostedt [Wed, 9 Jul 2008 04:15:33 +0000 (00:15 -0400)]
sched_clock: and multiplier for TSC to gtod drift
The sched_clock code currently tries to keep all CPU clocks of all CPUS
somewhat in sync. At every clock tick it records the gtod clock and
uses that and jiffies and the TSC to calculate a CPU clock that tries to
stay in sync with all the other CPUs.
ftrace depends heavily on this timer and it detects when this timer
"jumps". One problem is that the TSC and the gtod also drift.
When the TSC is 0.1% faster or slower than the gtod it is very noticeable
in ftrace. To help compensate for this, I've added a multiplier that
tries to keep the CPU clock updating at the same rate as the gtod.
I've tried various ways to get it to be in sync and this ended up being
the most reliable. At every scheduler tick we calculate the new multiplier:
multi = delta_gtod / delta_TSC
This means we perform a 64 bit divide at the tick (once a HZ). A shift
is used to handle the accuracy.
Other methods that failed due to dynamic HZ are:
(not used) multi += (gtod - tsc) / delta_gtod
(not used) multi += (gtod - (last_tsc + delta_tsc)) / delta_gtod
as well as other variants.
This code still allows for a slight drift between TSC and gtod, but
it keeps the damage down to a minimum.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Wed, 9 Jul 2008 04:15:32 +0000 (00:15 -0400)]
sched_clock: record TSC after gtod
To read the gtod we need to grab the xtime lock for read. Reading the gtod
before the TSC can cause a bigger gab if the xtime lock is contended.
This patch simply reverses the order to read the TSC after the gtod.
The locking in the reading of the gtod handles any barriers one might
think is needed.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Wed, 9 Jul 2008 04:15:31 +0000 (00:15 -0400)]
sched_clock: only update deltas with local reads.
Reading the CPU clock should try to stay accurate within the CPU.
By reading the CPU clock from another CPU and updating the deltas can
cause unneeded jumps when reading from the local CPU.
This patch changes the code to update the last read TSC only when read
from the local CPU.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Mon, 7 Jul 2008 23:49:41 +0000 (19:49 -0400)]
sched_clock: fix calculation of other CPU
The algorithm to calculate the 'now' of another CPU is not correct.
At each scheduler tick, each CPU records the last sched_clock and
gtod (tick_raw and tick_gtod respectively). If the TSC is somewhat the
same in speed between two clocks the algorithm would be:
This solves most of the rest of the issues I've had with timestamps in
ftace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Mon, 7 Jul 2008 18:16:52 +0000 (14:16 -0400)]
sched_clock: stop maximum check on NO HZ
Working with ftrace I would get large jumps of 11 millisecs or more with
the clock tracer. This killed the latencing timings of ftrace and also
caused the irqoff self tests to fail.
What was happening is with NO_HZ the idle would stop the jiffy counter and
before the jiffy counter was updated the sched_clock would have a bad
delta jiffies to compare with the gtod with the maximum.
The jiffies would stop and the last sched_tick would record the last gtod.
On wakeup, the sched clock update would compare the gtod + delta jiffies
(which would be zero) and compare it to the TSC. The TSC would have
correctly (with a stable TSC) moved forward several jiffies. But because the
jiffies has not been updated yet the clock would be prevented from moving
forward because it would appear that the TSC jumped too far ahead.
The clock would then virtually stop, until the jiffies are updated. Then
the next sched clock update would see that the clock was very much behind
since the delta jiffies is now correct. This would then jump the clock
forward by several jiffies.
This caused ftrace to report several milliseconds of interrupts off
latency at every resume from NO_HZ idle.
This patch adds hooks into the nohz code to disable the checking of the
maximum clock update when nohz is in effect. It resumes the max check
when nohz has updated the jiffies again.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Mon, 7 Jul 2008 18:16:51 +0000 (14:16 -0400)]
sched_clock: widen the max and min time
With keeping the max and min sched time within one jiffy of the gtod clock
was too tight. Just before a schedule tick the max could easily be hit, as
well as just after a schedule_tick the min could be hit. This caused the
clock to jump around by a jiffy.
This patch widens the minimum to
last gtod + (delta_jiffies ? delta_jiffies - 1 : 0) * TICK_NSECS
and the maximum to
last gtod + (2 + delta_jiffies) * TICK_NSECS
This keeps the minum to gtod or if one jiffy less than delta jiffies
and the maxim 2 jiffies ahead of gtod. This may cause unstable TSCs to be
a bit more sporadic, but it helps keep a clock with a stable TSC working well.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Mon, 7 Jul 2008 18:16:50 +0000 (14:16 -0400)]
sched_clock: record from last tick
The sched_clock code tries to keep within the gtod time by one tick (jiffy).
The current code mistakenly keeps track of the delta jiffies between
updates of the clock, where the the delta is used to compare with the
number of jiffies that have past since an update of the gtod. The gtod is
updated at each schedule tick not each sched_clock update. After one
jiffy passes the clock is updated fine. But the delta is taken from the
last update so if the next update happens before the next tick the delta
jiffies used will be incorrect.
This patch changes the code to check the delta of jiffies between ticks
and not updates to match the comparison of the updates with the gtod.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86_64: add pseudo-features for 32-bit compat syscall
Add pseudo-feature bits to describe whether the CPU supports sysenter
and/or syscall from ia32-compat userspace. This removes a hardcoded
test in vdso32-setup.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some BIOSen enable DIPM via _GTF which causes command timeouts under
certain configuration. This didn't occur on 2.6.25 because 2.6.25
defaulted to SRST, so _GTF wasn't executed during boot probe, so ahci
host reset disabled DIPM and as _GTF wasn't executed after SRST, DIPM
wasn't enabled. On 2.6.26, hardreset is used during probe and after
probe _GTF is executed enabling DIPM and thus the failures.
This patch could theoretically disable DIPM on machines which used to
have it enabled on 2.6.25 but AFAIK ahci is currently the only driver
which uses SATA ACPI hierarchy (_SDD) and as the host reset would have
always disabled DIPM, this shouldn't happen.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Paul Gortmaker [Fri, 11 Jul 2008 00:30:48 +0000 (17:30 -0700)]
rtc: fix reported IRQ rate for when HPET is enabled
The IRQ rate reported back by the RTC is incorrect when HPET is enabled.
Newer hardware that has HPET to emulate the legacy RTC device gets this value
wrong since after it sets the rate, it returns before setting the variable
used to report the IRQ rate back to users of the device -- so the set rate and
the reported rate get out of sync.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Brownell <david-b@pacbell.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
tun: Persistent devices can get stuck in xoff state
xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
ipv6: missed namespace context in ipv6_rthdr_rcv
netlabel: netlink_unicast calls kfree_skb on error path by itself
ipv4: fib_trie: Fix lookup error return
tcp: correct kcalloc usage
ip: sysctl documentation cleanup
Documentation: clarify tcp_{r,w}mem sysctl docs
netfilter: nf_nat_snmp_basic: fix a range check in NAT for SNMP
netfilter: nf_conntrack_tcp: fix endless loop
libertas: fix memory alignment problems on the blackfin
zd1211rw: stop beacons on remove_interface
rt2x00: Disable synchronization during initialization
rc80211_pid: Fix fast_start parameter handling
sctp: Add documentation for sctp sysctl variable
ipv6: fix race between ipv6_del_addr and DAD timer
irda: Fix netlink error path return value
irda: New device ID for nsc-ircc
irda: via-ircc proper dma freeing
sctp: Mark the tsn as received after all allocations finish
...
Max Krasnyansky [Thu, 10 Jul 2008 23:59:11 +0000 (16:59 -0700)]
tun: Persistent devices can get stuck in xoff state
The scenario goes like this. App stops reading from tun/tap.
TX queue gets full and driver does netif_stop_queue().
App closes fd and TX queue gets flushed as part of the cleanup.
Next time the app opens tun/tap and starts reading from it but
the xoff state is not cleared. We're stuck.
Normally xoff state is cleared when netdev is brought up. But
in the case of persistent devices this happens only during
initial setup.
The fix is trivial. If device is already up when an app opens
it we clear xoff state and that gets things moving again.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm: Add a XFRM_STATE_AF_UNSPEC flag to xfrm_usersa_info
Add a XFRM_STATE_AF_UNSPEC flag to handle the AF_UNSPEC behavior for
the selector family. Userspace applications can set this flag to leave
the selector family of the xfrm_state unspecified. This can be used
to to handle inter family tunnels if the selector is not set from
userspace.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Thu, 10 Jul 2008 23:52:52 +0000 (16:52 -0700)]
ipv4: fib_trie: Fix lookup error return
In commit a07f5f508a4d9728c8e57d7f66294bf5b254ff7f "[IPV4] fib_trie: style
cleanup", the changes to check_leaf() and fn_trie_lookup() were wrong - where
fn_trie_lookup() would previously return a negative error value from
check_leaf(), it now returns 0.
Now fn_trie_lookup() doesn't appear to care about plen, so we can revert
check_leaf() to returning the error value.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Tested-by: William Boughton <bill@boughton.de> Acked-by: Stephen Heminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Take out the confusing language in tcp_frto, and organize the
undocumented values.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
"The register %ecx looks innocent but is very important here. The disassembly:
mov %edx,%ecx
shr $0x2,%ecx
rep stos %eax,%es:(%edi) <-- the fault
So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
(0x6b6b6b6b >> 2 == 0x1adadada.)
%ecx is the counter for the memset, from here:
memset(object, 0, c->objsize);
i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
Where did "c" come from? Uh-oh...
c = get_cpu_slab(s, smp_processor_id());
This looks like it has very much to do with CPU hotplug/unplug. Is
there a race between SLUB/hotplug since the CPU slab is used after it
has been freed?"
Good analysis.
Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset(). The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).
At some point of time the caller continues on another CPU having an
obsolete pointer...
Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
exec'ing an ELF without a PT_GNU_STACK program header should default to an
executable stack; but this got broken by the unlimited argv feature because
stack vma is now created before the right personality has been established:
so breaking old binaries using nested function trampolines.
Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
vm_flags used to be set, before the mprotect_fixup. Checking through
our existing VM_flags, none would have changed since insert_vm_struct:
so this seems safer than finding a way through the personality labyrinth.
Gas 2.15 complains about 32-bit registers being used in lea.
AS arch/x86/lib/copy_user_64.o
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S: Assembler messages:
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S:188: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_64.S:257: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
AS arch/x86/lib/copy_user_nocache_64.o
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_nocache_64.S: Assembler messages:
/local/scratch-2/jeremy/hg/xen/paravirt/linux/arch/x86/lib/copy_user_nocache_64.S:107: Error: `(%edx,%ecx,8)' is not a valid 64 bit base/index expression
Nick Piggin [Thu, 10 Jul 2008 07:25:35 +0000 (17:25 +1000)]
Fix PREEMPT_RCU without HOTPLUG_CPU
PREEMPT_RCU without HOTPLUG_CPU is broken. The rcu_online_cpu is called
to initially populate rcu_cpu_online_map with all online CPUs when the
hotplug event handler is installed, and also to populate the map with
CPUs as they come online. The former case is meant to happen with and
without HOTPLUG_CPU, but without HOTPLUG_CPU, the rcu_offline_cpu
function is no-oped -- while it still gets called, it does not set the
rcu CPU map.
With a blank RCU CPU map, grace periods get to tick by completely
oblivious to active RCU read side critical sections. This results in
free-before-grace bugs.
Fix is obvious once the problem is known. (Also, change __devinit to
__cpuinit so the function gets thrown away on !HOTPLUG_CPU kernels).
Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ Nick is my personal hero of the day - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch/x86/kernel/built-in.o: In function `visws_early_detect':
: undefined reference to `mach_get_smp_config_quirk'
arch/x86/kernel/built-in.o: In function `visws_early_detect':
: undefined reference to `mach_find_smp_config_quirk'
arch/x86/kernel/visws_quirks.c: In function ‘visws_early_detect’:
arch/x86/kernel/visws_quirks.c:293: error: ‘no_broadcast’ undeclared (first use in this function)
arch/x86/kernel/visws_quirks.c:293: error: (Each undeclared identifier is reported only once
arch/x86/kernel/visws_quirks.c:293: error: for each function it appears in.)
make[1]: *** [arch/x86/kernel/visws_quirks.o] Error 1
make: *** [arch/x86/kernel/visws_quirks.o] Error 2
Robert Richter [Thu, 10 Jul 2008 16:58:24 +0000 (18:58 +0200)]
x86/pci merge: fixing numaq initialization
Patch d49c4288 (tip/x86/mpparse) introduced some changes in calling
subsys_init calls if CONFIG_X86_NUMAQ option is set. This patch
updates subsystem initalization according to this changes.
Signed-off-by: Robert Richter <robert.richter@amd.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>