Linus Torvalds [Fri, 20 Jun 2008 19:37:55 +0000 (12:37 -0700)]
Merge branch 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6
* 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
[agp]: fixup chipset flush for new Intel G4x.
agp: brown paper bag patch - put back the two lines it took out.
Linus Torvalds [Fri, 20 Jun 2008 19:37:13 +0000 (12:37 -0700)]
Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
rcupreempt: remove export of rcu_batches_completed_bh
cpuset: limit the input of cpuset.sched_relax_domain_level
Linus Torvalds [Fri, 20 Jun 2008 19:36:55 +0000 (12:36 -0700)]
Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue
sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
sched: rt-group: fix RR buglet
sched: rt-group: heirarchy aware throttle
sched: rt-group: fix hierarchy
sched: NULL pointer dereference while setting sched_rt_period_us
sched: fix defined-but-unused warning
Linus Torvalds [Fri, 20 Jun 2008 19:36:38 +0000 (12:36 -0700)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, geode: add a VSA2 ID for General Software
x86: use BOOTMEM_EXCLUSIVE on 32-bit
x86, 32-bit: fix boot failure on TSC-less processors
x86: fix NULL pointer deref in __switch_to
x86: set PAE PHYSICAL_MASK_SHIFT to 44 bits.
Linus Torvalds [Fri, 20 Jun 2008 19:34:43 +0000 (12:34 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6:
Blackfin Serial Driver: Use timer to poll CTS PIN instead of workqueue.
Blackfin arch: fix typo error in bf548 serial header file
Linus Torvalds [Fri, 20 Jun 2008 19:31:03 +0000 (12:31 -0700)]
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
ahci: sis can't do PMP
ata_piix: add TECRA M4 to broken suspend list
LIBATA: Add HAVE_PATA_PLATFORM to select PATA_PLATFORM driver
sata_mv: warn on PIO with multiple DRQs
sata_mv: enable async_notify for 60x1 Rev.C0 and higher
libata: don't check whether to use DMA or not for no data commands
ahci: jmb361 has only one port
Linus Torvalds [Fri, 20 Jun 2008 19:19:28 +0000 (12:19 -0700)]
[watchdog] hpwdt: fix use of inline assembly
The inline assembly in drivers/watchdog/hpwdt.c was incredibly broken,
and included all the function prologue and epilogue stuff, even though
it was itself then inside a C function where the compiler would add its
own prologue and epilogue on top of it all.
This then just _happened_ to work if you had exactly the right compiler
version and exactly the right compiler flags, so that gcc just happened
to not create any prologue at all (the gcc-generated epilogue wouldn't
matter, since it would never be reached).
But the more proper way to fix it is to simply not do this. Move the
inline asm to the top level, with no surrounding function at all (the
better alternative would be to remove the prologue and make it actually
use proper description of the arguments to the inline asm, but that's a
bigger change than the one I'm willing to make right now).
Linus Torvalds [Fri, 20 Jun 2008 18:18:25 +0000 (11:18 -0700)]
Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP
KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit 557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.
We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.
In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.
This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.
While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.
We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.
[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]
Jordan Crouse [Wed, 18 Jun 2008 17:34:38 +0000 (11:34 -0600)]
x86, geode: add a VSA2 ID for General Software
General Software writes their own VSA2 module for their version
of the Geode BIOS, which returns a different ID then the standard
VSA2. This was causing the framebuffer driver to break for most
GSW boards.
Bharath Ravi [Mon, 16 Jun 2008 09:41:01 +0000 (15:11 +0530)]
sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue
This patch corrects the incorrect value of per process run-queue wait
time reported by delay statistics. The anomaly was due to the following
reason. When a process leaves the CPU and immediately starts waiting for
CPU on the runqueue (which means it remains in the TASK_RUNNABLE state),
the time of re-entry into the run-queue is never recorded. Due to this,
the waiting time on the runqueue from this point of re-entry upto the
next time it hits the CPU is not accounted for. This is solved by
recording the time of re-entry of a process leaving the CPU in the
sched_info_depart() function IF the process will go back to waiting on
the run-queue. This IF condition is verified by checking whether the
process is still in the TASK_RUNNABLE state.
The patch was tested on 2.6.26-rc6 using two simple CPU hog programs.
The values noted prior to the fix did not account for the time spent on
the runqueue waiting. After the fix, the correct values were reported
back to user space.
Signed-off-by: Bharath Ravi <bharathravi1@gmail.com> Signed-off-by: Madhava K R <madhavakr@gmail.com> Cc: dhaval@linux.vnet.ibm.com Cc: vatsa@in.ibm.com Cc: balbir@in.ibm.com Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86, 32-bit: fix boot failure on TSC-less processors
Booting 2.6.26-rc6 on my 486 DX/4 fails with a "BUG: Int 6"
(invalid opcode) and a kernel halt immediately after the
kernel has been uncompressed. The BUG shows EIP pointing
to an rdtsc instruction in native_read_tsc(), invoked from
native_sched_clock().
(This error occurs so early that not even the serial console
can capture it.)
>x86: distangle user disabled TSC from unstable
>
>tsc_enabled is set to 0 from the command line switch "notsc" and from
>the mark_tsc_unstable code. Seperate those functionalities and replace
>tsc_enable with tsc_disable. This makes also the native_sched_clock()
>decision when to use TSC understandable.
>
>Preparatory patch to solve the sched_clock() issue on 32 bit.
>
>Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The core reason for this bug is that native_sched_clock() gets
called before tsc_init().
Before the commit above, tsc_32.c used a "tsc_enabled" variable
which defaulted to 0 == disabled, and which only got enabled late
in tsc_init(). Thus early calls to native_sched_clock() would skip
the TSC and use jiffies instead.
After the commit above, tsc_32.c uses a "tsc_disabled" variable
which defaults to 0, meaning that the TSC is Ok to use. Early calls
to native_sched_clock() now erroneously try to use the TSC on
!cpu_has_tsc processors, leading to invalid opcode exceptions.
My proposed fix is to initialise tsc_disabled to a "soft disabled"
state distinct from the hard disabled state set up by the "notsc"
kernel option. This fixes the native_sched_clock() problem. It also
allows tsc_init() to be simplified: instead of setting tsc_disabled = 1
on every error return, we just set tsc_disabled = 0 once when all
checks have succeeded.
I've verified that this lets my 486 boot again. I've also verified
that a Core2 machine still uses the TSC as clocksource after the patch.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Suresh Siddha [Fri, 13 Jun 2008 22:47:12 +0000 (15:47 -0700)]
x86: fix NULL pointer deref in __switch_to
Patrick McHardy reported a crash:
> > I get this oops once a day, its apparently triggered by something
> > run by cron, but the process is a different one each time.
> >
> > Kernel is -git from yesterday shortly before the -rc6 release
> > (last commit is the usb-2.6 merge, the x86 patches are missing),
> > .config is attached.
> >
> > I'll retry with current -git, but the patches that have gone in
> > since I last updated don't look related.
> >
> > [62060.043009] BUG: unable to handle kernel NULL pointer dereference at
> > 000001ff
> > [62060.043009] IP: [<c0102a9b>] __switch_to+0x2f/0x118
> > [62060.043009] *pde = 00000000
> > [62060.043009] Oops: 0002 [#1] PREEMPT
Vegard Nossum analyzed it:
> This decodes to
>
> 0: 0f ae 00 fxsave (%eax)
>
> so it's related to the floating-point context. This is the exact
> location of the crash:
>
> $ addr2line -e arch/x86/kernel/process_32.o -i ab0
> include/asm/i387.h:232
> include/asm/i387.h:262
> arch/x86/kernel/process_32.c:595
>
> ...so it looks like prev_task->thread.xstate->fxsave has become NULL.
> Or maybe it never had any other value.
Somehow (as described below) TS_USEDFPU is set but the fpu is not
allocated or freed.
Another possible FPU pre-emption issue with the sleazy FPU optimization
which was benign before but not so anymore, with the dynamic FPU allocation
patch.
New task is getting exec'd and it is prempted at the below point.
Now when it context switches in again, as the used_math() is still set
and fpu_counter can be > 5, we will do a math_state_restore() which sets
the task's TS_USEDFPU. After it continues from the above preemption point
it does clear_used_math() and much later free_thread_xstate().
Now, at the next context switch, it is quite possible that xstate is
null, used_math() is not set and TS_USEDFPU is still set. This will
trigger unlazy_fpu() causing kernel oops.
Fix this by clearing tsk's fpu_counter before clearing task's fpu.
When a 64-bit x86 processor runs in 32-bit PAE mode, a pte can
potentially have the same number of physical address bits as the
64-bit host ("Enhanced Legacy PAE Paging"). This means, in theory,
we could have up to 52 bits of physical address in a pte.
The 32-bit kernel uses a 32-bit unsigned long to represent a pfn.
This means that it can only represent physical addresses up to 32+12=44
bits wide. Rather than widening pfns everywhere, just set 2^44 as the
Linux x86_32-PAE architectural limit for physical address size.
This is a bugfix for two cases:
1. running a 32-bit PAE kernel on a machine with
more than 64GB RAM.
2. running a 32-bit PAE Xen guest on a host machine with
more than 64GB RAM
In both cases, a pte could need to have more than 36 bits of physical,
and masking it to 36-bits will cause fairly severe havoc.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Jan Beulich <jbeulich@novell.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jason Wessel [Tue, 27 May 2008 17:23:29 +0000 (12:23 -0500)]
softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
The touch_nmi_watchdog() routine on x86 ultimately calls
touch_softlockup_watchdog(). The problem is that to touch the
softlockup watchdog, the cpu_clock code has to be called which could
involve multiple cpu locks and can lead to a hard hang if one of the
locks is held by a processor that is not going to return anytime soon
(such as could be the case with kgdb or perhaps even with some other
kind of exception).
This patch causes the public version of the
touch_softlockup_watchdog() to defer the cpu clock access to a later
point.
The test case for this problem is to use the following kernel config
options:
It should be noted that kgdb test suite and these options were not
available until 2.6.26-rc2, so it was necessary to patch the kgdb
test suite during the bisection.
I would consider this patch a regression fix because the problem first
appeared in commit 27ec4407790d075c325e1f4da0a19c56953cce23 when some
logic was added to try to periodically sync the clocks. It was
possible to work around this particular problem by simply not
performing the sync anytime the system was in a critical context.
This was ok until commit 3e51f33fcc7f55e6df25d15b55ed10c8b4da84cd,
which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
multi-cpu locks to sync the clocks. It became clear that accessing
this code from an nmi was the source of the lockups. Avoiding the
access to the low level clock code from an code inside the NMI
processing also fixed the problem with the 27ec44... commit.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 22 May 2008 18:18:17 +0000 (14:18 -0400)]
rcupreempt: remove export of rcu_batches_completed_bh
In rcupreempt, rcu_batches_completed_bh is defined as a static inline in
the header file. This does not need to be exported, and not only that,
this breaks my PPC build.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: paulus@samba.org Cc: linuxppc-dev@ozlabs.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Li Zefan [Tue, 13 May 2008 02:27:17 +0000 (10:27 +0800)]
cpuset: limit the input of cpuset.sched_relax_domain_level
We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
for inputs outside this range.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Max Krasnyansky [Thu, 29 May 2008 18:17:01 +0000 (11:17 -0700)]
sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.
Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Thu, 19 Jun 2008 07:06:59 +0000 (09:06 +0200)]
sched: rt-group: fix RR buglet
In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
due to it running out of runtime, and then we try to requeue it, of it also
having exhausted its RR quota. Obviously requeueing something that is no longer
on the runqueue will not have the expected result.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Thu, 19 Jun 2008 07:06:57 +0000 (09:06 +0200)]
sched: rt-group: heirarchy aware throttle
The bandwidth throttle code dequeues a group when it runs out of quota, and
re-queues it once the period rolls over and the quota gets refreshed.
Sadly it failed to take the hierarchy into consideration. Share more of the
enqueue/dequeue code with regular task opterations.
Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
so check for that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Daniel K. <dk@uw.no> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus Torvalds [Thu, 19 Jun 2008 04:52:35 +0000 (21:52 -0700)]
Merge branch 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6
* 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
agp/intel: cleanup some serious whitespace badness
[AGP] intel_agp: Add support for Intel 4 series chipsets
[AGP] intel_agp: extra stolen mem size available for IGD_GM chipset
agp: more boolean conversions.
drivers/char/agp - use bool
agp: two-stage page destruction issue
agp/via: fixup pci ids
Ben Dooks [Mon, 16 Jun 2008 11:16:26 +0000 (12:16 +0100)]
LIBATA: Add HAVE_PATA_PLATFORM to select PATA_PLATFORM driver
Add HAVE_PATA_PLATFORM to select the pata platform driver
to ensure that we do not end up with a long 'depends on' list
when other users of this driver turn up.
Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Mark Lord [Wed, 18 Jun 2008 16:13:02 +0000 (12:13 -0400)]
sata_mv: warn on PIO with multiple DRQs
Chip errata sometimes prevents reliable use of PIO commands which involve
more than a single DRQ (data request). In normal operation, libata should
not generate such PIO commands (uses DMA instead), but they could be sent
in via SG_IO from userspace.
A full workaround might be to break up such commands into sequences
of single DRQ ones, but that's just way too complex for something
that doesn't normally happen in real life.
So, allow the attempt (it often works, despite the errata),
but log the event for reference when somebody screams.
Signed-off-by: Mark Lord <mlord@pobox.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Tejun Heo [Tue, 17 Jun 2008 03:36:26 +0000 (12:36 +0900)]
libata: don't check whether to use DMA or not for no data commands
There's no reason to check whether to use DMA or not for no data
commands. Don't do it. While at it, make local variable using_pio in
atapi_xlat() set iff ATAPI_PROT_PIO is going to be used and rename
ata_check_atapi_dma() to atapi_check_dma() for consistency.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jan Beulich [Wed, 18 Jun 2008 08:28:00 +0000 (09:28 +0100)]
agp: two-stage page destruction issue
besides it apparently being useful only in 2.6.24 (the changes in 2.6.25
really mean that it could be converted back to a single-stage mechanism),
I'm seeing an issue in Xen Dom0 kernels, which is caused by the calling
of gart_to_virt() in the second stage invocations of the destroy function.
I think that besides this being a real issue with Xen (where
unmap_page_from_agp() is not just a page table attribute change), this
also is invalid from a theoretical perspective: One should not assume that
gart_to_virt() is still valid after unmapping a page. So minimally (keeping
the 2-stage mechanism) a patch like the one below would be needed.
Linus Torvalds [Wed, 18 Jun 2008 23:08:59 +0000 (16:08 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/uverbs: Fix check of is_closed flag check in ib_uverbs_async_handler()
RDMA/nes: Fix off-by-one in nes_reg_user_mr() error path
Jack Morgenstein [Wed, 18 Jun 2008 22:36:38 +0000 (15:36 -0700)]
IB/uverbs: Fix check of is_closed flag check in ib_uverbs_async_handler()
Commit 1ae5c187 ("IB/uverbs: Don't store struct file * for event
files") changed the way that closed files are handled in the uverbs
code. However, after the conversion, is_closed flag is checked
incorrectly in ib_uverbs_async_handler(). As a result, no async
events are ever passed to applications.
Found by: Ronni Zimmerman <ronniz@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
[SCSI] dpt_i2o: Add PROC_IA64 define
[SCSI] scsi_host regression: fix scsi host leak
[SCSI] sr: fix corrupt CD data after media change and delay
Paul Mackerras [Wed, 18 Jun 2008 06:40:35 +0000 (16:40 +1000)]
[POWERPC] Clear sub-page HPTE present bits when demoting page size
When we demote a slice from 64k to 4k, and we are about to insert an
HPTE for a 4k subpage and we notice that there is an existing 64k
HPTE, we first invalidate that HPTE before inserting the new 4k
subpage HPTE. Since the bits that encode which hash bucket the old
HPTE was in overlap with the bits that encode which of the 16 subpages
have HPTEs, we need to clear out the subpage HPTE-present bits before
starting to insert HPTEs for the 4k subpages. If we don't do that, we
can erroneously think that a subpage already has an HPTE when it
doesn't.
That in itself wouldn't be such a problem except that when we go to
update the HPTE that we think is present on machines with a
hypervisor, the hypervisor can tell us that the HPTE we think is there
is actually there even though it isn't, which can lead to a process
getting stuck in a loop, continually faulting. The reason for the
confusion is that the AVPN (abbreviated virtual page number) we are
looking for in the HPTE for a 4k subpage can actually match the AVPN
in a stale HPTE for another 64k page. For example, the HPTE for
the 4k subpage at 0x84000f000 will be in the same hash bucket and have
the same AVPN as the HPTE for the 64k page at 0x8400f0000.
This fixes the code to clear out the subpage HPTE-present bits.
Josh Boyer [Tue, 17 Jun 2008 22:34:39 +0000 (08:34 +1000)]
[POWERPC] 4xx: Clear new TLB cache attribute bits in Data Storage vector
A recent commit added support for the new 440x6 and 464 cores that have the
added WL1, IL1I, IL1D, IL2I, and ILD2 bits for the caching attributes in the
TLBs. The new bits were cleared in the finish_tlb_load function, however a
similar bit of code was missed in the DataStorage interrupt vector.
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
Patrick McHardy [Wed, 18 Jun 2008 09:07:07 +0000 (02:07 -0700)]
netlink: genl: fix circular locking
genetlink has a circular locking dependency when dumping the registered
families:
- dump start:
genl_rcv() : take genl_mutex
genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex
netlink_dump_start(),
netlink_dump() : take nlk->cb_mutex
ctrl_dumpfamily() : try to detect this case and not take genl_mutex a
second time
- dump continuance:
netlink_rcv() : call netlink_dump
netlink_dump : take nlk->cb_mutex
ctrl_dumpfamily() : take genl_mutex
Register genl_lock as callback mutex with netlink to fix this. This slightly
widens an already existing module unload race, the genl ops used during the
dump might go away when the module is unloaded. Thomas Graf is working on a
seperate fix for this.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is that the mac80211 stack not only needs to be able to
muck with the link-level headers, it also might need to mangle all of
the packet data if doing sw wireless encryption.
This fixes kernel bugzilla #10903. Thanks to Didier Raboud (for the
bugzilla report), Andrew Prince (for bisecting), Johannes Berg (for
bringing this bisection analysis to my attention), and Ilpo (for
trying to analyze this purely from the TCP side).
In 2.6.27 we can take another stab at this, by using something like
skb_cow_data() when the TX path of mac80211 ends up with a non-NULL
tx->key. The ESP protocol code in the IPSEC stack can be used as a
model for implementation.
Signed-off-by: David S. Miller <davem@davemloft.net>
Rainer Weikusat [Wed, 18 Jun 2008 05:28:05 +0000 (22:28 -0700)]
af_unix: fix 'poll for write'/ connected DGRAM sockets
The unix_dgram_sendmsg routine implements a (somewhat crude)
form of receiver-imposed flow control by comparing the length of the
receive queue of the 'peer socket' with the max_ack_backlog value
stored in the corresponding sock structure, either blocking
the thread which caused the send-routine to be called or returning
EAGAIN. This routine is used by both SOCK_DGRAM and SOCK_SEQPACKET
sockets. The poll-implementation for these socket types is
datagram_poll from core/datagram.c. A socket is deemed to be writeable
by this routine when the memory presently consumed by datagrams
owned by it is less than the configured socket send buffer size. This
is always wrong for connected PF_UNIX non-stream sockets when the
abovementioned receive queue is currently considered to be full.
'poll' will then return, indicating that the socket is writeable, but
a subsequent write result in EAGAIN, effectively causing an
(usual) application to 'poll for writeability by repeated send request
with O_NONBLOCK set' until it has consumed its time quantum.
The change below uses a suitably modified variant of the datagram_poll
routines for both type of PF_UNIX sockets, which tests if the
recv-queue of the peer a socket is connected to is presently
considered to be 'full' as part of the 'is this socket
writeable'-checking code. The socket being polled is additionally
put onto the peer_wait wait queue associated with its peer, because the
unix_dgram_sendmsg routine does a wake up on this queue after a
datagram was received and the 'other wakeup call' is done implicitly
as part of skb destruction, meaning, a process blocked in poll
because of a full peer receive queue could otherwise sleep forever
if no datagram owned by its socket was already sitting on this queue.
Among this change is a small (inline) helper routine named
'unix_recvq_full', which consolidates the actual testing code (in three
different places) into a single location.
Signed-off-by: Rainer Weikusat <rweikusat@mssgmbh.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ang Way Chuang [Wed, 18 Jun 2008 04:10:33 +0000 (21:10 -0700)]
tun: Proper handling of IPv6 header in tun driver when TUN_NO_PI is set
By default, tun.c running in TUN_TUN_DEV mode will set the protocol of
packet to IPv4 if TUN_NO_PI is set. My program failed to work when I
assumed that the driver will check the first nibble of packet,
determine IP version and set the appropriate protocol.
Signed-off-by: Ang Way Chuang <wcang@nav6.org> Acked-by: Max Krasnyansky <maxk@qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Radu Cristescu [Thu, 12 Jun 2008 22:04:54 +0000 (17:04 -0500)]
atl1: relax eeprom mac address error check
The atl1 driver tries to determine the MAC address thusly:
- If an EEPROM exists, read the MAC address from EEPROM and
validate it.
- If an EEPROM doesn't exist, try to read a MAC address from
SPI flash.
- If that fails, try to read a MAC address directly from the
MAC Station Address register.
- If that fails, assign a random MAC address provided by the
kernel.
We now have a report of a system fitted with an EEPROM containing all
zeros where we expect the MAC address to be, and we currently handle
this as an error condition. Turns out, on this system the BIOS writes
a valid MAC address to the NIC's MAC Station Address register, but we
never try to read it because we return an error when we find the all-
zeros address in EEPROM.
This patch relaxes the error check and continues looking for a MAC
address even if it finds an illegal one in EEPROM.
Signed-off-by: Radu Cristescu <advantis@gmx.net> Signed-off-by: Jay Cliburn <jacliburn@bellsouth.net> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
David Brownell [Fri, 13 Jun 2008 04:38:06 +0000 (21:38 -0700)]
net/enc28j60: low power mode
Keep enc28j60 chips in low-power mode when they're not in use.
At typically 120 mA, these chips run hot even when idle; this
low power mode cuts that power usage by a factor of around 100.
This version provides a generic routine to poll a register until
its masked value equals some value ... e.g. bit set or cleared.
It's basically what the previous wait_phy_ready() did, but this
version is generalized to support the handshaking needed to
enter and exit low power mode.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Claudio Lanconelli <lanconelli.claudio@eptar.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Josh Boyer [Tue, 17 Jun 2008 23:27:55 +0000 (19:27 -0400)]
ibm_newemac: select CRC32 in Kconfig
The ibm_newemac driver requires ether_crc to be defined. Apparently it is
possible to generate a .config without CONFIG_CRC32 set which causes the
following link errors if IBM_NEW_EMAC is selected:
LD .tmp_vmlinux1
drivers/built-in.o: In function `emac_hash_mc':
core.c:(.text+0x2f524): undefined reference to `crc32_le'
core.c:(.text+0x2f528): undefined reference to `bitrev32'
make: *** [.tmp_vmlinux1] Error 1
This patch has IBM_NEW_EMAC select CRC32 so we don't hit this error.
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Linus Torvalds [Wed, 18 Jun 2008 00:47:50 +0000 (17:47 -0700)]
x86-64: Fix "bytes left to copy" return value for copy_from_user()
Most users by far do not care about the exact return value (they only
really care about whether the copy succeeded in its entirety or not),
but a few special core routines actually care deeply about exactly how
many bytes were copied from user space.
And the unrolled versions of the x86-64 user copy routines would
sometimes report that it had copied more bytes than it actually had.
Very few uses actually have partial copies to begin with, but to make
this bug even harder to trigger, most x86 CPU's use the "rep string"
instructions for normal user copies, and that version didn't have this
issue.
To make it even harder to hit, the one user of this that really cared
about the return value (and used the uncached version of the copy that
doesn't use the "rep string" instructions) was the generic write
routine, which pre-populated its source, once more hiding the problem by
avoiding the exception case that triggers the bug.
In other words, very special thanks to Bron Gondwana who not only
triggered this, but created a test-program to show it, and bisected the
behavior down to commit 08291429cfa6258c4cd95d8833beb40f828b194e ("mm:
fix pagecache write deadlocks") which changed the access pattern just
enough that you can now trigger it with 'writev()' with multiple
iovec's.
That commit itself was not the cause of the bug, it just allowed all the
stars to align just right that you could trigger the problem.
[ Side note: this is just the minimal fix to make the copy routines
(with __copy_from_user_inatomic_nocache as the particular version that
was involved in showing this) have the right return values.
We really should improve on the exceptional case further - to make the
copy do a byte-accurate copy up to the exact page limit that causes it
to fail. As it is, the callers have to do extra work to handle the
limit case gracefully. ]
Reported-by: Bron Gondwana <brong@fastmail.fm> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(which didn't have this problem), and since
most users that do the carethis was very hard to trigger, but
Steffen Klassert [Tue, 17 Jun 2008 23:37:13 +0000 (16:37 -0700)]
xfrm: fix fragmentation for ipv4 xfrm tunnel
When generating the ip header for the transformed packet we just copy
the frag_off field of the ip header from the original packet to the ip
header of the new generated packet. If we receive a packet as a chain
of fragments, all but the last of the new generated packets have the
IP_MF flag set. We have to mask the frag_off field to only keep the
IP_DF flag from the original packet. This got lost with git commit 36cf9acf93e8561d9faec24849e57688a81eb9c5 ("[IPSEC]: Separate
inner/outer mode processing on output")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
The H.245 helper is not registered/unregistered, but assigned to
connections manually from the Q.931 helper. This means on unload
existing expectations and connections using the helper are not
cleaned up, leading to the following oops on module unload:
One way to fix this would be to split helper cleanup from the unregistration
function and invoke it for the H.245 helper, but since ctnetlink needs to be
able to find the helper for synchonization purposes, a better fix is to
register it normally and make sure its not assigned to connections during
helper lookup. The missing l3num initialization is enough for this, this
patch changes it to use AF_UNSPEC to make it more explicit though.
Reported-by: liannan <liannan@twsz.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Tue, 17 Jun 2008 22:51:47 +0000 (15:51 -0700)]
netfilter: nf_nat: fix RCU races
Fix three ct_extend/NAT extension related races:
- When cleaning up the extension area and removing it from the bysource hash,
the nat->ct pointer must not be set to NULL since it may still be used in
a RCU read side
- When replacing a NAT extension area in the bysource hash, the nat->ct
pointer must be assigned before performing the replacement
- When reallocating extension storage in ct_extend, the old memory must
not be freed immediately since it may still be used by a RCU read side
atm: [iphase] doesn't call phy->start due to a bogus #ifndef
This causes the suni driver to oops if you try to use sonetdiag to get
the statistics. Also add the corresponding phy->stop call to fix another
oops if you try to remove the module.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net> Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil> Signed-off-by: David S. Miller <davem@davemloft.net>
atm: [iphase] set drvdata before enabling interrupts
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net> Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil> Signed-off-by: David S. Miller <davem@davemloft.net>
It happens that if a packet arrives in a VC between the call to open it on
the hardware and the call to change the backend to br2684, br2684_regvcc
processes the packet and oopses dereferencing skb->dev because it is
NULL before the call to br2684_push().
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net> Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Ben Hutchings [Tue, 17 Jun 2008 00:02:28 +0000 (17:02 -0700)]
net: Fix test for VLAN TX checksum offload capability
Selected device feature bits can be propagated to VLAN devices, so we
can make use of TX checksum offload and TSO on VLAN-tagged packets.
However, if the physical device does not do VLAN tag insertion or
generic checksum offload then the test for TX checksum offload in
dev_queue_xmit() will see a protocol of htons(ETH_P_8021Q) and yield
false.
This splits the checksum offload test into two functions:
- can_checksum_protocol() tests a given protocol against a feature bitmask
- dev_can_checksum() first tests the skb protocol against the device
features; if that fails and the protocol is htons(ETH_P_8021Q) then
it tests the encapsulated protocol against the effective device
features for VLANs
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Tue, 17 Jun 2008 00:00:29 +0000 (17:00 -0700)]
sctp: Correclty set changeover_active for SFR-CACC
Right now, any time we set a primary transport we set
the changeover_active flag. As a result, we invoke SFR-CACC
even when there has been no changeover events.
Only set changeover_active, when there is a true changeover
event, i.e. we had a primary path and we are changing to
another transport.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Mon, 16 Jun 2008 23:59:55 +0000 (16:59 -0700)]
sctp: Correctly cleanup procfs entries upon failure.
This patch remove the proc fs entry which has been created if fail to
set up proc fs entry for the SCTP protocol.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Avoid extra need for compat layer in PRL management.
We've introduced extra need of compat layer for ip_tunnel_prl{}
for PRL (Potential Router List) management. Though compat_ioctl
is still missing in ipv4/ipv6, let's make the interface more
straight-forward and eliminate extra need for nasty compat layer
anyway since the interface is new for 2.6.26.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
pkt_sched: HTB scheduler, change default hysteresis mode to off.
The HTB hysteresis mode reduce the CPU load, but at the
cost of scheduling accuracy.
On ADSL links (512 kbit/s upstream), this inaccuracy introduce
significant jitter, enought to disturbe VoIP. For details see my
masters thesis (http://www.adsl-optimizer.dk/thesis/), chapter 7,
section 7.3.1, pp 69-70.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk> Acked-by: Martin Devera <devik@cdi.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 16 Jun 2008 20:17:33 +0000 (13:17 -0700)]
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: Remove ->hangup() from stack glue operations.
ocfs2: Move the call of ocfs2_hb_ctl into the stack glue.
ocfs2: Move the hb_ctl_path sysctl into the stack glue.
Joel Becker [Fri, 30 May 2008 22:58:26 +0000 (15:58 -0700)]
ocfs2: Remove ->hangup() from stack glue operations.
The ->hangup() call was only used to execute ocfs2_hb_ctl. Now that
the generic stack glue code handles this, the underlying stack drivers
don't need to know about it.
Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Joel Becker [Fri, 30 May 2008 22:43:58 +0000 (15:43 -0700)]
ocfs2: Move the call of ocfs2_hb_ctl into the stack glue.
Take o2hb_stop() out of the o2cb code and make it part of the generic
stack glue as ocfs2_leave_group(). This also allows us to remove the
ocfs2_get_hb_ctl_path() function - everything to do with hb_ctl is now
part of stackglue.c. o2cb no longer needs a ->hangup() function.
Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>