err.no Git - linux-2.6/log

]> err.no Git - linux-2.6/log

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:32 +0000 (16:17 +0200)]

x86: KVM guest: Use the paravirt clocksource structs and functions

This patch updates the kvm host code to use the pvclock structs
and functions, thereby making it compatible with Xen.

The patch also fixes an initialization bug: on SMP systems the
per-cpu has two different locations early at boot and after CPU
bringup. kvmclock must take that in account when registering the
physical address within the host.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:31 +0000 (16:17 +0200)]

KVM: Make kvm host use the paravirt clocksource structs

This patch updates the kvm host code to use the pvclock structs.
It also makes the paravirt clock compatible with Xen.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:30 +0000 (16:17 +0200)]

x86: Make xen use the paravirt clocksource structs and functions

This patch updates the xen guest to use the pvclock structs
and helper functions.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Gerd Hoffmann [Tue, 3 Jun 2008 14:17:29 +0000 (16:17 +0200)]

x86: Add structs and functions for paravirt clocksource

This patch adds structs for the paravirt clocksource ABI
used by both xen and kvm (pvclock-abi.h).

It also adds some helper functions to read system time and
wall clock time from a paravirtual clocksource (pvclock.[ch]).
They are based on the xen code. They are enabled using
CONFIG_PARAVIRT_CLOCK.

Subsequent patches of this series will put the code in use.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Avi Kivity [Tue, 24 Jun 2008 08:48:49 +0000 (11:48 +0300)]

KVM: VMX: Fix host msr corruption with preemption enabled

Switching msrs can occur either synchronously as a result of calls to
the msr management functions (usually in response to the guest touching
virtualized msrs), or asynchronously when preempting a kvm thread that has
guest state loaded. If we're unlucky enough to have the two at the same
time, host msrs are corrupted and the machine goes kaput on the next syscall.

Most easily triggered by Windows Server 2008, as it does a lot of msr
switching during bootup.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Avi Kivity [Tue, 17 Jun 2008 22:36:36 +0000 (15:36 -0700)]

KVM: ioapic: fix lost interrupt when changing a device's irq

The ioapic acknowledge path translates interrupt vectors to irqs.  It
currently uses a first match algorithm, stopping when it finds the first
redirection table entry containing the vector.  That fails however if the
guest changes the irq to a different line, leaving the old redirection table
entry in place (though masked).  Result is interrupts not making it to the
guest.

Fix by always scanning the entire redirection table.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Avi Kivity [Thu, 12 Jun 2008 13:54:41 +0000 (16:54 +0300)]

KVM: MMU: Fix oops on guest userspace access to guest pagetable

KVM has a heuristic to unshadow guest pagetables when userspace accesses
them, on the assumption that most guests do not allow userspace to access
pagetables directly. Unfortunately, in addition to unshadowing the pagetables,
it also oopses.

This never triggers on ordinary guests since sane OSes will clear the
pagetables before assigning them to userspace, which will trigger the flood
heuristic, unshadowing the pagetables before the first userspace access. One
particular guest, though (Xenner) will run the kernel in userspace, triggering
the oops. Since the heuristic is incorrect in this case, we can simply
remove it.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Wed, 11 Jun 2008 23:32:40 +0000 (20:32 -0300)]

KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend)

kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed
guests properly. It will instantiate two 2MB sptes pointing to the same
physical 2MB page when a guest large pte update is trapped.

Instead of duplicating code to handle this, disallow directory level
updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes
emulating one guest 4MB pte can be correctly created by the page fault
handling path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Sun, 8 Jun 2008 04:48:53 +0000 (01:48 -0300)]

KVM: MMU: Fix rmap_write_protect() hugepage iteration bug

rmap_next() does not work correctly after rmap_remove(), as it expects
the rmap chains not to change during iteration. Fix (for now) by restarting
iteration from the beginning.

Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Fri, 6 Jun 2008 19:37:36 +0000 (16:37 -0300)]

KVM: close timer injection race window in __vcpu_run

If a timer fires after kvm_inject_pending_timer_irqs() but before
local_irq_disable() the code will enter guest mode and only inject such
timer interrupt the next time an unrelated event causes an exit.

It would be simpler if the timer->pending irq conversion could be done
with IRQ's disabled, so that the above problem cannot happen.

For now introduce a new vcpu requests bit to cancel guest entry.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Marcelo Tosatti [Fri, 6 Jun 2008 19:37:35 +0000 (16:37 -0300)]

KVM: Fix race between timer migration and vcpu migration

A guest vcpu instance can be scheduled to a different physical CPU
between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable().

If that happens, the timer will only be migrated to the current pCPU on
the next exit, meaning that guest LAPIC timer event can be delayed until
a host interrupt is triggered.

Fix it by cancelling guest entry if any vcpu request is pending. This
has the side effect of nicely consolidating vcpu->requests checks.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 23:25:11 +0000 (16:25 -0700)]

Merge branch 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6

* 'hotfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: nfs_updatepage(): don't mark page as dirty if an error occurred
  NFS: Fix filehandle size comparisons in the mount code
  NFS: Reduce the NFS mount code stack usage.

commit | commitdiff | tree

Trond Myklebust [Thu, 5 Jun 2008 20:02:35 +0000 (16:02 -0400)]

NFS: nfs_updatepage(): don't mark page as dirty if an error occurred

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

commit | commitdiff | tree

Trond Myklebust [Thu, 19 Jun 2008 19:21:11 +0000 (15:21 -0400)]

NFS: Fix filehandle size comparisons in the mount code

Fix a sign issue in xdr_decode_fhstatus3()
Fix incorrect comparison in nfs_validate_mount_data()

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

commit | commitdiff | tree

Trond Myklebust [Thu, 19 Jun 2008 18:20:11 +0000 (14:20 -0400)]

NFS: Reduce the NFS mount code stack usage.

This appears to fix the Oops reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10826

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:49:22 +0000 (12:49 -0700)]

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:48:50 +0000 (12:48 -0700)]

Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: refactor wait_for_completion_timeout()
  sched: fix wait_for_completion_timeout() spurious failure under heavy load
  sched: rt: dont stop the period timer when there are tasks wanting to run

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:48:17 +0000 (12:48 -0700)]

Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  xen: don't drop NX bit
  xen: mask unwanted pte bits in __supported_pte_mask
  xen: Use wmb instead of rmb in xen_evtchn_do_upcall().
  x86: fix NULL pointer deref in __switch_to

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:45:49 +0000 (12:45 -0700)]

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mthca: Clear ICM pages before handing to FW

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 19:18:06 +0000 (12:18 -0700)]

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: sb - Fix wrong assertions
ALSA: aw2 - Fix Oops at initialization

commit | commitdiff | tree

Nick Piggin [Mon, 23 Jun 2008 12:30:30 +0000 (14:30 +0200)]

mm: fix race in COW logic

There is a race in the COW logic.  It contains a shortcut to avoid the
COW and reuse the page if we have the sole reference on the page,
however it is possible to have two racing do_wp_page()ers with one
causing the other to mistakenly believe it is safe to take the shortcut
when it is not.  This could lead to data corruption.

Process 1 and process2 each have a wp pte of the same anon page (ie.
one forked the other).  The page's mapcount is 2.  Then they both
attempt to write to it around the same time...

  proc1 proc2 thr1 proc2 thr2
  CPU0 CPU1 CPU3
  do_wp_page() do_wp_page()
trylock_page()
  can_share_swap_page()
   load page mapcount (==2)
  reuse = 0
pte unlock
copy page to new_page
pte lock
page_remove_rmap(page);
   trylock_page()
    can_share_swap_page()
     load page mapcount (==1)
    reuse = 1
   ptep_set_access_flags (allow W)

  write private key into page
read from page
ptep_clear_flush()
set_pte_at(pte of new_page)

Fix this by moving the page_remove_rmap of the old page after the pte
clear and flush.  Potentially the entire branch could be moved down
here, but in order to stay consistent, I won't (should probably move all
the *_mm_counter stuff with one patch).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Mon, 23 Jun 2008 18:21:37 +0000 (11:21 -0700)]

Fix ZERO_PAGE breakage with vmware

Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE
optimization in 'get_user_pages()' and fix XIP") broke vmware, as
reported by Jeff Chua:

  "This broke vmware 6.0.4.
   Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED
   /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774"

and the reason seems to be that there's an old bug in how we handle do
FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only
triggered if the whole page table was missing, nobody had apparently hit
it before.

The recent changes to 'follow_page()' made the FOLL_ANON logic trigger
not just for whole missing page tables, but for individual pages as
well, and exposed this problem.

This fixes it by making the test for when FOLL_ANON is used more
careful, and also makes the code easier to read and understand by moving
the logic to a separate inline function.

Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Gustavo Fernando Padovan [Mon, 23 Jun 2008 11:07:06 +0000 (12:07 +0100)]

removed unused var real_tty on n_tty_ioctl()

I noted that the 'struct tty_struct *real_tty' is not used in this
function, so I removed the code about 'real_tty'.

Signed-off-by: Gustavo Fernando Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Alan Cox [Mon, 23 Jun 2008 11:06:52 +0000 (12:06 +0100)]

tty_driver: Update required method documentation

Some of the requirement rules are now more relaxed. Also correct a
contradiction in the previous update

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Eli Cohen [Mon, 23 Jun 2008 16:29:58 +0000 (09:29 -0700)]

IB/mthca: Clear ICM pages before handing to FW

Current memfree FW has a bug which in some cases, assumes that ICM
pages passed to it are cleared.  This patch uses __GFP_ZERO to
allocate all ICM pages passed to the FW.  Once firmware with a fix is
released, we can make the workaround conditional on firmware version.

This fixes the bug reported by Arthur Kepner <akepner@sgi.com> here:
http://lists.openfabrics.org/pipermail/general/2008-May/050026.html

Cc: <stable@kernel.org>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Rewritten to be a one-liner using __GFP_ZERO instead of vmap()ing
  ICM memory and memset()ing it to 0. - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>

commit | commitdiff | tree

Thomas Gleixner [Mon, 23 Jun 2008 09:21:58 +0000 (11:21 +0200)]

futexes: fix fault handling in futex_lock_pi

This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.

David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.

The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.

When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.

After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:

- the task which acquired the rtmutex
- the pending owner of the pi_state which did not get the rtmutex

Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.

One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.

Reported-by: David Holmes <david.holmes@sun.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Holmes <david.holmes@sun.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 73 insertions(+), 20 deletions(-)

commit | commitdiff | tree

Takashi Iwai [Mon, 23 Jun 2008 09:58:06 +0000 (11:58 +0200)]

ALSA: sb - Fix wrong assertions

snd_assert() in save_mixer() and restore_mixer() in sb_mixer.c is
just wrong. The debug code wasn't tested at all, obviously...

Signed-off-by: Takashi Iwai <tiwai@suse.de>

commit | commitdiff | tree

Takashi Iwai [Mon, 23 Jun 2008 09:54:05 +0000 (11:54 +0200)]

ALSA: aw2 - Fix Oops at initialization

The irq handler may be called before the proper initialization of hardware.
Call snd_aw2_saa7146_setup() before the irq handler registration.

Signed-off-by: Takashi Iwai <tiwai@suse.de>

commit | commitdiff | tree

Ingo Molnar [Mon, 23 Jun 2008 09:00:26 +0000 (11:00 +0200)]

Merge branch 'linus' into sched/urgent

commit | commitdiff | tree

Linus Torvalds [Sun, 22 Jun 2008 19:23:15 +0000 (12:23 -0700)]

Fix performance regression on lmbench select benchmark

Christian Borntraeger reported that reinstating cond_resched() with
CONFIG_PREEMPT caused a performance regression on lmbench:

For example select file 500:
23 microseconds
32 microseconds

and that's really because we totally unnecessarily do the cond_resched()
in the innermost loop of select(), which is just silly.

This moves it out from the innermost loop (which only ever loops ove the
bits in a single "unsigned long" anyway), which makes the performance
regression go away.

Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Christoph Lameter [Sat, 21 Jun 2008 23:46:35 +0000 (16:46 -0700)]

Slab: Fix memory leak in fallback_alloc()

The zonelist patches caused the loop that checks for available
objects in permitted zones to not terminate immediately. One object
per zone per allocation may be allocated and then abandoned.

Break the loop when we have successfully allocated one object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 23:43:56 +0000 (16:43 -0700)]

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Ext4: Fix online resize block group descriptor corruption

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 19:31:02 +0000 (12:31 -0700)]

Merge branch 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6

* 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6:
  hwmon: (lm75) sensor reading bugfix
  hwmon: (abituguru3) update driver detection
  hwmon: (w83791d) new maintainer
  hwmon: (abituguru3) Identify Abit AW8D board as such
  hwmon: Update the sysfs interface documentation
  hwmon: (adt7473) Initialize max_duty_at_overheat before use
  hwmon: (lm85) Fix function RANGE_TO_REG()

commit | commitdiff | tree

Bernhard Walle [Sat, 21 Jun 2008 17:01:02 +0000 (19:01 +0200)]

Add return value to reserve_bootmem_node()

This patch changes the function reserve_bootmem_node() from void to int,
returning -ENOMEM if the allocation fails.

This fixes a build problem on x86 with CONFIG_KEXEC=y and
CONFIG_NEED_MULTIPLE_NODES=y

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Reported-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 15:44:08 +0000 (08:44 -0700)]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  netns: Don't receive new packets in a dead network namespace.
  sctp: Make sure N * sizeof(union sctp_addr) does not overflow.
  pppoe: warning fix
  ipv6: Drop packets for loopback address from outside of the box.
  ipv6: Remove options header when setsockopt's optlen is 0
  mac80211: detect driver tx bugs

commit | commitdiff | tree

Eric W. Biederman [Sat, 21 Jun 2008 05:16:51 +0000 (22:16 -0700)]

netns: Don't receive new packets in a dead network namespace.

Alexey Dobriyan <adobriyan@gmail.com> writes:
> Subject: ICMP sockets destruction vs ICMP packets oops

> After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
> icmp_reply() wants ICMP socket.
>
> Steps to reproduce:
>
> launch shell in new netns
> move real NIC to netns
> setup routing
> ping -i 0
> exit from shell
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
> PGD 17f3cd067 PUD 17f3ce067 PMD 0
> Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0
> Modules linked in: usblp usbcore
> Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
> RIP: 0010:[<ffffffff803fce17>]  [<ffffffff803fce17>] icmp_sk+0x17/0x30
> RSP: 0018:ffffffff8057fc30  EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
> RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
> RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
> R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
> FS:  0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
> Stack:  0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
>  ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
>  000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
> Call Trace:
>  <IRQ>  [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
>  [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
>  [<ffffffff803fd645>] icmp_echo+0x65/0x70
>  [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
>  [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
>  [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
>  [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
>  [<ffffffff803be57d>] process_backlog+0x9d/0x100
>  [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
>  [<ffffffff80237be5>] __do_softirq+0x75/0x100
>  [<ffffffff8020c97c>] call_softirq+0x1c/0x30
>  [<ffffffff8020f085>] do_softirq+0x65/0xa0
>  [<ffffffff80237af7>] irq_exit+0x97/0xa0
>  [<ffffffff8020f198>] do_IRQ+0xa8/0x130
>  [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
>  [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
>  <EOI>  [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
>  [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
>  [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
>  [<ffffffff8040f380>] ? rest_init+0x70/0x80
> Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
> 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
> RIP  [<ffffffff803fce17>] icmp_sk+0x17/0x30
>  RSP <ffffffff8057fc30>
> CR2: 0000000000000000
> ---[ end trace ea161157b76b33e8 ]---
> Kernel panic - not syncing: Aiee, killing interrupt handler!

Receiving packets while we are cleaning up a network namespace is a
racy proposition. It is possible when the packet arrives that we have
removed some but not all of the state we need to fully process it.  We
have the choice of either playing wack-a-mole with the cleanup routines
or simply dropping packets when we don't have a network namespace to
handle them.

Since the check looks inexpensive in netif_receive_skb let's just
drop the incoming packets.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Sat, 21 Jun 2008 05:04:34 +0000 (22:04 -0700)]

sctp: Make sure N * sizeof(union sctp_addr) does not overflow.

As noticed by Gabriel Campana, the kmalloc() length arg
passed in by sctp_getsockopt_local_addrs_old() can overflow
if ->addr_num is large enough.

Therefore, enforce an appropriate limit.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Sat, 21 Jun 2008 04:58:02 +0000 (21:58 -0700)]

pppoe: warning fix

Fix warning:
drivers/net/pppoe.c: In function 'pppoe_recvmsg':
drivers/net/pppoe.c:945: warning: comparison of distinct pointer types lacks a cast
because skb->len is unsigned int and total_len is size_t

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Linus Torvalds [Sat, 21 Jun 2008 00:10:04 +0000 (17:10 -0700)]

Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] SN2: security hole in sn2_ptc_proc_write

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:28:54 +0000 (03:28 +0400)]

alpha: resurrect Cypress IDE quirk

Which was removed in the hope that generic legacy IDE quirk in
drivers/pci/probe.c is sufficient for Cypress IDE.
It isn't, as this controller has non-standard BAR layout:
secondary channel registers are in the BAR0-1 of the second
PCI function - not in the BAR2-3 of the same function, as the
generic quirk routine assumes.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:28:31 +0000 (03:28 +0400)]

alpha: fix compile failures with gcc-4.3 (bug #10438)

Vast majority of these build failures are gcc-4.3 warnings
about static functions and objects being referenced from
non-static (read: "extern inline") functions, in conjunction
with our -Werror.

We cannot just convert "extern inline" to "static inline",
as people keep suggesting all the time, because "extern inline"
logic is crucial for generic kernel build.
So
- just make sure that all callees of critical "extern inline"
functions are also "extern inline";
- use "static inline", wherever it's possible.

traps.c: work around gcc-4.3 being too smart about array
bounds-checking.

TODO: add "gnu_inline" attribute to all our "extern inline"
functions to ensure desired behaviour with future compilers.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:26:21 +0000 (03:26 +0400)]

alpha: link failure fix

With built-in scsi disk driver, the final link fails with a following
error:
`.exit.text' referenced in section `.rodata' of drivers/built-in.o:
defined in discarded section `.exit.text' of drivers/built-in.o

This happens with -Os (CONFIG_CC_OPTIMIZE_FOR_SIZE=y) with all gcc-4
versions, and also with -O2 and gcc-4.3.

The problem is in sd.c:sd_major() being inlined into __exit function
exit_sd(), and the compiler generating a jump table in .rodata section
for the 'switch' statement in sd_major(). So we have references to
discarded section.

Fixed with a big hammer in the form of -fno-jump-tables.

Note that jump tables vs. discarded sections is a generic problem,
other architectures are just lucky not to suffer from it. But with
a slightly more complex switch/case statement it can be reproduced
on x86 as well. So maybe at some point we should consider
-fno-jump-tables as a generic compile option...

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Ivan Kokshaysky [Fri, 20 Jun 2008 23:25:39 +0000 (03:25 +0400)]

alpha: fix module load failures on smp (bug #10926)

To calculate addresses of locally defined variables, GCC uses 32-bit
displacement from the GP. Which doesn't work for per cpu variables in
modules, as an offset to the kernel per cpu area is way above 4G.

The workaround is to force allocation of a GOT entry for per cpu variable
using ldq instruction with a 'literal' relocation.
I had to use custom asm/percpu.h, as a required argument magic doesn't
work with asm-generic/percpu.h macros.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 23:19:44 +0000 (16:19 -0700)]

Linux 2.6.26-rc7

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:46:47 +0000 (12:46 -0700)]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6:
  BAST: Remove old IDE driver
  pcmcia ide kingston compactflash's have a new manufacturer id
  pcmcia: add another pata/ide ID
  pcmcia: add an pata/ide ID
  ide: increase timeout in wait_drive_not_busy()
  palm_bk3710: fix resource management

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:41:10 +0000 (12:41 -0700)]

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
  ieee1394: Kconfig menu touch-up
  firewire: Kconfig menu touch-up
  firewire: deadline for PHY config transmission
  firewire: fw-ohci: unify printk prefixes
  firewire: fill_bus_reset_event needs lock protection
  firewire: fw-ohci: write selfIDBufferPtr before LinkControl.rcvSelfID
  firewire: fw-ohci: disable PHY packet reception into AR context
  firewire: fw-ohci: use of uninitialized data in AR handler
  firewire: don't panic on invalid AR request buffer

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:39:12 +0000 (12:39 -0700)]

Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
ACPI: no AC status notification
ACPI Exception (video-1721): UNKNOWN_STATUS_CODE, Cant attach device

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:38:18 +0000 (12:38 -0700)]

Merge branch 'drm-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6

* 'drm-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (21 commits)
  drm: only trust core drm ioctls - driver ioctls are a mess.
  drm/i915: add support for Intel series 4 chipsets.
  drm/radeon: add hier-z registers for r300 and r500 chipsets
  drm/radeon: use DSTCACHE_CTLSTAT rather than RB2D_DSTCACHE_CTLSTAT
  drm/radeon: switch IGP gart to use radeon_write_agp_base()
  drm/radeon: Restore sw interrupt on resume
  drm/r500: add support for AGP based cards.
  drm/radeon: fix texture uploads with large 3d textures (bug 13980)
  drm/radeon: add initial r500 support.
  drm/radeon: init pipe setup in kernel code.
  drm/radeon: fixup radeon_do_engine_reset
  drm/radeon: fix pixcache and purge/cache flushing registers
  drm/radeon: write AGP_BASE_2 on chips that support it.
  drm/radeon: merge IGP chip setup and fixup RS400 vs RS480 support
  drm/radeon: IGP clean up register and magic numbers.
  drm/rs690: set base 2 to 0.
  drm/rs690: set all of gart base address.
  radeon: add production microcode from AMD
  drm: pcigart use proper pci map interfaces.
  drm: the sg alloc ioctl should write back the handle to userspace
  ...

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:37:55 +0000 (12:37 -0700)]

Merge branch 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6

* 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6:
[agp]: fixup chipset flush for new Intel G4x.
agp: brown paper bag patch - put back the two lines it took out.

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:37:13 +0000 (12:37 -0700)]

Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
  rcupreempt: remove export of rcu_batches_completed_bh
  cpuset: limit the input of cpuset.sched_relax_domain_level

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:36:55 +0000 (12:36 -0700)]

Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue
  sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
  sched: rt-group: fix RR buglet
  sched: rt-group: heirarchy aware throttle
  sched: rt-group: fix hierarchy
  sched: NULL pointer dereference while setting sched_rt_period_us
  sched: fix defined-but-unused warning

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:36:38 +0000 (12:36 -0700)]

Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, geode: add a VSA2 ID for General Software
  x86: use BOOTMEM_EXCLUSIVE on 32-bit
  x86, 32-bit: fix boot failure on TSC-less processors
  x86: fix NULL pointer deref in __switch_to
  x86: set PAE PHYSICAL_MASK_SHIFT to 44 bits.

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:34:43 +0000 (12:34 -0700)]

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/blackfin-2.6:
Blackfin Serial Driver: Use timer to poll CTS PIN instead of workqueue.
Blackfin arch: fix typo error in bf548 serial header file

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:31:03 +0000 (12:31 -0700)]

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  ahci: sis can't do PMP
  ata_piix: add TECRA M4 to broken suspend list
  LIBATA: Add HAVE_PATA_PLATFORM to select PATA_PLATFORM driver
  sata_mv: warn on PIO with multiple DRQs
  sata_mv: enable async_notify for 60x1 Rev.C0 and higher
  libata: don't check whether to use DMA or not for no data commands
  ahci: jmb361 has only one port

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 19:19:28 +0000 (12:19 -0700)]

[watchdog] hpwdt: fix use of inline assembly

The inline assembly in drivers/watchdog/hpwdt.c was incredibly broken,
and included all the function prologue and epilogue stuff, even though
it was itself then inside a C function where the compiler would add its
own prologue and epilogue on top of it all.

This then just _happened_ to work if you had exactly the right compiler
version and exactly the right compiler flags, so that gcc just happened
to not create any prologue at all (the gcc-generated epilogue wouldn't
matter, since it would never be reached).

But the more proper way to fix it is to simply not do this. Move the
inline asm to the top level, with no surrounding function at all (the
better alternative would be to remove the prologue and make it actually
use proper description of the arguments to the inline asm, but that's a
bigger change than the one I'm willing to make right now).

Tested-by: S.Çağlar Onur <caglar@pardus.org.tr>
Acked-by: Thomas Mingarelli <Thomas.Mingarelli@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Cliff Wickman [Fri, 20 Jun 2008 19:02:00 +0000 (12:02 -0700)]

[IA64] SN2: security hole in sn2_ptc_proc_write

Security hole in sn2_ptc_proc_write

It is possible to overrun a buffer with a write to this /proc file.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

commit | commitdiff | tree

Ben Dooks [Fri, 20 Jun 2008 18:53:35 +0000 (20:53 +0200)]

BAST: Remove old IDE driver

Remove the old BAST IDE driver, as we are now using the platform-pata
support.

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

commit | commitdiff | tree

Christophe Niclaes [Fri, 20 Jun 2008 18:53:34 +0000 (20:53 +0200)]

pcmcia ide kingston compactflash's have a new manufacturer id

Up to now, Kingston compactflash cards (ab)used the Toshiba Manufacturer's ID,
In their new CF cards, they use a new one. Let's the ide subsystem
recognize CF cards with the new id.

Signed-off-by: Christophe Niclaes <cniclaes@develtech.com>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

commit | commitdiff | tree

Kristoffer Ericson [Fri, 20 Jun 2008 18:53:34 +0000 (20:53 +0200)]

pcmcia: add another pata/ide ID

Addition of Transcend 1GB 45x id so that it is properly detected.

[bart: fix typo in ide-cs's ID spotted by Alan Cox]

Signed-off-by: William Peters <w1ll14@gmail.com>
Signed-off-by: Kristoffer Ericson <Kristoffer_e1@hotmail.com>
CC: Alan Cox <alan@lxorguk.ukuu.org.uk>
CC: linux-ide@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

commit | commitdiff | tree

Matt Reimer [Fri, 20 Jun 2008 18:53:34 +0000 (20:53 +0200)]

pcmcia: add an pata/ide ID

Add an id for:

product info: "M-Systems", "CF300", ""
manfid: 0x000a, 0x0000
function: 4 (fixed disk)

Signed-off-by: Matt Reimer <mreimer@vpop.net>
CC: Alan Cox <alan@lxorguk.ukuu.org.uk>
CC: linux-ide@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

commit | commitdiff | tree

Bartlomiej Zolnierkiewicz [Fri, 20 Jun 2008 18:53:33 +0000 (20:53 +0200)]

ide: increase timeout in wait_drive_not_busy()

Some ATAPI devices take longer than the current max timeout value to
become ready (i.e. TEAC DV-W28ECW takes 6 ms) so increase the timeout
value to 10 ms.

This fixes kernel.org bugzilla bug #10887:
http://bugzilla.kernel.org/show_bug.cgi?id=10887

Reported-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

commit | commitdiff | tree

Sergei Shtylyov [Fri, 20 Jun 2008 18:53:32 +0000 (20:53 +0200)]

palm_bk3710: fix resource management

The driver expected a *virtual* address in the IDE platform device's memory
resource and didn't request the memory region for the register block. Fix this
taking into account the fact that DaVinci SoC devices are fixed-mapped to the
virtual memory early and we can get their virtual addresses using IO_ADDRESS()
macro, not having to call ioremap()...

While at it, also do some cosmetic changes...

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

commit | commitdiff | tree

Linus Torvalds [Fri, 20 Jun 2008 18:18:25 +0000 (11:18 -0700)]

Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP

KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.

We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.

In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.

This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.

While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.

We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.

[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]

Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Frederic Bohe [Fri, 20 Jun 2008 15:48:48 +0000 (11:48 -0400)]

Ext4: Fix online resize block group descriptor corruption

This is the patch for the group descriptor table corruption during
online resize pointed out by Theodore Tso. The problem was caused by
the fact that the ext4 group descriptor can be either 32 or 64 bytes
long. Only the 64 bytes structure was taken into account.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Oleg Nesterov [Fri, 20 Jun 2008 14:32:20 +0000 (18:32 +0400)]

sched: refactor wait_for_completion_timeout()

Simplify the code and fix the boundary condition of
wait_for_completion_timeout(,0).

We can kill the first __remove_wait_queue() as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

commit | commitdiff | tree

Jeremy Fitzhardinge [Mon, 16 Jun 2008 22:01:56 +0000 (15:01 -0700)]

xen: don't drop NX bit

Because NX is now enforced properly, we must put the hypercall page
into the .text segment so that it is executable.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Jeremy Fitzhardinge [Mon, 16 Jun 2008 22:01:53 +0000 (15:01 -0700)]

xen: mask unwanted pte bits in __supported_pte_mask

[ Stable: this isn't a bugfix in itself, but it's a pre-requiste
for "xen: don't drop NX bit" ]

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Isaku Yamahata [Mon, 16 Jun 2008 21:58:13 +0000 (14:58 -0700)]

xen: Use wmb instead of rmb in xen_evtchn_do_upcall().

This patch is ported one from 534:77db69c38249 of linux-2.6.18-xen.hg.
Use wmb instead of rmb to enforce ordering between
evtchn_upcall_pending and evtchn_pending_sel stores
in xen_evtchn_do_upcall().

Cc: Samuel Thibault <samuel.thibault@eu.citrix.com>
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 19 Jun 2008 16:41:22 +0000 (09:41 -0700)]

x86: fix NULL pointer deref in __switch_to

I am able to reproduce the oops reported by Simon in __switch_to() with
lguest.

My debug showed that there is at least one lguest specific
issue (which should be present in 2.6.25 and before aswell) and it got
exposed with a kernel oops with the recent fpu dynamic allocation patches.

In addition to the previous possible scenario (with fpu_counter), in the
presence of lguest, it is possible that the cpu's TS bit it still set and the
lguest launcher task's thread_info has TS_USEDFPU still set.

This is because of the way the lguest launcher handling the guest's TS bit.
(look at lguest_set_ts() in lguest_arch_run_guest()). This can result
in a DNA fault while doing unlazy_fpu() in __switch_to(). This will
end up causing a DNA fault in the context of new process thats
getting context switched in (as opossed to handling DNA fault in the context
of lguest launcher/helper process).

This is wrong in both pre and post 2.6.25 kernels. In the recent
2.6.26-rc series, this is showing up as NULL pointer dereferences or
sleeping function called from atomic context(__switch_to()), as
we free and dynamically allocate the FPU context for the newly
created threads. Older kernels might show some FPU corruption for processes
running inside of lguest.

With the appended patch, my test system is running for more than 50 mins
now. So atleast some of your oops (hopefully all!) should get fixed.
Please give it a try. I will spend more time with this fix tomorrow.

Reported-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Roland Dreier [Thu, 19 Jun 2008 22:04:07 +0000 (15:04 -0700)]

sched: fix wait_for_completion_timeout() spurious failure under heavy load

It seems that the current implementaton of wait_for_completion_timeout()
has a small problem under very high load for the common pattern:

if (!wait_for_completion_timeout(&done, timeout))
/* handle failure */

because the implementation very roughly does (lots of code deleted to
show the basic flow):

static inline long __sched
do_wait_for_common(struct completion *x, long timeout, int state)
{
if (x->done)
return timeout;

do {
timeout = schedule_timeout(timeout);

if (!timeout)
return timeout;

} while (!x->done);

return timeout;
}

so if the system is very busy and x->done is not set when
do_wait_for_common() is entered, it is possible that the first call to
schedule_timeout() returns 0 because the task doing wait_for_completion
doesn't get rescheduled for a long time, even if it is woken up early
enough.

In this case, wait_for_completion_timeout() returns 0 without even
checking x->done again, and the code above falls into its failure case
purely for scheduler reasons, even if the hardware event or whatever was
being waited for happened early enough.

It would make sense to add an extra test to do_wait_for() in the timeout
case and return 1 if x->done is actually set.

A quick audit (not exhaustive) of wait_for_completion_timeout() callers
seems to indicate that no one actually cares about the return value in
the success case -- they just test for 0 (timed out) versus non-zero
(wait succeeded).

Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Thu, 19 Jun 2008 12:22:28 +0000 (14:22 +0200)]

sched: rt: dont stop the period timer when there are tasks wanting to run

So if the group ever gets throttled, it will never wake up again.

Reported-by: "Daniel K." <dk@uw.no>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Len Brown [Fri, 20 Jun 2008 06:47:16 +0000 (02:47 -0400)]

Merge branch 'bugzilla-9761' into release

commit | commitdiff | tree

Len Brown [Fri, 20 Jun 2008 06:45:05 +0000 (02:45 -0400)]

Merge branch 'bugzilla-10695' into release

commit | commitdiff | tree

Dave Airlie [Fri, 20 Jun 2008 05:42:38 +0000 (15:42 +1000)]

drm: only trust core drm ioctls - driver ioctls are a mess.

So driver ioctls need a full auditing before we can make this change.

Signed-off-by: Dave Airlie <airlied@redhat.com>

commit | commitdiff | tree

Zhenyu Wang [Fri, 20 Jun 2008 02:12:56 +0000 (12:12 +1000)]

drm/i915: add support for Intel series 4 chipsets.

Signed-off-by: Zhenyu Wang <zhenyu.z.wang@intel.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

commit | commitdiff | tree

Zhenyu Wang [Fri, 20 Jun 2008 01:48:06 +0000 (11:48 +1000)]

[agp]: fixup chipset flush for new Intel G4x.

Signed-off-by: Zhenyu Wang <zhenyu.z.wang@intel.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>

commit | commitdiff | tree

YOSHIFUJI Hideaki [Thu, 19 Jun 2008 23:33:57 +0000 (16:33 -0700)]

ipv6: Drop packets for loopback address from outside of the box.

[ Based upon original report and patch by Karsten Keil.  Karsten
  has verified that this fixes the TAHI test case "ICMPv6 test
  v6LC.5.1.2 Part F". -DaveM ]

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Shan Wei [Thu, 19 Jun 2008 23:29:39 +0000 (16:29 -0700)]

ipv6: Remove options header when setsockopt's optlen is 0

Remove the sticky Hop-by-Hop options header by calling setsockopt()
for IPV6_HOPOPTS with a zero option length, per RFC3542.

Routing header and Destination options header does the same as
Hop-by-Hop options header.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jordan Crouse [Wed, 18 Jun 2008 17:34:38 +0000 (11:34 -0600)]

x86, geode: add a VSA2 ID for General Software

General Software writes their own VSA2 module for their version
of the Geode BIOS, which returns a different ID then the standard
VSA2. This was causing the framebuffer driver to break for most
GSW boards.

Signed-off-by: Jordan Crouse <jordan.crouse@amd.com>
Cc: tglx@linutronix.de
Cc: linux-geode@lists.infradead.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Bharath Ravi [Mon, 16 Jun 2008 09:41:01 +0000 (15:11 +0530)]

sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue

This patch corrects the incorrect value of per process run-queue wait
time reported by delay statistics. The anomaly was due to the following
reason. When a process leaves the CPU and immediately starts waiting for
CPU on the runqueue (which means it remains in the TASK_RUNNABLE state),
the time of re-entry into the run-queue is never recorded. Due to this,
the waiting time on the runqueue from this point of re-entry upto the
next time it hits the CPU is not accounted for. This is solved by
recording the time of re-entry of a process leaving the CPU in the
sched_info_depart() function IF the process will go back to waiting on
the run-queue. This IF condition is verified by checking whether the
process is still in the TASK_RUNNABLE state.

The patch was tested on 2.6.26-rc6 using two simple CPU hog programs.
The values noted prior to the fix did not account for the time spent on
the runqueue waiting. After the fix, the correct values were reported
back to user space.

Signed-off-by: Bharath Ravi <bharathravi1@gmail.com>
Signed-off-by: Madhava K R <madhavakr@gmail.com>
Cc: dhaval@linux.vnet.ibm.com
Cc: vatsa@in.ibm.com
Cc: balbir@in.ibm.com
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

David Brownell [Sun, 4 May 2008 02:19:16 +0000 (19:19 -0700)]

hwmon: (lm75) sensor reading bugfix

LM75 sensor reading bugfix: never save error status as valid
sensor output. This could be improved, but at least this
prevents certain rude failure modes.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Hans de Goede [Fri, 23 May 2008 14:10:41 +0000 (16:10 +0200)]

hwmon: (abituguru3) update driver detection

It has been reported that the abituguru3 driver fails to load after a BIOS
update. This patch fixes this by loosening the detection routine so that it
will work after the BIOS update too. To compensate for the now very loose
detection an additional check is added on the DMI Base Board vendor string to
make sure we only load on Abit motherboards, this is the same as the check in
the abituguru (1 / 2) driver.

Signed-of-by: Hans de Goede <j.w.r.degoede@hhs.nl>
Signed-off-by: Alistair John Strachan <alistair@devzero.co.uk>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Marc Hulsman [Sun, 8 Jun 2008 14:59:41 +0000 (10:59 -0400)]

hwmon: (w83791d) new maintainer

Signed-off-by: Charles Spirakis <bezaur@gmail.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Hans de Goede [Tue, 26 Feb 2008 18:34:48 +0000 (19:34 +0100)]

hwmon: (abituguru3) Identify Abit AW8D board as such

This patch identifies the Abit AW8D board as such, and adds support for its
aux5 fan connector

Signed-off-by: Hans de Goede <j.w.r.degoede@hhs.nl>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Jean Delvare [Sat, 23 Feb 2008 09:57:53 +0000 (10:57 +0100)]

hwmon: Update the sysfs interface documentation

* Document the characteristics of libsensors 3.0.0 and 3.0.1.
* The sysfs interface is no longer subject to changes.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Juerg Haefliger <juergh at gmail.com>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Jean Delvare [Sat, 26 Apr 2008 14:34:26 +0000 (16:34 +0200)]

hwmon: (adt7473) Initialize max_duty_at_overheat before use

data->max_duty_at_overheat is not updated in adt7473_update_device,
so it might be used before it is initialized (if the user reads from
sysfs file max_duty_at_crit before writing to it.)

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Jean Delvare [Thu, 3 Apr 2008 08:40:39 +0000 (10:40 +0200)]

hwmon: (lm85) Fix function RANGE_TO_REG()

Function RANGE_TO_REG() is broken. For a requested range of 2000 (2
degrees C), it will return an index value of 15, i.e. 80.0 degrees C,
instead of the expected index value of 0. All other values are handled
properly, just 2000 isn't.

The bug was introduced back in November 2004 by this patch:
http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commit;h=1c28d80f1992240373099d863e4996cdd5d646d0

While this can be fixed easily with the current code, I'd rather
rewrite the whole function in a way which is more obviously correct.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Justin Thiessen <jthiessen@penguincomputing.com>
Signed-off-by: Mark M. Hoffman <mhoffman@lightlink.com>

commit | commitdiff | tree

Sonic Zhang [Thu, 19 Jun 2008 09:46:39 +0000 (17:46 +0800)]

Blackfin Serial Driver: Use timer to poll CTS PIN instead of workqueue.

This allows other threads to run when the serial driver polls the CTS
PIN in a loop.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Bryan Wu <cooloney@kernel.org>

commit | commitdiff | tree

Sonic Zhang [Thu, 19 Jun 2008 09:07:15 +0000 (17:07 +0800)]

Blackfin arch: fix typo error in bf548 serial header file

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Bryan Wu <cooloney@kernel.org>

commit | commitdiff | tree

Bernhard Walle [Sun, 8 Jun 2008 14:16:07 +0000 (16:16 +0200)]

x86: use BOOTMEM_EXCLUSIVE on 32-bit

This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
i386 and prints a error message on failure.

The patch is still for 2.6.26 since it is only bug fixing. The unification
of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>

commit | commitdiff | tree

Mikael Pettersson [Sun, 15 Jun 2008 00:19:56 +0000 (02:19 +0200)]

x86, 32-bit: fix boot failure on TSC-less processors

Booting 2.6.26-rc6 on my 486 DX/4 fails with a "BUG: Int 6"
(invalid opcode) and a kernel halt immediately after the
kernel has been uncompressed. The BUG shows EIP pointing
to an rdtsc instruction in native_read_tsc(), invoked from
native_sched_clock().

(This error occurs so early that not even the serial console
can capture it.)

A bisection showed that this bug first occurs in 2.6.26-rc3-git7,
via commit 9ccc906c97e34fd91dc6aaf5b69b52d824386910:

>x86: distangle user disabled TSC from unstable
>
>tsc_enabled is set to 0 from the command line switch "notsc" and from
>the mark_tsc_unstable code. Seperate those functionalities and replace
>tsc_enable with tsc_disable. This makes also the native_sched_clock()
>decision when to use TSC understandable.
>
>Preparatory patch to solve the sched_clock() issue on 32 bit.
>
>Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

The core reason for this bug is that native_sched_clock() gets
called before tsc_init().

Before the commit above, tsc_32.c used a "tsc_enabled" variable
which defaulted to 0 == disabled, and which only got enabled late
in tsc_init(). Thus early calls to native_sched_clock() would skip
the TSC and use jiffies instead.

After the commit above, tsc_32.c uses a "tsc_disabled" variable
which defaults to 0, meaning that the TSC is Ok to use. Early calls
to native_sched_clock() now erroneously try to use the TSC on
!cpu_has_tsc processors, leading to invalid opcode exceptions.

My proposed fix is to initialise tsc_disabled to a "soft disabled"
state distinct from the hard disabled state set up by the "notsc"
kernel option. This fixes the native_sched_clock() problem. It also
allows tsc_init() to be simplified: instead of setting tsc_disabled = 1
on every error return, we just set tsc_disabled = 0 once when all
checks have succeeded.

I've verified that this lets my 486 boot again. I've also verified
that a Core2 machine still uses the TSC as clocksource after the patch.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Fri, 13 Jun 2008 22:47:12 +0000 (15:47 -0700)]

x86: fix NULL pointer deref in __switch_to

Patrick McHardy reported a crash:

> > I get this oops once a day, its apparently triggered by something
> > run by cron, but the process is a different one each time.
> >
> > Kernel is -git from yesterday shortly before the -rc6 release
> > (last commit is the usb-2.6 merge, the x86 patches are missing),
> > .config is attached.
> >
> > I'll retry with current -git, but the patches that have gone in
> > since I last updated don't look related.
> >
> > [62060.043009] BUG: unable to handle kernel NULL pointer dereference at
> > 000001ff
> > [62060.043009] IP: [<c0102a9b>] __switch_to+0x2f/0x118
> > [62060.043009] *pde = 00000000
> > [62060.043009] Oops: 0002 [#1] PREEMPT

Vegard Nossum analyzed it:

> This decodes to
>
> 0: 0f ae 00 fxsave (%eax)
>
> so it's related to the floating-point context. This is the exact
> location of the crash:
>
> $ addr2line -e arch/x86/kernel/process_32.o -i ab0
> include/asm/i387.h:232
> include/asm/i387.h:262
> arch/x86/kernel/process_32.c:595
>
> ...so it looks like prev_task->thread.xstate->fxsave has become NULL.
> Or maybe it never had any other value.

Somehow (as described below) TS_USEDFPU is set but the fpu is not
allocated or freed.

Another possible FPU pre-emption issue with the sleazy FPU optimization
which was benign before but not so anymore, with the dynamic FPU allocation
patch.

New task is getting exec'd and it is prempted at the below point.

flush_thread() {
...
/*
* Forget coprocessor state..
*/
clear_fpu(tsk);
<----- Preemption point
clear_used_math();
...
}

Now when it context switches in again, as the used_math() is still set
and fpu_counter can be > 5, we will do a math_state_restore() which sets
the task's TS_USEDFPU. After it continues from the above preemption point
it does clear_used_math() and much later free_thread_xstate().

Now, at the next context switch, it is quite possible that xstate is
null, used_math() is not set and TS_USEDFPU is still set. This will
trigger unlazy_fpu() causing kernel oops.

Fix this by clearing tsk's fpu_counter before clearing task's fpu.

Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Jeremy Fitzhardinge [Fri, 6 Jun 2008 09:21:39 +0000 (10:21 +0100)]

x86: set PAE PHYSICAL_MASK_SHIFT to 44 bits.

When a 64-bit x86 processor runs in 32-bit PAE mode, a pte can
potentially have the same number of physical address bits as the
64-bit host ("Enhanced Legacy PAE Paging").  This means, in theory,
we could have up to 52 bits of physical address in a pte.

The 32-bit kernel uses a 32-bit unsigned long to represent a pfn.
This means that it can only represent physical addresses up to 32+12=44
bits wide.  Rather than widening pfns everywhere, just set 2^44 as the
Linux x86_32-PAE architectural limit for physical address size.

This is a bugfix for two cases:
1. running a 32-bit PAE kernel on a machine with
  more than 64GB RAM.
2. running a 32-bit PAE Xen guest on a host machine with
  more than 64GB RAM

In both cases, a pte could need to have more than 36 bits of physical,
and masking it to 36-bits will cause fairly severe havoc.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Jason Wessel [Tue, 27 May 2008 17:23:29 +0000 (12:23 -0500)]

softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression

The touch_nmi_watchdog() routine on x86 ultimately calls
touch_softlockup_watchdog().  The problem is that to touch the
softlockup watchdog, the cpu_clock code has to be called which could
involve multiple cpu locks and can lead to a hard hang if one of the
locks is held by a processor that is not going to return anytime soon
(such as could be the case with kgdb or perhaps even with some other
kind of exception).

This patch causes the public version of the
touch_softlockup_watchdog() to defer the cpu clock access to a later
point.

The test case for this problem is to use the following kernel config
options:

CONFIG_KGDB_TESTS=y
CONFIG_KGDB_TESTS_ON_BOOT=y
CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"

It should be noted that kgdb test suite and these options were not
available until 2.6.26-rc2, so it was necessary to patch the kgdb
test suite during the bisection.

I would consider this patch a regression fix because the problem first
appeared in commit 27ec4407790d075c325e1f4da0a19c56953cce23 when some
logic was added to try to periodically sync the clocks.  It was
possible to work around this particular problem by simply not
performing the sync anytime the system was in a critical context.
This was ok until commit 3e51f33fcc7f55e6df25d15b55ed10c8b4da84cd,
which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
multi-cpu locks to sync the clocks.  It became clear that accessing
this code from an nmi was the source of the lockups.  Avoiding the
access to the low level clock code from an code inside the NMI
processing also fixed the problem with the 27ec44... commit.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Steven Rostedt [Thu, 22 May 2008 18:18:17 +0000 (14:18 -0400)]

rcupreempt: remove export of rcu_batches_completed_bh

In rcupreempt, rcu_batches_completed_bh is defined as a static inline in
the header file. This does not need to be exported, and not only that,
this breaks my PPC build.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: paulus@samba.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

commit | commitdiff | tree

Li Zefan [Tue, 13 May 2008 02:27:17 +0000 (10:27 +0800)]

cpuset: limit the input of cpuset.sched_relax_domain_level

We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
for inputs outside this range.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

commit | commitdiff | tree

Ingo Molnar [Thu, 19 Jun 2008 07:37:31 +0000 (09:37 +0200)]

Merge branch 'linus' into sched/urgent

commit | commitdiff | tree

Max Krasnyansky [Thu, 29 May 2008 18:17:01 +0000 (11:17 -0700)]

sched: CPU hotplug events must not destroy scheduler domains created by the cpusets

First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

commit | commitdiff | tree

Peter Zijlstra [Thu, 19 Jun 2008 07:06:59 +0000 (09:06 +0200)]

sched: rt-group: fix RR buglet

In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
due to it running out of runtime, and then we try to requeue it, of it also
having exhausted its RR quota. Obviously requeueing something that is no longer
on the runqueue will not have the expected result.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Peter Zijlstra [Thu, 19 Jun 2008 07:06:57 +0000 (09:06 +0200)]

sched: rt-group: heirarchy aware throttle

The bandwidth throttle code dequeues a group when it runs out of quota, and
re-queues it once the period rolls over and the quota gets refreshed.

Sadly it failed to take the hierarchy into consideration. Share more of the
enqueue/dequeue code with regular task opterations.

Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
so check for that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Linux 2.6 source tree

RSS Atom