Dmitry Adamushko [Mon, 15 Oct 2007 15:00:13 +0000 (17:00 +0200)]
sched: tidy up SCHED_RR
- make timeslices of SCHED_RR tasks constant and not
dependent on task's static_prio [1] ;
- remove obsolete code (timeslice related bits);
- make sched_rr_get_interval() return something more
meaningful [2] for SCHED_OTHER tasks.
[1] according to the following link, it's not compliant with SUSv3
(not sure though, what is the reference for us :-)
http://lkml.org/lkml/2007/3/7/656
[2] the interval is dynamic and can be depicted as follows "should a
task be one of the runnable tasks at this particular moment, it would
expect to run for this interval of time before being re-scheduled by the
scheduler tick".
(i.e. it's more precise if a task is runnable at the moment)
yeah, this seems to require task_rq_lock/unlock() but this is not a hot
path.
Dmitry Adamushko [Mon, 15 Oct 2007 15:00:13 +0000 (17:00 +0200)]
sched: fix __pick_next_entity()
The thing is that __pick_next_entity() must never be called when
first_fair(cfs_rq) == NULL. It wouldn't be a problem, should 'run_node'
be the very first field of 'struct sched_entity' (and it's the second).
The 'nr_running != 0' check is _not_ enough, due to the fact that
'current' is not within the tree. Generic paths are ok (e.g. schedule()
as put_prev_task() is called previously)... I'm more worried about e.g.
migration_call() -> CPU_DEAD_FROZEN -> migrate_dead_tasks()... if
'current' == rq->idle, no problems.. if it's one of the SCHED_NORMAL
tasks (or imagine, some other use-cases in the future -- i.e. we should
not make outer world dependent on internal details of sched_fair class)
-- it may be "Houston, we've got a problem" case.
it's +16 bytes to the ".text". Another variant is to make 'run_node' the
first data member of 'struct sched_entity' but an additional check (se !
= NULL) is still needed in pick_next_entity().
There is a possibility that because of task of a group moving from one
cpu to another, it may gain more cpu time that desired. See
http://marc.info/?l=linux-kernel&m=119073197730334 for details.
This is an attempt to fix that problem. Basically it simulates dequeue
of higher level entities as if they are going to sleep. Similarly it
simulate wakeup of higher level entities as if they are waking up from
sleep.
The adjusting sched_class is a missing part of the already existing "do
not leak PI boosting priority to the child" at the sched_fork(). This
patch moves the adjusting sched_class from wake_up_new_task() to
sched_fork().
this also shrinks the code a bit:
text data bss dec hex filename
40111 4018 292 44421 ad85 sched.o.before
40102 4018 292 44412 ad7c sched.o.after
Ingo Molnar [Mon, 15 Oct 2007 15:00:11 +0000 (17:00 +0200)]
sched: fix sched_fork()
fix sched_fork(): large latencies at new task creation time because
the ->vruntime was not fixed up cross-CPU, if the parent got migrated
after the child's CPU got set up.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Mon, 15 Oct 2007 15:00:10 +0000 (17:00 +0200)]
sched debug: check spread
debug feature: check how well we schedule within a reasonable
vruntime 'spread' range. (note that CPU overload can increase
the spread, so this is not a hard condition, but normal loads
should be within the spread.)
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Enable user-id based fair group scheduling. This is useful for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.
A separate scheduling group (i.e struct task_grp) is automatically created for
every new user added to the system. Upon uid change for a task, it is made to
move to the corresponding scheduling group.
A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.
Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Dmitry Adamushko [Mon, 15 Oct 2007 15:00:08 +0000 (17:00 +0200)]
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
rework enqueue/dequeue_entity() to get rid of
sched_class::set_curr_task(). This simplifies sched_setscheduler(),
rt_mutex_setprio() and sched_move_tasks().
text data bss dec hex filename
24330 2734 20 27084 69cc sched.o.before
24233 2730 20 26983 6967 sched.o.after
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Dmitry Adamushko [Mon, 15 Oct 2007 15:00:08 +0000 (17:00 +0200)]
sched: optimize task_new_fair()
due to the fact that we no longer keep the 'current' within the tree,
dequeue/enqueue_entity() is useless for the 'current' in
task_new_fair(). We are about to reschedule and
sched_class->put_prev_task() will put the 'current' back into the tree,
based on its new key.
text data bss dec hex filename
24388 2734 20 27142 6a06 sched.o.before
24341 2734 20 27095 69d7 sched.o.after
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Dmitry Adamushko [Mon, 15 Oct 2007 15:00:07 +0000 (17:00 +0200)]
sched: do not keep current in the tree and get rid of sched_entity::fair_key
Get rid of 'sched_entity::fair_key'.
As a side effect, 'current' is not kept withing the tree for
SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code
(e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes
them (e.g. a single update_curr() now vs. dequeue/enqueue() before in
entity_tick()).
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Dmitry Adamushko [Mon, 15 Oct 2007 15:00:07 +0000 (17:00 +0200)]
sched: add set_curr_task() calls
p->sched_class->set_curr_task() has to be called before
activate_task()/enqueue_task() in rt_mutex_setprio(),
sched_setschedule() and sched_move_task() in order to set up
'cfs_rq->curr'. The logic of enqueueing depends on whether a task to be
inserted is 'current' or not.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Dmitry Adamushko [Mon, 15 Oct 2007 15:00:07 +0000 (17:00 +0200)]
sched: sched_setscheduler() fix
Fix a problem in the 'sched-group' patch for !CONFIG_FAIR_GROUP_SCHED.
description:
sched_setscheduler()
{
...
if (task_running()) p->sched_class->put_prev_entity();
[ this one sets up cfs_rq->curr to NULL ]
...
if (task_running) p->sched_class->set_curr_task();
[ and this one is a _NOP_ (empty) for !CONFIG_FAIR_GROUP_SCHED ]
As a result, the task continues to run with cfs_rq->curr == NULL... no
crashes (due to checks for !NULL in place) but e.g. update_curr()
effectively becomes a NOP... i.e. runtime statistics for this task is
not accounted untill it's rescheduled anew.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Mike Galbraith [Mon, 15 Oct 2007 15:00:07 +0000 (17:00 +0200)]
sched: fix SMP migration latencies
fix SMP migration latencies: the vruntimes of different CPUs are
at incompatible offsets so they have to be fixed up when migrating
a task across CPUs.
Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Ingo Molnar [Mon, 15 Oct 2007 15:00:07 +0000 (17:00 +0200)]
sched: x86: allow single-depth wchan output
sched.o gets smaller and faster if we compile it with -fomit-frame-pointers,
so make this a config option. The cost is the loss of multi-depth wchan
lookups - but SysRq-T is a sufficient replacement for them anyway, so their
utility is much lower these days.
the size difference is significant:
text data bss dec hex filename
34005 3462 24 37491 9273 sched.o.before
33470 3462 24 36956 905c sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Mon, 15 Oct 2007 15:00:04 +0000 (17:00 +0200)]
sched: new task placement for vruntime
add proper new task placement for the vruntime based math too.
( note: introduces a swap() macro, but the swap token is too
widely used in the kernel namespace for a generic version
to be added without changing non-scheduler code - so this
cleanup will be done separately. )
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Ingo Molnar [Mon, 15 Oct 2007 15:00:04 +0000 (17:00 +0200)]
sched: introduce se->vruntime
introduce se->vruntime as a sum of weighted delta-exec's, and use that
as the key into the tree.
the idea to use absolute virtual time as the basic metric of scheduling
has been first raised by William Lee Irwin, advanced by Tong Li and first
prototyped by Roman Zippel in the "Really Fair Scheduler" (RFS) patchset.
also see:
http://lkml.org/lkml/2007/9/2/76
for a simpler variant of this patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Ingo Molnar [Mon, 15 Oct 2007 15:00:03 +0000 (17:00 +0200)]
sched: remove precise CPU load
CPU load calculations are statistical anyway, and there's little benefit
from having it calculated on every scheduling event. So remove this code,
it gets rid of a divide from the scheduler wakeup and context-switch
fastpath.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Ingo Molnar [Mon, 15 Oct 2007 15:00:01 +0000 (17:00 +0200)]
sched: fix sysctl_sched_child_runs_first flag
fix the sched_child_runs_first flag: always call into ->task_new()
if we are on the same CPU, as SCHED_OTHER tasks depend on it for
correct initial setup.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
David Brownell [Sun, 14 Oct 2007 21:50:25 +0000 (14:50 -0700)]
Fix compile while compiling drivers/mmc/host/mmc_spi.o with !BLOCK
Make sure the mmc_spi driver can build without CONFIG_BLOCK.
Issue noted by "Avuton Olrich" <avuton@gmail.com> and randconfig.
While that won't be a common configuration, sometimes embedded
boards use SDIO to interface WLAN or Bluetooth chips (vs some
parallel interface), and don't provide an MMC/SD socket for use
with flash memory cards.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86:
x86: force timer broadcast on late AMD C1E detection
x86: move local APIC timer init to the end of start_secondary()
clockevents: introduce force broadcast notifier
x86: fix missing include for vsyscall
The call to napi_disable() in the PCI shutdown handler is problematic,
and is aggravated by the new NAPI.
Also, make sure watchdog timer doesn't go off.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 14 Oct 2007 20:57:45 +0000 (22:57 +0200)]
x86: force timer broadcast on late AMD C1E detection
The 64bit SMP bootup is slightly different to the 32bit one. It enables
the boot CPU local APIC timer before all CPUs are brought up. Some AMD C1E
systems have the C1E feature flag only set in the secondary CPU. Due to
the early enable of the boot CPU local APIC timer the APIC timer is
registered as a fully functional device. When we detect the wreckage during
the bringup of the secondary CPU, we need to force the boot CPU into
broadcast mode.
Check the C1E caused APIC timer disable, when the secondary APIC timer is
initialized. If the boot CPU APIC timer was registered as a functional
clock event device, then fix this up and utilize the
CLOCK_EVT_NOTIFY_BROADCAST_FORCE mechanism to force the already
registered boot CPU APIC timer into broadcast mode.
Tested by force injecting the failure mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Sun, 14 Oct 2007 20:57:45 +0000 (22:57 +0200)]
clockevents: introduce force broadcast notifier
The 64bit SMP bootup is slightly different to the 32bit one. It enables
the boot CPU local APIC timer before all CPUs are brought up. Some AMD C1E
systems have the C1E feature flag only set in the secondary CPU. Due to
the early enable of the boot CPU local APIC timer the APIC timer is
registered as a fully functional device. When we detect the wreckage during
the bringup of the secondary CPU, we need to force the boot CPU into
broadcast mode.
Add a new notifier reason and implement the force broadcast in the clock
events layer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dave Jones [Sun, 14 Oct 2007 20:57:45 +0000 (22:57 +0200)]
x86: fix missing include for vsyscall
> Maybe I just picked a bad time to try, but...
>
> arch/x86/kernel/alternative.c: In function 'apply_alternatives':
> arch/x86/kernel/alternative.c:191: error: 'VSYSCALL_START' undeclared (first use in this function)
> arch/x86/kernel/alternative.c:191: error: (Each undeclared identifier is reported only once
> arch/x86/kernel/alternative.c:191: error: for each function it appears in.)
> arch/x86/kernel/alternative.c:191: error: 'VSYSCALL_END' undeclared (first use in this function)
> make[1]: *** [arch/x86/kernel/alternative.o] Error 1
> make: *** [arch/x86/kernel] Error 2
Try this.
Include missing header for vsyscall.
Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Linus Torvalds [Sun, 14 Oct 2007 19:50:19 +0000 (12:50 -0700)]
Merge branch 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6
* 'release' of git://lm-sensors.org/kernel/mhoffman/hwmon-2.6: (53 commits)
hwmon: (vt8231) fix sparse warning
hwmon: (sis5595) fix sparse warning
hwmon: (w83627hf) don't assume bank 0
hwmon: (w83627hf) Fix setting fan min right after driver load
hwmon: (w83627hf) De-macro sysfs callback functions
hwmon: Add new combined driver for FSC chips
hwmon: (ibmpex) Release IPMI user if hwmon registration fails
hwmon: (dme1737) Add sch311x support
hwmon: (dme1737) group functions logically
hwmon: (dme1737) cleanups
hwmon: IBM power meter driver
hwmon: (coretemp) Add support for Celeron 4xx
hwmon: (lm87) Disable VID when it should be
hwmon: (w83781d) Add individual alarm and beep files
hwmon: VRM is not read from registers
MAINTAINERS: update hwmon subsystem git trees
hwmon: Fix the code examples in documentation
hwmon: update sysfs interface document - error handling
hwmon: (thmc50) Fix a debug message
hwmon: (thmc50) Don't create temp3 if not enabled
...