Vivek Goyal [Sat, 25 Jun 2005 21:57:51 +0000 (14:57 -0700)]
[PATCH] kexec: reserve Bootmem fix for booting nondefault location kernel
This patch fixes a problem with reserving memory during boot up of a kernel
built for non-default location. Currently boot memory allocator reserves
the memory required by kernel image, boot allocaotor bitmap etc. It
assumes that kernel is loaded at 1MB (HIGH_MEMORY hard coded to 1024*1024).
But kernel can be built for non-default locatoin, hence existing
hardcoding will lead to reserving unnecessary memory. This patch fixes it.
For one kernel to report a crash another kernel has created we need
to have 2 kernels loaded simultaneously in memory. To accomplish this
the two kernels need to built to run at different physical addresses.
This patch adds the CONFIG_PHYSICAL_START option to the x86 kernel
so we can do just that. You need to know what you are doing and
the ramifications are before changing this value, and most users
won't care so I have made it depend on CONFIG_EMBEDDED
bzImage kernels will work and run at a different address when compiled
with this option but they will still load at 1MB. If you need a kernel
loaded at a different address as well you need to boot a vmlinux.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The vmlinux on x86_64 does not report the correct physical address of
the kernel. Instead in the physical address field it currently
reports the virtual address of the kernel.
This is patch is a bug fix that corrects vmlinux to report the
proper physical addresses.
This is potentially a help for crash dump analysis tools.
This definitiely allows bootloaders that load vmlinux as a standard
ELF executable. Bootloaders directly loading vmlinux become of
practical importance when we consider the kexec on panic case.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The vmlinux on i386 does not report the correct physical address of
the kernel. Instead in the physical address field it currently
reports the virtual address of the kernel.
This is patch is a bug fix that corrects vmlinux to report the
proper physical addresses.
This is potentially a help for crash dump analysis tools.
This definitiely allows bootloaders that load vmlinux as a standard
ELF executable. Bootloaders directly loading vmlinux become of
practical importance when we consider the kexec on panic case.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In vmlinux.lds.h the code is carefull to define every section so vmlinux
properly reports the correct physical load address of code, as well as
it's virtual address.
The new SECURITY_INIT definition fails to follow that convention and
and causes incorrect physical address to appear in the vmlinux if
there are any security initcalls.
This patch updates the SECURITY_INIT to follow the convention in the rest of
the file.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] kexec: x86_64: restore apic virtual wire mode on shutdown
When coming out of apic mode attempt to set the appropriate
apic back into virtual wire mode. This improves on previous versions
of this patch by by never setting bot the local apic and the ioapic
into veritual wire mode.
This code looks at data from the mptable to see if an ioapic has
an ExtInt input to make this decision. A future improvement
is to figure out which apic or ioapic was in virtual wire mode
at boot time and to remember it. That is potentially a more accurate
method, of selecting which apic to place in virutal wire mode.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] kexec: x86: resture apic virtual wire mode on shutdown
When coming out of apic mode attempt to set the appropriate
apic back into virtual wire mode. This improves on previous versions
of this patch by by never setting bot the local apic and the ioapic
into veritual wire mode.
This code looks at data from the mptable to see if an ioapic has
an ExtInt input to make this decision. A future improvement
is to figure out which apic or ioapic was in virtual wire mode
at boot time and to remember it. That is potentially a more accurate
method, of selecting which apic to place in virutal wire mode.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch disables interrupt generation from the legacy pic on reboot. Now
that there is a sys_device class it should not be called while drivers are
still using interrupts.
There is a report about this breaking ACPI power off on some systems.
http://bugme.osdl.org/show_bug.cgi?id=4041
However the final comment seems to exonerate this code. So until
I get more information I believe that was a false positive.
Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a kexec problem whcih causes local APIC detection failure.
The problem is detect_init_APIC() is called early, before the command line
have been processed. Therefore "lapic" (and "nolapic") have not been seen,
yet.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename APIC_MODE_EXINT to APIC_MODE_EXTINT - I think it should be named
after what the mode is called in documentation.
From: "Eric W. Biederman" <ebiederm@lnxi.com>
I have reduced this patch to just the name change in the header. And
integrated the changes into the patches that add those
lines. Otherwise I ran into some ugly dependencies.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org Signed-off-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
we still default to the stock (Server) preemption model.
Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check. It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
represents a different latency/complexity/overhead tradeoff.
It has no runtime impact at all if disabled. Here are size stats that show
how the various preemption models impact the kernel's size:
text data bss dec hex filename 3618774 547184 179896 4345854 424ffe vmlinux.stock 3626406 547184 179896 4353486 426dce vmlinux.voluntary +0.2% 3748414 548640 179896 4476950 445016 vmlinux.preempt +3.5%
voluntary-preempt is +0.2% of .text, preempt is +3.5%.
This feature has been tested for many months by lots of people (and it's
also included in the RHEL4 distribution and earlier variants were in Fedora
as well), and it's intended for users and distributions who dont want to
use full-blown CONFIG_PREEMPT for one reason or another.
Ingo Molnar [Sat, 25 Jun 2005 21:57:38 +0000 (14:57 -0700)]
[PATCH] enable PREEMPT_BKL on !PREEMPT+SMP too
The only sane way to clean up the current 3 lock_kernel() variants seems to
be to remove the spinlock-based BKL implementations altogether, and to keep
the semaphore-based one only. If we dont want to do that for whatever
reason then i'm afraid we have to live with the current complexity. (but
i'm open for other cleanup suggestions as well.)
To explore this possibility we'll (at a minimum) have to know whether the
semaphore-based BKL works fine on plain SMP too. The patch below enables
this.
The patch may make sense in isolation as well, as it might bring
performance benefits: code that would formerly spin on the BKL spinlock
will now schedule away and give up the CPU. It might introduce performance
regressions as well, if any performance-critical code uses the BKL heavily
and gets overscheduled due to the semaphore. I very much hope there is no
such performance-critical codepath left though.
Ingo Molnar [Sat, 25 Jun 2005 21:57:36 +0000 (14:57 -0700)]
[PATCH] consolidate PREEMPT options into kernel/Kconfig.preempt
This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL
preemption options into kernel/Kconfig.preempt. This, besides reducing
source-code, also enables more centralized tweaking of preemption related
options.
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the core update_cpu_domains code and updated cpusets documentation
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patches add dynamic sched domains functionality that was
extensively discussed on lkml and lse-tech. I would like to see this added to
-mm
o The main advantage with this feature is that it ensures that the scheduler
load balacing code only balances against the cpus that are in the sched
domain as defined by an exclusive cpuset and not all of the cpus in the
system. This removes any overhead due to load balancing code trying to
pull tasks outside of the cpu exclusive cpuset only to be prevented by
the tasks' cpus_allowed mask.
o cpu exclusive cpusets are useful for servers running orthogonal
workloads such as RT applications requiring low latency and HPC
applications that are throughput sensitive
o It provides a new API partition_sched_domains in sched.c
that makes dynamic sched domains possible.
o cpu_exclusive cpusets sets are now associated with a sched domain.
Which means that the users can dynamically modify the sched domains
through the cpuset file system interface
o ia64 sched domain code has been updated to support this feature as well
o Currently, this does not support hotplug. (However some of my tests
indicate hotplug+preempt is currently broken)
o I have tested it extensively on x86.
o This should have very minimal impact on performance as none of
the fast paths are affected
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.
But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.
The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.
This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...
The POSIX norm says that the permissions are implementation specific, so
I think we can do that.
In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.
Nick Piggin [Sat, 25 Jun 2005 21:57:30 +0000 (14:57 -0700)]
[PATCH] sched: relax pinned balancing
The maximum rebalance interval allowed by the multiprocessor balancing
backoff is often not large enough to handle corner cases where there are
lots of tasks pinned on a CPU. Suresh reported:
I see system livelock's if for example I have 7000 processes
pinned onto one cpu (this is on the fastest 8-way system I
have access to).
After this patch, the machine is reported to go well above this number.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:27 +0000 (14:57 -0700)]
[PATCH] sched: RCU domains
One of the problems with the multilevel balance-on-fork/exec is that it needs
to jump through hoops to satisfy sched-domain's locking semantics (that is,
you may traverse your own domain when not preemptable, and you may traverse
others' domains when holding their runqueue lock).
balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.
So bite the bullet and make sched-domains go completely RCU. This actually
simplifies the code quite a bit.
From: Ingo Molnar <mingo@elte.hu>
schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:26 +0000 (14:57 -0700)]
[PATCH] sched: multilevel sbe sbf
The fundamental problem that Suresh has with balance on exec and fork is that
it only tries to balance the top level domain with the flag set.
This was worked around by removing degenerate domains, but is still a problem
if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.
This patch makes balance on fork and exec try balancing over not just the top
most domain with the flag set, but all the way down the domain tree.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Suresh Siddha [Sat, 25 Jun 2005 21:57:25 +0000 (14:57 -0700)]
[PATCH] sched: remove degenerate domains
Remove degenerate scheduler domains during the sched-domain init.
For example on x86_64, we always have NUMA configured in. On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group
in it.
With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
taking wrong decisions because of this topmost domain (as it contains only one
group and find_idlest_group always returns NULL). We will endup loading HT
package completely first, letting active load balance kickin and correct it.
In general, this patch also makes sense with out recent Nick's fixes in -mm.
From: Nick Piggin <nickpiggin@yahoo.com.au>
Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin. And allow a runqueue's sd to go NULL
rather than keep a single degenerate domain around (this happens when you run
with maxcpus=1).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:23 +0000 (14:57 -0700)]
[PATCH] sched: cleanup context switch locking
Instead of requiring architecture code to interact with the scheduler's
locking implementation, provide a couple of defines that can be used by the
architecture to request runqueue unlocked context switches, and ask for
interrupts to be enabled over the context switch.
Also replaces the "switch_lock" used by these architectures with an oncpu
flag (note, not a potentially slow bitflag). This eliminates one bus
locked memory operation when context switching, and simplifies the
task_running function.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:19 +0000 (14:57 -0700)]
[PATCH] sched: balance on fork
Reimplement the balance on exec balancing to be sched-domains aware. Use this
to also do balance on fork balancing. Make x86_64 do balance on fork over the
NUMA domain.
The problem that the non sched domains aware blancing became apparent on dual
core, multi socket opterons. What we want is for the new tasks to be sent to
a different socket, but more often than not, we would first load up our
sibling core, or fill two cores of a single remote socket before selecting a
new one.
This gives large improvements to STREAM on such systems.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:17 +0000 (14:57 -0700)]
[PATCH] sched: no aggressive idle balancing
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go. Hopefully we can regain
performance through other methods.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:15 +0000 (14:57 -0700)]
[PATCH] sched: tweak affine wakeups
Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
regressions here... make sure we don't don't move tasks the wrong way in an
imbalance condition. Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point. It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.
But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:13 +0000 (14:57 -0700)]
[PATCH] sched: balance timers
Do CPU load averaging over a number of different intervals. Allow each
interval to be chosen by sending a parameter to source_load and target_load.
0 is instantaneous, idx > 0 returns a decaying average with the most recent
sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased).
So generally a higher number will result in more conservative balancing.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:12 +0000 (14:57 -0700)]
[PATCH] sched: less aggressive idle balancing
Remove the special casing for idle CPU balancing. Things like this are
hurting for example on SMT, where are single sibling being idle doesn't really
warrant a really aggressive pull over the NUMA domain, for example.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:09 +0000 (14:57 -0700)]
[PATCH] sched: fix SMT scheduling problems
SMT balancing has a couple of problems. Firstly, active_load_balance is too
complex - basically it should be a dumb helper for when the periodic balancer
has determined there is an imbalance, but gets stuck because the task is
running.
So rip out all its "smarts", and just make it move one task to the target CPU.
Second, the busy CPU's sched-domain tree was being used for active balancing.
This means that it may not see that nr_balance_failed has reached a critical
level. So use the target CPU's sched-domain tree for this. We can do this
because we hold its runqueue lock.
Lastly, reset nr_balance_failed to a point where we allow cache hot migration.
This will help ensure active load balancing is successful.
Thanks to Suresh Siddha for pointing out these issues.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Sat, 25 Jun 2005 21:57:08 +0000 (14:57 -0700)]
[PATCH] sched: reduce active load balancing
Fix up active load balancing a bit so it doesn't get called when it shouldn't.
Reset the nr_balance_failed counter at more points where we have found
conditions to be balanced. This reduces too aggressive active balancing seen
on some workloads.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A large number of processes that are pinned to a single CPU results
in every other CPU's load_balance() seeing this overloaded CPU as
"busiest", yet move_tasks() never finds a task to pull-migrate. This
condition occurs during module unload, but can also occur as a
denial-of-service using sys_sched_setaffinity(). Several hundred
CPUs performing this fruitless load_balance() will livelock on the
busiest CPU's runqueue lock. A smaller number of CPUs will livelock
if the pinned task count gets high.
Expanding slightly on John's patch, this one attempts to work out whether the
balancing failure has been due to too many tasks pinned on the runqueue. This
allows it to be basically invisible to the regular blancing paths (ie. when
there are no pinned tasks). We can use this extra knowledge to shut down the
balancing faster, and ensure the migration threads don't start running which
is another problem observed in the wild.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is failing on my cross-compilation environment (From a solaris system)
using gcc-3.4.1 (as the compiler can't find a prototype for the setlocale()
function).
Badari Pulavarty [Sat, 25 Jun 2005 21:55:42 +0000 (14:55 -0700)]
[PATCH] fix for generic_file_write iov problem
Here is the fix for the problem described in
http://bugzilla.kernel.org/show_bug.cgi?id=4721
Basically, problem is generic_file_buffered_write() is accessing beyond end
of the iov[] vector after handling the last vector. If we happen to cross
page boundary, we get a fault.
I think this simple patch is good enough. If we really don't want to
depend on the "count", then we need pass nr_segs to
filemap_set_next_iovec() and decrement it and check it.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kylene Jo Hall [Sat, 25 Jun 2005 21:55:39 +0000 (14:55 -0700)]
[PATCH] tpm: Support new National TPMs
This patch is work to support new National TPMs that problems were reported
with on Thinkpad T43 and Thinkcentre S51. Thanks to Jens and Gang for
their debugging work on these issues.
Signed-off-by: Kylene Hall <kjhall@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Paul E. McKenney [Sat, 25 Jun 2005 21:55:38 +0000 (14:55 -0700)]
[PATCH] RCU: clean up a few remaining synchronize_kernel() calls
2.6.12-rc6-mm1 has a few remaining synchronize_kernel()s, some (but not
all) in comments. This patch changes these synchronize_kernel() calls (and
comments) to synchronize_rcu() or synchronize_sched() as follows:
- arch/x86_64/kernel/mce.c mce_read(): change to synchronize_sched() to
handle races with machine-check exceptions (synchronize_rcu() would not cut
it given RCU implementations intended for hardcore realtime use.
- drivers/input/serio/i8042.c i8042_stop(): change to synchronize_sched() to
handle races with i8042_interrupt() interrupt handler. Again,
synchronize_rcu() would not cut it given RCU implementations intended for
hardcore realtime use.
- include/*/kdebug.h comments: change to synchronize_sched() to handle races
with NMIs. As before, synchronize_rcu() would not cut it...
- include/linux/list.h comment: change to synchronize_rcu(), since this
comment is for list_del_rcu().
- security/keys/key.c unregister_key_type(): change to synchronize_rcu(),
since this is interacting with RCU read side.
- security/keys/process_keys.c install_session_keyring(): change to
synchronize_rcu(), since this is interacting with RCU read side.
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Michael Holzheu [Sat, 25 Jun 2005 21:55:33 +0000 (14:55 -0700)]
[PATCH] s390: debug feature changes
This patch changes the memory allocation method for the s390 debug feature.
Trace buffers had been allocated using the get_free_pages() function before.
Therefore it was not possible to get big memory areas in a running system due
to memory fragmentation. Now the trace buffers are subdivided into several
subbuffers with pagesize. Therefore it is now possible to allocate more
memory for the trace buffers and more trace records can be written.
In addition to that, dynamic specification of the size of the trace buffers is
implemented. It is now possible to change the size of a trace buffer using a
new debugfs file instance. When writing a number into this file, the trace
buffer size is changed to 'number * pagesize'.
In the past all the traces could be obtained from userspace by accessing files
in the "proc" filesystem. Now with debugfs we have a new filesystem which
should be used for debugging purposes. This patch moves the debug feature
from procfs to debugfs.
Since the interface of debug_register() changed, all device drivers, which use
the debug feature had to be adjusted.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Heiko Carstens [Sat, 25 Jun 2005 21:55:30 +0000 (14:55 -0700)]
[PATCH] s390: improved machine check handling
Improved machine check handling. Kernel is now able to receive machine checks
while in kernel mode (system call, interrupt and program check handling).
Also register validation is now performed.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Cope with a conditional i386 definition, which is wrong for UML. Before we
just used that one, but it wasn't defined for CONFIG_SMP, so in that case
we got link errors.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Sat, 25 Jun 2005 21:55:25 +0000 (14:55 -0700)]
[PATCH] uml: hot-unplug code cleanup
Clean up the hot-unplugging code. There is now an id procedure which is
called to figure out what device we're talking to. The error messages from
that are now done from mconsole_remove instead of the driver. remove is now
called with the device number, after it has been checked, so doesn't need to
do sanity checking on it.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Sat, 25 Jun 2005 21:55:24 +0000 (14:55 -0700)]
[PATCH] uml: time initialization tidying
user_time_init_skas and user_time_init_tt were essentially the same. So, this
merges them, deleting the mode-specific functions and declarations.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Sat, 25 Jun 2005 21:55:23 +0000 (14:55 -0700)]
[PATCH] uml: always disable kmalloc during shutdown
kmalloc wasn't being disabled during panic. This patch ensures that, no
matter how UML is exiting, it is disabled. This matters because part of the
cleanup is to remove the umid file, which involves readdir, which calls
malloc. This must map to libc malloc, rather than kmalloc or vmalloc.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Sat, 25 Jun 2005 21:55:22 +0000 (14:55 -0700)]
[PATCH] uml: fix timer initialization
In skas mode, the call to uml_idle_timer permanently shut off the virtual
timer, resulting in no timer ticks to anything but the idle thread. This is
likely the cause of the soft lockups that are seen sporadically in recent
UMLs.
Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Dike [Sat, 25 Jun 2005 21:55:21 +0000 (14:55 -0700)]
[PATCH] uml: fork cleanup
Fix the do_fork calling convention: normal arch pass the regs and the new sp
value to do_fork instead of NULL.
Currently the arch-independent code ignores these values, while the UML code
(actually it's copy_thread) gets the right values by itself.
With this patch, things are fixed up.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jesper Juhl [Sat, 25 Jun 2005 21:55:20 +0000 (14:55 -0700)]
[PATCH] uml: kfree cleanup
Here's a small patch to remove a few unnessesary NULL pointer checks before
kfree() in arch/um/drivers/daemon_user.c
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Morton [Sat, 25 Jun 2005 21:55:19 +0000 (14:55 -0700)]
[PATCH] uml: fix sizeof usage
Size of pointer doesn't seem right, but maybe my solution isn't either
(sig_size maybe?).
Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Jeff Dike <jdike@addtoit.com> Cc: Paolo Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Shaohua Li [Sat, 25 Jun 2005 21:55:15 +0000 (14:55 -0700)]
[PATCH] CPU hotplug printk fix
In the cpu hotplug case, per-cpu data possibly isn't initialized even the
system state is 'running'. As the comments say in the original code, some
console drivers assume per-cpu resources have been allocated. radeon fb is
one such driver, which uses kmalloc. After a CPU is down, the per-cpu data
of slab is freed, so the system crashed when printing some info.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pavel Machek [Sat, 25 Jun 2005 21:55:14 +0000 (14:55 -0700)]
[PATCH] swsusp: clean assembly parts
This patch fixes register saving so that each register is only saved once,
and adds missing saving of %cr8 on x86-64. Some reordering so that
save/restore is more logical/safer (segment registers should be restored
after gdt).
Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pavel Machek [Sat, 25 Jun 2005 21:55:14 +0000 (14:55 -0700)]
[PATCH] swsusp: fix nr_copy_pages
The following patch moves the recalculation of nr_copy_pages so that the right
number is used in the calculation of the size of memory and swap needed.
It prevents swsusp from attempting to suspend if there is not enough memory
and/or swap (which is unlikely anyway).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pavel Machek [Sat, 25 Jun 2005 21:55:12 +0000 (14:55 -0700)]
[PATCH] swsusp: cleanup whitespace
The following patch cleans up whitespace in swsusp.c (a bit):
- removes any trailing whitespace
- adds spaces after if, for, for_each_pbe, for_each_zone etc., wherever
necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch removes the unnecessary function does_collide_order().
This function is no longer necessary, as currently there are only 0-order
allocations in swsusp, and the use of it is confusing.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Li Shaohua [Sat, 25 Jun 2005 21:55:06 +0000 (14:55 -0700)]
[PATCH] suspend/resume SMP support
Using CPU hotplug to support suspend/resume SMP. Both S3 and S4 use
disable/enable_nonboot_cpus API. The S4 part is based on Pavel's original S4
SMP patch.
Signed-off-by: Li Shaohua<shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nathan Lynch [Sat, 25 Jun 2005 21:55:05 +0000 (14:55 -0700)]
[PATCH] generate hotplug events for cpu online
We already do kobject_hotplug for cpu offline; this adds a kobject_hotplug
call for the online case. This is being requested by developers of an
application which wants to be notified about both kinds of events.
Signed-off-by: Nathan Lynch <ntl@pobox.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Shaohua Li [Sat, 25 Jun 2005 21:55:05 +0000 (14:55 -0700)]
[PATCH] set cpu_state for CPU hotplug (ia64)
Dead CPU notifies online CPU that it's dead using cpu_state variable.
After switching to physical cpu hotplug, we forgot setting the variable.
This patch fixes it. Currently only __cpu_die uses it. We changed other
locations for consistency in case others use it.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Ashok Raj <ashok.raj@intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sat, 25 Jun 2005 21:55:03 +0000 (14:55 -0700)]
[PATCH] x86_64: Provide ability to choose using shortcuts for IPI in flat mode.
This patch provides an option to switch broadcast or use mask version for
sending IPI's. If CONFIG_HOTPLUG_CPU is defined, we choose not to use
broadcast shortcuts by default, otherwise we choose broadcast mode as default.
both cases, one can change this via startup cmd line option, to choose
no-broadcast mode.
no_ipi_broadcast=1
This is provided on request from Andi Kleen, since he doesnt agree with
replacing IPI shortcuts as a solution for CPU hotplug. Without removing
broadcast IPI's, it would mean lots of new code for __cpu_up() path, which
would acheive the same results.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Andi Kleen <ak@muc.de> Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sat, 25 Jun 2005 21:55:02 +0000 (14:55 -0700)]
[PATCH] x86_64: Dont use broadcast shortcut to make it cpu hotplug safe.
Broadcast IPI's provide un-expected behaviour for cpu hotplug. CPU's in
offline state also end up receiving the IPI. Once the cpus become online they
receive these stale IPI's which are bad and introduce unexpected behaviour.
This is easily avoided by not sending a broadcast and addressing just the
CPU's in online map. Doing prelim cycle counts it appears there is no big
overhead and numbers seem around 0x3000-0x3900 on an average on x86 and x86_64
systems with CPUS running 3G, both for broadcast and mask version of the
API's.
The shortcuts are useful only for flat mode (where the perf shows no
degradation), and in cluster mode, its unicast anyway. Its simpler to just
not use broadcast anymore.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Andi Kleen <ak@muc.de> Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sat, 25 Jun 2005 21:55:00 +0000 (14:55 -0700)]
[PATCH] x86_64: CPU hotplug support
Experimental CPU hotplug patch for x86_64
-----------------------------------------
This supports logical CPU online and offline.
- Test with maxcpus=1, and then kick other cpu's off to test if init code
is all cleaned up. CONFIG_SCHED_SMT works as well.
- idle threads are forked on demand from keventd threads for clean startup
TBD:
1. Not tested on a real NUMA machine (tested with numa=fake=2)
2. Handle ACPI pieces for physical hotplug support.
Ashok Raj [Sat, 25 Jun 2005 21:54:58 +0000 (14:54 -0700)]
[PATCH] x86_64: Change init sections for CPU hotplug support
This patch adds __cpuinit and __cpuinitdata sections that need to exist past
boot to support cpu hotplug.
Caveat: This is done *only* for EM64T CPU Hotplug support, on request from
Andi Kleen. Much of the generic hotplug code in kernel, and none of the other
archs that support CPU hotplug today, i386, ia64, ppc64, s390 and parisc dont
mark sections with __cpuinit, but only mark them as __devinit, and
__devinitdata.
If someone is motivated to change generic code, we need to make sure all
existing hotplug code does not break, on other arch's that dont use __cpuinit,
and __cpudevinit.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Andi Kleen <ak@muc.de> Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sat, 25 Jun 2005 21:54:57 +0000 (14:54 -0700)]
[PATCH] make smp_prepare_cpu to a weak function
I really wish smp_prepare_cpu() would disappear eventually. In the interim
this is ideally a weak function, so we dont end up changing several places
to define this dummy in headers.
Today since the dummy declaration is done only in drivers/base/cpu.c but
the function is called in kernel/power/smp.c i get undefined reference in
my cpu hotplug code for x86_64 under development.
Ashok Raj [Sat, 25 Jun 2005 21:54:52 +0000 (14:54 -0700)]
[PATCH] i386: Dont use IPI broadcast when using cpu hotplug.
This patch introduces a startup parameter no_broadcast. When we enable
CONFIG_HOTPLUG_CPU, we dont want to use broadcast shortcut as it has ill
effects on a offline cpu. If we issue broadcast, the IPI is also delivered
to offline cpus, or partially up cpu causing stale IPI's to be handled,
which is a problem and can cause undesirable effects.
Introduces a new startup cmdline option no_ipi_broadcast, that can be
switched at cmdline if necessary.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Zwane Mwaikambo [Sat, 25 Jun 2005 21:54:50 +0000 (14:54 -0700)]
[PATCH] i386 CPU hotplug
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)
The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree. In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs(). The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs. There are additional changes to s390
and ppc64 to account for this change.
1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>