[PATCH] i386: use -mcpu, not -mtune, for GCCs older than 3.4
I just noted that -mtune is used, which is only supported on recent GCCs; by
reading http://gcc.gnu.org/gcc-3.4/changes.html, you see "-mcpu has been
renamed to -mtune.", so for GCC < 3.4 we're not using any specific tuning in
the appropriate cases. However -mcpu is deprecated, so use -mtune when
possible.
This was introduced by commit e9d4dce954a60dc23dd1d967766ca2347b780e54 of the
old tree (between 2.6.10-rc3 and 2.6.10) by Linus Torvalds, to remove the use
of -march, since that could trigger gcc using SSE on its own. But no
attention was used about using -mcpu vs. -mtune.
And btw, the old 2.6.4 code (for instance) was:
cflags-$(CONFIG_MPENTIUMII) += $(call check_gcc,-march=pentium2,-march=i686)
cflags-$(CONFIG_MPENTIUMIII) += $(call check_gcc,-march=pentium3,-march=i686)
cflags-$(CONFIG_MPENTIUMM) += $(call check_gcc,-march=pentium3,-march=i686)
cflags-$(CONFIG_MPENTIUM4) += $(call check_gcc,-march=pentium4,-march=i686)
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove RWSEM_GENERIC_SPINLOCK, it's now defined (only if needed) by the
underlying arch/i386/Kconfig.cpu. Leave it only for x86_64. Even there, it's
totally wrong, as they even have the code to support XCHG_ADD.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make UML share the underlying cpu-specific tuning done on i386.
Actually, for now many config options aren't used a lot - but that can be done
later. Also, UML relies on GCC optimization for things like memcpy and such
more than i386, so specifying the correct -march and -mtune should be enough.
Later, we may want to correct some other stuff.
For instance, since FPU context switching, for us, is done (at least
partially, i.e. between our kernelspace and userspace) by the host, we may
allow usage of FPU operations by GCC. This doesn't hold for kernelspace vs.
kernelspace, but we don't support preemption.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hirokazu Takata [Sun, 30 Oct 2005 23:00:06 +0000 (15:00 -0800)]
[PATCH] m32r: SMC91x driver update
Update SMC91x driver for m32r.
- Remove needless NONCACHE_OFFSET adjustment.
> [PATCH 2.6.14-rc4] m32r: NONCACHE_OFFSET in _port2addr
> Change _port2addr() not to add NONCACHE_OFFSET.
> Adding NONCACHE_OFFSET requires needless address adjusting by a driver
> using ioremap() like a SMC91x driver.
- Fix lots of warnings as following:
/usr/src/ctest/git/kernel/drivers/net/smc91x.c: In function `smc_reset':
/usr/src/ctest/git/kernel/drivers/net/smc91x.c:324: warning: passing arg 2 of `_outw' makes integer from pointer without a cast
/usr/src/ctest/git/kernel/drivers/net/smc91x.c:325: warning: passing arg 2 of `_outw' makes integer from pointer without a cast
/usr/src/ctest/git/kernel/drivers/net/smc91x.c:341: warning: passing arg 2 of `_outw' makes integer from pointer without a cast
/usr/src/ctest/git/kernel/drivers/net/smc91x.c:342: warning: passing arg 2 of `_outw' makes integer from pointer without a cast
:
/usr/src/ctest/git/kernel/drivers/net/smc91x.c:1915: warning: passing arg 1 of `_inw' makes integer from pointer without a cast
/usr/src/ctest/git/kernel/drivers/net/smc91x.c:1915: warning: passing arg 1 of `_inw' makes integer from pointer without a cast
Hirokazu Takata [Sun, 30 Oct 2005 23:00:04 +0000 (15:00 -0800)]
[PATCH] m32r: NONCACHE_OFFSET in _port2addr
Change _port2addr() not to add NONCACHE_OFFSET. Adding NONCACHE_OFFSET
requires needless address adjusting by a driver using ioremap() like a
SMC91x driver.
Shaohua Li [Sun, 30 Oct 2005 23:00:01 +0000 (15:00 -0800)]
[PATCH] introduce .valid callback for pm_ops
Add pm_ops.valid callback, so only the available pm states show in
/sys/power/state. And this also makes an earlier states error report at
enter_state before we do actual suspend/resume.
Signed-off-by: Shaohua Li<shaohua.li@intel.com> Acked-by: Pavel Machek<pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch simplifies the progress meter in disk.c:free_some_memory()
and makes disk.c:pm_suspend_disk() call device_resume() explicitly in the
suspend path.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] swsusp: get rid of unnecessary wrapper function
The following patch merges two functions in a trivial way.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] swsusp: move snapshot functionality to separate file
The following patch moves the functionality of swsusp related to creating and
handling the snapshot of memory to a separate file, snapshot.c
This should enable us to untangle the code in the future and eventually to
implement some parts of swsusp.c in the user space.
The patch does not change the code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch makes swsusp use PG_nosave and PG_nosave_free flags to
mark pages that should be freed after the state of the system has been
restored from the image (or in case of an error during suspend).
This allows us to avoid storing metadata in swap twice and to reduce the
amount of memory needed by swsusp. Additionally, it allows us to simplify
the code by removing a couple of functions that are no longer necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sun, 30 Oct 2005 22:59:54 +0000 (14:59 -0800)]
[PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers
cpufreq entries in sysfs should only be populated when CPU is online state.
When we either boot with maxcpus=x and then boot the other cpus by echoing
to sysfs online file, these entries should be created and destroyed when
CPU_DEAD is notified. Same treatement as cache entries under sysfs.
We place the processor in the lowest frequency, so hw managed P-State
transitions can still work on the other threads to save power.
Primary goal was to just make these directories appear/disapper dynamically.
There is one in this patch i had to do, which i really dont like myself but
probably best if someone handling the cpufreq infrastructure could give
this code right treatment if this is not acceptable. I guess its probably
good for the first cut.
- Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt.
The locking was smack in the middle of the notification path, when the
hotplug is already holding the lock. I tried another solution to avoid this
so avoid taking locks if we know we are from notification path. The solution
was getting very ugly and i decided this was probably good for this iteration
until someone who understands cpufreq could do a better job than me.
(akpm: export cpucontrol to GPL modules: drivers/cpufreq/cpufreq_stats.c now
does lock_cpu_hotplug())
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sun, 30 Oct 2005 22:59:50 +0000 (14:59 -0800)]
[PATCH] create and destroy cache sysfs entries based on cpu notifiers
cpu cache entries should be populated only when cpu is online and removed
when they are logically offlined.
Without which entries are not removed when cpu is offlined, or dont appear
when we boot with maxcpus=1 and then kick the rest of the cpus via echo 1
to the sysfs online file.
- Changed __devinit to __cpuinit for consistency.
- Changed sysfs_driver_register to register_cpu_notifier.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ashok Raj [Sun, 30 Oct 2005 22:59:49 +0000 (14:59 -0800)]
[PATCH] introduce get_cpu_sysdev() to retrieve a sysfs entry for a cpu.
Some modules creating sysfs entries under /sys/devices/system/cpu/cpuX/
need to know the parent sysfs entry to make devices under them. This will
just return the sysfs entry for a given cpu.
sysfs entries showing under each cpu sysfs can be easily created if such
entries can be created by registering a sysfs driver for cpuclass. The
issue is when the entry is created the CPU may not be online, hence we
would need to defer the creation until the online notification comes.
Current users: cache entries for Intel CPU's and cpufreq subsystem.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Magnus Damm [Sun, 30 Oct 2005 22:59:48 +0000 (14:59 -0800)]
[PATCH] i386: srat on non-acpi hw fix
This patch adds a check for the return value of acpi_find_root_pointer().
Without this patch systems without ACPI support such as QEMU crashes when
booting a NUMA kernel with CONFIG_ACPI_SRAT=y.
Signed-off-by: Magnus Damm <magnus@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] i386 mpparse: Only ignore lapic information we can't store
After staring at mpparse.c for a little longer I noticed that when we hit
our limit of num_processors we are filtering out information about other
processors that we can still store.
This patch just reorders the code so we store everything we can.
This should avoid the incorrect warning about our boot CPU not being listed
by the BIOS that we are now getting in the kexec on panic case, and it
should allow us to detect all apicid conflicts even when our physical
number of cpus exceeds maxcpus.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Vivek Goyal [Sun, 30 Oct 2005 22:59:46 +0000 (14:59 -0800)]
[PATCH] kdump/i386: apic verification failure fix
o Removes the unnecessary call to local_irq_disable().
o Kdump was failing while second kernel was coming up. Check for presence
of boot cpu apic id was failing in (apic_id_registered), hence hitting
BUG().
o This should not have failed because before calling setup_local_APIC(), it is
ensured that even if BIOS has not reported boot cpu, then hard set the
prence of it. Problem happens because of usage of hard_smp_processor_id()
which is hardcoded to zero in case of non SMP kernel. In kdump case second
kernel can boot on a cpu whose boot cpu id is not zero.
o Using boot_cpu_physical_apicid instead to hard set the presence of boot cpu.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] i386 kexec-on-panic: Don't shutdown the apics.
It is dangerous to shutdown the apics in machine_crash_shutdown.
With my previous patch to initialize apics in init_IRQ we should be able to
boot a kernel without this. As long as we reinitialize the APICs we don't
care what state they were in during bootup.
This should make machine_crash_shutdown noticeably more reliable.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All kinds of ugliness exists because we don't initialize
the apics during init_IRQs.
- We calibrate jiffies in non apic mode even when we are using apics.
- We have to have special code to initialize the apics when non-smp.
- The legacy i8259 must exist and be setup correctly, even
when we won't use it past initialization.
- The kexec on panic code must restore the state of the io_apics.
- init/main.c needs a special case for !smp smp_init on x86
In addition to pure code movement I needed a couple
of non-obvious changes:
- Move setup_boot_APIC_clock into APIC_late_time_init for
simplicity.
- Use cpu_khz to generate a better approximation of loops_per_jiffies
so I can verify the timer interrupt is working.
- Call setup_apic_nmi_watchdog again after cpu_khz is initialized on
the boot cpu.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] i386 nmi_watchdog: Merge check_nmi_watchdog fixes from x86_64
The per cpu nmi watchdog timer is based on an event counter. idle cpus
don't generate events so the NMI watchdog doesn't fire and the test to see
if the watchdog is working fails.
- Add nmi_cpu_busy so idle cpus don't mess up the test.
- kmalloc prev_nmi_count to keep kernel stack usage bounded.
- Improve the error message on failure so there is enough
information to debug problems.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] i386 io_apic.c: Memorize at bootup where the i8259 is connected
Currently we attempt to restore virtual wire mode on reboot, which only
works if we can figure out where the i8259 is connected. This is very
useful when we kexec another kernel and likely helpful when dealing with a
BIOS that make assumptions about how the system is setup.
Since the acpi MADT table does not provide the location where the i8259 is
connected we have to look at the hardware to figure it out.
Most systems have the i8259 connected the local apic of the cpu so won't be
affected but people running Opteron and some serverworks chipsets should be
able to use kexec now.
In addition this patch removes the hard coded assumption that the io_apic
that delivers isa interrups is always known to the kernel as io_apic 0. As
there does not appear to be anything to guarantee that assumption is true.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is platform code update for ES7000: disables IRQ overrides for the
recent ES7000 (Rascal/Zorro), cleans up the compile warning. The patch
only affects the ES7000 subarch.
Signed-off-by: <Natalie.Protasevich@unisys.com> Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 22:59:37 +0000 (14:59 -0800)]
[PATCH] fixup bogus e820 entry with mem=
This was reported because someone was getting oopses reading /proc/iomem.
It was tracked down to a zero-sized 'struct resource' entry which was
located right at 4GB.
You need two conditions to hit this bug: a BIOS E820_RAM area starting at
exactly the boundary where you specify mem= (to get a zero-sized entry),
and for the legacy_init_iomem_resources() loop to skip that resource (which
only happens at exactly 4G).
I think the killing zero-sized e820 entry is the easiest way to fix this.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Similar problem has been reported before here:
http://groups.google.com/group/linux.kernel/browse_thread/thread/def4ca19dbc3cd4/5cffbf349f2c87a4?tvc=2&q=Aleksey+Gorelov&hl=en#5cffbf349f2c87a4
and was related to bug in BIOS reporting 82C686 router compatible to 586.
I suspect BIOS on this board has similar issue: reports VT8235 router to be
compatible with 586 one - which is obviously not true. Patch from the link
above has already incorporated in both 2.6 & 2.4 series, but might not work
in this particular case.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] x86: bug fix in P6 Machine check initialization
Make P6 MCA initialization code complaint with guidelines in IA-32 SDM
Vol3. Bank 0 control register should not be set by OS and clear status
registers on all banks on reset.
This will prevent false MCE alarms on the systems that has some non-MCE
information left-over in MC0_STATUS on reboot.
Zachary Amsden [Sun, 30 Oct 2005 22:59:33 +0000 (14:59 -0800)]
[PATCH] x86: bogus tls from gdt
The per-CPU initialization code is copying in bogus data into
thread->tls_array. Note that it copies &per_cpu(cpu_gdt_table, cpu), not
&per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TLS_MIN). That is totally broken
and unnecessary. Make the initialization explicitly NULL.
[PATCH] x86: hot plug CPU to support physical add of new processors
The patch allows physical bring-up of new processors (not initially present
in the configuration) from facilities such as driver/utility implemented on
a platform. The actual method of making processors available is up to the
platform implementation.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial internal version of Venki's cpuid(4) deterministic cache parameter
identification patch used static arrays of size MAX_CACHE_LEAVES. Final patch
which made to the base used dynamic array allocation, with this
MAX_CACHE_LEAVES limit hunk still in place.
cpuid(4) already has a mechanism to find out the number of cache levels
implemented and there is no need for this hardcoded MAX_CACHE_LEAVES limit.
So remove the MAX_CACHE_LEAVES limit from the routine which calculates the
number of cache levels using cpuid(4)
Shaohua Li [Sun, 30 Oct 2005 22:59:28 +0000 (14:59 -0800)]
[PATCH] FPU context corrupted after resume
mxcsr_feature_mask_init isn't needed in suspend/resume time (we can use
boot time mask). And actually it's harmful, as it clear task's saved
fxsave in resume. This bug is widely seen by users using zsh.
(akpm: my eyes. Fixed some surrounding whitespace mess)
Signed-off-by: Shaohua Li<shaohua.li@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jan Beulich [Sun, 30 Oct 2005 22:59:27 +0000 (14:59 -0800)]
[PATCH] x86: cmpxchg improvements
This adjusts i386's cmpxchg patterns so that
- for word and long cmpxchg-es the compiler can utilize all possible
registers
- cmpxchg8b gets disabled when the minimum specified hardware architectur
doesn't support it (like was already happening for the byte, word, and
long ones).
Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] i386 and x86_64 TSC set_cyc2ns_scale imprecision
I just found out that some precision is unnecessarily lost in the
arch/i386/kernel/timers/timer_tsc.c:set_cyc2ns_scale function. It uses a
cpu_mhz parameter when it could use a cpu_khz. In the specific case of an
Intel P4 running at 3001.171 Mhz, the truncation to 3001 Mhz leads to an
imprecision of 19 microseconds per second : this is very sad for a timer with
nearly nanosecond accuracy.
Fix the x86_64 architecture too.
Cc: george anzinger <george@mvista.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
James Morris [Sun, 30 Oct 2005 22:59:22 +0000 (14:59 -0800)]
[PATCH] SELinux: canonicalize getxattr()
This patch allows SELinux to canonicalize the value returned from
getxattr() via the security_inode_getsecurity() hook, which is called after
the fs level getxattr() function.
The purpose of this is to allow the in-core security context for an inode
to override the on-disk value. This could happen in cases such as
upgrading a system to a different labeling form (e.g. standard SELinux to
MLS) without needing to do a full relabel of the filesystem.
In such cases, we want getxattr() to return the canonical security context
that the kernel is using rather than what is stored on disk.
The implementation hooks into the inode_getsecurity(), adding another
parameter to indicate the result of the preceding fs-level getxattr() call,
so that SELinux knows whether to compare a value obtained from disk with
the kernel value.
We also now allow getxattr() to work for mountpoint labeled filesystems
(i.e. mount with option context=foo_t), as we are able to return the
kernel value to the user.
Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jeff Garzik [Sun, 30 Oct 2005 11:41:29 +0000 (06:41 -0500)]
[libata] fix legacy IDE probing
ata_pci_init_one() receives an array of struct ata_port_info. Recent
updates to the code had always obtained port information from
array element 0, rather than array element N.
Change to avoid hardcoding port_info[0], thereby restoring proper
hardware information to secondary legacy ports.
Jeff Garzik [Sun, 30 Oct 2005 09:44:42 +0000 (04:44 -0500)]
[libata] change ata_qc_complete() to take error mask as second arg
The second argument to ata_qc_complete() was being used for two
purposes: communicate the ATA Status register to the completion
function, and indicate an error. On legacy PCI IDE hardware, the latter
is often implicit in the former. On more modern hardware, the driver
often completely emulated a Status register value, passing ATA_ERR as an
indication that something went wrong.
Now that previous code changes have eliminated the need to use drv_stat
arg to communicate the ATA Status register value, we can convert it to a
mask of possible error classes.
This will lead to more flexible error handling in the future.
John Hawkes [Sun, 30 Oct 2005 01:17:01 +0000 (18:17 -0700)]
[PATCH] mm: wider use of for_each_*cpu()
In 'mm' change the explicit use of a for-loop using NR_CPUS into the
general for_each_cpu() constructs. This widens the scope of potential
future optimizations of the general constructs, as well as takes advantage
of the existing optimizations of first_cpu() and next_cpu(), which is
advantageous when the true CPU count is much smaller than NR_CPUS.
Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Remove policy contextualization from mbind
Policy contextualization is only useful for task based policies and not for
vma based policies. It may be useful to define allowed nodes that are not
accessible from this thread because other threads may have access to these
nodes. Without this patch strange memory policy situations may cause an
application to fail with out of memory.
Example:
Let's say we have two threads A and B that share the same address space and
a huge array computational array X.
Thread A is restricted by its cpuset to nodes 0 and 1 and thread B is
restricted by its cpuset to nodes 2 and 3.
Thread A now wants to restrict allocations to the first node and thus
applies a BIND policy on X to node 0 and 2. The cpuset limits this to node
0. Thus pages for X must be allocated on node 0 now.
Thread B now touches a page that has never been used in X and faults in a
page. According to the BIND policy of the vma for X the page must be
allocated on page 0. However, the cpuset of B does not allow allocation on
0 and 1. Now the application fails in alloc_pages with out of memory.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] Implement sys_* do_* layering in the memory policy layer.
- Do a separation between do_xxx and sys_xxx functions. sys_xxx functions
take variable sized bitmaps from user space as arguments. do_xxx functions
take fixed sized nodemask_t as arguments and may be used from inside the
kernel. Doing so simplifies the initialization code. There is no
fs = kernel_ds assumption anymore.
- Split up get_nodes into get_nodes (which gets the node list) and
contextualize_policy which restricts the nodes to those accessible
to the task and updates cpusets.
- Add comments explaining limitations of bind policy
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 01:16:56 +0000 (18:16 -0700)]
[PATCH] memory hotplug: call setup_per_zone_pages_min after hotplug
From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
> I found the tests does not work well with Dave's patchset.
> I've found the followings:
>
> - setup_per_zone_pages_min() calls should be added in
> capture_page_range() and online_pages()
> - lru_add_drain() should be called before try_to_migrate_pages()
The following patch deals with the first item.
Signed-off-by: IWAMOTO Toshihiro <iwamoto@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 01:16:55 +0000 (18:16 -0700)]
[PATCH] memory hotplug: move section_mem_map alloc to sparse.c
This basically keeps up from having to extern __kmalloc_section_memmap().
The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that
header gets hard to work with, because it needs some arch-specific macros.
Just stick it in here for now, instead of creating another header.
Dave Hansen [Sun, 30 Oct 2005 01:16:53 +0000 (18:16 -0700)]
[PATCH] memory hotplug locking: zone span seqlock
See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 01:16:52 +0000 (18:16 -0700)]
[PATCH] memory hotplug locking: node_size_lock
pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.
Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem(). However, they are all included for completeness. This
should also make any future consolidation of all of the implementations a
little more straightforward.
This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated. This is the place there pfn_valid() is made false
for a memory area that's being removed. The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 01:16:52 +0000 (18:16 -0700)]
[PATCH] memory hotplug prep: fixup bad_range()
When doing memory hotplug operations, the size of existing zones can obviously
change. This means that zone->zone_{start_pfn,spanned_pages} can change.
There are currently no locks that protect these structure members. However,
they are rarely accessed at runtime. Outside of swsusp, the only place that I
can find is bad_range().
So, split bad_range() up into two pieces: one that needs to be locked and
anther that doesn't.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 01:16:50 +0000 (18:16 -0700)]
[PATCH] memory hotplug prep: break out zone initialization
If a zone is empty at boot-time and then hot-added to later, it needs to run
the same init code that would have been run on it at boot.
This patch breaks out zone table and per-cpu-pages functions for use by the
hotplug code. You can almost see all of the free_area_init_core() function on
one page now. :)
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Hansen [Sun, 30 Oct 2005 01:16:49 +0000 (18:16 -0700)]
[PATCH] memory hotplug prep: kill local_mapnr
The following series implements memory hot-add for ppc64 and i386. There are
x86_64 and ia64 implementations that will be submitted shortly as well,
through the normal maintainers.
This patch:
local_mapnr is unused, except for in an alpha header. Keep the alpha one,
kill the rest.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We had a problem on ppc64 where with more than 4 threads a large system
wouldn't scale well while faulting in the .text (most of the time was spent
in the kernel despite it was an userland compute intensive app). The
reason is the useless overwrite of the same pte from all cpu.
I fixed it this way (verified on an older kernel but the forward port is
almost identical). This will benefit all archs not just ppc64.
Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adam Litke [Sun, 30 Oct 2005 01:16:47 +0000 (18:16 -0700)]
[PATCH] hugetlb: overcommit accounting check
Basic overcommit checking for hugetlb_file_map() based on an implementation
used with demand faulting in SLES9.
Since demand faulting can't guarantee the availability of pages at mmap
time, this patch implements a basic sanity check to ensure that the number
of huge pages required to satisfy the mmap are currently available.
Despite the obvious race, I think it is a good start on doing proper
accounting. I'd like to work towards an accounting system that mimics the
semantics of normal pages (especially for the MAP_PRIVATE/COW case). That
work is underway and builds on what this patch starts.
Huge page shared memory segments are simpler and still maintain their
commit on shmget semantics.
Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adam Litke [Sun, 30 Oct 2005 01:16:46 +0000 (18:16 -0700)]
[PATCH] hugetlb: demand fault handler
Below is a patch to implement demand faulting for huge pages. The main
motivation for changing from prefaulting to demand faulting is so that huge
page memory areas can be allocated according to NUMA policy.
Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler. The bulk of the patch just moves the logic from
hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().
Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reformat hugelbfs_forget_inode and add the missing but harmless
write_inode_now call. It looks the same as generic_forget_inode now except
for the call to truncate_hugepages instead of truncate_inode_pages.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] hugetlbfs: clean up hugetlbfs_delete_inode
Make hugetlbfs looks the same as generic_detelte_inode, fixing a bunch of
missing updates to it at the same time. Rename it to
hugetlbfs_do_delete_inode and add a real hugetlbfs_delete_inode that
implements ->delete_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move hugetlbfs accounting into ->alloc_inode / ->destroy_inode. This keeps
the code simpler, fixes a loeak where a failing inode allocation wouldn't
decrement the counter and moves hugetlbfs_delete_inode and
hugetlbfs_forget_inode closer to their generic counterparts.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh Dickins [Sun, 30 Oct 2005 01:16:41 +0000 (18:16 -0700)]
[PATCH] mm: fix rss and mmlist locking
A couple of oddities were guarded by page_table_lock, no longer properly
guarded when that is split.
The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
atomic64_t if the architecture supports it, in such a case. Definitions by
courtesy of Christoph Lameter: who spent considerable effort on more scalable
ways of counting, but found insufficient benefit in practice.
And adding an mm with swap to the mmlist for swapoff: the list is well-
guarded by its own lock, but the list_empty check now has to be repeated
inside it.
Hugh Dickins [Sun, 30 Oct 2005 01:16:40 +0000 (18:16 -0700)]
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Hugh Dickins [Sun, 30 Oct 2005 01:16:39 +0000 (18:16 -0700)]
[PATCH] mm: uml kill unused
In worrying over the various pte operations in different architectures, I came
across some unused functions in UML: remove mprotect_kernel_vm,
protect_vm_page and addr_pte.
Hugh Dickins [Sun, 30 Oct 2005 01:16:37 +0000 (18:16 -0700)]
[PATCH] mm: cris v32 mmu_context_lock
The cris v32 switch_mm guards get_mmu_context with next->page_table_lock: good
it's not really SMP yet, since get_mmu_context messes with global variables
affecting other mms. Replace by global mmu_context_lock.
Hugh Dickins [Sun, 30 Oct 2005 01:16:36 +0000 (18:16 -0700)]
[PATCH] mm: parisc pte atomicity
There's a worrying function translation_exists in parisc cacheflush.h,
unaffected by split ptlock since flush_dcache_page is using it on some other
mm, without any relevant lock. Oh well, make it a slightly more robust by
factoring the pfn check within it. And it looked liable to confuse a
camouflaged swap or file entry with a good pte: fix that too.
Hugh Dickins [Sun, 30 Oct 2005 01:16:36 +0000 (18:16 -0700)]
[PATCH] mm: arm ready for split ptlock
Prepare arm for the split page_table_lock: three issues.
Signal handling's preserve and restore of iwmmxt context currently involves
reading and writing that context to and from user space, while holding
page_table_lock to secure the user page(s) against kswapd. If we split the
lock, then the structure might span two pages, secured by to read into and
write from a kernel stack buffer, copying that out and in without locking (the
structure is 160 bytes in size, and here we're near the top of the kernel
stack). Or would the overhead be noticeable?
arm_syscall's cmpxchg emulation use pte_offset_map_lock, instead of
pte_offset_map and mm-wide page_table_lock; and strictly, it should now also
take mmap_sem before descending to pmd, to guard against another thread
munmapping, and the page table pulled out beneath this thread.
Updated two comments in fault-armv.c. adjust_pte is interesting, since its
modification of a pte in one part of the mm depends on the lock held when
calling update_mmu_cache for a pte in some other part of that mm. This can't
be done with a split page_table_lock (and we've already taken the lowest lock
in the hierarchy here): so we'll have to disable split on arm, unless
CONFIG_CPU_CACHE_VIPT to ensures adjust_pte never used.
Hugh Dickins [Sun, 30 Oct 2005 01:16:34 +0000 (18:16 -0700)]
[PATCH] mm: i386 sh sh64 ready for split ptlock
Use pte_offset_map_lock, instead of pte_offset_map (or inappropriate
pte_offset_kernel) and mm-wide page_table_lock, in sundry arch places.
The i386 vm86 mark_screen_rdonly: yes, there was and is an assumption that the
screen fits inside the one page table, as indeed it does.
The sh __do_page_fault: which handles both kernel faults (without lock) and
user mm faults (locked - though it set_pte without locking before).
The sh64 flush_cache_range and helpers: which wrongly thought callers held
page_table_lock before (only its tlb_start_vma did, and no longer does so);
moved the flush loop down, and adjusted the large versus small range decision
to consider a range which spans page tables as large.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh Dickins [Sun, 30 Oct 2005 01:16:33 +0000 (18:16 -0700)]
[PATCH] mm: follow_page with inner ptlock
Final step in pushing down common core's page_table_lock. follow_page no
longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
and so no page_table_lock is taken in get_user_pages itself.
But get_user_pages (and get_futex_key) do then need follow_page to pin the
page for them: take Daniel's suggestion of bitflags to follow_page.
Need one for WRITE, another for TOUCH (it was the accessed flag before:
vanished along with check_user_page_readable, but surely get_numa_maps is
wrong to mark every page it finds as accessed), another for GET.
And another, ANON to dispose of untouched_anonymous_page: it seems silly for
that to descend a second time, let follow_page observe if there was no page
table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED -
make_pages_present ought to make readonly anonymous present.
Give get_numa_maps a cond_resched while we're there.
Hugh Dickins [Sun, 30 Oct 2005 01:16:32 +0000 (18:16 -0700)]
[PATCH] mm: kill check_user_page_readable
check_user_page_readable is a problematic variant of follow_page. It's used
only by oprofile's i386 and arm backtrace code, at interrupt time, to
establish whether a userspace stackframe is currently readable.
This is problematic, because we want to push the page_table_lock down inside
follow_page, and later split it; whereas oprofile is doing a spin_trylock on
it (in the i386 case, forgotten in the arm case), and needs that to pin
perhaps two pages spanned by the stackframe (which might be covered by
different locks when we split).
I think oprofile is going about this in the wrong way: it doesn't need to know
the area is readable (neither i386 nor arm uses read protection of user
pages), it doesn't need to pin the memory, it should simply
__copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've
not got around to devising the sparse __user annotations for this.
Then we can eliminate check_user_page_readable, and return to a single
follow_page without the __follow_page variants.
Hugh Dickins [Sun, 30 Oct 2005 01:16:31 +0000 (18:16 -0700)]
[PATCH] mm: rmap with inner ptlock
rmap's page_check_address descend without page_table_lock. First just
pte_offset_map in case there's no pte present worth locking for, then take
page_table_lock for the full check, and pass ptl back to caller in the same
style as pte_offset_map_lock. __xip_unmap, page_referenced_one and
try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.
page_check_address reformatted to avoid progressive indentation. No use is
made of its one error code, return NULL when it fails.