David Rientjes [Sat, 21 Jul 2007 15:11:29 +0000 (17:11 +0200)]
x86_64: fix e820_hole_size based on address ranges
e820_hole_size() now uses the newly extracted helper function,
e820_find_active_region(), to determine the size of usable RAM in a range of
PFN's.
This was previously broken because of two reasons:
- The start and end PFN's of each e820 entry were not properly rounded
prior to excluding those entries in the range, and
- Entries smaller than a page were not properly excluded from being
accumulated.
This resulted in emulated nodes being incorrectly mapped to ranges that
were completely reserved and not candidates for being registered as
active ranges.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yinghai Lu [Sat, 21 Jul 2007 15:11:28 +0000 (17:11 +0200)]
x86_64: disable the GART in shutdown
For K8 system: 4G RAM with memory hole remapping enabled, or more than 4G
RAM installed. when using kexec to load second kernel. In the second
kernel, when mem is allocated for GART, it will do the memset for clear, it
will cause restart, because some device still used that for dma. solution
will be:
in second kernel: disable that at first before we try to allocate mem for
it. or in the first kernel: do disable that before shutdown.
Andi/Eric/Alan prefer to second one for clean shutdown in first kernel.
Andi also point out need to consider to AGP enable but mem less 4G case
too.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Muli Ben-Yehuda <muli@il.ibm.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert P. J. Day [Sat, 21 Jul 2007 15:11:26 +0000 (17:11 +0200)]
i386: replace hard-coded constant with appropriate macro from kernel.h
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andreas Mohr [Sat, 21 Jul 2007 15:11:25 +0000 (17:11 +0200)]
i386: add cpu_relax() to cmos_lock()
Add cpu_relax() to cmos_lock() inline function for faster operation on SMT
CPUs and less power consumption on others in case of lock contention (which
probably doesn't happen too often, so admittedly this patch is not too
exciting).
[akpm@linux-foundation.org: Include the header file for cpu_relax()] Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Schmidt [Sat, 21 Jul 2007 15:11:17 +0000 (17:11 +0200)]
x86_64: During VM oom condition, kill all threads in process group
During a VM oom condition, kill all threads in the process group.
We have had complaints where a threaded application is left in a bad state
after one of it's threads is killed when we hit a VM: out_of_memory condition.
Killing just one of the process threads can leave the application in a bad
state, whereas killing the entire process group would allow for the
application to restart, or otherwise handled, and makes it very obvious that
something has gone wrong.
This change allows the entire process group to be taken down, rather than just
the one thread.
Signed-off-by: Will <will_schmidt@vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86_64: Move functions declarations to header file
Some interrupt entry points are currently defined in i8259.c They probably
belong in a header. Right now, their only user is init_IRQ, justifying
their declaration in-file. But when virtualization comes in, we may be
interested in using that functions in late initializations.
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Whitcroft [Sat, 21 Jul 2007 15:11:15 +0000 (17:11 +0200)]
i386: move the kernel to 16MB for NUMA-Q
We are seeing corruption of the decompressed kernel. It is suspected that
this is platform specific as it has yet to be seen on any other x86. Move
the kernel to the 16MB boundary.
Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
i386: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
PAE is useful for more than supporting more than 4GB RAM. It supports
expanded swapspace and NX executable protections. Some users may want NX
or expanded swapspace support without the overhead or instability of
highmem. For these reasons, the following patch divorces CONFIG_X86_PAE
from CONFIG_HIGHMEM64G.
Cc: Mark Lord <lkml@rtr.ca> Signed-off-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
James Jarvis [Sat, 21 Jul 2007 15:11:11 +0000 (17:11 +0200)]
i386: DMI_MATCH patch in reboot.c for SFF Dell OptiPlex 745 - fixes hang on reboot
The following patch enables reboot through BIOS on the Dell Optiplex 745
Small Form Factor base, on which reboot hangs. The larger form factor does
not require this, hence the match on DMI_BOARD_NAME.
i386: do not restore reserved memory after hibernation
On some systems the ACPI NVS area is located in the first 1 MB of RAM and
it is overwritten by the i386 code during the restore after hibernation.
This confuses the ACPI platform firmware that doesn't update the AC adapter
status appropriately as a result
(http://bugzilla.kernel.org/show_bug.cgi?id=7995).
The solution is to register the reserved memory in the first 1 MB as
'nosave', so that swsusp doesn't touch it during the restore. Also, this
has been done on x86_64 for a long time now, so this patch makes the i386
restore code behave like the x86_64 one.
[akpm@linux-foundation.org: build fix] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sam Ravnborg [Sat, 21 Jul 2007 15:11:08 +0000 (17:11 +0200)]
i386: fix section mismatch warning in intel_cacheinfo
Fix following warning:
WARNING: arch/i386/kernel/built-in.o(.init.text+0x3818): Section mismatch: reference to .exit.text:cache_remove_dev (between 'cacheinfo_cpu_callback' and 'cache_sysfs_init')
It points out that a function marked __cpuexit is calling a function marked
__cpuinit => oops.
The call happens only in an error-condition which may explain why we have
not seen it before.
The offending function was not used anywhere else - so marked it __cpuexit.
Note: This warning triggers only with a local copy of modpost
but that version will soon be pushed out.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the bitmap changes we can get rid of the unlocked versions of
calgary_unmap_sg and iommu_free. Fold __calgary_unmap_sg and
__iommu_free into their calgary_unmap_sg and iommu_free, respectively.
Currently the IOMMU table's lock protects both the bitmap and access
to the hardware's TCE table. Access to the TCE table is synchronized
through the bitmap; therefore, only hold the lock while modifying the
bitmap. This gives a yummy 10-15% reduction in CPU utilization for
netperf on a large SMP machine.
x86_64: reserve TCEs with the same address as MEM regions
This works around a bug where DMAs that have the same addresses as
some MEM regions do not go through. Not clear yet if this is due to a
mis-configuration or something deeper.
CalIOC2 is a PCI-e implementation of the Calgary logic. Most of the
programming details are the same, but some differ, e.g., TCE cache
flush. This patch introduces CalIOC2 support - detection and various
support routines. It's not expected to work yet (but will with
follow-on patches).
Adrian Bunk [Sat, 21 Jul 2007 15:10:46 +0000 (17:10 +0200)]
x86: remove support for the Rise CPU
The Rise CPUs were only very short-lived, and there are no reports of
anyone both owning one and running Linux on it.
Googling for the printk string "CPU: Rise iDragon" didn't find any dmesg
available online.
If it turns out that against all expectations there are actually users
reverting this patch would be easy.
This patch will make the kernel images smaller by a few bytes for all
i386 users.
Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86_64: check remote IRR bit before migrating level triggered irq
On x86_64 kernel, level triggered irq migration gets initiated in the
context of that interrupt(after executing the irq handler) and following
steps are followed to do the irq migration.
1. mask IOAPIC RTE entry; // write to IOAPIC RTE
2. EOI; // processor EOI write
3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
// and interrupt vector due to per cpu vector
// allocation.
4. unmask IOAPIC RTE entry; // write to IOAPIC RTE
Because of the per cpu vector allocation in x86_64 kernels, when the irq
migrates to a different cpu, new vector(corresponding to the new cpu) will
get allocated.
An EOI write to local APIC has a side effect of generating an EOI write for
level trigger interrupts (normally this is a broadcast to all IOAPICs).
The EOI broadcast generated as a side effect of EOI write to processor may
be delayed while the other IOAPIC writes (step 3 and 4) can go through.
Normally, the EOI generated by local APIC for level trigger interrupt
contains vector number. The IOAPIC will take this vector number and search
the IOAPIC RTE entries for an entry with matching vector number and clear
the remote IRR bit (indicate EOI). However, if the vector number is
changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
is received later. This will cause the remote IRR to get stuck causing the
interrupt hang (no more interrupt from this RTE).
Current x86_64 kernel assumes that remote IRR bit is cleared by the time
IOAPIC RTE is reprogrammed. Fix this assumption by checking for remote IRR
bit and if it still set, delay the irq migration to the next interrupt
arrival event(hopefully, next time remote IRR bit will get cleared before
the IOAPIC RTE is reprogrammed).
Initial analysis and patch from Nanhai.
Clean up patch from Suresh.
Rewritten to be less intrusive, and to contain a big fat comment by Eric.
[akpm@linux-foundation.org: fix comments] Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nanhai Zou <nanhai.zou@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Keith Packard <keith.packard@intel.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sam Ravnborg [Sat, 21 Jul 2007 15:10:39 +0000 (17:10 +0200)]
i386: fix section mismatch warnings in mtrr
Following section mismatch warnings were reported by Andrey Borzenkov:
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:amd_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x967a) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:cyrix_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x967f) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:centaur_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x9684) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa735) and 'generic_get_mtrr'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa749) and 'generic_get_mtrr'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa770) and 'generic_get_mtrr'
It was tracked down to a few functions missing __init tag.
Compile tested only.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tim Hockin [Sat, 21 Jul 2007 15:10:37 +0000 (17:10 +0200)]
x86_64: mcelog tolerant level cleanup
Background:
The MCE handler has several paths that it can take, depending on various
conditions of the MCE status and the value of the 'tolerant' knob. The
exact semantics are not well defined and the code is a bit twisty.
Description:
This patch makes the MCE handler's behavior more clear by documenting the
behavior for various 'tolerant' levels. It also fixes or enhances
several small things in the handler. Specifically:
* If RIPV is set it is not safe to restart, so set the 'no way out'
flag rather than the 'kill it' flag.
* Don't panic() on correctable MCEs.
* If the _OVER bit is set *and* the _UC bit is set (meaning possibly
dropped uncorrected errors), set the 'no way out' flag.
* Use EIPV for testing whether an app can be killed (SIGBUS) rather
than RIPV. According to docs, EIPV indicates that the error is
related to the IP, while RIPV simply means the IP is valid to
restart from.
* Don't clear the MCi_STATUS registers until after the panic() path.
This leaves the status bits set after the panic() so clever BIOSes
can find them (and dumb BIOSes can do nothing).
This patch also calls nonseekable_open() in mce_open (as suggested by akpm).
Result:
Tolerant levels behave almost identically to how they always have, but
not it's well defined. There's a slightly higher chance of panic()ing
when multiple errors happen (a good thing, IMHO). If you take an MBE and
panic(), the error status bits are not cleared.
Alternatives:
None.
Testing:
I used software to inject correctable and uncorrectable errors. With
tolerant = 3, the system usually survives. With tolerant = 2, the system
usually panic()s (PCC) but not always. With tolerant = 1, the system
always panic()s. When the system panic()s, the BIOS is able to detect
that the cause of death was an MC4. I was not able to reproduce the
case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
That will be rare at best.
Signed-off-by: Tim Hockin <thockin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tim Hockin [Sat, 21 Jul 2007 15:10:36 +0000 (17:10 +0200)]
x86_64: support poll() on /dev/mcelog
Background:
/dev/mcelog is typically polled manually. This is less than optimal for
situations where accurate accounting of MCEs is important. Calling
poll() on /dev/mcelog does not work.
Description:
This patch adds support for poll() to /dev/mcelog. This results in
immediate wakeup of user apps whenever the poller finds MCEs. Because
the exception handler can not take any locks, it can not call the wakeup
itself. Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
caught at the next return from interrupt or exit from idle, calling the
mce_user_notify() routine. This patch also disables the "fake panic"
path of the mce_panic(), because it results in printk()s in the exception
handler and crashy systems.
This patch also does some small cleanup for essentially unused variables,
and moves the user notification into the body of the poller, so it is
only called once per poll, rather than once per CPU.
Result:
Applications can now poll() on /dev/mcelog. When an error is logged
(whether through the poller or through an exception) the applications are
woken up promptly. This should not affect any previous behaviors. If no
MCEs are being logged, there is no overhead.
Alternatives:
I considered simply supporting poll() through the poller and not using
TIF_MCE_NOTIFY at all. However, the time between an uncorrectable error
happening and the user application being notified is *the*most* critical
window for us. Many uncorrectable errors can be logged to the network if
given a chance.
I also considered doing the MCE poll directly from the idle notifier, but
decided that was overkill.
Testing:
I used an error-injecting DIMM to create lots of correctable DRAM errors
and verified that my user app is woken up in sync with the polling interval.
I also used the northbridge to inject uncorrectable ECC errors, and
verified (printk() to the rescue) that the notify routine is called and the
user app does wake up. I built with PREEMPT on and off, and verified
that my machine survives MCEs.
[wli@holomorphy.com: build fix] Signed-off-by: Tim Hockin <thockin@google.com> Signed-off-by: William Irwin <bill.irwin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tim Hockin [Sat, 21 Jul 2007 15:10:35 +0000 (17:10 +0200)]
x86_64: O_EXCL on /dev/mcelog
Background:
/dev/mcelog is a clear-on-read interface. It is currently possible for
multiple users to open and read() the device. Users are protected from
each other during any one read, but not across reads.
Description:
This patch adds support for O_EXCL to /dev/mcelog. If a user opens the
device with O_EXCL, no other user may open the device (EBUSY). Likewise,
any user that tries to open the device with O_EXCL while another user has
the device will fail (EBUSY).
Result:
Applications can get exclusive access to /dev/mcelog. Applications that
do not care will be unchanged.
Alternatives:
A simpler choice would be to only allow one open() at all, regardless of
O_EXCL.
Testing:
I wrote an application that opens /dev/mcelog with O_EXCL and observed
that any other app that tried to open /dev/mcelog would fail until the
exclusive app had closed the device.
Caveats:
None.
Signed-off-by: Tim Hockin <thockin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Insert the unclaimed MMCONFIG resources into the resource tree without the
IORESOURCE_BUSY flag during late initialization. This allows the MMCONFIG
regions to be visible in the iomem resource tree without interfering with
other system resources that were discovered during PCI initialization.
David Rientjes [Sat, 21 Jul 2007 15:10:33 +0000 (17:10 +0200)]
x86_64: fake apicid_to_node mapping for fake numa
When we are in the emulated NUMA case, we need to make sure that all existing
apicid_to_node mappings that point to real node ID's now point to the
equivalent fake node ID's.
If we simply iterate over all apicid_to_node[] members for each node, we risk
remapping an entry if it shares a node ID with a real node. Since apicid's
may not be consecutive, we're forced to create an automatic array of
apicid_to_node mappings and then copy it over once we have finished remapping
fake to real nodes.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2007 15:10:32 +0000 (17:10 +0200)]
x86_64: fake pxm-to-node mapping for fake numa
For NUMA emulation, our SLIT should represent the true NUMA topology of the
system but our proximity domain to node ID mapping needs to reflect the
emulated state.
When NUMA emulation has successfully setup fake nodes on the system, a new
function, acpi_fake_nodes() is called. This function determines the proximity
domain (_PXM) for each true node found on the system. It then finds which
emulated nodes have been allocated on this true node as determined by its
starting address. The node ID to PXM mapping is changed so that each fake
node ID points to the PXM of the true node that it is located on.
If the machine failed to register a SLIT, then we assume there is no special
requirement for emulated node affinity so we use the default LOCAL_DISTANCE,
which is newly exported to this code, as our measurement if the emulated nodes
appear in the same PXM. Otherwise, we use REMOTE_DISTANCE.
PXM_INVAL and NID_INVAL are also exported to the ACPI header file so that we
can compare node_to_pxm() results in generic code (in this case, the SRAT
code).
Cc: Len Brown <lenb@kernel.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Sat, 21 Jul 2007 15:10:31 +0000 (17:10 +0200)]
x86_64: extract helper function from e820_register_active_regions
The logic in e820_find_active_regions() for determining the true active
regions for an e820 entry given a range of PFN's is needed for
e820_hole_size() as well.
e820_hole_size() is called from the NUMA emulation code to determine the
reserved area within an address range on a per-node basis. Its logic should
duplicate that of finding active regions in an e820 entry because these are
the only true ranges we may register anyway.
x86_64: Avoid too many remote cpu references due to /proc/stat
Too many remote cpu references due to /proc/stat.
On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing lots
of cacheline transfers
This statistic seems useless. Other 'big iron' arches disable this.
AK: changed to remove for all SMP setups
AK: add comment
Chris Wright [Sat, 21 Jul 2007 15:10:09 +0000 (17:10 +0200)]
x86_64: Untangle asm/hpet.h from asm/timex.h
When making changes to x86_64 timers, I noticed that touching hpet.h triggered
an unreasonably large rebuild. Untangling it from timex.h quiets the extra
rebuild quite a bit.
Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current SMI detection logic in read_hpet_tsc() makes sure,
that when a SMI happens between the read of the HPET counter and
the read of the TSC, this wrong value is used for TSC calibration.
This is not the intention of the function. The comparison must ensure,
that we do _NOT_ use such a value.
Fix the check to use calibration values where delta of the two TSC reads
is smaller than a reasonable threshold.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu
This implements new vDSO for x86-64. The concept is similar
to the existing vDSOs on i386 and PPC. x86-64 has had static
vsyscalls before, but these are not flexible enough anymore.
A vDSO is a ELF shared library supplied by the kernel that is mapped into
user address space. The vDSO mapping is randomized for each process
for security reasons.
Doing this was needed for clock_gettime, because clock_gettime
always needs a syscall fallback and having one at a fixed
address would have made buffer overflow exploits too easy to write.
The vdso can be disabled with vdso=0
It currently includes a new gettimeofday implemention and optimized
clock_gettime(). The gettimeofday implementation is slightly faster
than the one in the old vsyscall. clock_gettime is significantly faster
than the syscall for CLOCK_MONOTONIC and CLOCK_REALTIME.
The new calls are generally faster than the old vsyscall.
Advantages over the old x86-64 vsyscalls:
- Extensible
- Randomized
- Cleaner
- Easier to virtualize (the old static address range previously causes
overhead e.g. for Xen because it has to create special page tables for it)
Weak points:
- glibc support still to be written
The VM interface is partly based on Ingo Molnar's i386 version.
gcc 4.3 supports a new __attribute__((__cold__)) to mark functions cold. Any
path directly leading to a call of this function will be unlikely. And gcc
will try to generate smaller code for the function itself.
Please use with care. The code generation advantage isn't large and in most
cases it is not worth uglifying code with this.
This patch marks some common error functions like panic(), printk()
as cold. This will longer term make many unlikely()s unnecessary, although
we can keep them for now for older compilers.
BUG is not marked cold because there is currently no way to tell
gcc to mark a inline function told.
Also all __init and __exit functions are marked cold. With a non -Os
build this will tell the compiler to generate slightly smaller code
for them. I think it currently only uses less alignments for labels,
but that might change in the future.
One disadvantage over *likely() is that they cannot be easily instrumented
to verify them.
Another drawback is that only the latest gcc 4.3 snapshots support this.
Unfortunately we cannot detect this using the preprocessor. This means older
snapshots will fail now. I don't think that's a problem because they are
unreleased compilers that nobody should be using.
gcc also has a __hot__ attribute, but I don't see any sense in using
this in the kernel right now. But someday I hope gcc will be able
to use more aggressive optimizing for hot functions even in -Os,
if that happens it should be added.
i386: Move all simple string operations out of line
The compiler generally generates reasonable inline code for the simple
cases and for the rest it's better for code size for them to be out of line.
Also there they can be potentially optimized more in the future.
In fact they probably should be in a .S file because they're all pure
assembly, but that's for another day.
Also some code style cleanup on them while I was on it (this seems
to be the last untouched really early Linux code)
This saves ~12k text for a defconfig kernel with gcc 4.1.
David Rientjes [Sat, 21 Jul 2007 15:09:56 +0000 (17:09 +0200)]
x86_64: various cleanups in NUMA scan node
In acpi_scan_nodes(), we immediately return -1 if acpi_numa <= 0, meaning
we haven't detected any underlying ACPI topology or we have explicitly
disabled its use from the command-line with numa=noacpi.
acpi_table_print_srat_entry() and acpi_table_parse_srat() are only
referenced within drivers/acpi/numa.c, so we can mark them as static and
remove their prototypes from the header file.
Likewise, pxm_to_node_map[] and node_to_pxm_map[] are only used within
drivers/acpi/numa.c, so we mark them as static and remove their externs
from the header file.
The automatic 'result' variable is unused in acpi_numa_init(), so it's
removed.
Linux 64bit only uses the IO-APIC ID as an internal cookie. In the future
there could be some cases where the IO-APIC IDs are not unique because
they share an 8 bit space with CPUs and if there are enough CPUs
it is difficult to get them that. But Linux needs the io apic ID
internally for its data structures. Assign unique IO APIC ids on
table parsing.
Nicolas Ferre [Sat, 21 Jul 2007 11:37:59 +0000 (04:37 -0700)]
atmel_lcdfb: Fix STN LCD support
Fixes STN LCD support for the atmel_lcdfb framebuffer driver.
This patch is the result of a work from Jan Altenberg and has
been tested on a Hitachi SP06Q002 on at91sam9261ek.
It adds a Kconfig switch that enables the proper LCD in the
board configuration file (STN or TFT). The switch is used
in arch/arm/mach-at91/at91sam9261_devices.c & board-sam9261ek.c
as an example.
This patch includes the "Fix wrong line_length calculation"
little one from Jan and Haavard (submitted earlier).
AT91 platform informations are directly submitted trough
the at91 maintainer, here :
http://article.gmane.org/gmane.linux.kernel/543158
Signed-off-by: Nicolas Ferre <nicolas.ferre@rfo.atmel.com> Cc: "Antonino A. Daplas" <adaplas@gmail.com> Cc: Jan Altenberg <jan.altenberg@linutronix.de> Cc: Patrice Vilchez <patrice.vilchez@rfo.atmel.com> Cc: Andrew Victor <andrew@sanpeople.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Hommel [Sat, 21 Jul 2007 11:37:58 +0000 (04:37 -0700)]
rtc: add support for STK17TA8 chip
This patch adds support for the Simtek STK17TA8 timekeeping chip.
The STK17TA8 is quite similar to the DS1553, but differs in register layout
and in various control bits in the registers. I chose to make this a new
driver to avoid confusion in the code and to not get lost in #ifdefs.
Signed-off-by: Thomas Hommel <thomas.hommel@gefanuc.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We now read and write the century byte in the max6900 chip. We probably
don't need to do so on Linux-only system, but it's necessary when the chip
is shared by another OS that uses the century byte.
Signed-off-by: Dale Farnsworth <dale@farnsworth.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Sat, 21 Jul 2007 11:37:56 +0000 (04:37 -0700)]
rtc kconfig: point out need for static linkage
Various people have expressed surprise that their modular RTC drivers don't
seem to work for initializing the system time at boot. To help avoid such
unpleasantness, make the Kconfig text point out that the driver probably
needs to be statically linked.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>