err.no Git - linux-2.6/log

[PATCH] M68KNOMMU: user ARRAY_SIZE macro when appropriate

Use ARRAY_SIZE macro already defined in linux/kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] kernel/time/clocksource.c needs struct task_struct on m68k

kernel/time/clocksource.c needs struct task_struct on m68k.

Because it uses spin_unlock_irq(), which, on m68k, uses hardirq_count(), which
uses preempt_count(), which needs to dereference struct task_struct, we
have to include sched.h. Because it would cause a loop inclusion, we
cannot include sched.h in any other of asm-m68k/system.h,
linux/thread_info.h, linux/hardirq.h, which leaves this ugly include in
a C file as the only simple solution.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] m68k: work around binutils tokenizer change

Recent as(1) doesn't think that .  terminates a macro name, so getuser.l is
_not_ treated as invoking getuser with .l as the first argument.
arch/m68k/math-emu relies on old behaviour, so it gets a lot of undefined
macros with more or less current binutils.

Note that this behaviour remains in all recent versions and is unrelated to
another binutils problems we used to have for a while (having (%a0)+ parsed
as two arguments).  This one is there to stay; it's an intentional and
documented change.

.irp <identifier> <words>
[text]
.endr
expands to a copy of text per each word, with \<identifier> replaced with
corresponding word.  Again, what happens depends on whether gas_ident.x
is treated as one or as two tokens; in the former case we'll get old_gas
incremented once, in the latter - twice.  The rest is obvious.

Unlike .macro argument list _anything_ is explicitly allowed after
.irp <identifier>; here we are on very safe ground.  And yes, it does
work with all gas variants I've got here (including vanilla 2.15, 2.16,
2.16.1 and 2.17, plus debian and FC binutils).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] m32r: cosmetic updates and trivial fixes

Cosmetic updates and trivial fixes of m32r arch-dependent files.
- Remove RCS ID strings and trailing white lines
- Other misc. cosmetic updates

Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] m32r: fix kernel entry address of vmlinux

This patch fixes the kernel entry point address of vmlinux.

The m32r kernel entry address is 0x08002000 (physical).
But, so far, the ENTRY point written in vmlinux.lds.S was not point
the correct kernel entry address.

(before fix)
    $ objdump -x vmlinux
    vmlinux:     file format elf32-m32r-linux
    vmlinux
    architecture: m32r2, flags 0x00000112:
    EXEC_P, HAS_SYMS, D_PAGED
    start address 0x88002090 /* NG */
        :
    Sections:
    Idx Name          Size      VMA       LMA       File off  Algn
      0 .empty_zero_page 00001000  88001000  88001000  00001000  2**12
                      CONTENTS, ALLOC, LOAD, DATA
      1 .boot         0000008c  88002000  88002000  00002000  2**2
                      CONTENTS, ALLOC, LOAD, READONLY, CODE
      2 .text         001ab694  88002090  88002090  00002090  2**4
                      CONTENTS, ALLOC, LOAD, READONLY, CODE
        :

(after fix)
    $ objdump -x vmlinux
    vmlinux:     file format elf32-m32r-linux
    vmlinux
    architecture: m32r2, flags 0x00000112:
    EXEC_P, HAS_SYMS, D_PAGED
    start address 0x08002000 /* OK */
        :

This fix also remedies the following GDB error message (of gdb-6.4 or after)
at the first operation of kernel debugging:
"Previous frame identical to this frame (corrupt stack?)".

Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] m32r: update defconfig files for v2.6.19

This patch upgrades defconfig files for all m32r platforms.

Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] m32r: fix do_page_fault and update_mmu_cache

Fix do_page_fault and update_mmu_cache.

  * Fix do_page_fault (vmalloc_fault:) to pass error_code correctly
    to update_mmu_cache by using a thread-fault code for all m32r chips.

  * Fix update_mmu_cache for OPSP chip
    - #ifdef CONFIG_CHIP_OPSP portion is a workaround of OPSP;
      Add a notfound-case operation to update_mmu_cache for OPSP
      like other m32r chip.
    - Fix pte_data that was not initialized if no entry found.

Signed-off-by: Kazuhiro Inaoka <inaoka@linux-m32r.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] m32r: build fix for processors without ISA_DSP_LEVEL2

Additional fixes for processors without ISA_DSP_LEVEL2. sigcontext_t does not
have dummy_acc1h, dummy_acc1l members any longer.

Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] swsusp: Change pm_ops handling by userland interface

Make the userland interface of swsusp call pm_ops->finish() after
enable_nonboot_cpus() and before resume_device(), as indicated by the recent
discussion on Linux-PM (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

This patch changes the SNAPSHOT_PMOPS ioctl so that its first function,
PMOPS_PREPARE, only sets a switch turning the platform suspend mode on, and
its last function, PMOPS_FINISH, only checks if the platform mode is enabled.
This should allow the older userland tools to work with new kernels without
any modifications.

The changes here only affect the userland interface of swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] swsusp-change-code-ordering-in-userc-sanity

The compiler will do that. And if it doesn't, we don't want to either ;)

Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] swsusp: Change code ordering in user.c

Change the ordering of code in kernel/power/user.c so that device_suspend() is
called before disable_nonboot_cpus() and device_resume() is called after
enable_nonboot_cpus(). This is needed to make the userland suspend call
pm_ops->finish() after enable_nonboot_cpus() and before device_resume(), as
indicated by the recent discussion on Linux-PM (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

The changes here only affect the userland interface of swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] swsusp: Change code ordering in disk.c

Change the ordering of code in kernel/power/disk.c so that device_suspend() is
called before disable_nonboot_cpus() and platform_finish() is called after
enable_nonboot_cpus() and before device_resume(), as indicated by the recent
discussion on Linux-PM (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

The changes here only affect the built-in swsusp.

[alexey.y.starikovskiy@linux.intel.com: fix LED blinking during image load]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Cc: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] PM: Change code ordering in main.c

As indicated in a recent thread on Linux-PM, it's necessary to call
pm_ops->finish() before devce_resume(), but enable_nonboot_cpus() has to be
called before pm_ops->finish() (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html). For
consistency, it seems reasonable to call disable_nonboot_cpus() after
device_suspend().

This way the suspend code will remain symmetrical with respect to the resume
code and it may allow us to speed up things in the future by suspending and
resuming devices and/or saving the suspend image in many threads.

The following series of patches reorders the suspend and resume code so that
nonboot CPUs are disabled after devices have been suspended and enabled before
the devices are resumed. It also causes pm_ops->finish() to be called after
enable_nonboot_cpus() wherever necessary.

This patch:

Change the ordering of code in kernel/power/main.c so that device_suspend()
is called before disable_nonboot_cpus() and pm_ops->finish() is called after
enable_nonboot_cpus() and before device_resume(), as indicated by recent
discussion on Linux-PM
(cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] ARM26: Use ARRAY_SIZE macro when appropriate

Use ARRAY_SIZE macro already defined in linux/kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Ian Molton <spyro@f2s.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Alpha: increase PERCPU_ENOUGH_ROOM

Module loading on Alpha was failing with error "Could not allocate 8 bytes
percpu data".

Looking at dmesg we have the below error "No per-cpu room for modules."

Increase the PERCPU_ENOUGH_ROOM in a similar way as x86_64

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com>
Cc: <Jay.Estabrook@hp.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] make reading /proc/sys/kernel/cap-bould not require CAP_SYS_MODULE

Reading /proc/sys/kernel/cap-bound requires CAP_SYS_MODULE.  (see
proc_dointvec_bset in kernel/sysctl.c)

sysctl appears to drive all over proc reading everything it can get it's
hands on and is complaining when it is being denied access to read
cap-bound.  Clearly writing to cap-bound should be a sensitive operation
but requiring CAP_SYS_MODULE to read cap-bound seems a bit to strong.  I
believe the information could with reasonable certainty be obtained by
looking at a bunch of the output of /proc/pid/status which has very low
security protection, so at best we are just getting a little obfuscation of
information.

Currently SELinux policy has to 'dontaudit' capability checks for
CAP_SYS_MODULE for things like sysctl which just want to read cap-bound.
In doing so we also as a byproduct have to hide warnings of potential
exploits such as if at some time that sysctl actually tried to load a
module.  I wondered if anyone would have a problem opening cap-bound up to
read from anyone?

Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] do not disturb page referenced state when unmapping memory range

When kernel unmaps an address range, it needs to transfer PTE state into
page struct.  Currently, kernel transfer access bit via
mark_page_accessed().  The call to mark_page_accessed in the unmap path
doesn't look logically correct.

At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state.  It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it.  The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.

I'm not too much concerned with moving the page from one list to another in
LRU.  Sooner or later it might be moved because of multiple mappings from
various processes.  But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages.  Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction.  It also
prolongs unmapping latency which is the core issue I'm trying to solve.

As suggested by Peter, we should still preserve the info on pte young
pages, but not more.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ken Chen <kenchen@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] convert ramfs to use __set_page_dirty_no_writeback

As pointed out by Hugh, ramfs would also benefit from using the new
set_page_dirty aop method for memory backed file systems.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] simplify shmem_aops.set_page_dirty() method

shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting.  So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill.  It unnecessarily prolongs shm unmap latency.

For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long.  Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag.  The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page.  What's bothering is that radix tree tag is used for page write back.
However, shmem is memory backed and there is no page write back for such
file system.  And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used.  So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] zoneid: fix up calculations for ZONEID_PGSHIFT

Currently if we have a non-zero ZONES_SHIFT we assume we are able to rely
on that as the bottom edge of the ZONEID, if not then we use the
NODES_PGOFF as the right end of either NODES _or_ SECTION. This latter is
more luck than judgement and would be incorrect if we reordered the
SECTION,NODE,ZONE options in the fields space.

Really what we want is the lower of the right hand end of the two fields we
are using (either NODE,ZONE or SECTION,ZONE). Codify that explicitly. As
always allow for there being no bits in either of the fields, such as might
be valid in a non-numa machine with only a zone NORMAL.

I have checked that the compiler is still able to constant fold all of this
away correctly.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA

As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
channel management.  Other functionality may still expect GFP_DMA to
provide memory below 16M.  So we need to make sure that CONFIG_ZONE_DMA is
set independent of CONFIG_GENERIC_ISA_DMA.  Undo the modifications to
mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
theses explicitly in each arches Kconfig.

Reviews must occur for each arch in order to determine if ZONE_DMA can be
switched off.  It can only be switched off if we know that all devices
supported by a platform are capable of performing DMA transfers to all of
memory (Some arches already support this: uml, avr32, sh sh64, parisc and
IA64/Altix).

In order to switch ZONE_DMA off conditionally, one would have to establish
a scheme by which one can assure that no drivers are enabled that are only
capable of doing I/O to a part of memory, or one needs to provide an
alternate means of performing an allocation from a specific range of memory
(like provided by alloc_pages_range()) and insure that all drivers use that
call.  In that case the arches alloc_dma_coherent() may need to be modified
to call alloc_pages_range() instead of relying on GFP_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] optional ZONE_DMA: remove ZONE_DMA remains from sh/sh64

sh / sh64: Remove ZONE_DMA remains.

Both arches do not need ZONE_DMA

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] optional ZONE_DMA: remove ZONE_DMA remains from parisc

Remove ZONE_DMA remains from parisc so that kernels are build without
ZONE_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] optional ZONE_DMA: optional ZONE_DMA for ia64

ZONE_DMA less operation for IA64 SGI platform

Disable ZONE_DMA for SGI SN2. All memory is addressable by all devices and we
do not need any special memory pool.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM

Make ZONE_DMA optional in core code.

- ifdef all code for ZONE_DMA and related definitions following the example
for ZONE_DMA32 and ZONE_HIGHMEM.

- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
0.

- Modify the VM statistics to work correctly without a DMA zone.

- Modify slab to not create DMA slabs if there is no ZONE_DMA.

[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] optional ZONE_DMA: introduce CONFIG_ZONE_DMA

This patch simply defines CONFIG_ZONE_DMA for all arches.  We later do special
things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
without ZONE_DMA.

CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
handles ISA DMA.

First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
needs ZONE_DMA because ISA DMA devices are supported.  We can catch this in
mm/Kconfig and do not need to modify arch code.

Second, arches may use ZONE_DMA in an unknown way.  We set CONFIG_ZONE_DMA for
all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
compatibility.  The arches may later undefine ZONE_DMA if their arch code has
been verified to not depend on ZONE_DMA.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone

This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones.  The patches allow to go to a minimum of 2 zones.  This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.

ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
why we would not want to have ZONE_DMA

1. Some arches do not need ZONE_DMA at all.

2. With the advent of IOMMUs DMA zones are no longer needed.
   The necessity of DMA zones may drastically be reduced
   in the future. This patchset allows a compilation of
   a kernel without that overhead.

3. Devices that require ISA DMA get rare these days. All
   my systems do not have any need for ISA DMA.

4. The presence of an additional zone unecessarily complicates
   VM operations because it must be scanned and balancing
   logic must operate on its.

5. With only ZONE_NORMAL one can reach the situation where
   we have only one zone. This will allow the unrolling of many
   loops in the VM and allows the optimization of varous
   code paths in the VM.

6. Having only a single zone in a NUMA system results in a
   1-1 correspondence between nodes and zones. Various additional
   optimizations to critical VM paths become possible.

Many systems today can operate just fine with a single zone.  If you look at
what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).

On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.

The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
and make this as compact as possible.  The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.

I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone   Normal
  pages free     4435
        min      1448
        low      1810
        high     2172
        active   241786
        inactive 210170
        scanned  0 (a: 0 i: 0)
        spanned  524224
        present  524224
    nr_anon_pages 61680
    nr_mapped    14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty     39
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    cpu: 0 pcp: 0
              count: 156
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 20
    cpu: 1 pcp: 0
              count: 177
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 12
              high:  62
              batch: 15
  vm stats threshold: 20
  all_unreclaimable: 0
  prev_priority:     12
  temp_priority:     12
  start_pfn:         0

This patch:

In two places in the VM we use ZONE_DMA to refer to the first zone.  If
ZONE_DMA is optional then other zones may be first.  So simply replace
ZONE_DMA with zone 0.

This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat.  However, we still need a zonetable to index all
the zones for each node if this is a NUMA system.  Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Drop get_zone_counts()

Values are available via ZVC sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Drop __get_zone_counts()

Values are readily available via ZVC per node and global sums.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Drop nr_free_pages_pgdat()

Function is unnecessary now. We can use the summing features of the ZVCs to
get the values we need.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Drop free_pages()

nr_free_pages is now a simple access to a global variable.  Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included.  There is one
occurrence in power management where we need to add the include.  Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Reorder ZVCs according to cacheline

The global and per zone counter sums are in arrays of longs.  Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline.  That
way calculations of the global, node and per zone vm state touches only a
single cacheline.  This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Use ZVC for free_pages

This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Use ZVC for inactive and active counts

The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.

However, not all pages in the system may be dirtied.  Thus the ratio is always
too low and can never reach 100%.  The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore.  In that case we may get into
a situation in which f.e.  the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.

This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty.  These are the pages on the active and
the inactive list plus free pages.

The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up.  This patchset makes these counters ZVC counters.  This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.

The patchset results in some other good side effects:

- Removal of the various functions that sum up free, active and inactive
  page counts

- Cleanup of the functions that display information via the proc filesystem.

This patch:

The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations.  More ZVC functionality is used for sums etc in the
following patches.

[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] page_mkwrite caller race fix

After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] typeof __page_to_pfn with SPARSEMEM=y

With CONFIG_SPARSEMEM=y:

mm/rmap.c:579: warning: format '%lx' expects type 'long unsigned int', but argument 2 has type 'int'

Make __page_to_pfn() return unsigned long.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] /proc/zoneinfo: fix vm stats display

This early break prevents us from displaying info for the vm stats thresholds
if the zone doesn't have any pages in its per-cpu pagesets.

So my 800MB i386 box says:

Node 0, zone      DMA
  pages free     2365
        min      16
        low      20
        high     24
        active   0
        inactive 0
        scanned  0 (a: 0 i: 0)
        spanned  4096
        present  4044
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 868, 868)
  pagesets
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         0
Node 0, zone   Normal
  pages free     199713
        min      934
        low      1167
        high     1401
        active   10215
        inactive 4507
        scanned  0 (a: 0 i: 0)
        spanned  225280
        present  222420
    nr_anon_pages 2685
    nr_mapped    1110
    nr_file_pages 12055
    nr_slab_reclaimable 2216
    nr_slab_unreclaimable 1527
    nr_page_table_pages 213
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 0, 0)
  pagesets
    cpu: 0 pcp: 0
              count: 152
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 13
              high:  62
              batch: 15
  vm stats threshold: 16
    cpu: 1 pcp: 0
              count: 34
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 10
              high:  62
              batch: 15
  vm stats threshold: 16
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096

Just nuke all that search-for-the-first-non-empty-pageset code.  Dunno why it
was there in the first place..

Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Avoid excessive sorting of early_node_map[]

find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call.  This is an excessive amount of sorting and
that can be avoided.  This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found.  The
map is then only sorted once when required.  Successfully boot tested on a
number of machines.

[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] Remove final references to deprecated "MAP_ANON" page protection flag

Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
flag: use "MAP_ANONYMOUS" instead.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] slab: use parameter passed to cache_reap to determine pointer to work structure

Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] slab: cache alloc cleanups

Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler. No functional changes in this patch.

Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().

Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[PATCH] slab: remove broken PageSlab check from kfree_debugcheck

The PageSlab debug check in kfree_debugcheck() is broken for compound
pages. It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

libata: kill ATA_ENABLE_PATA

The ATA_ENABLE_PATA define was never meant to be permanent, and in
recent kernels, it's already been unconditionally enabled. Remove.

Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: build fix after dmesg probe output changes

Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: Early CFA adapters are not required to support mode setting

If we are doing a PIO setup for a CFA card and it blows up with a device
error then assume it is an older CFA card which doesn't support this
rather than failing the device out of existance.

Stands seperate to the quieting patch but that is obviously useful with
this change.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_nv: propagate ata_pci_device_do_resume return value

ata_pci_device_do_resume can fail if the PCI device couldn't be re-enabled.
Update sata_nv to propagate the return value from this call and to not try
to do any other resume activities if it fails. Fixes a compile warning.

Signed-off-by: Robert Hancock <hancockr@shaw.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_nv: wait for response on entering/leaving ADMA mode

Update sata_nv to wait for the controller to indicate via the status
register that it has entered the requested state when switching between
ADMA mode and register mode. This issue came up recently when debugging
some problems with cache flush command timeouts and while it didn't appear
to fix that problem, this is something we should likely be doing in any
case.

Signed-off-by: Robert Hancock <hancockr@shaw.ca>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_nv: use ADMA for NODATA commands

Some problems showed up recently with cache flush commands timing out on
sata_nv.  Previously these commands were always handled by transitioning to
legacy mode from ADMA mode first.  The timeout problem was worked around
already by a change to the interrupt handling code for legacy mode, but for
non-data commands like these it appears we can handle them in ADMA mode, so
the switch to legacy mode is not needed.

This patch changes the behavior so that we use ADMA mode to submit
interrupt-driven commands with ATA_PROT_NODATA protocol.  In addition to
avoiding the problem mentioned above entirely, this avoids the overhead of
switching to legacy mode and back to ADMA mode for handling cache flushes.
When handling non-DMA-mapped commands, we leave the APRD blank and clear
the NV_CPB_CTL_APRD_VALID field in the CPB so the controller does not
attempt to read it.

Signed-off-by: Robert Hancock <hancockr@shaw.ca>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_nv: cleanup ADMA error handling

This cleans up a few issues with the error handling in sata_nv in ADMA mode
to make it more consistent with other NCQ-capable drivers like ahci and
sata_sil24:

- When a command failed, we would effectively set AC_ERR_DEV on the
  queued command always.  In the case of NCQ commands this prevents libata
  from doing a log page query to determine the details of the failed
  command, since it thinks we've already analyzed.  Just set flags in the
  port ehi->err_mask, then freeze or abort and let libata figure out what
  went wrong.

- The code handled NV_ADMA_STAT_CPBERR as a "really bad error" which
  caused it to set error flags on every queued command.  I don't know
  exactly what this flag means (no docs, grr!) but from what I can guess
  from the standard ADMA spec, it just means that one or more of the CPBs
  had an error, so we just need to go through and do our normal checks in
  this case.

- In the error_handler function the code would always dump the state of
  all the CPBs.  This output seems redundant at this point since libata
  already dumps the state of all active commands on errors (and it also
  triggers at times when it shouldn't, like when suspending).  Take this
  out.

[akpm@osdl.org: many coding-style fixes]
Signed-off-by: Robert Hancock <hancockr@shaw.ca>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Allen Martin <AMartin@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

(2.6.20) pata_mpiix: probing cleanup (resend)

MPIIX has only single channel IDE which can be configured for either primary or
secondary legacy I/O ports and IRQ. So, get rid of the unneeded second probe
entry in mpiix_init_one() and of the invalid (but unused anyway) enable bits in
mpiix_pre_reset().

Warning: this cleanup has only been compile-tested...

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

(2.6.20) pata_mpiix: fix PIO setup issues

Fix clearing/setting the wrong TIME/IE/PPE bits for a slave drive caused by a
wrong shift count.
Fix the PIO mode 1 being overclocked by wrongly selecting the fast timing bank.
Also, fix/rephrase some comments while at it.

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

(2.6.20) pata_oldpiix: fix PIO2 underclocking

Fix the PIO mode 2 using mode 0 timings -- this driver should enable the
fast timing bank starting with PIO2, just like the ata_piix driver does.
Also, fix/rephrase some comments while at it.

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

ata: Add defines for the iordy bits

IORDY and IORDY enable/disable flags.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

Kconfig: clarify ATA_PIIX description

People are getting confused about which drivers to enable for PATA PIIX
type devices. Change the ATA_PIIX line and help to make it clearer.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_inic162x: fix a few glitches in hardreset

* Hardreset must not exit without actually performing reset regardless
of link status. We're resetting the link after all.

* Minor message update.

* 150ms delay is meaningful iff link is online after reset is
complete.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: add 150ms between completion of hardreset and status checking

Follow the old SRST rule and delay 150ms between completion of
hardreset and status checking. Debouncing delay should usually cover
this but debounce duration could be shorter than 150ms under certain
circumstances.

Usefulness depends on host controller implementation but it can't hurt
and serves as a reminder that 2s delay for GoVault should also be
added here.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: rearrange dmesg info to add full ATA revision

Per Jeff's suggestion, this patch rearranges the info printed for ATA
drives into dmesg to add the full ATA firmware revision and model
information, while keeping the output to 2 lines.

Signed-off-by: Eric D. Mudama <edmudama@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

pata_sl82c105: wrong assumptions about compatible PIO modes

Fix the wrong "compatible" PIO mode choices: MWDMA0 has 480 ns cycle while PIO1
only has 383 ns cycle, and MWDMA2 timings matchs those of PIO4 exactly.

Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: add another IRQ calls (libata drivers)

This patch is against each libata driver.

Two IRQ calls are added in ata_port_operations.
- irq_on() is used to enable interrupts.
- irq_ack() is used to acknowledge a device interrupt.

In most drivers, ata_irq_on() and ata_irq_ack() are used for
irq_on and irq_ack respectively.

In some drivers (ex: ahci, sata_sil24) which cannot use them
as is, ata_dummy_irq_on() and ata_dummy_irq_ack() are used.

Signed-off-by: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>
Signed-off-by: Akira Iguchi <akira2.iguchi@toshiba.co.jp>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: add another IRQ calls (core and headers)

This patch is against the libata core and headers.

Two IRQ calls are added in ata_port_operations.
- irq_on() is used to enable interrupts.
- irq_ack() is used to acknowledge a device interrupt.

In most drivers, ata_irq_on() and ata_irq_ack() are used for
irq_on and irq_ack respectively.

In some drivers (ex: ahci, sata_sil24) which cannot use them
as is, ata_dummy_irq_on() and ata_dummy_irq_ack() are used.

Signed-off-by: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>
Signed-off-by: Akira Iguchi <akira2.iguchi@toshiba.co.jp>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

git-libata-all: forward declare struct device

In file included from drivers/infiniband/hw/ipath/ipath_diag.c:44:
include/linux/io.h:35: warning: 'struct device' declared inside parameter list
include/linux/io.h:35: warning: its scope is only this definition or declaration

Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: convert to iomap

Convert libata core layer and LLDs to use iomap.

* managed iomap is used.  Pointer to pcim_iomap_table() is cached at
  host->iomap and used through out LLDs.  This basically replaces
  host->mmio_base.

* if possible, pcim_iomap_regions() is used

Most iomap operation conversions are taken from Jeff Garzik
<jgarzik@pobox.com>'s iomap branch.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

pata_platform: fix devres conversion

devres updates for pata_platform were dropped while merging devres
patches due to merge conflict. This is the updated version.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

iomap: iomap should be in obj-y not in lib-y

devres change moved iomap.o from obj-$(CONFIG_GENERIC_IOMAP) to lib-y
making it not linked if no in-kernel driver uses it. Fix it.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[libata] Shuffle DRV_xxx in core and SiS drivers, to kill warnings

Signed-off-by: Jeff Garzik <jeff@garzik.org>

devres: implement pcim_iomap_regions()

Implement pcim_iomap_regions(). This function takes mask of BARs to
request and iomap. No BAR should have length of zero. BARs are
iomapped using pcim_iomap_table().

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: remove unused functions

Now that all LLDs are converted to use devres, default stop callbacks
are unused. Remove them.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: update libata LLDs to use devres

Update libata LLDs to use devres.  Core layer is already converted to
support managed LLDs.  This patch simplifies initialization and fixes
many resource related bugs in init failure and detach path.  For
example, all converted drivers now handle ata_device_add() failure
gracefully without excessive resource rollback code.

As most resources are released automatically on driver detach, many
drivers don't need or can do with much simpler ->{port|host}_stop().
In general, stop callbacks are need iff port or host needs to be given
commands to shut it down.  Note that freezing is enough in many cases
and ports are automatically frozen before being detached.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: update libata core layer to use devres

Update libata core layer to use devres.

* ata_device_add() acquires all resources in managed mode.

* ata_host is allocated as devres associated with ata_host_release.

* Port attached status is handled as devres associated with
  ata_host_attach_release().

* Initialization failure and host removal is handedl by releasing
  devres group.

* Except for ata_scsi_release() removal, LLD interface remains the
  same.  Some functions use hacky is_managed test to support both
  managed and unmanaged devices.  These will go away once all LLDs are
  updated to use devres.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: implement ata_host_detach()

Implement ata_host_detach() which calls ata_port_detach() for each
port in the host and export it. ata_port_detach() is now internal and
thus un-exported. ata_host_detach() will be used as the 'deregister
from libata layer' function after devres conversion.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

devres: device resource management

Implement device resource management, in short, devres.  A device
driver can allocate arbirary size of devres data which is associated
with a release function.  On driver detach, release function is
invoked on the devres data, then, devres data is freed.

devreses are typed by associated release functions.  Some devreses are
better represented by single instance of the type while others need
multiple instances sharing the same release function.  Both usages are
supported.

devreses can be grouped using devres group such that a device driver
can easily release acquired resources halfway through initialization
or selectively release resources (e.g. resources for port 1 out of 4
ports).

This patch adds devres core including documentation and the following
managed interfaces.

* alloc/free : devm_kzalloc(), devm_kzfree()
* IO region : devm_request_region(), devm_release_region()
* IRQ : devm_request_irq(), devm_free_irq()
* DMA : dmam_alloc_coherent(), dmam_free_coherent(),
  dmam_declare_coherent_memory(), dmam_pool_create(),
  dmam_pool_destroy()
* PCI : pcim_enable_device(), pcim_pin_device(), pci_is_managed()
* iomap : devm_ioport_map(), devm_ioport_unmap(), devm_ioremap(),
  devm_ioremap_nocache(), devm_iounmap(), pcim_iomap_table(),
  pcim_iomap(), pcim_iounmap()

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

fix CONFIG_SATA_SIS=y compile error

Static code shouldn't be used from other modules.

drivers/built-in.o: In function `sis_init_one':
sata_sis.c:(.text+0x7634cd): undefined reference to `sis_info133'
sata_sis.c:(.text+0x7634d6): undefined reference to `sis_info133'

While I was at it, I also moved the prototype of this struct to a header
file.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_sis: Support for PATA supports

This is quick rework of the patch Uwe proposed but using Kconfig not
ifdefs and user selection to sort out PATA support. Instead of ifdefs and
requiring the user to select both drivers the SATA driver selects the
PATA one.

For neatness I've also moved the extern into the function that uses it.

Signed-off-by: Alan Cox
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: implement HDIO_GET_IDENTITY

'hdparm -I' doesn't work with ATAPI devices and sg_sat is not widely
spread yet leaving no easy way to access ATAPI IDENTIFY data.
Implement HDIO_GET_IDENTITY such that at least 'hdparm -i' works.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: trivial stuff

Readability/typos etc

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_promise: kill qc->nsect

Merge order left qc->nsect usage in sata_promise dangling. Kill it.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata piix3 support warning fix

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: PIIX3 support

This I believe completes the PIIX range of support for libata

This adds the table entries needed for the PIIX3, both a new PCI
identifier and a new mode list. It also fixes an erroneous access to PCI
configuration 0x48 on non UDMA capable chips.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_promise: handle ATAPI_NODATA ourselves

This patch extends sata_promise to handle ATAPI_NODATA
commands internally. However, commands destined to
ATA_DFLAG_CDB_INTR devices are excluded from this and
continue to be returned to libata.

Concrete changes:
- pdc_atapi_dma_pkt() is renamed to pdc_atapi_pkt(), and is
extended to set up correct headers for NODATA packets
- pdc_qc_prep() calls pdc_atapi_pkt() for ATAPI_NODATA
- pdc_host_intr() handles ATAPI_NODATA
- pdc_qc_issue_prot() sends ATAPI_NODATA packets via the
chip's packet mechanism, except for CDB_INTR devices

Tested on first- and second-generation chips, SATAPI and PATAPI,
with no observable regressions.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_promise: issue ATAPI commands as normal packets

This patch (against libata #upstream + the ATAPI cleanup patch)
reimplements sata_promise's ATAPI support to format ATAPI DMA
commands as normal packets, and to issue them via the hardware's
normal packet machinery.

It turns out that the only reason for issuing ATAPI DMA
commands via the pdc_issue_atapi_pkt_cmd() procedure was to
perform two interrupt-fiddling steps for ATA_DFLAG_CDB_INTR
devices. But these steps aren't needed because sata_promise
sets ATA_FLAG_PIO_POLLING, which disables DMA for those devices.
The remaining steps can easily be done in ATA taskfile packets.

Concrete changes:
- pdc_atapi_dma_pkt() is extended to program all packet setup
  steps, and not just contain the CDB; the sequence of steps
  exactly mirrors what pdc_issue_atapi_pkt_cmd() did
- pdc_atapi_dma_pkt() needed more parameters: simplify it by
  just passing 'qc' and having it extract the data it needs
- pdc_issue_atai_pkt_cmd() and its two helper procedures
  pdc_wait_for_drq() and pdc_wait_on_busy() are removed

Tested on first- and second-generation chips, SATAPI and PATAPI,
with no observable regressions.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_promise: ATAPI cleanup

Here's a cleanup for yesterday's sata_promise ATAPI patch:
- add and use a symbolic constant for the altstatus register
- check return status from ata_busy_wait()
- add missing newline in a warning printk()
- update comment in pdc_issue_atapi_pkt_cmd() to clarify
  that the maybe-wait-for-INT issue cannot occur in the
  current driver, but may occur if the driver starts issuing
  ATAPI non-DMA commands as PDC packets

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_inic162x: finally, driver for initio 162x SATA controllers, take #2

Driver for Initio 162x SATA controllers. ATA r/w, ATAPI r, hotplug
and suspend/resume work. ATAPI w (recording, that is) broken. Feel
free to fix it, but be warned, this controller is weird.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: kill qc->nsect and cursect

libata used two separate sets of variables to record request size and
current offset for ATA and ATAPI.  This is confusing and fragile.
This patch replaces qc->nsect/cursect with qc->nbytes/curbytes and
kills them.  Also, ata_pio_sector() is updated to use bytes for
qc->cursg_ofs instead of sectors.  The field used to be used in bytes
for ATAPI and in sectors for ATA.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[libata] sata_vsc: build fix after PCI MSI feature addition

Signed-off-by: Jeff Garzik <jeff@garzik.org>

[libata] sata_vsc: support PCI MSI

Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: handle pci_enable_device() failure while resuming

Handle pci_enable_device() failure while resuming. This patch kills
the "ignoring return value of 'pci_enable_device'" warning message and
propagates __must_check through ata_pci_device_do_resume().

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: use ata_id_c_string()

There were several places where ATA ID strings are manually terminated
and in some places possibly unterminated strings were passed to string
functions which don't limit length like strstr(). This patch converts
all of them over to ata_id_c_string().

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata: straighten out ATA_ID_* constants

* Kill _OFS suffixes in ATA_ID_{SERNO|FW_REV|PROD}_OFS for consistency
  with other ATA_ID_* constants.

* Kill ATA_SERNO_LEN

* Add and use ATA_ID_SERNO_LEN, ATA_ID_FW_REV_LEN and ATA_ID_PROD_LEN.
  This change also makes ata_device_blacklisted() use proper length
  for fwrev.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_nv: add suspend/resume support v3 (Resubmit)

Thoughts from Jeff & company on merging the patch below into libata-dev?
This has been in the -mm tree for over a month now, I haven't heard any
complaints about regressions..

Signed-off-by: Jeff Garzik <jeff@garzik.org>

drivers/ata/: make 4 functions static

This patch makes the following needlessly global functions static:
- libata-core.c: ata_qc_complete_internal()
- libata-scsi.c: ata_scsi_qc_new()
- libata-scsi.c: ata_dump_status()
- libata-scsi.c: ata_to_sense_error()

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

ahci: Remove jmicron fixup

The AHCI set up is handled properly along with the other bits in the
JMICRON quirk. Remove the code whacking it in ahci.c as its un-needed and
also blindly fiddles with bits it doesn't own.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

libata-sff: Don't try and activate channels which are not in use

An ATA controller in native mode may have one or more channels disabled
and not assigned resources. In that case the existing code crashes trying
to access I/O ports 0-7.

Add the neccessary check.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_via: PATA support

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

pata_sis: implement laptop list and add ASUS A6K/A6U

In ASUS A6K/A6U hdd is connected to SiS 96x via 40c cable, however it
is short cable and is UDMA66 capable.

tj: fixed if () conditionals
ah: fixed infinite loop

Signed-off-by: Jakub W. Jozwicki <jakub007@go2.pl>
Cc: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

ata_piix: add ICH7 on Acer 3682WLMi to laptop list

In Acer Aspire hdd is connected to ICH7 via 40c cable, however it is
short cable and it is UDMA66 capable.

Signed-off-by: J J <jakub007@go2.pl>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

Add pci class code for SATA & AHCI, and replace some magic numbers.

Signed-off-by: Conke Hu <conke.hu@amd.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_promise: ATAPI support

This patch adds ATAPI support to the sata_promise driver.
This has been tested on both first- and second-generation
chips (20378 and 20575), and with both SATAPI and PATAPI
devices. CD-writing works.

SATAPI DMA works on second-generation chips, but on
first-generation chips SATAPI is limited to PIO due
to what appears to be HW limitations.
PATAPI DMA works on both first- and second-generation
chips, but requires the separate PATA support patch
before it can be used on TX2plus chips.

The functional changes to the driver are:
- remove ATA_FLAG_NO_ATAPI from PDC_COMMON_FLAGS
- add ->check_atapi_dma() operation to enable DMA for bulk data
  transfers but force PIO for other ATAPI commands; this filter
  is from Promise's driver and largely matches pata_pdc207x.c
- use a more restrictive ->check_atapi_dma() on first-generation
  chips to force SATAPI to always use PIO
- add handling of ATAPI protocols to pdc_qc_prep(), pdc_host_intr(),
  and pdc_qc_issue_prot(): ATAPI_DMA is handled by the driver
  while non-DMA protocols are handed over to libata generic code
- add pdc_issue_atapi_pkt_cmd() to handle the initial steps in
  issuing ATAPI DMA commands before sending the actual CDB;
  this procedure was ported from Promise's driver

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

sata_promise: TX2plus PATA support

This patch implements a simple way of setting up per-port
flags on the SATA+PATA Promise TX2plus chips, which is a
prerequisite for supporting the PATA port on those chips.

It is based on the observation that ap->flags isn't really
used until after ->port_start() has been invoked. So it
places the "exceptional" per-port flags array in the driver's
private host structure, and uses it in ->port_start() to
finalise the port's flags.

This patch obsoletes the #promise-sata-pata branch included
in the #all branch.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] user of the jiffies rounding patch: ATA subsystem

This patch introduces users of the round_jiffies() function: ATA subsystem

This delayed work is of the "about once a second" variety and can be rounded
to coincide with other wakers.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

[PATCH] pci: Move PCI_VDEVICE from libata to core

Updated diff which doesn't move the comment as per Jeff's request and
corrects the docs as per report on l/k

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>