err.no Git - linux-2.6/log

SUNRPC: get rid of cl_chatty

Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.

Test-plan:
Compile with CONFIG_NFS enabled.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: transport switch API for setting port number

At some point, transport endpoint addresses will no longer be IPv4.  To hide
the structure of the rpc_xprt's address field from ULPs and port mappers,
add an API for setting the port number during an RPC bind operation.

Test-plan:
Destructive testing (unplugging the network temporarily).  Connectathon
with UDP and TCP.  NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: new interface to force an RPC rebind

We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols.
Start by creating an API to force RPC rebind, replacing logic that simply
sets cl_port to zero.

Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: switchable buffer allocation

Add RPC client transport switch support for replacing buffer management
on a per-transport basis.

In the current IPv4 socket transport implementation, RPC buffers are
allocated as needed for each RPC message that is sent.  Some transport
implementations may choose to use pre-allocated buffers for encoding,
sending, receiving, and unmarshalling RPC messages, however.  For
transports capable of direct data placement, the buffers can be carved
out of a pre-registered area of memory rather than from a slab cache.

Test-plan:
Millions of fsx operations.  Performance characterization with "sio" and
"iozone".  Use oprofile and other tools to look for significant regression
in CPU utilization.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv3: try get_root user-supplied security_flavor

Thanks to Ed Keizer for bug and root cause. He says: "... we could only mount
the top-level Solaris share. We could not mount deeper into the tree.
Investigation showed that Solaris allows UNIX authenticated FSINFO only on the
top level of the share. This is a problem because we share/export our home
directories one level higher than we mount them. I.e. we share the partition
and not the individual home directories. This prevented access to home
directories."

We still may need to try auth_sys for the case where the client doesn't have
appropriate credentials.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NLM: fix parsing of sm notify procedure

The procedure that decodes statd sm_notify call seems to be skipping a
few arguments. How did this ever work?

>From folks at Polyserve.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NLM: Further cancel fixes

If the server receives an NLM cancel call and finds no waiting lock to
cancel, then chances are the lock has already been applied, and the client
just hadn't yet processed the NLM granted callback before it sent the
cancel.

The Open Group text, for example, perimts a server to return either success
(LCK_GRANTED) or failure (LCK_DENIED) in this case. But returning an error
seems more helpful; the client may be able to use it to recognize that a
race has occurred and to recover from the race.

So, modify the relevant functions to return an error in this case.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NLM: clean up nlmsvc_delete_block

The fl_next check here is superfluous (and possibly a layering violation).

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NLM: don't unlock on cancel requests

Currently when lockd gets an NLM_CANCEL request, it also does an unlock for
the same range.  This is incorrect.

The Open Group documentation says that "This procedure cancels an
*outstanding* blocked lock request."  (Emphasis mine.)

Also, consider a client that holds a lock on the first byte of a file, and
requests a lock on the entire file.  If the client cancels that request
(perhaps because the requesting process is signalled), the server shouldn't
apply perform an unlock on the entire file, since that will also remove the
previous lock that the client was already granted.

Or consider a lock request that actually *downgraded* an exclusive lock to
a shared lock.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NLM: Clean up nlmsvc_grant_reply locking

Slightly simpler logic here makes it more trivial to verify that the up's
and down's are balanced here. Break out an assignment from a conditional
while we're at it.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: net/sunrpc/xdr.c: remove xdr_decode_string()

This patch removes ths unused function xdr_decode_string().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Charles Lever <Charles.Lever@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Allow user to set the port used by the NFSv4 callback channel

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: Clean up weak cache consistency code

...and ensure that nfs_update_inode() respects wcc

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Ensure DELEGRETURN returns attributes

Upon return of a write delegation, the server will almost always bump the
change attribute. Ensure that we pick up that change so that we don't
invalidate our data cache unnecessarily.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Ensure change attribute returned by GETATTR callback conforms to spec

According to RFC3530 we're supposed to cache the change attribute
at the time the client receives a write delegation.
If the inode is clean, a CB_GETATTR callback by the server to the
client is supposed to return the cached change attribute.
If, OTOH, the inode is dirty, the client should bump the cached
change attribute by 1.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Fix a potential race in rpc_pipefs.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: Make directIO aware of compound pages...

...and avoid calling set_page_dirty on them

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: Make stat() return updated mtimes after a write()

The SuS states that a call to write() will cause mtime to be updated on
the file. In order to satisfy that requirement, we need to flush out
any cached writes in nfs_getattr().
Speed things up slightly by not committing the writes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Ensure that we return the delegation on the target of a rename too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: support large reads and writes on the wire

Most NFS server implementations allow up to 64KB reads and writes on the
wire. The Solaris NFS server allows up to a megabyte, for instance.

Now the Linux NFS client supports transfer sizes up to 1MB, too. This will
help reduce protocol and context switch overhead on read/write intensive NFS
workloads, and support larger atomic read and write operations on servers
that support them.

Test-plan:
Connectathon and iozone on mount point with wsize=rsize>32768 over TCP.
Tests with NFS over UDP to verify the maximum RPC payload size cap.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: make "inode number mismatch" message more useful

To help NFS users and server developers, make the "inode number mismatch"
message display more useful information.

Test-plan:
None.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: get rid of useless kernel log message

nfs_statfs() generates a log message when GETATTR returns an error. This
is usually a useless message. Make it a dprintk.

Test plan:
None

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: simplify inlined bit ops in nfs_page.h

Minor cleanup: inlined bit ops in nfs_page.h can be simpler.

Test plan:
Write-intensive workload against a server that requires COMMITs.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: Fix error recovery code in fs/nfs/inode.c:__init_nfs()

Red Hat found a problem in the error recovery logic in __init_nfs.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: use generic_write_checks() to sanity check direct writes

Replace ad hoc write parameter sanity checking in nfs_file_direct_write()
with a call to generic_write_checks(). This should make the proper checks
modulo the O_LARGEFILE flag, and should catch NFSv2-specific limitations by
virtue of i_sb->s_maxbytes.

Test plan:
Posix compliance testing with both NFSv2 and NFSv3.

Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Remove requirement for machine creds for the "setclientid" operation

Use a cred from the nfs4_client->cl_state_owners list.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Remove requirement for machine creds for the "renew" operation

In RFC3530, the RENEW operation is allowed to use either

the same principal, RPC security flavour and (if RPCSEC_GSS), the same
mechanism and service that was used for SETCLIENTID_CONFIRM

OR

Any principal, RPC security flavour and service combination that
currently has an OPEN file on the server.

Choose the latter since that doesn't require us to keep credentials for
the same principal for the entire duration of the mount.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Send RENEW requests to the server only when we're holding state

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: Convert instances of kernel_thread() to kthread()

Convert private implementations in NFSv4 state recovery and delegation
code to use kthreads.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: State recovery cleanup

Use wait_on_bit() when waiting for state recovery to complete.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: OPEN/LOCK/LOCKU/CLOSE will automatically renew the NFSv4 lease

Cut down on the number of unnecessary RENEW requests on the wire.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Ensure that SIGKILL will always terminate a synchronous RPC call.

...and make sure that the "intr" flag also enables SIGHUP and SIGTERM to
interrupt RPC calls too (as per the Solaris implementation).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Make DELEGRETURN an interruptible operation.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Convert LOCK rpc call into an asynchronous RPC call

In order to allow users to interrupt/cancel it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: locking XDR cleanup

Get rid of some unnecessary intermediate structures

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Make open recovery track O_RDWR, O_RDONLY and O_WRONLY correctly

When recovering from a delegation recall or a network partition, we need
to replay open(O_RDWR), open(O_RDONLY) and open(O_WRONLY) separately.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Make nfs4_state track O_RDWR, O_RDONLY and O_WRONLY separately

A closer reading of RFC3530 reveals that OPEN_DOWNGRADE must always
specify a access modes that have been the argument of a previous OPEN
operation.
IOW: doing OPEN(O_RDWR) and then OPEN_DOWNGRADE(O_WRONLY) is forbidden
unless the user called OPEN(O_WRONLY)

In order to fix that, we really need to track the three possible open
states separately.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Make open_confirm() asynchronous too

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Convert open() into an asynchronous RPC call

OPEN is a stateful operation, so we must ensure that it always
completes. In order to allow users to interrupt the operation,
we need to make the RPC call asynchronous, and then wait on
completion (or cancel).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: rpc_execute should not return task->tk_status;

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Get rid of some unused exports

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Allocate OPEN call RPC arguments using kmalloc()

Cleanup in preparation for making OPEN calls interruptible by the user.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: Make locku use the new RPC "wait on completion" interface.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFSv4: stateful NFSv4 RPC call interface

The NFSv4 model requires us to complete all RPC calls that might
establish state on the server whether or not the user wants to
interrupt it. We may also need to schedule new work (including
new RPC calls) in order to cancel the new state.

The asynchronous RPC model will allow us to ensure that RPC calls
always complete, but in order to allow for "synchronous" RPC, we
want to add the ability to wait for completion.
The waits are, of course, interruptible.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Further cleanups

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

RPC: Clean up RPC task structure

Shrink the RPC task structure. Instead of storing separate pointers
for task->tk_exit and task->tk_release, put them in a structure.

Also pass the user data pointer as a parameter instead of passing it via
task->tk_calldata. This enables us to nest callbacks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

SUNRPC: Yet more RPC cleanups

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

NFS: Work correctly with single-page ->writepage() calls

Ensure that we always initiate flushing of data before we exit
a single-page ->writepage() call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

identify multipage ->writepages() calls

NFS needs to be able to distinguish between single-page ->writepage() calls and
multipage ->writepages() calls.

For the single-page writepage calls NFS can kick off the I/O within the
context of ->writepage().

For multipage ->writepages calls, nfs_writepage() will leave the I/O pending
and nfs_writepages() will kick off the I/O when it all has been queued up
within NFS.

Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

Merge branch 'post-2.6.15' of git://brick.kernel.dk/data/git/linux-2.6-block

Manual fixup for merge with Jens' "Suspend support for libata", commit
ID 9b847548663ef1039dd49f0eb4463d001e596bc3.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>

x86: remove bogus 'pci=usepirqmask' suggestion when no irq is defined

This was harmless, but for the case of a device that had no irq
pre-defined we would incorrectly suggest that "usepirqmask" might make a
difference. It never would, and the message was just confusing people.

Reported in the dmesg of Etienne Lorrain.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] Suspend support for libata

This patch adds suspend patch to libata, and ata_piix in particular. For
most low level drivers, they should just need to add the 4 hooks to
work. As I can only test ata_piix, I didn't enable it for more
though.

Suspend support is the single most important feature on a notebook, and
most new notebooks have sata drives. It's quite embarrassing that we
_still_ do not support this. Right now, it's perfectly possible to
suspend the drive in mid-transfer.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow sync-speed to be controlled per-device

Also export current (average) speed and status in sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: support adding new devices to md arrays via sysfs

Writing major:minor to md/new_dev will bind that device to the array.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow available size of component devices to be set via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md-export-rdev-data_offset-via-sysfs-fix

drivers/md/md.c: In function `offset_show':
drivers/md/md.c:1670: warning: long long unsigned int format, different type arg (arg 3)

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: export rdev->data_offset via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: expose device slot information via sysfs

This the role that a device has in an array can be viewed and set.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: keep better track of dev/array size when assembling md arrays

Move the checks - that dev size is never less than array size - into
bind_rdev_to_array to make sure it always happens properly (there is one place
where currently it doesn't).

Also reject any superblock which claims an array size smaller than the device
in question can hold.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow md/raid_disks to be settable

If array is active, try to reshape, else just set the value.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: count corrected read errors per drive

Store this total in superblock (As appropriate), and make it available to
userspace via sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow array level to be set textually via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: expose md metadata format in sysfs

Allow it to be set to a particular version, or 'none'.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow md array component size to be accessed and set via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow chunk_size to be settable through sysfs

... only before array is started of course.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: fix rdev->pending counts in raid1

When we do a user-requested check/repair, we lose count of the outstanding
requests...

Also make sure that when anything is written to md/sync_action, the
RECOVERY_NEEDED flag is set and the thread is woken up so any changes take
effect.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: make sure bitmap updates are visible through filesystem

When we update a page_cache page in the kernel, we need to flush_dache_page or
userspace might not see the change.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] drivers/md/md.c: make md_new_event() static

Make the needlessly global function md_new_event() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: make a couple of names in md.c static

.. because they aren't used outside md.c

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: fix typo in comment

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: helper function to match commands written to sysfs files

Commands written to sysfs files may, or my not, be \n terminated. We want to
accept with case. For this we use cmd_match.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: define and use safe_put_page for md

md sometimes call put_page on NULL pointers (treating it like kfree). This is
not safe, so define and use a 'safe_put_page' which checks for NULL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: remove inappropriate limits in md/bitmap configuration.

The kernel should not be imposing these policy limits: The time between
bitmap updates should certainly be allowed to be more than 15 seconds, and
if someone wants a bitmap chunk size in excess of 4MB, the kernel isn't the
place to stop them.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: fix possible problem in raid1/raid10 error overwriting

The code to overwrite/reread for addressing read errors in raid1/raid10
currently assumes that the read will not alter the buffer which could be used
to write to the next device. This is not a safe assumption to make.

So we split the loops into a overwrite loop and a separate re-read loop, so
that the writing is complete before reading is attempted.

Cc: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: remove personality numbering from md

md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).

These personalities have fairly artificial 'numbers'.  The numbers
are use to:
1- provide an index into an array where the various personalities
    are recorded
2- identify the module (via an alias) which implements are particular
    personality.

Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely).  Module identification can be done using an alias
based on level rather than 'personality' number.

The current 'raid5' modules support two level (4 and 5) but only one
personality.  This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.

With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.

This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: break out of a loop that doesn't need to run to completion

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: convert recently exported symbol to GPL

...because that seems to be the preferred practice these days.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: convert various kmap calls to kmap_atomic

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: tidy up raid5/6 hash table code

- replace open-coded hash chain with hlist macros

- Fix hash-table size at one page - it is already quite generous, so there
will never be a need to use multiple pages, so no need for __get_free_pages

No functional change.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: convert md to use kzalloc throughout

Replace multiple kmalloc/memset pairs with kzalloc calls.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: clean up 'page' related names in md

Substitute:

  page_cache_get -> get_page
  page_cache_release -> put_page
  PAGE_CACHE_SHIFT -> PAGE_SHIFT
  PAGE_CACHE_SIZE -> PAGE_SIZE
  PAGE_CACHE_MASK -> PAGE_MASK
  __free_page -> put_page

because we aren't using the page cache, we are just using pages.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: make /proc/mdstat pollable

With this patch it is possible to poll /proc/mdstat to detect arrays appearing
or disappearing, to detect failures, recovery starting, recovery completing,
and devices being added and removed.

It is similar to the poll-ability of /proc/mounts, though different in that:

We always report that the file is readable (because face it, it is, even if
only for EOF).

We report POLLPRI when there is a change so that select() can detect
it as an exceptional event. Not only are these exceptional events, but
that is the mechanism that the current 'mdadm' uses to watch for events
(It also polls after a timeout).
(We also report POLLERR like /proc/mounts).

Finally, we only reset the per-file event counter when the start of the file
is read, rather than when poll() returns an event. This is more robust as it
means that an fd will continue to report activity to poll/select until the
program clearly responds to that activity.

md_new_event takes an 'mddev' which isn't currently used, but it will be soon.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: raid10 read-error handling - resync and read-only

Add in correct read-error handling for resync and read-only situations.

When read-only, we don't over-write, so we need to mark the failed drive in
the r10_bio so we don't re-try it.  During resync, we always read all blocks,
so if there is a read error, we simply over-write it with the good block that
we found (assuming we found one).

Note that the recovery case still isn't handled in an interesting way.  There
is nothing useful to do for the 2-copies case.  If there are 3 or more copies,
then we could try reading from one of the non-missing copies, but this is a
bit complicated and very rarely would be used, so I'm leaving it for now.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: auto-correct correctable read errors in raid10

Largely just a cross-port from raid1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: make sure read error on last working drive of raid1 actually returns failure

We are inadvertently setting the R1BIO_Uptodate bit on read errors when we
decide not to try correcting (because there are no other working devices).
This means that the read error is reported to the client as success.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow raid1 to check consistency

Where performing a user-requested 'check' or 'repair', we read all readable
devices, and compare the contents. We only write to blocks which had read
errors, or blocks with content that differs from the first good device found.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: support check-without-repair of raid10 arrays

Also keep count on the number of errors found.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: fix up some rdev rcu locking in raid5/6

There is this "FIXME" comment with a typo in it!!  that been annoying me for
days, so I just had to remove it.

conf->disks[i].rdev should only be accessed if
  - we know we hold a reference or
  - the mddev->reconfig_sem is down or
  - we have a rcu_readlock

handle_stripe was referencing rdev in three places without any of these.  For
the first two, get an rcu_readlock.  For the last, the same access
(md_sync_acct call) is made a little later after the rdev has been claimed
under and rcu_readlock, if R5_Syncio is set.  So just use that access...
However R5_Syncio isn't really needed as the 'syncing' variable contains the
same information.  So use that instead.

Issues, comment, and fix are identical in raid5 and raid6.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: handle errors when read-only

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: better handling for read error in raid1 during resync

Handling of read errors during resync is separate from handling of read errors
during normal IO in raid1. A previous patch added support for read errors
during normal IO. This one adds support for read errors during resync or
recovery.

The key differences are that we don't need to freeze the array, because the
normal handling of resync means that this part of the array will be idle
except for resync, and the read/overwrite/re-read is needed in a separate
piece of code.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: tidyup some issues with raid1 resync and prepare for catching read errors

We are dereferencing ->rdev without an rcu lock!

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: attempt to auto-correct read errors in raid1

On a read-error we suspend the array, then synchronously read the block from
other arrays until we find one where we can read it. Then we try writing the
good data back everywhere and make sure it works. If any write or subsequent
read fails, only then do we fail the device out of the array.

To be able to suspend the array, we need to also keep track of how many
requests are queued for handling by raid1d.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: improve handing of read errors with raid6

This is a simple port of match functionality across from raid5. If we get a
read error, we don't kick the drive straight away, but try to over-write with
good data first.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: fix raid6 resync check/repair code

raid6 currently does not check the P/Q syndromes when doing a resync, it just
calculates the correct value and writes it. Doing the check can reduce writes
(often to 0) for a resync, and it is needed to properly implement the

echo check > sync_action

operation.

This patch implements the appropriate checks and tidies up some related code.

It also allows raid6 user-requested resync to bypass the intent bitmap.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: write intent bitmap support for raid10

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: move bitmap_create to after md array has been initialised

This is important because bitmap_create uses
mddev->resync_max_sectors
and that doesn't have a valid value until after the array
has been initialised (with pers->run()).
[It doesn't make a difference for current personalities that
support bitmaps, but will make a difference for raid10]

This has the added advantage of meaning with can move the thread->timeout
manipulation inside the bitmap.c code instead of sprinkling identical code
throughout all personalities.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: allow dirty raid[456] arrays to be started at boot

See patch to md.txt for more details

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: small cleanups for raid5

Resync code:
  A test that isn't needed,
  a 'compute_block' that makes more sense
    elsewhere (And then doesn't need a test),
  a couple of BUG_ONs to confirm the change makes sense.

Printks:
  A few were missing KERN_*

Also fix a typo in a comment..

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: improve raid10 "IO Barrier" concept

raid10 needs to put up a barrier to new requests while it does resync or other
background recovery. The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] md: improve raid1 "IO Barrier" concept

raid1 needs to put up a barrier to new requests while it does resync or other
background recovery. The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>