Angelo Castello [Thu, 6 Mar 2008 03:50:53 +0000 (12:50 +0900)]
rtc: rtc-sh: Add support for periodic IRQs.
This adds support for periodic IRQs to the rtc-sh driver.
RTC_IRQP_READ/RTC_IRQP_SET are added, with a number of other fixes and
reordering across the rest of the code.
Signed-off-by: Angelo Castello <angelo.castello@st.com> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Magnus Damm [Tue, 4 Mar 2008 23:23:45 +0000 (15:23 -0800)]
sh: SuperH KEYSC platform driver
Add a platform driver for the SuperH KEYSC block. The driver expects to get
mode, timing information and keypad layout from the board code as platform
data. The board code is resonsible for pin configuration.
Both sh7343 and sh7722 should be supported, but only the sh7722 processor has
been tested so far. SH_KEYSC_MODE_3 is yet to be tested.
Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Merge branch 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (87 commits)
[XFS] Fix merge failure
[XFS] The forward declarations for the xfs_ioctl() helpers and the
[XFS] Update XFS documentation for noikeep/ikeep.
[XFS] Update XFS Documentation for ikeep and ihashsize
[XFS] Remove unused HAVE_SPLICE macro.
[XFS] Remove CONFIG_XFS_SECURITY.
[XFS] xfs_bmap_compute_maxlevels should be based on di_forkoff
[XFS] Always use di_forkoff when checking for attr space.
[XFS] Ensure the inode is joined in xfs_itruncate_finish
[XFS] Remove periodic logging of in-core superblock counters.
[XFS] fix logic error in xfs_alloc_ag_vextent_near()
[XFS] Don't error out on good I/Os.
[XFS] Catch log unmount failures.
[XFS] Sanitise xfs_log_force error checking.
[XFS] Check for errors when changing buffer pointers.
[XFS] Don't allow silent errors in xfs_inactive().
[XFS] Catch errors from xfs_imap().
[XFS] xfs_bulkstat_one_dinode() never returns an error.
[XFS] xfs_iflush_fork() never returns an error.
[XFS] Catch unwritten extent conversion errors.
...
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx:
dmaengine: ack to flags: make use of the unused bits in the 'ack' field
iop-adma: remove the workaround for missed interrupts on iop3xx
async_tx: kill ->device_dependency_added
async_tx: fix multiple dependency submission
fsldma: Split the MPC83xx event from MPC85xx and refine irq codes.
fsldma: Remove CONFIG_FSL_DMA_SELFTEST, keep fsl_dma_self_test() running always.
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: (79 commits)
ata-acpi: don't call _GTF for disabled drive
sata_mv add temporary 3 second init delay for SiliconImage PMs
sata_mv remove redundant edma init code
sata_mv add basic port multiplier support
sata_mv fix SOC flags, enable NCQ on SOC
sata_mv disable hotplug for now
sata_mv cosmetics
sata_mv hardreset rework
[libata] improve Kconfig help text for new PMP, SFF options
libata: make EH fail gracefully if no reset method is available
libata: Be a bit more slack about early devices
libata: cable logic
libata: move link onlineness check out of softreset methods
libata: kill dead code paths in reset path
pata_scc: fix build breakage
libata: make PMP support optional
libata: implement PMP helpers
libata: separate PMP support code from core code
libata: make SFF support optional
libata: don't use ap->ioaddr in non-SFF drivers
...
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
clocksource: make clocksource watchdog cycle through online CPUs
Documentation: move timer related documentation to a single place
clockevents: optimise tick_nohz_stop_sched_tick() a bit
locking: remove unused double_spin_lock()
hrtimers: simplify lockdep handling
timers: simplify lockdep handling
posix-timers: fix shadowed variables
timer_list: add annotations to workqueue.c
hrtimer: use nanosleep specific restart_block fields
hrtimer: add nanosleep specific restart_block member
Merge branch 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc
* 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc:
Remove DEBUG_SEMAPHORE from Kconfig
Improve semaphore documentation
Simplify semaphore implementation
Add down_timeout and change ACPI to use it
Introduce down_killable()
Generic semaphore implementation
Add semaphore.h to kernel_lock.c
Fix quota.h includes
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (104 commits)
IB/iser: Don't change itt endianness
IB/mlx4: Update module version and release date
IPoIB: Handle case when P_Key is deleted and re-added at same index
IB/iser: Release connection resources on RDMA_CM_EVENT_DEVICE_REMOVAL event
IB/mlx4: Fix incorrect comment
IB/mlx4: Fix race when detaching a QP from a multicast group
IB/ehca: Support all ibv_devinfo values in query_device() and query_port()
RDMA/nes: Free IRQ before killing tasklet
IB/mthca: Update module version and release date
IB/mlx4: Update QP state if query QP succeeds
IB/mthca: Update QP state if query QP succeeds
RDMA/amso1100: Add check for NULL reply_msg in c2_intr()
IB/mlx4: Add support for resizing CQs
IB/mlx4: Add support for modifying CQ moderation parameters
IPoIB: Support modifying IPoIB CQ event moderation
IB/core: Add support for modify CQ
IPoIB: Add basic ethtool support
mlx4_core: Increase max number of QPs to 128K
RDMA/amso1100: Add support for "send with invalidate" work requests
IB/core: Add support for "send with invalidate" work requests
...
Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (36 commits)
[S390] Remove code duplication from monreader / dcssblk.
[S390] kernel: show last breaking-event-address on oops
[S390] lowcore: Change type of lowcores softirq_pending to __u32.
[S390] zcrypt: Comments and kernel-doc cleanup
[S390] uaccess: Always access the correct address space.
[S390] Fix a lot of sparse warnings.
[S390] Convert s390 to GENERIC_CLOCKEVENTS.
[S390] genirq/clockevents: move irq affinity prototypes/inlines to interrupt.h
[S390] Convert monitor calls to function calls.
[S390] qdio (new feature): enhancing info-retrieval from QDIO-adapters
[S390] replace remaining __FUNCTION__ occurrences
[S390] remove redundant display of free swap space in show_mem()
[S390] qdio: remove outdated developerworks link.
[S390] Add debug_register_mode() function to debug feature API
[S390] crypto: use more descriptive function names for init/exit routines.
[S390] switch sched_clock to store-clock-extended.
[S390] zcrypt: add support for large random numbers
[S390] hw_random: allow rng_dev_read() to return hardware errors.
[S390] Vertical cpu management.
[S390] cpu topology support for s390.
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: No need for per node slab counters if !SLUB_DEBUG
slub: Move map/flag clearing to __free_slab
slub: Fixes to per cpu stat output in sysfs
slub: Deal with config variable dependencies
slub: Reduce #ifdef ZONE_DMA by moving kmalloc_caches_dma near dma logic
slub: Initialize per-cpu stats
Roland McGrath [Fri, 18 Apr 2008 01:44:38 +0000 (18:44 -0700)]
ptrace_signal subroutine
This breaks out the ptrace handling from get_signal_to_deliver into a
new subroutine. The actual code there doesn't change, and it gets
inlined into nearly identical compiled code. This makes the function
substantially shorter and thus easier to read, and it nicely isolates
the ptrace magic.
Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Got burned by setting the proposed default of 65536
across all Debian archs.
Thus proposing to be more specific on which archs you may
set this. Also propose a value for arm and friends that
doesn't break sshd.
Reword to mention working archs ia64 and ppc64 too.
Signed-off-by: maximilian attems <max@stro.at> Cc: Martin Michlmayr <tbm@cyrius.com> Cc: Gordon Farquharson <gordonfarquharson@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
Paul Moore [Thu, 10 Apr 2008 14:48:14 +0000 (10:48 -0400)]
SELinux: Add network port SID cache
Much like we added a network node cache, this patch adds a network port
cache. The design is taken almost completely from the network node cache
which in turn was taken from the network interface cache. The basic idea is
to cache entries in a hash table based on protocol/port information. The
hash function only takes the port number into account since the number of
different protocols in use at any one time is expected to be relatively
small.
Signed-off-by: Paul Moore <paul.moore@hp.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org>
Eric Paris [Mon, 31 Mar 2008 01:17:33 +0000 (12:17 +1100)]
selinux: introduce permissive types
Introduce the concept of a permissive type. A new ebitmap is introduced to
the policy database which indicates if a given type has the permissive bit
set or not. This bit is tested for the scontext of any denial. The bit is
meaningless on types which only appear as the target of a decision and never
the source. A domain running with a permissive type will be allowed to
perform any action similarly to when the system is globally set permissive.
Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org>
Roland McGrath [Wed, 26 Mar 2008 22:46:39 +0000 (15:46 -0700)]
selinux: remove ptrace_sid
This changes checks related to ptrace to get rid of the ptrace_sid tracking.
It's good to disentangle the security model from the ptrace implementation
internals. It's sufficient to check against the SID of the ptracer at the
time a tracee attempts a transition.
Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org>
Eric Paris [Tue, 11 Mar 2008 18:19:34 +0000 (14:19 -0400)]
SELinux: requesting no permissions in avc_has_perm_noaudit is a BUG()
This patch turns the case where we have a call into avc_has_perm with no
requested permissions into a BUG_ON. All callers to this should be in
the kernel and thus should be a function we need to fix if we ever hit
this. The /selinux/access permission checking it done directly in the
security server and not through the avc, so those requests which we
cannot control from userspace should not be able to trigger this BUG_ON.
Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen D. Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org>
Andrew Morton [Wed, 5 Mar 2008 23:05:08 +0000 (10:05 +1100)]
security: code cleanup
ERROR: "(foo*)" should be "(foo *)"
#168: FILE: security/selinux/hooks.c:2656:
+ "%s, rc=%d\n", __func__, (char*)value, -rc);
total: 1 errors, 0 warnings, 195 lines checked
./patches/security-replace-remaining-__function__-occurences.patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>
Eric Paris [Thu, 28 Feb 2008 17:58:40 +0000 (12:58 -0500)]
SELinux: create new open permission
Adds a new open permission inside SELinux when 'opening' a file. The idea
is that opening a file and reading/writing to that file are not the same
thing. Its different if a program had its stdout redirected to /tmp/output
than if the program tried to directly open /tmp/output. This should allow
policy writers to more liberally give read/write permissions across the
policy while still blocking many design and programing flaws SELinux is so
good at catching today.
Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Reviewed-by: Paul Moore <paul.moore@hp.com> Signed-off-by: James Morris <jmorris@namei.org>
James Morris [Tue, 26 Feb 2008 09:42:02 +0000 (20:42 +1100)]
SELinux: unify printk messages
Replace "security:" prefixes in printk messages with "SELinux"
to help users identify the source of the messages. Also fix a
couple of minor formatting issues.
Paul Moore [Mon, 25 Feb 2008 16:40:33 +0000 (11:40 -0500)]
SELinux: Correct the NetLabel locking for the sk_security_struct
The RCU/spinlock locking approach for the nlbl_state in the sk_security_struct
was almost certainly overkill. This patch removes both the RCU and spinlock
locking, relying on the existing socket locks to handle the case of multiple
writers. This change also makes several code reductions possible.
Less locking, less code - it's a Good Thing.
Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: James Morris <jmorris@namei.org>
[XFS] The forward declarations for the xfs_ioctl() helpers and the
associated comment about gcc behavior really aren't needed; all of these
functions are marked STATIC which includes noinline, and the stack usage
won't be a problem.
This effectively just removes the forward declarations and moves
xfs_ioctl() back to the end of the file.
Josef Sipek [Fri, 11 Apr 2008 07:11:02 +0000 (17:11 +1000)]
[XFS] Update XFS documentation for noikeep/ikeep.
Mention how DMAPI affects default for noikeep.
Slightly modified since Josef's patch was based on
an old xfs.txt prior to Dave's (dgc) checkin which
missed going to oss.
Signed-off-by: Josef Sipek <jeffpc@josefsipek.net> Signed-off-by: Tim Shimmin <tes@sgi.com>
Eric Sandeen [Thu, 17 Apr 2008 06:50:22 +0000 (16:50 +1000)]
[XFS] Remove CONFIG_XFS_SECURITY.
There is no point to the CONFIG_XFS_SECURITY option; it disables the
ability to set security attributes at runtime, but it does not actually
slim down or remove any code for runtime. Just remove it and always allow
security attributes to be set.
Tim Shimmin [Thu, 17 Apr 2008 06:50:16 +0000 (16:50 +1000)]
[XFS] xfs_bmap_compute_maxlevels should be based on di_forkoff
Fix up xfs_bmap_compute_maxlevels() to account for the case when we go
from using attr2 to using attr1. In that case attr1 will no longer
necessarily be at m_attr_offset>>3, but could be at a different value for
di_forkoff. Therefore, we return the worst case scenario using MINDBTPTRS
and MINABTPTRS, as this function is used for determining the maximum log
space.
Eric Sandeen [Thu, 17 Apr 2008 06:50:09 +0000 (16:50 +1000)]
[XFS] Always use di_forkoff when checking for attr space.
In the case where we mount a filesystem which was previously using the
attr2 format as attr1, returning the default mp->m_attroffset instead of
the per-inode di_forkoff for inline attribute fit calculations, may result
in corruption, if for example, the data fork is already taking more space
than the default fork offset and we try to add an extended attribute. Fix
tested by xfstests/186.
David Chinner [Thu, 17 Apr 2008 06:49:55 +0000 (16:49 +1000)]
[XFS] Remove periodic logging of in-core superblock counters.
xfssyncd triggers the logging of superblock counters every 30s if the
filesystem is made with lazy-count=1. This will prevent disks from idling
and spinning down as there will be a log write every 30s. With the way
counter recovery works for lazy-count=1, this code is unnecessary and
provides no real benefit, so just remove it.
David Chinner [Thu, 10 Apr 2008 02:24:30 +0000 (12:24 +1000)]
[XFS] Sanitise xfs_log_force error checking.
xfs_log_force() is declared to return an error, but we almost never check
it. We don't need to check it in most cases; if there's a log I/O error
then we'll be shutting down the filesystem anyway and that means we'll
catch the error somewhere else.
However, on certain calls we should be returning an error - sync
transactions, fsync, sync writes, etc. so this isn't a pure black and
white distinction. Hence make xfs_log_force() a void function that issues
a warning to the syslog on error, and call _xfs_log_force() in all the
places where we actually care about the error status returned.
David Chinner [Thu, 10 Apr 2008 02:24:17 +0000 (12:24 +1000)]
[XFS] Don't allow silent errors in xfs_inactive().
xfs_inactive() fails to report errors when committing the inactive
transaction. Hence we can get silent failures either finishing off the
truncation or committing the transaction. Even if we get errors, we need
to continue, so simply warn loudly to the system if we get errors here.
David Chinner [Thu, 10 Apr 2008 02:23:52 +0000 (12:23 +1000)]
[XFS] Catch unwritten extent conversion errors.
On unwritten I/O completion, we fail to propagate an error when converting
the extent to a written extent. This means that the I/O silently fails.
propagate the error onto the ioend so that the inode is marked with an
error appropriately.
David Chinner [Thu, 10 Apr 2008 02:23:46 +0000 (12:23 +1000)]
[XFS] xfs_bdwrite() does not return errors.
xfs_bdwrite() cannot return an error; it only queues buffers to the
delayed write list and as such never encounters anything that can fail.
Mark it void.
David Chinner [Thu, 10 Apr 2008 02:22:24 +0000 (12:22 +1000)]
[XFS] Ensure xfs_bawrite() errors are checked.
xfs_bawrite() can return immediate error status on async writes. Unlike
xfsbdstrat() we don't ever check the error on the buffer after the call,
so we currently do not catch errors at all here. Ensure we catch and
propagate or warn to the syslog about up-front async write errors.
David Chinner [Thu, 10 Apr 2008 02:22:17 +0000 (12:22 +1000)]
[XFS] Ensure errors from xfs_bdstrat() are correctly checked.
xfsbdstrat() is declared to return an error. That is never checked because
the error is propagated by the xfs_buf_t that is passed through the
function.
Mark xfsbdstrat() as returning void and comment the prototype on the
methods needed for error checking.
David Chinner [Thu, 10 Apr 2008 02:21:53 +0000 (12:21 +1000)]
[XFS] Check for xfs_free_extent() failing.
xfs_free_extent() can fail, but log recovery never bothers to check if it
successfully free the extent it was supposed to. This could lead to silent
corruption during log recovery. Abort log recovery if we fail to free an
extent.
David Chinner [Thu, 10 Apr 2008 02:21:46 +0000 (12:21 +1000)]
[XFS] Warn if errors come from block_truncate_page().
block_truncate_page() can return errors that we currently ignore and
silently discard. We should not ever get errors reported here - an error
indicates a bug somewhere else. Hence catch the error and issue a stack
dump to the syslog because we cannot propagate the error any further up
the call chain.
David Chinner [Thu, 10 Apr 2008 02:21:32 +0000 (12:21 +1000)]
[XFS] Make xfs_alloc_compute_aligned() void.
xfs_alloc_compute_aligned() returns a value based on a comparison of the
computed extent length and the minimum length allowed. This is only used
by some callers - the other four return parameters are used more often.
Hence move the comparison to the code that actually needs to do it and
make xfs_alloc_compute_aligned() a void function.
David Chinner [Thu, 10 Apr 2008 02:21:25 +0000 (12:21 +1000)]
[XFS] Clean up xfs_alloc_search_busy() return values.
xfs_alloc_search_busy() returns an index into the busy array if the extent
was found in the array. This is never checked, and the
xfs_alloc_search_busy() does a log force to prevent reuse of the extent
before the free transaction hits the disk. Hence the return value is
useless. Declare the function void and remove the slot number from the
tracing as well.
David Chinner [Thu, 10 Apr 2008 02:21:18 +0000 (12:21 +1000)]
[XFS] Propagate errors from xfs_trans_commit().
xfs_trans_commit() can return errors when there are problems in the
transaction subsystem. They are indicative that the entire transaction may
be incomplete, and hence the error should be propagated as there is a good
possibility that there is something fatally wrong in the filesystem. Catch
and propagate or warn about commit errors in the places where they are
currently ignored.
David Chinner [Thu, 10 Apr 2008 02:21:11 +0000 (12:21 +1000)]
[XFS] Propagate xfs_trans_reserve() errors.
xfs_trans_reserve() reports errors that should not be ignored. For
example, a shutdown filesystem will report errors through
xfs_trans_reserve() to prevent further changes from being attempted on a
damaged filesystem. Catch and propagate all error conditions from
xfs_trans_reserve().
David Chinner [Thu, 10 Apr 2008 02:20:45 +0000 (12:20 +1000)]
[XFS] Catch errors when turning off quotas.
When turning off quota, we need to write various transactions to the log
to ensure that they are cleanly removed in the case of a crash. We need to
check that the transactions hit the disk correctly. If we fail to write
the final quota off transaction, we are corrupt in memory and so the only
option is to shut the filesystem down at this point.
David Chinner [Thu, 10 Apr 2008 02:20:31 +0000 (12:20 +1000)]
[XFS] Clean up quotamount error handling.
xfs_qm_mount_quotas() returns an error status that is ignored. If we fail
to mount quotas, we continue with quota's turned off, which is all handled
inside xfs_qm_mount_quotas(). Mark it as void to indicate that errors need
not be returned to the callers.
David Chinner [Thu, 10 Apr 2008 02:20:24 +0000 (12:20 +1000)]
[XFS] Check for dquot flush errors
xfs_qm_dqflush() can fail, but the return is not checked anywhere. Hence
we never know if we've failed to flush a dquot to disk. Propagate the
error and warn to the syslog if a flush ever fails.
David Chinner [Thu, 10 Apr 2008 02:20:03 +0000 (12:20 +1000)]
[XFS] Report errors from xfs_reserve_blocks().
xfs_reserve_blocks() can fail in interesting ways. In neither case is it a
fatal error, but the result can lead to sub-optimal behaviour. Warn to the
syslog if the call fails but otherwise continue.
David Chinner [Thu, 10 Apr 2008 02:19:02 +0000 (12:19 +1000)]
[XFS] Fix lock inversion in forced shutdown.
Recent changes to xlog_state_release_iclog() placed the grant_lock inside
the icloglock. forced unmount of the log does this the opposite way
around, but does not depend on the order for correct working. Fix the
inversion by changing the order locks are gained in
xfs_log_force_umount().
David Chinner [Thu, 10 Apr 2008 02:18:54 +0000 (12:18 +1000)]
[XFS] Reorganise xlog_t for better cacheline isolation of contention
To reduce contention on the log in large CPU count, separate out different
parts of the xlog_t structure onto different cachelines. Move each lock
onto a different cacheline along with all the members that are
accessed/modified while that lock is held.
David Chinner [Thu, 10 Apr 2008 02:18:46 +0000 (12:18 +1000)]
[XFS] Remove the xlog_ticket allocator
The ticket allocator is just a simple slab implementation internal to the
log. It requires the icloglock to be held when manipulating it and this
contributes to contention on that lock.
Just kill the entire allocator and use a memory zone instead. While there,
allow us to gracefully fail allocation with ENOMEM.
David Chinner [Thu, 10 Apr 2008 02:18:39 +0000 (12:18 +1000)]
[XFS] Per iclog callback chain lock
Rather than use the icloglock for protecting the iclog completion callback
chain, use a new per-iclog lock so that walking the callback chain doesn't
require holding a global lock.
This reduces contention on the icloglock during transaction commit and log
I/O completion by reducing the number of times we need to hold the global
icloglock during these operations.
While investigating the extent corruption bug I ran into this bug in debug
only code. xfs_bmap_check_leaf_extents() loops through the leaf blocks of
the extent btree checking that every extent is entirely before the next
extent. It also compares the last extent in the previous block to the
first extent in the current block when the previous block has been
released and potentially unmapped. So take a copy of the last extent
instead of a pointer. Also move the last extent check out of the loop
because we only need to do it once.
Most VN_RELE calls either directly contain a XFS_ITOV or have the
corresponding xfs_inode already in scope. Use the IRELE helper instead of
VN_RELE to clarify the code. With a little more work we can kill VN_RELE
altogether and define IRELE in terms of iput directly.
[XFS] cleanup root inode handling in xfs_fs_fill_super
- rename rootvp to root for clarify
- remove useless vn_to_inode call
- check is_bad_inode before calling d_alloc_root
- use iput instead of VN_RELE in the error case
David Chinner [Thu, 27 Mar 2008 07:00:45 +0000 (18:00 +1100)]
[XFS] Ensure a btree insert returns a valid cursor.
When writing into preallocated regions there is a case where XFS can oops
or hang doing the unwritten extent conversion on I/O completion. It turns
out that the problem is related to the btree cursor being invalid.
When we do an insert into the tree, we may need to split blocks in the
tree. When we only split at the leaf level (i.e. level 0), everything
works just fine. However, if we have a multi-level split in the btreee,
the cursor passed to the insert function is no longer valid once the
insert is complete.
The leaf level split is handled correctly because all the operations at
level 0 are done using the original cursor, hence it is updated correctly.
However, when we need to update the next level up the tree, we don't use
that cursor - we use a cloned cursor that points to the index in the next
level up where we need to do the insert.
Hence if we need to split a second level, the changes to the tree are
reflected in the cloned cursor and not the original cursor. This
clone-and-move-up-a-level-on-split behaviour recurses all the way to the
top of the tree.
The complexity here is that these cloned cursors do not point to the
original index that was inserted - they point to the newly allocated block
(the right block) and the original cursor pointer to that level may still
point to the left block. Hence, without deep examination of the cloned
cursor and buffers, we cannot update the original cursor with the new path
from the cloned cursor.
In these cases the original cursor could be pointing to the wrong block(s)
and hence a subsequent modification to the tree using that cursor will
lead to corruption of the tree.
The crash case occurs when the tree changes height - we insert a new level
in the tree, and the cursor does not have a buffer in it's path for that
level. Hence any attempt to walk back up the cursor to the root block will
result in a null pointer dereference.
To make matters even more complex, the BMAP BT is rooted in an inode, so
we can have a change of height in the btree *without a root split*. That
is, if the root block in the inode is full when we split a leaf node, we
cannot fit the pointer to the new block in the root, so we allocate a new
block, migrate all the ptrs out of the inode into the new block and point
the inode root block at the newly allocated block. This changes the height
of the tree without a root split having occurred and hence invalidates the
path in the original cursor.
The patch below prevents xfs_bmbt_insert() from returning with an invalid
cursor by detecting the cases that invalidate the original cursor and
refresh it by do a lookup into the btree for the original index we were
inserting at.
Note that the INOBT, AGFBNO and AGFCNT btree implementations also have
this bug, but the cursor is currently always destroyed or revalidated
after an insert for those trees. Hence this patch only address the problem
in the BMBT code.
David Chinner [Thu, 27 Mar 2008 07:00:38 +0000 (18:00 +1100)]
[XFS] Account for inode cluster alignment in all allocations
At ENOSPC, we can get a filesystem shutdown due to a cancelling a dirty
transaction in xfs_mkdir or xfs_create. This is due to the initial
allocation attempt not taking into account inode alignment and hence we
can prepare the AGF freelist for allocation when it's not actually
possible to do an allocation. This results in inode allocation returning
ENOSPC with a dirty transaction, and hence we shut down the filesystem.
Because the first allocation is an exact allocation attempt, we must tell
the allocator that the alignment does not affect the allocation attempt.
i.e. we will accept any extent alignment as long as the extent starts at
the block we want. Unfortunately, this means that if the longest free
extent is less than the length + alignment necessary for fallback
allocation attempts but is long enough to attempt a non-aligned
allocation, we will modify the free list.
If we then have the exact allocation fail, all other allocation attempts
will also fail due to the alignment constraint being taken into account.
Hence the initial attempt needs to set the "alignment slop" field so that
alignment, while not required, must be taken into account when determining
if there is enough space left in the AG to do the allocation.
That means if the exact allocation fails, we will not dirty the freelist
if there is not enough space available fo a subsequent allocation to
succeed. Hence we get an ENOSPC error back to userspace without shutting
down the filesystem.