* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
netns: Fix crash by making igmp per namespace
bnx2x: Version update
bnx2x: Checkpatch compliance
bnx2x: Spelling mistakes
bnx2x: Minor code improvements
bnx2x: Driver info
bnx2x: 1G LED does not turn off
bnx2x: 8073 PHY changes
bnx2x: Change GPIO for any port
bnx2x: Pause settings
bnx2x: Link order with external PHY
bnx2x: No LRO without Rx checksum
bnx2x: Wrong structure size
bnx2x: WoL capability
bnx2x: Clearing MAC addresses filters
bnx2x: Delay in while loops
bnx2x: PBA Table Page Alignment Workaround
bnx2x: Self-test false positive
bnx2x: Memory allocation
bnx2x: HW attention lock
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
sparc64: Handle stack trace attempts before irqstacks are setup.
sparc64: Implement IRQ stacks.
sparc: remove include of linux/of_device.h from asm/of_device.h
sparc64: Fix recursion in stack overflow detection handling.
sparc/drivers: use linux/of_device.h instead of asm/of_device.h
sparc64: Don't MAGIC_SYSRQ ifdef smp_fetch_global_regs and support code.
David Howells [Wed, 13 Aug 2008 15:20:04 +0000 (16:20 +0100)]
CRED: Introduce credential access wrappers
The patches that are intended to introduce copy-on-write credentials for 2.6.28
require abstraction of access to some fields of the task structure,
particularly for the case of one task accessing another's credentials where RCU
will have to be observed.
Introduced here are trivial no-op versions of the desired accessors for current
and other tasks so that other subsystems can start to be converted over more
easily.
Wrappers are introduced into a new header (linux/cred.h) for UID/GID,
EUID/EGID, SUID/SGID, FSUID/FSGID, cap_effective and current's subscribed
user_struct. These wrappers are macros because the ordering between header
files mitigates against making them inline functions.
linux/cred.h is #included from linux/sched.h.
Further, XFS is modified such that it no longer defines and uses parameterised
versions of current_fs[ug]id(), thus getting rid of the namespace collision
otherwise incurred.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
Daniel Lezcano [Wed, 13 Aug 2008 23:15:57 +0000 (16:15 -0700)]
netns: Fix crash by making igmp per namespace
This patch makes the multicast socket to be per namespace.
When a network namespace is created, other than the init_net and a
multicast packet is received, the kernel goes to a hang or a kernel panic.
How to reproduce ?
* create a child network namespace
* create a pair virtual device veth
* ip link add type veth
* move one side to the pair network device to the child namespace
* ip link set netns <childpid> dev veth1
* ping -I veth0 224.0.0.1
The bug appears because the function ip_mc_init_dev does not initialize
the different multicast fields as it exits because it is not the init_net.
BUG: soft lockup - CPU#0 stuck for 61s! [avahi-daemon:2695]
Modules linked in:
irq event stamp: 50350
hardirqs last enabled at (50349): [<c03ee949>] _spin_unlock_irqrestore+0x34/0x39
hardirqs last disabled at (50350): [<c03ec639>] schedule+0x9f/0x5ff
softirqs last enabled at (45712): [<c0374d4b>] ip_setsockopt+0x8e7/0x909
softirqs last disabled at (45710): [<c03ee682>] _spin_lock_bh+0x8/0x27
Eilon Greenstein [Wed, 13 Aug 2008 22:58:49 +0000 (15:58 -0700)]
bnx2x: Minor code improvements
Minor code improvements
Small changes to make the code a little bit more efficient and mostly
more readable:
- Using unified macros for EMAC_RD/WR which looks like normal REG_RD/WR
- Removing the NIG_WR since it did nothing and was only confusing
- On bnx2x_panic_dump, print only the used parts of the rings
- define parameters only on the branch they are needed and not at the
beginning of the function
- using NETIF_MSG_INTR and not private BNX2X_MSG_SP for debug prints
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eilon Greenstein [Wed, 13 Aug 2008 22:58:30 +0000 (15:58 -0700)]
bnx2x: Driver info
Driver info
The internal FW which is downloaded by the driver should not be
displayed - it is only causing confusion and it is redundant since it
can be concluded from the driver version. Display only FW which is
burned on the board nvram
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yaniv Rosner [Wed, 13 Aug 2008 22:57:28 +0000 (15:57 -0700)]
bnx2x: 8073 PHY changes
8073 PHY changes
The initial support we had for this PHY needs some serious changing. The
major change is that this PHY should be initialized only when the first
function is loaded and not for each function. The official SPI-ROM of
this PHY was released and it requires some changes in the initialization
code as well
Signed-off-by: Yaniv Rosner <yanivr@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eilon Greenstein [Wed, 13 Aug 2008 22:56:59 +0000 (15:56 -0700)]
bnx2x: Change GPIO for any port
Change GPIO for any port
The set GPIO function should receive the port index to allow changing
the GPIO of another port. This is needed for the common init phase (one
the first driver is loaded for the chip)
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yaniv Rosner [Wed, 13 Aug 2008 22:56:17 +0000 (15:56 -0700)]
bnx2x: Pause settings
Pause settings
- 1G pause was not working due to missing write to the emac block
(TX_MODE_FLOW_EN)
- The flow control should use the negotiated result (after autoneg) so
we should save both the requested autoneg and the result
- The HW credits with flow control at 1G speed were not optimized and
caused low throughput
- It is recommended to turn off flow control if the MTU is bigger than
5000B due to internal buffers size
Signed-off-by: Yaniv Rosner <yanivr@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yaniv Rosner [Wed, 13 Aug 2008 22:55:28 +0000 (15:55 -0700)]
bnx2x: Link order with external PHY
Link order with external PHY
When external PHY exists (second chip with the PHY to translate to
another physical medium) the link with the eternal PHY and the network
should be established before setting the link between the 5771x and the
PHY. This is the right order and it is important when using autoneg -
the link to the network should use the autoneg and the link between the
two chips should be forced to the network result.
Signed-off-by: Yaniv Rosner <yanivr@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
No LRO without Rx checksum
Disabling LRO when Rx checksum is disabled
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yitchak Gertner [Wed, 13 Aug 2008 22:53:12 +0000 (15:53 -0700)]
bnx2x: Wrong structure size
Wrong structure size
The wrong structure was used in the sizeof to clear (luckily both
structures have the same size in this version...)
Signed-off-by: Yitchak Gertner <gertner@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yitchak Gertner [Wed, 13 Aug 2008 22:52:28 +0000 (15:52 -0700)]
bnx2x: Clearing MAC addresses filters
Clearing MAC addresses filters
When the driver unloads, it should clear the MAC addresses filters in
the HW - this prevents packets from entering the chip when the driver is
re-loaded before initializing the right filters
Signed-off-by: Yitchak Gertner <gertner@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yitchak Gertner [Wed, 13 Aug 2008 22:52:08 +0000 (15:52 -0700)]
bnx2x: Delay in while loops
Delay in while loops
The delay in the loop should be after the change. This has very little
effect (can save one delay) but it is the right thing to do
Signed-off-by: Yitchak Gertner <gertner@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eilon Greenstein [Wed, 13 Aug 2008 22:51:48 +0000 (15:51 -0700)]
bnx2x: PBA Table Page Alignment Workaround
PBA Table Page Alignment Workaround
The PBA table starts on the middle of the page and that's causing very
low performance with virtualization. The solution is not to update via
the BAR directly but via chip access to the same memory
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yitchak Gertner [Wed, 13 Aug 2008 22:51:28 +0000 (15:51 -0700)]
bnx2x: Self-test false positive
Self-test false positive
- The memory test should use a mask according to the chip type
- In the register test, check the port only once and not inside the for
loop (not causing a failure - just ugly)
Signed-off-by: Yitchak Gertner <gertner@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eilon Greenstein [Wed, 13 Aug 2008 22:51:07 +0000 (15:51 -0700)]
bnx2x: Memory allocation
Memory allocation
- The CQE ring was allocated to the max size even for a chip that does
not support it. Fixed to allocate according to the chip type to save
memory
- The rx_page_ring was not freed on driver unload
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yitchak Gertner [Wed, 13 Aug 2008 22:50:23 +0000 (15:50 -0700)]
bnx2x: HW lock mechanism
HW lock mechanism
Enhancing the HW lock to work per function and not only per port - this
is needed for the next patch that protects races over HW attention
detection between the different functions. At this chance, changing the
functions names to be more inline with the current naming convention
Signed-off-by: Yitchak Gertner <gertner@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Load/Unload under traffic
Few issues were found when loading and unloading under traffic:
- When receiving Tx interrupt call netif_wake_queue if the queue is
stopped but the state is open
- Check that interrupts are enabled before doing anything else on the
msix_fp_int function
- In nic_load, enable the interrupts only when needed and ready for it
- Function stop_leading returns status since it can fail
- Add 1ms delay when unloading the driver to validate that there are no
open transactions that already started by the FW
- Splitting the "has work" function into Tx and Rx so the same function
will be used on unload and interrupts
- Do not request for WoL if only resetting the device (save the time
that it takes the FW to set the link after reset)
- Fixing the device reset after iSCSI boot and before driver load - all
internal buffers must be cleared before the driver is loaded
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eilon Greenstein [Wed, 13 Aug 2008 22:49:35 +0000 (15:49 -0700)]
bnx2x: FW Internal Memory structure
FW Internal Memory structure
The FW uses data structures on the chip internal memory to aggregate the
connections when TPA is enabled. The driver was clearing the wrong offsets
and therefore one function could cause another function to loose packets.
Changing the initialization of the chip internal memory to clear only the
relevant memory for each function which is being loaded
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yitchak Gertner [Wed, 13 Aug 2008 22:49:05 +0000 (15:49 -0700)]
bnx2x: Statistics
Statistics
- Making sure that each drop is accounted for in the driver statistics
- Clearing the FW statistics when driver is loaded to prevent
inconsistency with HW statistics
- Once error is detected (bnx2x_panic_dump), stop the statistics
before other actions (currently it is stopped last and can corrupt
the data) - Adding HW checksum error counter to the statistics
- Removing unused variable stats_ticks
- Using macros instead of magic numbers to indicate which statistics are
shared per port and which are per function
Signed-off-by: Yitchak Gertner <gertner@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eilon Greenstein [Wed, 13 Aug 2008 22:47:33 +0000 (15:47 -0700)]
bnx2x: FW (bootcode) interface fixes
FW (bootcode) interface fixes
- Making sure that the device will not cause kernel panic of the
bootcode is corrupted or missing
- Removing module debug parameter "nomcp" since no one should work
without the bootcode (this is a left over from the chip bring up days)
- Instead of waiting fix amount of time for bootcode response, sample it
every 10ms (usually the answer is ready after less than 10ms)
Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Aug 2008 22:18:38 +0000 (15:18 -0700)]
pkt_sched: Fix queue quiescence testing in dev_deactivate().
Based upon discussions with Jarek P. and Herbert Xu.
First, we're testing the wrong qdisc. We just reset the device
queue qdiscs to &noop_qdisc and checking it's state is completely
pointless here.
We want to wait until the previous qdisc that was sitting at
the ->qdisc pointer is not busy any more. And that would be
->qdisc_sleeping.
Because of how we propagate the samples qdisc pointer down into
qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
we have to wait also for the __QDISC_STATE_SCHED bit to clear as
well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 13 Aug 2008 22:17:49 +0000 (15:17 -0700)]
Merge git://oss.sgi.com:8090/xfs/linux-2.6
* git://oss.sgi.com:8090/xfs/linux-2.6: (45 commits)
[XFS] Fix use after free in xfs_log_done().
[XFS] Make xfs_bmap_*_count_leaves void.
[XFS] Use KM_NOFS for debug trace buffers
[XFS] use KM_MAYFAIL in xfs_mountfs
[XFS] refactor xfs_mount_free
[XFS] don't call xfs_freesb from xfs_unmountfs
[XFS] xfs_unmountfs should return void
[XFS] cleanup xfs_mountfs
[XFS] move root inode IRELE into xfs_unmountfs
[XFS] stop using file_update_time
[XFS] optimize xfs_ichgtime
[XFS] update timestamp in xfs_ialloc manually
[XFS] remove the sema_t from XFS.
[XFS] replace dquot flush semaphore with a completion
[XFS] replace inode flush semaphore with a completion
[XFS] extend completions to provide XFS object flush requirements
[XFS] replace the XFS buf iodone semaphore with a completion
[XFS] clean up stale references to semaphores
[XFS] use get_unaligned_* helpers
[XFS] Fix compile failure in xfs_buf_trace()
...
Jarek Poplawski [Wed, 13 Aug 2008 22:16:43 +0000 (15:16 -0700)]
pkt_sched: Fix oops in htb_delete.
Recent changes introduced a bug in htb_delete(): cl->parent->children
counter update misses checking cl->parent for NULL, which is used for
root classes, so deleting them causes an oops.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Gallatin [Wed, 13 Aug 2008 22:16:00 +0000 (15:16 -0700)]
pktgen: prevent pktgen from using bad tx queue
With the new multi-queue transmit code, it is possible to accidentally
make pktgen pick a non-existing tx queue simply by using a stale
script to drive pktgen. Access to this non-existing tx queue will
then trigger a bad memory access and kill the machine.
For example, setting "queue_map_max 2" will cause my machine to die
when accessing a garbage spinlock in the non-existing tx queue:
BUG: spinlock bad magic on CPU#0, kpktgend_0/564
lock: ffff88001ddf6718, .magic: ffffffff, .owner: /-1, .owner_cpu: 0
Pid: 564, comm: kpktgend_0 Not tainted 2.6.27-rc3 #35
dccp: change L/R must have at least one byte in the dccpsf_val field
Thanks to Eugene Teo for reporting this problem.
Signed-off-by: Eugene Teo <eugenete@kernel.sg> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
David Teigland [Thu, 31 Jul 2008 14:31:53 +0000 (09:31 -0500)]
dlm: rename structs
Add a dlm_ prefix to the struct names in config.c. This resolves a
conflict with struct node in particular, when include/linux/node.h
happens to be included.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Teigland <teigland@redhat.com>
Wolfgang also found out that adding kernel_fpu_begin() and kernel_fpu_end()
around the padlock instructions fix the oops.
Suresh wrote:
These padlock instructions though don't use/touch SSE registers, but it behaves
similar to other SSE instructions. For example, it might cause DNA faults
when cr0.ts is set. While this is a spurious DNA trap, it might cause
oops with the recent fpu code changes.
This is the code sequence that is probably causing this problem:
a) new app is getting exec'd and it is somewhere in between
start_thread() and flush_old_exec() in the load_xyz_binary()
b) At pont "a", task's fpu state (like TS_USEDFPU, used_math() etc) is
cleared.
c) Now we get an interrupt/softirq which starts using these encrypt/decrypt
routines in the network stack. This generates a math fault (as
cr0.ts is '1') which sets TS_USEDFPU and restores the math that is
in the task's xstate.
d) Return to exec code path, which does start_thread() which does
free_thread_xstate() and sets xstate pointer to NULL while
the TS_USEDFPU is still set.
e) At the next context switch from the new exec'd task to another task,
we have a scenarios where TS_USEDFPU is set but xstate pointer is null.
This can cause an oops during unlazy_fpu() in __switch_to()
Now:
1) This should happen with or with out pre-emption. Viro also encountered
similar problem with out CONFIG_PREEMPT.
2) kernel_fpu_begin() and kernel_fpu_end() will fix this problem, because
kernel_fpu_begin() will manually do a clts() and won't run in to the
situation of setting TS_USEDFPU in step "c" above.
3) This was working before the fpu changes, because its a spurious
math fault which doesn't corrupt any fpu/sse registers and the task's
math state was always in an allocated state.
With out the recent lazy fpu allocation changes, while we don't see oops,
there is a possible race still present in older kernels(for example,
while kernel is using kernel_fpu_begin() in some optimized clear/copy
page and an interrupt/softirq happens which uses these padlock
instructions generating DNA fault).
This is the failing scenario that existed even before the lazy fpu allocation
changes:
0. CPU's TS flag is set
1. kernel using FPU in some optimized copy routine and while doing
kernel_fpu_begin() takes an interrupt just before doing clts()
2. Takes an interrupt and ipsec uses padlock instruction. And we
take a DNA fault as TS flag is still set.
3. We handle the DNA fault and set TS_USEDFPU and clear cr0.ts
4. We complete the padlock routine
5. Go back to step-1, which resumes clts() in kernel_fpu_begin(), finishes
the optimized copy routine and does kernel_fpu_end(). At this point,
we have cr0.ts again set to '1' but the task's TS_USEFPU is stilll
set and not cleared.
6. Now kernel resumes its user operation. And at the next context
switch, kernel sees it has do a FP save as TS_USEDFPU is still set
and then will do a unlazy_fpu() in __switch_to(). unlazy_fpu()
will take a DNA fault, as cr0.ts is '1' and now, because we are
in __switch_to(), math_state_restore() will get confused and will
restore the next task's FP state and will save it in prev tasks's FP state.
Remember, in __switch_to() we are already on the stack of the next task
but take a DNA fault for the prev task.
This causes the fpu leakage.
Fix the padlock instruction usage by calling them inside the
context of new routines irq_ts_save/restore(), which clear/restore cr0.ts
manually in the interrupt context. This will not generate spurious DNA
in the context of the interrupt which will fix the oops encountered and
the possible FPU leakage issue.
Reported-and-bisected-by: Wolfgang Walter <wolfgang.walter@stwm.de> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
introduced a typo that broke AEAD chunk testing. In particular,
axbuf should really be xbuf.
There is also an issue with testing the last segment when encrypting.
The additional part produced by AEAD wasn't tested. Similarly, on
decryption the additional part of the AEAD input is mistaken for
corruption.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Lee Nipper [Wed, 30 Jul 2008 08:26:57 +0000 (16:26 +0800)]
crypto: talitos - Add handling for SEC 3.x treatment of link table
Later SEC revision requires the link table (used for scatter/gather)
to have an extra entry to account for the total length in descriptor [4],
which contains cipher Input and ICV.
This only applies to decrypt, not encrypt.
Without this change, on 837x, a gather return/length error results
when a decryption uses a link table to gather the fragments.
This is observed by doing a ping with size of 1447 or larger with AES,
or a ping with size 1455 or larger with 3des.
So, add check for SEC compatible "fsl,3.0" for using extra link table entry.
Signed-off-by: Lee Nipper <lee.nipper@freescale.com> Signed-off-by: Kim Phillips <kim.phillips@freescale.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Jamal Hadi Salim [Wed, 13 Aug 2008 09:41:22 +0000 (02:41 -0700)]
net-sched: Fix actions flushing
Flushing of actions has been broken since we changed
the semantics of netlink parsed tb[X] to mean X is an attribute type.
This makes the flushing work.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
Julien Brunel [Wed, 13 Aug 2008 09:40:48 +0000 (02:40 -0700)]
net/rxrpc: Use an IS_ERR test rather than a NULL test
In case of error, the function rxrpc_get_transport returns an ERR
pointer, but never returns a NULL pointer. So after a call to this
function, a NULL test should be replaced by an IS_ERR test.
A simplified version of the semantic patch that makes this change is
as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@correct_null_test@
expression x,E;
statement S1, S2;
@@
x = rxrpc_get_transport(...)
<... when != x = E
if (
(
- x@p2 != NULL
+ ! IS_ERR ( x )
|
- x@p2 == NULL
+ IS_ERR( x )
)
)
S1
else S2
...>
? x = E;
// </smpl>
Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Aug 2008 09:13:34 +0000 (02:13 -0700)]
pkt_sched: Add queue stopped test back to qdisc_run().
Based upon a bug report by Andrew Gallatin on netdev
with subject "CPU utilization increased in 2.6.27rc"
In commit 37437bb2e1ae8af470dfcd5b4ff454110894ccaf
("pkt_sched: Schedule qdiscs instead of netdev_queue.")
the test of the queue being stopped was erroneously
removed from qdisc_run().
When the TX queue of the device fills up, this omission
causes lots of extraneous useless work to be queued up
to softirq context, where we'll just return immediately
because the device is still stuffed up.
Signed-off-by: David S. Miller <davem@davemloft.net>
Lachlan McIlroy [Wed, 13 Aug 2008 06:51:57 +0000 (16:51 +1000)]
[XFS] Use KM_NOFS for debug trace buffers
Use KM_NOFS to prevent recursion back into the filesystem which can cause
deadlocks.
In the case of xfs_iread() we hold the lock on the inode cluster buffer
while allocating memory for the trace buffers. If we recurse back into XFS
to flush data that may require a transaction to allocate extents which
needs log space. This can deadlock with the xfsaild thread which can't
push the tail of the log because it is trying to get the inode cluster
buffer lock.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31838a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com>
xfs_mount_free mostly frees the perag data, which is something that is
duplicated in the mount error path.
Move the XFS_QM_DONE call to the caller and remove the useless
mutex_destroy/spinlock_destroy calls so that we can re-use it for the
mount error path. Also rename it to xfs_free_perag to reflect what it
does.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31836a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_readsb is called before xfs_mount so xfs_freesb should be called after
xfs_unmountfs, too. This means it now happens after a few things during
the of xfs_unmount which all have nothing to do with the superblock.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31835a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The root inode is allocated in xfs_mountfs so it should be release in
xfs_unmountfs. For the unmount case that means we do it after the the
xfs_sync(mp, SYNC_WAIT | SYNC_CLOSE) in the forced shutdown case and the
dmapi unmount event. Note that both reference the rip variable which might
be freed by that time in case inode flushing has kicked in, so strictly
speaking this might count as a bug fix
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31830a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_ichtime updates the xfs_inode and Linux inode timestamps just fine, no
need to call file_update_time and then copy the values over to the XFS
inode. The only additional thing in file_update_time are checks not
applicable to the write path.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31829a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com>
Port a little optmization from file_update_time to xfs_ichgtime, and only
update the timestamp and mark the inode dirty if the timestamp actually
changes in the timer tick resultion supported by the running kernel.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31827a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In xfs_ialloc we just want to set all timestamps to the current time. We
don't need to mark the inode dirty like xfs_ichgtime does, and we don't
need nor want the opimizations in xfs_ichgtime that I will introduce in
the next patch.
So just opencode the timestamp update in xfs_ialloc, and remove the new
unused XFS_ICHGTIME_ACC case in xfs_ichgtime.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31825a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
David Chinner [Wed, 13 Aug 2008 06:40:43 +0000 (16:40 +1000)]
[XFS] extend completions to provide XFS object flush requirements
XFS object flushing doesn't quite match existing completion semantics. It
mixed exclusive access with completion. That is, we need to mark an object as
being flushed before flushing it to disk, and then block any other attempt to
flush it until the completion occurs. We do this but adding an extra count to
the completion before we start using them. However, we still need to
determine if there is a completion in progress, and allow no-blocking attempts
fo completions to decrement the count.
To do this we introduce:
int try_wait_for_completion(struct completion *x)
returns a failure status if done == 0, otherwise decrements done
to zero and returns a "started" status. This is provided
to allow counted completions to begin safely while holding
object locks in inverted order.
int completion_done(struct completion *x)
returns 1 if there is no waiter, 0 if there is a waiter
(i.e. a completion in progress).
This replaces the use of semaphores for providing this exclusion
and completion mechanism.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31816a
Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
David Chinner [Wed, 13 Aug 2008 06:36:11 +0000 (16:36 +1000)]
[XFS] replace the XFS buf iodone semaphore with a completion
The xfs_buf_t b_iodonesema is really just a semaphore that wants to be a
completion. Change it to a completion and remove the last user of the
sema_t from XFS.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31815a
Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
David Chinner [Wed, 13 Aug 2008 06:34:31 +0000 (16:34 +1000)]
[XFS] clean up stale references to semaphores
A lot of code has been converted away from semaphores, but there are still
comments that reference semaphore behaviour. The log code is the worst
offender. Update the comments to reflect what the code really does now.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31814a
Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
[XFS] Use the same btree_cur union member for alloc and inobt trees.
The alloc and inobt btree use the same agbp/agno pair in the btree_cur
union. Make them use the same bc_private.a union member so that code for
these two short form btree implementations can be shared.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31788a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Setting up the xfs_inode <-> inode link is opencoded in xfs_iget_core now
because that's the only place it needs to be done, xfs_initialize_vnode is
renamed to xfs_setup_inode and loses all superflous paramaters. The check
for I_NEW is removed because it always is true and the di_mode check moves
into xfs_iget_core because it's only needed there.
xfs_set_inodeops and xfs_revalidate_inode are merged into xfs_setup_inode
and the whole things is moved into xfs_iops.c where it belongs.
All remaining bhv_vnode_t instance are in code that's more or less Linux
specific. (Well, for xfs_acl.c that could be argued, but that code is on
the removal list, too). So just do an s/bhv_vnode_t/struct inode/ over the
whole tree. We can clean up variable naming and some useless helpers
later.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31781a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When multiple inodes are locked in XFS it happens in order of the inode
number, with the everything but the first inode trylocked if any of the
previous inodes is in the AIL.
Except for the sorting of the inodes this logic is implemented in
xfs_lock_inodes, but also partially duplicated in xfs_lock_dir_and_entry
in a particularly stupid way adds a lock roundtrip if the inode ordering
is not optimal.
This patch adds a new helper xfs_lock_two_inodes that takes two inodes and
locks them in the most optimal way according to the above locking protocol
and uses it for all places that want to lock two inodes.
The only caller of xfs_lock_inodes is xfs_rename which might lock up to
four inodes.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31772a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
All the error injection is already enabled through ifdef DEBUG, so kill
the never set second cpp symbol to activate it without the rest of the
debugging infrastructure.
Now that all direct calls to VN_HOLD/VN_RELE are gone we can implement
IHOLD/IRELE directly.
For the IHOLD case also replace igrab with a direct increment of i_count
because we are guaranteed to already have a live and referenced inode by
the VFS. Also remove the vn_hold statistic because it's been rather
meaningless for some time with most references done by other callers.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31764a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
[XFS] remove spurious VN_HOLD/VN_RELE calls from xfs_acl.c
All the ACL routines are called from inode operations which are guaranteed
to have a referenced inode by the VFS, so there's no need for the ACL code
to grab another temporary one.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31763a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Add a helper to free the m_fsname/m_rtname/m_logname allocations and use
it properly for all mount failure cases. Also switch the allocations for
these to kstrdup while we're at it.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31728a
Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Niv Sardi [Wed, 13 Aug 2008 06:03:35 +0000 (16:03 +1000)]
[XFS] Move attr log alloc size calculator to another function.
We will need that to be able to calculate the size of log we need for a
specific attr (for Create+EA). The local flag is needed so that we can
fail if we run into ENOSPC when trying to alloc blocks.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31727a
Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
David Chinner [Wed, 13 Aug 2008 06:02:51 +0000 (16:02 +1000)]
[XFS] Use KM_NOFS for incore inode extent tree allocation V2
If we allow incore extent tree allocations to recurse into the
filesystem under memory pressure, new delayed allocations through
xfs_iomap_write_delay() can deadlock on themselves if memory
reclaim tries to write back dirty pages from that inode.
It will deadlock in xfs_iomap_write_allocate() trying to take the
ilock we already hold. This can also show up as complex ABBA deadlocks
when multiple threads are triggering memory reclaim when trying to
allocate extents.
The main cause of this is the fact that delayed allocation is not done in
a transaction, so KM_NOFS is not automatically added to the allocations to
prevent this recursion.
Mark all allocations done for the incore inode extent tree as KM_NOFS to
ensure they never recurse back into the filesystem.
Version 2: o KM_NOFS implies KM_SLEEP, so just use KM_NOFS
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31726a
Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
David Chinner [Wed, 13 Aug 2008 05:45:15 +0000 (15:45 +1000)]
[XFS] Avoid directly referencing the VFS inode.
In several places we directly convert from the XFS inode
to the linux (VFS) inode by a simple deference of ip->i_vnode.
We should not do this - a helper function should be used to
extract the VFS inode from the XFS inode.
Introduce the function VFS_I() to extract the VFS inode
from the XFS inode. The name was chosen to match XFS_I() which
is used to extract the XFS inode from the VFS inode.
SGI-PV: 981498
SGI-Modid: xfs-linux-melb:xfs-kern:31720a
Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>