ext4: Cache the correct extent length for uninit extents
When we convert an uninitialized extent to an initialized extent
we need to make sure we return the number of blocks in the
extent from the file system block corresponding to logical
file block. Otherwise we cache wrong extent details and this
results in file system corruption.
ext4: Return unwritten buffer head when trying to read from prealloc space.
ext4_ext_get_blocks() returns the number of blocks allocated with buffer
head unmapped for a read from prealloc space. This is needed so that
delayed allocation doesn't do block reservation for prealloc space
since the blocks are already reserved on disk. Mark the buffer head
unwritten. Some code paths try to read the block if the buffer_head is
not new and no uptodate. Marking the buffer head unwritten avoids this
reading.
ext4: make ext4_ext_get_blocks always return <= max_blocks
ext4_ext_get_blocks() returns number of blocks allocated with buffer
heads unmapped for a read from prealloc space. This is needed so that
delayed allocation doesn't do block reservation for prealloc space since
the blocks are already resevred on disk. Fix ext4_ext_get_blocks to not
return greater than max_blocks, since some of the code paths cannot
handle such a return value.
ext4: Fix fallocate to update the file size in each transaction
ext4_fallocate needs to update file size in each transaction. Otherwise
if we crash the file size won't be seen. We were also not marking
the inode dirty after updating file size before. Also when we try to
retry allocation due to ENOSPC, make sure we reset the variable ret so
that we actually do a retry.
Fail migrate if we allocated new blocks via mmap write.
If we write to holes in the file via mmap, we end up allocating
new blocks. This block allocation happens without taking inode->i_mutex.
Since migrate is protected by i_mutex and migrate expects that no
new blocks get allocated during migrate, fail migrate if new blocks
get allocated.
We can't take inode->i_mutex in the mmap write path because that
would result in a locking order violation between i_mutex and mmap_sem.
Also adding a separate rw_sempahore for protection is really high overhead
for a rare operation such as migrate.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4: zero out small extents when writing to prealloc area.
If the preallocated area is small zero out the full extent
instead of splitting them. This should avoid the "write
every alternate block" problem that could grow the number
of extents dramatically.
ext4: ENOSPC error handling for writing to an uninitialized extent
This patch handles possible ENOSPC errors when writing to an
uninitialized extent in case the filesystem is full.
A write to a prealloc area causes the split of an unititalized extent
into initialized and uninitialized extents. If we don't have
space to add new extent information, instead of returning error,
convert the existing uninitialized extent to initialized one. We
need to zero out the blocks corresponding to the entire extent to
prevent uninitialized data reaching userspace.
sparc: Export symbols for ZERO_PAGE usage in modules.
ext4 uses ZERO_PAGE(0) to zero out blocks. We need to export
different symbols in different arches for the usage of ZERO_PAGE
in modules.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch enables extent-formatted normal symlinks. Using extents
format allows a symlink to refer to a block number larger than 2^32
on large filesystems. We still don't enable extent format for fast
symlinks, which are contained in the inode itself.
Josef Bacik [Wed, 30 Apr 2008 02:05:28 +0000 (22:05 -0400)]
ext4: fix mount option parsing
The "resize" option won't be noticed as it comes after the NULL option,
so if you try to mount (or in this case remount) with that option it
won't be recognized.
Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4: fdatasync should skip metadata writeout when overwriting
Currently fdatasync is identical to fsync in ext3.
I think fdatasync should skip journal flush in data=ordered and
data=writeback mode when it overwrites to already-instantiated blocks on
HDD. When I_DIRTY_DATASYNC flag is not set, fdatasync should skip journal
writeout because this indicates only atime or/and mtime updates.
Following patch is the same approach of ext2's fsync code(ext2_sync_file).
I did a performance test using the sysbench.
#sysbench --num-threads=128 --max-requests=50000 --test=fileio --file-total-size=128G
--file-test-mode=rndwr --file-fsync-mode=fdatasync run
The result on ext3 was:
-2.6.24
Operations performed: 0 Read, 50080 Write, 59600 Other = 109680 Total
Read 0b Written 782.5Mb Total transferred 782.5Mb (12.116Mb/sec)
775.45 Requests/sec executed
Test execution summary:
total time: 64.5814s
total number of events: 50080
total time taken by event execution: 3713.9836
per-request statistics:
min: 0.0000s
avg: 0.0742s
max: 0.9375s
approx. 95 percentile: 0.2901s
Threads fairness:
events (avg/stddev): 391.2500/23.26
execution time (avg/stddev): 29.0155/1.99
-2.6.24-patched
Operations performed: 0 Read, 50009 Write, 61596 Other = 111605 Total
Read 0b Written 781.39Mb Total transferred 781.39Mb (16.419Mb/sec)
1050.83 Requests/sec executed
Test execution summary:
total time: 47.5900s
total number of events: 50009
total time taken by event execution: 2934.5768
per-request statistics:
min: 0.0000s
avg: 0.0587s
max: 0.8938s
approx. 95 percentile: 0.1993s
Threads fairness:
events (avg/stddev): 390.6953/22.64
execution time (avg/stddev): 22.9264/1.17
Filesystem I/O throughput was improved.
Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Josef Bacik [Thu, 17 Apr 2008 14:38:59 +0000 (10:38 -0400)]
jbd2: fix possible journal overflow issues
There are several cases where the running transaction can get buffers
added to its BJ_Metadata list which it never dirtied, which makes its
t_nr_buffers counter end up larger than its t_outstanding_credits
counter.
This will cause issues when starting new transactions as while we are
logging buffers we decrement t_outstanding_buffers, so when
t_outstanding_buffers goes negative, we will report that we need less
space in the journal than we actually need, so transactions will be
started even though there may not be enough room for them. In the worst
case scenario (which admittedly is almost impossible to reproduce) this
will result in the journal running out of space.
The fix is to only refile buffers from the committing transaction to the
running transactions BJ_Modified list when b_modified is set on that
journal, which is the only way to be sure if the running transaction has
modified that buffer.
This patch also fixes an accounting error in journal_forget, it is
possible that we can call journal_forget on a buffer without having
modified it, only gotten write access to it, so instead of freeing a
credit, we only do so if the buffer was modified. The assert will help
catch if this problem occurs. Without these two patches I could hit
this assert within minutes of running postmark, with them this issue no
longer arises.
Cc: <linux-ext4@vger.kernel.org> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Josef Bacik [Thu, 17 Apr 2008 14:38:59 +0000 (10:38 -0400)]
jbd2: fix the way the b_modified flag is cleared
Currently at the start of a journal commit we loop through all of the buffers
on the committing transaction and clear the b_modified flag (the flag that is
set when a transaction modifies the buffer) under the j_list_lock.
The problem is that everywhere else this flag is modified only under the jbd2
lock buffer flag, so it will race with a running transaction who could
potentially set it, and have it unset by the committing transaction.
This is also a big waste, you can have several thousands of buffers that you
are clearing the modified flag on when you may not need to. This patch
removes this code and instead clears the b_modified flag upon entering
do_get_write_access/journal_get_create_access, so if that transaction does
indeed use the buffer then it will be accounted for properly, and if it does
not then we know we didn't use it.
That will be important for the next patch in this series. Tested thoroughly
by myself using postmark/iozone/bonnie++.
Cc: <linux-ext4@vger.kernel.org> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
Smack: Integrate Smack with Audit
Security: Typecast CAP_*_SET macros
Security: Make secctx_to_secid() take const secdata
Ahmed S. Darwish [Tue, 29 Apr 2008 22:34:10 +0000 (08:34 +1000)]
Smack: Integrate Smack with Audit
Setup the new Audit hooks for Smack. SELinux Audit rule fields are recycled
to avoid `auditd' userspace modifications. Currently only equality testing
is supported on labels acting as a subject (AUDIT_SUBJ_USER) or as an object
(AUDIT_OBJ_USER).
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
[libata] linux/libata.h: reorganize ata_device struct members a bit
ahci: SB600 ahci can't do MSI, blacklist that capability
libata: More TSSTcorp pain, keep in sync with legacy IDE
pata_via: Fix 6410 misdetect
[libata] pata_atiixp: fix PIO timing data misprogramming
Alex Chiang [Tue, 29 Apr 2008 22:05:29 +0000 (15:05 -0700)]
[IA64] Provide ACPI fixup for /proc/cpuinfo/physical_id
Legacy HP ia64 platforms currently cannot provide
/proc/cpuinfo/physical_id due to legacy SAL/PAL implementations.
However, that physical topology information can be obtained
via ACPI.
Provide an interface that gives ACPI one last chance to provide
physical_id for these legacy platforms. This logic only comes
into play iff:
- ACPI actually provides slot information for the CPU
- we lack a valid socket_id
Otherwise, we don't do anything.
Since x86 uses the ACPI processor driver as well, we provide a nop
stub function for arch_fix_phys_package_id() in asm-x86/topology.h
Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/v4l-dvb: (28 commits)
V4L-DVB(7789a): cx18: fix symbol conflict with ivtv driver
V4L/DVB (7789): tuner: remove static dependencies on analog tuner sub-modules
V4L/DVB (7785): [2.6 patch] make mt9{m001,v022}_controls[] static
V4L/DVB (7786): cx18: new driver for the Conexant CX23418 MPEG encoder chip
V4L/DVB (7783): drivers/media/dvb/frontends/s5h1420.c: printk fix
V4L/DVB (7782): pvrusb2: Driver is no longer experimental
V4L/DVB (7781): pvrusb2-dvb: include dvb support by default and update Kconfig help text
V4L/DVB (7780): pvrusb2: always enable support for OnAir Creator / HDTV USB2
V4L/DVB (7779): pvrusb2-dvb: quiet down noise in kernel log for feed debug
Rename common tuner Kconfig names to use the same
Fix V4L/DVB core help messages
V4L/DVB (7769): Move other terrestrial tuners to common/tuners
V4L/DVB (7768): reorganize some DVB-S Kconfig items
V4L/DVB(7767): Move tuners to common/tuners
V4L/DVB (7766): saa7134: add another PCI ID for Beholder M6
V4L/DVB (7765): Add support for Beholder BeholdTV H6
V4L/DVB (7763): ivtv: add tuner support for the AverMedia M116
V4L/DVB (7762): ivtv: fix tuner detection for PAL-N/Nc
V4L/DVB (7761): ivtv: increase the DMA timeout from 100 to 300 ms
V4L/DVB (7759): ivtv: increase version number to 1.2.1
...
Merge branch 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6
* 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6:
i2c: Convert most new-style drivers to use module aliasing
i2c: Add support for device alias names
i2c-amd756-s4882: Fix an error path
i2c: Drop unused RTC driver IDs
i2c/tps65010: Add missing intialization of client data
i2c-sis5595: Minor cleanups in sis5595_access
i2c-piix4: Minor cleanups
i2c: Spelling fix (successful)
i2c-stub: No newline in parameter description
V4L-DVB(7789a): cx18: fix symbol conflict with ivtv driver
LD drivers/media/video/built-in.o
drivers/media/video/cx18/built-in.o: In function `get_service_set':
/home/v4l/tokernel/git/drivers/media/video/cx18/cx18-ioctl.c:118: multiple definition of `get_service_set'
drivers/media/video/ivtv/built-in.o:/home/v4l/tokernel/git/drivers/media/video/ivtv/ivtv-ioctl.c:119: first defined here
drivers/media/video/cx18/built-in.o: In function `expand_service_set':
/home/v4l/tokernel/git/drivers/media/video/cx18/cx18-ioctl.c:92: multiple definition of `expand_service_set'
drivers/media/video/ivtv/built-in.o:/home/v4l/tokernel/git/drivers/media/video/ivtv/ivtv-ioctl.c:92: first defined here
drivers/media/video/cx18/built-in.o: In function `service2vbi':
/home/v4l/tokernel/git/drivers/media/video/cx18/cx18-ioctl.c:44: multiple definition of `service2vbi'
drivers/media/video/ivtv/built-in.o:/home/v4l/tokernel/git/drivers/media/video/ivtv/ivtv-ioctl.c:42: first defined here
make[2]: ** [drivers/media/video/built-in.o] Erro 1
make[1]: ** [drivers/media/video] Erro 2
make: ** [drivers/media/] Erro 2
Hans Verkuil [Mon, 28 Apr 2008 23:24:33 +0000 (20:24 -0300)]
V4L/DVB (7786): cx18: new driver for the Conexant CX23418 MPEG encoder chip
Many thanks to Steve Toth from Hauppauge and Nattu Dakshinamurthy from
Conexant for their support. I am in particular thankful to Hauppauge
since without their help this driver would not exist. It should also
be noted that Steve did the work to get the DVB part up and running.
Thank you!
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl> Signed-off-by: Steven Toth <stoth@hauppauge.com> Signed-off-by: Michael Krufky <mkrufky@linuxtv.org> Signed-off-by: G. Andrew Walls <awalls@radix.net> Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
drivers/media/dvb/frontends/s5h1420.c: In function `s5h1420_setsymbolrate':
drivers/media/dvb/frontends/s5h1420.c:484: warning: long long unsigned int format, u64 arg (arg 2)
We do not know what type the architecture uses for u64.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Mike Isely [Mon, 28 Apr 2008 00:37:33 +0000 (21:37 -0300)]
V4L/DVB (7782): pvrusb2: Driver is no longer experimental
This driver has been in-kernel and reasonably stable for well over a
year. It is in a stable form and is known to work well. Remove its
experimental status.
Signed-off-by: Mike Isely <isely@pobox.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Michael Krufky [Sun, 27 Apr 2008 22:12:29 +0000 (19:12 -0300)]
V4L/DVB (7780): pvrusb2: always enable support for OnAir Creator / HDTV USB2
This was a build option in the past, to avoid conflicts with the cxusb module
for digital televsion support. Now that dtv mode support has been merged into
pvrusb2, the OnAir devices are fully supported by this single module. This no
longer should be a build option.
Signed-off-by: Michael Krufky <mkrufky@linuxtv.org> Signed-off-by: Mike Isely <isely@pobox.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
V4L/DVB (7769): Move other terrestrial tuners to common/tuners
Those tuners are currently used only under media/dvb. However,
they can support also analog TV. Better to move them to the same place
as the other hybrid tuners. This would make easier to use those tuners also
by analog drivers.
There were several issues in the past, caused by the hybrid tuner design, since
now, the same tuner can be used by drivers/media/dvb and drivers/media/video.
Kconfig items were rearranged, to split V4L/DVB core from their drivers.
Tuner setup were happening during i2c attach callback. This means that it would
happen on two conditions:
1) if tuner module weren't load, it will happen at request_module("tuner");
2) if tuner is not compiled as a module, or it is already loaded
(for example, on setups with more than one tuner), it will happen
when saa7134 registers I2C bus.
Due to that, if tuner were loaded, tuner setup will happen _before_ reading
the proper values at tuner eeprom. Since set_addr refuses to change for a tuner
that were previously defined (except if the tuner_addr is set), this were
making eeprom tuner detection useless.
This patch removes tuner type setup from saa7134-i2c, moving it to the proper
place, after taking eeprom into account.
Reviewed-by: Hermann Pitton <hermann-pitton@arcor.de> Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Tuner setup were happening during i2c attach callback. This means that it would
happen on two conditions:
1) if tuner module weren't load, it will happen at request_module("tuner");
2) if tuner is not compiled as a module, or it is already loaded
(for example, on setups with more than one tuner), it will happen
when cx88 registers I2C bus.
Due to that, if tuner were loaded, tuner setup will happen _before_ reading
the proper values at tuner eeprom. Since set_addr refuses to change for a tuner
that were previously defined (except if the tuner_addr is set), this were making
eeprom tuner detection useless.
This patch removes tuner type setup from cx88-i2c, moving it to the proper
place, after taking eeprom into account.
Reviewed-by: Gert Vervoort <gert.vervoort@hccnet.nl> Reviewed-by: Ian Pickworth <ian@pickworth.me.uk> Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Alan Cox [Tue, 29 Apr 2008 13:10:57 +0000 (14:10 +0100)]
pata_via: Fix 6410 misdetect
The discrete VIA ATA chips don't have 0x40 enable bits. We check that
properly in one location but not another. This causes some users 6410
RAID cards to be incorrectly skipped.
Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jean Delvare [Tue, 29 Apr 2008 21:11:40 +0000 (23:11 +0200)]
i2c: Convert most new-style drivers to use module aliasing
Based on earlier work by Jon Smirl and Jochen Friedrich.
Update most new-style i2c drivers to use standard module aliasing
instead of the old driver_name/type driver matching scheme. I've
left the video drivers apart (except for SoC camera drivers) as
they're a bit more diffcult to deal with, they'll have their own
patch later.
Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Jon Smirl <jonsmirl@gmail.com> Cc: Jochen Friedrich <jochen@scram.de>
Jean Delvare [Tue, 29 Apr 2008 21:11:39 +0000 (23:11 +0200)]
i2c: Add support for device alias names
Based on earlier work by Jon Smirl and Jochen Friedrich.
This patch allows new-style i2c chip drivers to have alias names using
the official kernel aliasing system and MODULE_DEVICE_TABLE(). At this
point, the old i2c driver binding scheme (driver_name/type) is still
supported.
Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Jochen Friedrich <jochen@scram.de> Cc: Jon Smirl <jonsmirl@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/nes: Formatting cleanup
RDMA/nes: Add support for SFP+ PHY
RDMA/nes: Use LRO
IPoIB: Copy child MTU from parent
IB/mthca: Avoid changing userspace ABI to handle DMA write barrier attribute
IB/mthca: Avoid recycling old FMR R_Keys too soon
mlx4_core: Avoid recycling old FMR R_Keys too soon
IB/ehca: Allocate event queue size depending on max number of CQs and QPs
IPoIB: Use separate CQ for UD send completions
IB/iser: Count FMR alignment violations per session
IB/iser: Move high-volume debug output to higher debug level
IB/ehca: handle negative return value from ibmebus_request_irq() properly
RDMA/cxgb3: Support peer-2-peer connection setup
RDMA/cxgb3: Set the max_mr_size device attribute correctly
RDMA/cxgb3: Correctly serialize peer abort path
mlx4_core: Add a way to set the "collapsed" CQ flag
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6:
alim15x3: disable init_hwif_ali15x3 for PowerPC
ide: fix crash at boot with siimage driver
Anton Vorontsov [Tue, 29 Apr 2008 20:57:38 +0000 (22:57 +0200)]
alim15x3: disable init_hwif_ali15x3 for PowerPC
We don't need init_hwif_ali15x3() on the PowerPC systems either.
Before:
ALI15X3: IDE controller (0x10b9:0x5229 rev 0xc8) at PCI slot 0001:03:1f.0
ALI15X3: 100% native mode on irq 19
ide0: BM-DMA at 0x1120-0x1127
ide1: BM-DMA at 0x1128-0x112f
hda: SONY DVD RW AW-Q170A, ATAPI CD/DVD-ROM drive
hda: UDMA/66 mode selected
ide0: Disabled unable to get IRQ 14.
ide0: failed to initialize IDE interface
ide1: Disabled unable to get IRQ 15.
ide1: failed to initialize IDE interface
After:
ALI15X3: IDE controller (0x10b9:0x5229 rev 0xc8) at PCI slot 0001:03:1f.0
ALI15X3: 100% native mode on irq 19
ide0: BM-DMA at 0x1120-0x1127
ide1: BM-DMA at 0x1128-0x112f
hda: SONY DVD RW AW-Q170A, ATAPI CD/DVD-ROM drive
hda: UDMA/66 mode selected
ide0 at 0x1100-0x1107,0x110a on irq 19
ide1 at 0x1110-0x1117,0x111a on irq 19
hda: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache
ide0 works well, though I can't test ide1, it isn't traced out on
the board.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Some change to the IDE layer are causing the siimage driver to crash
at boot with a NULL dereference. This is due to the sil_dma_ops not
containing all the necessary pointers. I suppose it used to just
"override" the defaults while now, it needs to contain everything.
[bart: while at it: sil_dma_ops should be const now (pointed out by Sergei)]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>, Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Alex Chiang [Thu, 24 Apr 2008 18:57:08 +0000 (12:57 -0600)]
[IA64] Remove printk noise on unimplemented SAL_PHYSICAL_ID_INFO
Commit 113134fcbca83619be4c68d0ca66db6093777b5d changed the flow of
control when calling PAL_LOGICAL_TO_PHYSICAL and SAL_PHYSICAL_ID_INFO.
With the change, if a platform did not implement the latter, a useless
printk would appear in the boot log:
ia64_sal_pltid failed with -1
So let's check the return code and only printk on a true error, and do
not print anything in the unimplemented case. While we're in there,
clean up some stylistic issues too.
Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
Various cleanups:
- Change // to /* .. */
- Place whitespace around binary operators.
- Trim down a few long lines.
- Some minor alignment formatting for better readability.
- Remove some silly tabs.
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Eli Cohen [Tue, 29 Apr 2008 20:46:53 +0000 (13:46 -0700)]
IPoIB: Copy child MTU from parent
When creating a child interface, copy the MTU information from the
parent. Otherwise when the child's multicast join completes, the MTU
will not be updated since the code does
dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
and priv->admin_mtu will be set to 0.
Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Roland Dreier [Tue, 29 Apr 2008 20:46:53 +0000 (13:46 -0700)]
IB/mthca: Avoid changing userspace ABI to handle DMA write barrier attribute
Commit cb9fbc5c ("IB: expand ib_umem_get() prototype") changed the
mthca userspace ABI to provide a way for userspace to indicate which
memory regions need the DMA write barrier attribute. However, it is
possible to handle this without breaking existing userspace, by having
the mthca kernel driver recognize whether it is talking to old or new
userspace, depending on the size of the register MR structure passed in.
The only potential drawback of this is that is allows old userspace
(which has a bug with DMA ordering on large SGI Altix systems) to
continue to run on new kernels, but the advantage of allowing old
userspace to continue to work on unaffected systems seems to outweigh
this, and we can print a warning to push people to upgrade their
userspace.
Olaf Kirch [Tue, 29 Apr 2008 20:46:53 +0000 (13:46 -0700)]
IB/mthca: Avoid recycling old FMR R_Keys too soon
When a FMR is unmapped, mthca resets the map count to 0, and clears
the upper part of the R_Key which is used as the sequence counter.
This poses a problem for RDS, which uses ib_fmr_unmap as a fence
operation. RDS assumes that after issuing an unmap, the old R_Keys
will be invalid for a "reasonable" period of time. For instance,
Oracle processes uses shared memory buffers allocated from a pool of
buffers. When a process dies, we want to reclaim these buffers -- but
we must make sure there are no pending RDMA operations to/from those
buffers. The only way to achieve that is by using unmap and sync the
TPT.
However, when the sequence count is reset on unmap, there is a high
likelihood that a new mapping will be given the same R_Key that was
issued a few milliseconds ago.
To prevent this, don't reset the sequence count when unmapping a FMR.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Olaf Kirch [Tue, 29 Apr 2008 20:46:53 +0000 (13:46 -0700)]
mlx4_core: Avoid recycling old FMR R_Keys too soon
When a FMR is unmapped, mlx4 resets the map count to 0, and clears the
upper part of the R_Key which is used as the sequence counter.
This poses a problem for RDS, which uses ib_fmr_unmap as a fence
operation. RDS assumes that after issuing an unmap, the old R_Keys
will be invalid for a "reasonable" period of time. For instance,
Oracle processes uses shared memory buffers allocated from a pool of
buffers. When a process dies, we want to reclaim these buffers -- but
we must make sure there are no pending RDMA operations to/from those
buffers. The only way to achieve that is by using unmap and sync the
TPT.
However, when the sequence count is reset on unmap, there is a high
likelihood that a new mapping will be given the same R_Key that was
issued a few milliseconds ago.
To prevent this, don't reset the sequence count when unmapping a FMR.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Stefan Roscher [Tue, 29 Apr 2008 20:46:53 +0000 (13:46 -0700)]
IB/ehca: Allocate event queue size depending on max number of CQs and QPs
If a lot of QPs fall into Error state at once and the EQ of the
respective HCA is too small, it might overrun, causing the eHCA driver
to stop processing completion events and calling the application's
completion handlers, effectively causing traffic to stop.
Fix this by limiting available QPs and CQs to a customizable max
count, and determining EQ size based on these counts and a worst-case
assumption.
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Eli Cohen [Tue, 29 Apr 2008 20:46:53 +0000 (13:46 -0700)]
IPoIB: Use separate CQ for UD send completions
Use a dedicated CQ for UD send completions. Also, do not arm the UD
send CQ, which reduces the number of interrupts generated. This patch
farther reduces overhead by not calling poll CQ for every posted send
WR -- it does polls only when there 16 or more outstanding work requests.
Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
IB/ehca: handle negative return value from ibmebus_request_irq() properly
ehca_create_eq() was assigning a signed return value to an unsiged
local variable and then checking if the variable was < 0, which meant
that errors were always ignored. Fix this by using one variable for
signed integer return values and another for u64 hcall return values.
Bug originally found by Roel Kluin <12o3l@tiscali.nl>.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
Steve Wise [Tue, 29 Apr 2008 20:46:52 +0000 (13:46 -0700)]
RDMA/cxgb3: Support peer-2-peer connection setup
Open MPI, Intel MPI and other applications don't respect the iWARP
requirement that the client (active) side of the connection send the
first RDMA message. This class of application connection setup is
called peer-to-peer. Typically once the connection is setup, _both_
sides want to send data.
This patch enables supporting peer-to-peer over the chelsio RNIC by
enforcing this iWARP requirement in the driver itself as part of RDMA
connection setup.
Connection setup is extended, when the peer2peer module option is 1,
such that the MPA initiator will send a 0B Read (the RTR) just after
connection setup. The MPA responder will suspend SQ processing until
the RTR message is received and reply-to.
In the longer term, this will be handled in a standardized way by
enhancing the MPA negotiation so peers can indicate whether they
want/need the RTR and what type of RTR (0B read, 0B write, or 0B send)
should be sent. This will be done by standardizing a few bits of the
private data in order to negotiate all this. However this patch
enables peer-to-peer applications now and allows most of the required
firmware and driver changes to be done and tested now.
Design:
- Add a module option, peer2peer, to enable this mode.
- New firmware support for peer-to-peer mode:
- a new bit in the rdma_init WR to tell it to do peer-2-peer
and what form of RTR message to send or expect.
- process _all_ preposted recvs before moving the connection
into rdma mode.
- passive side: defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the RTR is received.
- active side: expect and process the 0B read WR on offload TX
queue. Defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the 0B read WR is processed from the offload TX queue.
- If peer2peer is set, driver posts 0B read request on offload TX
queue just after posting the rdma_init WR to the offload TX queue.
- Add CQ poll logic to ignore unsolicitied read responses.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>