- No need to perform data_len = 0 in the switch command, since data_len
is initialized to 0 in the beginning of the ipq_build_packet_message()
method.
- {ip,ip6}_queue: We can reach nlmsg_failure only from one place; skb is
sure to be NULL when getting there; since skb is NULL, there is no need
to check this fact and call kfree_skb().
Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Rami Rosen [Mon, 9 Jun 2008 23:00:22 +0000 (16:00 -0700)]
netfilter: nf_conntrack: remove unnecessary function declaration
This patch removes nf_ct_ipv4_ct_gather_frags() method declaration from
include/net/netfilter/ipv4/nf_conntrack_ipv4.h, since it is unused in
the Linux kernel.
Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: ctnetlink: include conntrack status in destroy event message
When a conntrack is destroyed, the connection status does not get
exported to netlink. I don't see a reason for not doing so. This patch
exports the status on all conntrack events.
Signed-off-by: Fabian Hugelshofer <hugelshofer2006@gmx.ch> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the last packet of a connection isn't accounted when its causing
abnormal termination.
Introduces nf_ct_kill_acct() which increments the accounting counters on
conntrack kill. The new function was necessary, because there are calls
to nf_ct_kill() which don't need accounting:
nf_conntrack_proto_tcp.c line ~847:
Kills ct and returns NF_REPEAT. We don't want to count twice.
nf_conntrack_proto_tcp.c line ~880:
Kills ct and returns NF_DROP. I think we don't want to count dropped
packets.
nf_conntrack_netlink.c line ~824:
As far as I can see ctnetlink_del_conntrack() is used to destroy a
conntrack on behalf of the user. There is an sk_buff, but I don't think
this is an actual packet. Incrementing counters here is therefore not
desired.
Signed-off-by: Fabian Hugelshofer <hugelshofer2006@gmx.ch> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Pekka Enberg [Mon, 9 Jun 2008 22:58:39 +0000 (15:58 -0700)]
netfilter: nf_conntrack_extend: use krealloc() in nf_conntrack_extend.c V2
The ksize() API is going away because it is being abused and it doesn't even
work consistenly across different allocators. Therefore, convert
net/netfilter/nf_conntrack_extend.c to use krealloc().
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
James Morris [Mon, 9 Jun 2008 22:57:24 +0000 (15:57 -0700)]
netfilter: ip_tables: add iptables security table for mandatory access control rules
The following patch implements a new "security" table for iptables, so
that MAC (SELinux etc.) networking rules can be managed separately to
standard DAC rules.
This is to help with distro integration of the new secmark-based
network controls, per various previous discussions.
The need for a separate table arises from the fact that existing tools
and usage of iptables will likely clash with centralized MAC policy
management.
The SECMARK and CONNSECMARK targets will still be valid in the mangle
table to prevent breakage of existing users.
Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: ctnetlink: add full support for SCTP to ctnetlink
This patch adds full support for SCTP to ctnetlink. This includes three
new attributes: state, original vtag and reply vtag.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: ctnetlink: group errors into logical errno sets
This patch groups ctnetlink errors into three logical sets:
* Malformed messages: if ctnetlink receives a message without some mandatory
attribute, then it returns EINVAL.
* Unsupported operations: if userspace tries to perform an unsupported
operation, then it returns EOPNOTSUPP.
* Unchangeable: if userspace tries to change some attribute of the
conntrack object that can only be set once, then it returns EBUSY.
This patch reduces the number of -EINVAL from 23 to 14 and it results in
5 -EBUSY and 6 -EOPNOTSUPP.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Kuo-lang Tseng [Mon, 9 Jun 2008 22:55:45 +0000 (15:55 -0700)]
netfilter: ebtables: add IPv6 support
It implements matching functions for IPv6 address & traffic class
(merged from the patch sent by Jan Engelhardt [jengelh@computergmbh.de]
http://marc.info/?l=netfilter-devel&m=120182168424052&w=2), protocol,
and layer-4 port id. Corresponding watcher logging function is also
added for IPv6.
Signed-off-by: Kuo-lang Tseng <kuo-lang.tseng@intel.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jun 2008 22:51:03 +0000 (15:51 -0700)]
af_iucv: exploit target message class support of IUCV
The first 4 bytes of data to be sent are stored additionally into
the message class field of the send request. A receiving target
program (not an af_iucv socket program) can make use of this
information to pre-screen incoming messages.
Signed-off-by: Ursula Braun <braunu@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiko Carstens [Mon, 9 Jun 2008 22:50:30 +0000 (15:50 -0700)]
iucv: prevent cpu hotplug when walking cpu_online_map.
The code used preempt_disable() to prevent cpu hotplug, however that
doesn't protect for cpus being added. So use get_online_cpus() instead.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ursula Braun <braunu@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiko Carstens [Mon, 9 Jun 2008 22:49:57 +0000 (15:49 -0700)]
iucv: fix section mismatch warning.
WARNING: net/iucv/built-in.o(.exit.text+0x9c): Section mismatch in
reference from the function iucv_exit() to the variable
.cpuinit.data:iucv_cpu_notifier
This warning is caused by a reference from unregister_hotcpu_notifier()
from an exit function to a cpuinitdata annotated data structurre.
This is a false positive warning since for the non CPU_HOTPLUG case
unregister_hotcpu_notifier() is a nop.
Use __refdata instead of __cpuinitdata to get rid of the warning.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ursula Braun <braunu@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines a few new message header manipulation routines,
and generalizes the usefulness of another, in preparation for upcoming
rework of TIPC's message rejection code.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:47:30 +0000 (17:47 -0700)]
tipc: Expand link sequence gap field to 13 bits
This patch increases the "sequence gap" field of the LINK_PROTOCOL
message header from 8 bits to 13 bits (utilizing 5 previously
unused 0 bits). This ensures that the field is big enough to
indicate the loss of up to 8191 consecutive messages on the link,
thereby accommodating the current worst-case scenario of 4000
lost messages.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:38:22 +0000 (17:38 -0700)]
tipc: Add missing spinlock in name table display code
This patch ensures that the display code that traverses the
publication lists belonging to a name table entry take its
associated spinlock, to protect against a possible change to
one of its "head of list" pointers caused by a simultaneous
name table lookup operation by another thread of control.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:37:59 +0000 (17:37 -0700)]
tipc: Prevent display of name table types with no publications
This patch adds a check to prevent TIPC's name table display code
from listing a name type entry if it exists only to hold subscription
info, rather than published names.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:37:34 +0000 (17:37 -0700)]
tipc: Optimize message initialization routine
This patch eliminates the rarely-used "error code" argument
when initializing a TIPC message header, since the default
value of zero is the desired result in most cases; the few
exceptional cases now set the error code explicitly.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:36:58 +0000 (17:36 -0700)]
tipc: Prevent access of non-existent field in short message header
This patch eliminates a case where TIPC's link code could try reading
a field that is not present in a short message header. (The random
value obtained was not being used, but the read operation could result
in an invalid memory access exception in extremely rare circumstances.)
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:32:35 +0000 (17:32 -0700)]
tipc: Minor optimizations to received message processing
This patch enhances TIPC's handler for incoming messages in two
ways:
- the trivial, single-use routine for processing non-sequenced
messages has been merged into the main handler
- the interface that received a message is now identified without
having to access and/or modify the associated sk_buff
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:29:39 +0000 (17:29 -0700)]
tipc: Fix minor bugs in link session number handling
This patch introduces a new, out-of-range value to indicate that
a link endpoint does not have an existing session established
with its peer, eliminating the risk that the previously used
"invalid session number" value (i.e. zero) might eventually be
assigned as a valid session number and cause incorrect link
behavior.
The patch also introduces explicit bit masking when assigning a
new link session number to ensure it does not exceed 16 bits.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:29:09 +0000 (17:29 -0700)]
tipc: Fix bugs in message error code display when debugging
This patch corrects two problems in the display of error code
information in TIPC messages when debugging:
- no longer tries to display error code in NAME_DISTRIBUTOR
messages, which don't have the error field
- now displays error code in 24 byte data messages, which do
have the error field
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:28:45 +0000 (17:28 -0700)]
tipc: Standardize error checking on incoming messages via native API
This patch re-orders & re-groups the error checks performed on
messages being delivered to native API ports, in order to clarify the
similarities and differences required for the various message types.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Allan Stephens [Thu, 5 Jun 2008 00:28:21 +0000 (17:28 -0700)]
tipc: Fix bug in connection setup via native API
This patch fixes a bug that prevented TIPC from receiving a
connection setup request message on a native TIPC port.
The revised connection setup logic ensures that validation
of the source of a connection-based message is skipped if
the port is not yet connected to a peer.
Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Turn on special bits to save more power when device is shutdown.
Tested on a limited range of hardware, some of the bits are for hardware
that probably isn't even in production (like Yukon Supreme) and was ported
from the vendor driver.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Tobias Diedrich [Sun, 18 May 2008 13:04:29 +0000 (15:04 +0200)]
[netdrvr] forcedeth: reorder suspend/resume code
Match the suspend/resume code ordering in e100/e1000e more closely.
For example the configuration space should be saved on suspend even for
devices that are not up.
Signed-off-by: Tobias Diedrich <ranma+kernel@tdiedrich.de> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Tobias Diedrich [Sun, 18 May 2008 13:02:37 +0000 (15:02 +0200)]
[netdrvr] forcedeth: setup wake-on-lan before shutting down
When hibernating in 'shutdown' mode, after saving the image the suspend hook
is not called again.
However, if the device is in promiscous mode, wake-on-lan will not work.
This adds a shutdown hook to setup wake-on-lan before the final shutdown.
Signed-off-by: Tobias Diedrich <ranma+kernel@tdiedrich.de> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Randy Dunlap [Fri, 30 May 2008 17:29:19 +0000 (10:29 -0700)]
cxgb3: fix build error when INET=n
cxgb3 uses lro_* functions and selects INET_LRO, but this doesn't help unless
INET is already enabled, so make the driver depend on INET also.
sge.c:(.text+0x9f09a): undefined reference to `lro_flush_all'
sge.c:(.text+0x9f62f): undefined reference to `lro_receive_skb'
sge.c:(.text+0x9f8a3): undefined reference to `lro_receive_frags'
sge.c:(.text+0x9fbe0): undefined reference to `lro_vlan_hwaccel_receive_skb'
sge.c:(.text+0x9ffcd): undefined reference to `lro_vlan_hwaccel_receive_frags'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Handle shared IRQ correctly. If IRQ is shared, it typically will show up
as an IRQ with an empty status field. So check in driver and handle it
without crapping out with invalid interrupt message.
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Rx allocation failure at runtime is non-fatal. For normal Rx frame, it
just reuses the buffer, and during setup it just continues with a smaller
receive buffer pool.
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Brice Goglin [Fri, 9 May 2008 00:21:49 +0000 (02:21 +0200)]
myri10ge: add multislices support
Add multi-slice/MSI-X support. By default, a single slice
(and the normal firmware) are used. To enable msi-x, multi-slice
mode, one must load the driver with myri10ge_max_slices set to
either -1, or something larger than 1.
Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Ilpo Järvinen [Thu, 29 May 2008 10:25:23 +0000 (03:25 -0700)]
tcp: Reorganize tcp_sock to fill 64-bit holes & improve locality
I tried to group recovery related fields nearby (non-CA_Open related
variables, to be more accurate) so that one to three cachelines would
not be necessary in CA_Open. These are now contiguously deployed:
...and they're then followed by URG slowpath & keepalive related
variables.
Head of the out_of_order_queue always needed for empty checks, if
that's empty (and TCP is in CA_Open), following ~200 bytes (in 64-bit)
shouldn't be necessary for anything. If only OFO queue exists but TCP
is in CA_Open, selective_acks (and possibly duplicate_sack) are
necessary besides the out_of_order_queue but the rest of the block
again shouldn't be (ie., the other direction had losses).
As the cacheline boundaries depend on many factors in the preceeding
stuff, trying to align considering them doesn't make too much sense.
Commented one ordering hazard.
There are number of low utilized u8/16s that could be combined get 2
bytes less in total so that the hole could be made to vanish (includes
at least ecn_flags, urg_data, urg_mode, frto_counter, nonagle).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:51:01 +0000 (23:51 -0700)]
tg3: Update version to 3.93
This patch increments the version to 3.93.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:49:44 +0000 (23:49 -0700)]
tg3: Add shmem options.
This patch adds some options obtained through shared memory.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:48:31 +0000 (23:48 -0700)]
tg3: Add 5785 ASIC revision
This patch added the 5785 device ID and ASIC revision to the code.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:47:41 +0000 (23:47 -0700)]
tg3: Add libphy support.
This patch introduces the libphy support.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Thu, 29 May 2008 08:37:54 +0000 (01:37 -0700)]
tg3: Add mdio bus registration
This patch introduces code to register and unregister the tg3 mdio bus
with the system.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:45:58 +0000 (23:45 -0700)]
tg3: Add TG3_FLG3_USE_PHYLIB
This patch introduces the TG3_FLG3_USE_PHYLIB flag and applies it to
some select places. This work makes later patches a little easier to
read.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:45:08 +0000 (23:45 -0700)]
tg3: Code cleanup.
This patch applies cleanups that would otherwise clutter later
patches.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 26 May 2008 06:44:14 +0000 (23:44 -0700)]
tg3: Pure code movement.
This patch moves some functions towards the top of the file to avoid
unnecessary function prototypes.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Cox [Mon, 26 May 2008 06:40:58 +0000 (23:40 -0700)]
ppp: push BKL down into the driver
I've pushed it down as far as I dare at this point. Someone familiar with
the internal PPP semantics can probably push it further. Another step to
eliminating the old BKL ioctl usage.
Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Fri, 23 May 2008 07:22:04 +0000 (00:22 -0700)]
vlan: Use bitmask of feature flags instead of seperate feature bits
Herbert Xu points out that the use of seperate feature bits for features
to be propagated to VLAN devices is going to get messy real soon.
Replace the VLAN feature bits by a bitmask of feature flags to be
propagated and restore the old GSO_SHIFT/MASK values.
Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Komuro [Mon, 5 May 2008 01:51:12 +0000 (10:51 +0900)]
fmvj18x_cs: add NextCom NC5310 rev B support
fmvj18x_cs: The manfid of "NextCom NC5310 rev B" is MANF_ID_FUJITSU.
but this card is MBH10302 based card.
use ConfigBase to detect the cardtype for this card.
Signed-off-by: Komuro <komurojun-mbn@nifty.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Paul Gortmaker [Thu, 22 May 2008 16:43:50 +0000 (12:43 -0400)]
phylib: do EXPORT_SYMBOL on get_phy_id
Commit cac1f3c8 factored out the code for get_phy_id so that it
could be reused in multiple places. Turns out that some of the
users can be modular, so we need to export this symbol as well.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Thomas Graf [Thu, 22 May 2008 17:48:59 +0000 (10:48 -0700)]
netlink: Fix nla_parse_nested_compat() to call nla_parse() directly
The purpose of nla_parse_nested_compat() is to parse attributes which
contain a struct followed by a stream of nested attributes. So far,
it called nla_parse_nested() to parse the stream of nested attributes
which was wrong, as nla_parse_nested() expects a container attribute
as data which holds the attribute stream. It needs to call
nla_parse() directly while pointing at the next possible alignment
point after the struct in the beginning of the attribute.
With this patch, I can no longer reproduce the reported leftover
warnings.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Nate Case [Sat, 17 May 2008 05:40:39 +0000 (06:40 +0100)]
PHYLIB: Add 1000Base-X support for Broadcom bcm5482
Configure the BCM5482S secondary SerDes for 1000Base-X mode when the
appropriate dev_flags are passed in to phy_connect(). This is
needed when the PHY is used for fiber and backplane connections.
Signed-off-by: Nate Case <ncase@xes-inc.com> Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jay Vosburgh [Sun, 18 May 2008 04:10:14 +0000 (21:10 -0700)]
bonding: Add "follow" option to fail_over_mac
Add a "follow" selection for fail_over_mac. This option
causes the MAC address to move from slave to slave as the active
slave changes. This is in addition to the existing fail_over_mac option
that causes the bond's MAC address to change during failover.
This new option is useful for devices that cannot tolerate
multiple ports using the same MAC address simultaneously, either
because it confuses them or incurs a performance penalty (as is the
case with some LPAR-aware multiport devices). Because the MAC of the
bond itself does not change, the "follow" option is slightly more
reliable during failover and doesn't change the MAC of the bond during
operation.
This patch requires a previous ARP monitor change to properly
handle RTNL during failovers.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jay Vosburgh [Sun, 18 May 2008 04:10:13 +0000 (21:10 -0700)]
bonding: refactor ARP active-backup monitor
Refactor ARP monitor for active-backup mode. The motivation for
this is to take care of locking issues in a clear manner (particularly to
correctly handle RTNL vs. the bonding locks). Currently, the a-b ARP
monitor does not hold RTNL at all, but future changes will require RTNL
during ARP monitor failovers.
Rather than using conditional locking, this patch instead breaks
up the ARP monitor into three discrete steps: inspection, commit changes,
and probe. The inspection phase marks slaves that require link state
changes. The commit phase is only called if inspection detects that
changes are needed, and is called with RTNL. Lastly, the probe phase
issues the ARP probes that the inspection phase uses to determine link
state.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Moni Shoua [Sun, 18 May 2008 04:10:12 +0000 (21:10 -0700)]
bonding: Send more than one gratuitous ARP when slave takes over
With IPoIB, reception of gratuitous ARP by neighboring hosts
is essential for a successful change of slaves in case of failure.
Otherwise, they won't learn about the HW address change and need
to wait a long time until the neighboring system gives up and sends
an ARP request to learn the new HW address. This patch decreases
the chance for a lost of a gratuitous ARP packet by sending it more
than once. The number retries is configurable and can be set with a
module param.
Signed-off-by: Moni Shoua <monis@voltaire.com> Acked-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Pavel Emelyanov [Sun, 18 May 2008 04:10:11 +0000 (21:10 -0700)]
bonding: Remove unneeded list_empty checks.
Some places iterate over the checked list right after the check
itself, so even if the list is empty, the list_for_each_xxx
iterator will make everything right by himself.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jay Vosburgh [Sun, 18 May 2008 04:10:08 +0000 (21:10 -0700)]
bonding: remove test for IP in ARP monitor
Remove bond_has_ip and all references to it. With this change,
the ARP monitor will always send ARP probes if the master is up and has
at least one slave. If the bond has an IP address, it is used in the
ARP probe; if not, the probes are sent with all zeros in the sender's
IP address (which is consistent with an RFC 2131 4.4.1 duplicate address
probe).
This is useful for cases when bonding itself is hidden underneath
a layer of virtual devices, e.g., with Xen.
Change suggested by Tsutomu Fujii <t-fujii@nb.jp.nec.com>, who
included a one-line patch that only affected active-backup mode.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Jay Vosburgh [Sun, 18 May 2008 04:10:07 +0000 (21:10 -0700)]
bonding: Use msecs_to_jiffies, eliminate panic
Convert bonding to use msecs_to_jiffies instead of doing the
math. For the ARP monitor, there was an underflow problem that could
result in an infinite loop. The miimon already had that worked around,
but this is cleaner.
Originally by Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Jay Vosburgh corrected a math error in the original; Nicolas' original
commit message is:
When setting arp_interval parameter to a very low value, delta_in_ticks
for next arp might become 0, causing an infinite loop.
See http://bugzilla.kernel.org/show_bug.cgi?id=10680
Same problem for miimon parameter already fixed, but fix might be
enhanced, by using msecs_to_jiffies() function.
Signed-off-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>