err.no Git - linux-2.6/log

]> err.no Git - linux-2.6/log

projects / linux-2.6 / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Herbert Xu [Mon, 26 Mar 2007 03:10:56 +0000 (20:10 -0700)]

[UDP]: Clean up UDP-Lite receive checksum

This patch eliminates some duplicate code for the verification of
receive checksums between UDP-Lite and UDP. It does this by
introducing __skb_checksum_complete_head which is identical to
__skb_checksum_complete_head apart from the fact that it takes
a length parameter rather than computing the first skb->len bytes.

As a result UDP-Lite will be able to use hardware checksum offload
for packets which do not use partial coverage checksums. It also
means that UDP-Lite loopback no longer does unnecessary checksum
verification.

If any NICs start support UDP-Lite this would also start working
automatically.

This patch removes the assumption that msg_flags has MSG_TRUNC clear
upon entry in recvmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Herbert Xu [Wed, 7 Mar 2007 04:29:58 +0000 (20:29 -0800)]

[UDP6]: Restore sk_filter optimisation

This reverts the changeset

    [IPV6]: UDPv6 checksum.

    We always need to check UDPv6 checksum because it is mandatory.

The sk_filter optimisation has nothing to do whether we verify the
checksum.  It simply postpones it to the point when the user calls
recv or poll.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 7 Mar 2007 04:23:10 +0000 (20:23 -0800)]

[IPV4]: Optimize inet_getpeer()

1) Some sysctl vars are declared __read_mostly

2) We can avoid updating stack[] when doing an AVL lookup only.

lookup() macro is extended to receive a second parameter, that may be NULL
in case of a pure lookup (no need to save the AVL path). This removes
unnecessary instructions, because compiler knows if this _stack parameter is
NULL or not.

text size of net/ipv4/inetpeer.o is 2063 bytes instead of 2107 on x86_64

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Wed, 7 Mar 2007 04:21:20 +0000 (20:21 -0800)]

[TCP] TCP Yeah: cleanup

Eliminate need for full 6/4/64 divide to compute queue.
Variable maxqueue was really a constant.
Fix indentation.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Mon, 26 Mar 2007 03:21:15 +0000 (20:21 -0700)]

[TCP] tcp_cubic: faster cube root

The Newton-Raphson method is quadratically convergent so
only a small fixed number of steps are necessary.
Therefore it is faster to unroll the loop. Since div64_64 is no longer
inline it won't cause code explosion.

Also fixes a bug that can occur if x^2 was bigger than 32 bits.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

YOSHIFUJI Hideaki [Wed, 7 Mar 2007 04:19:26 +0000 (20:19 -0800)]

[ATM] ENI: Convert to struct timeval to ktime_t.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Mon, 26 Mar 2007 03:27:59 +0000 (20:27 -0700)]

[NETLINK]: Limit NLMSG_GOODSIZE to 8K.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

YOSHIFUJI Hideaki [Wed, 28 Feb 2007 14:13:20 +0000 (23:13 +0900)]

[IPV6] ADDRCONF: Fix possible inet6_ifaddr leakage with CONFIG_OPTIMISTIC_DAD.

The inet6_ifaddr for source address of RS is leaked if the address
is not an optimistic address.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Neil Horman [Thu, 26 Apr 2007 00:08:10 +0000 (17:08 -0700)]

[IPV6] ADDRCONF: Optimistic Duplicate Address Detection (RFC 4429) Support.

Nominally an autoconfigured IPv6 address is added to an interface in the
Tentative state (as per RFC 2462).  Addresses in this state remain in this
state while the Duplicate Address Detection process operates on them to
determine their uniqueness on the network.  During this period, these
tentative addresses may not be used for communication, increasing the time
before a node may be able to communicate on a network.  Using Optimistic
Duplicate Address Detection, autoconfigured addresses may be used
immediately for communication on the network, as long as certain rules are
followed to avoid conflicts with other nodes during the Duplicate Address
Detection process.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yasuyuki Kozakai [Thu, 30 Nov 2006 05:43:28 +0000 (14:43 +0900)]

[IPV6] IP6TUNNEL: Enable to control the handled inner protocol.

ip6_tunnel before supporting IPv4/IPv6 tunnel allows only IPPROTO_IPV6
in configurations from userland. This allows userland to set IPPROTO_IPIP
and 0(wildcard). ip6_tunnel only handles allowed inner protocols.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yasuyuki Kozakai [Fri, 9 Feb 2007 15:30:33 +0000 (00:30 +0900)]

[IPV6] IP6TUNNEL: Rename functions ip6ip6_* to ip6_tnl_*.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yasuyuki Kozakai [Wed, 14 Feb 2007 15:43:16 +0000 (00:43 +0900)]

[IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.

Some notes
- Protocol number IPPROTO_IPIP is used for IPv4 over IPv6 packets.
- If IP6_TNL_F_USE_ORIG_TCLASS is set, TOS in IPv4 header is copied to
  Traffic Class in outer IPv6 header on xmit.
- IP6_TNL_F_USE_ORIG_FLOWLABEL is ignored on xmit of IPv4 packets, because
  IPv4 header does not have flow label.
- Kernel sends ICMP error if IPv4 packet is too big on xmit, even if
  DF flag is not set.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yasuyuki Kozakai [Sun, 5 Nov 2006 13:56:45 +0000 (22:56 +0900)]

[IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_xmit().

This enables to add IPv4/IPv6 specific handling later,

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yasuyuki Kozakai [Fri, 3 Nov 2006 00:39:14 +0000 (09:39 +0900)]

[IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_rcv().

This enables to add IPv4/IPv6 specific handling later,

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Yasuyuki Kozakai [Tue, 31 Oct 2006 14:11:25 +0000 (23:11 +0900)]

[IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_err().

This enables to add IPv4/IPv6 specific error handling later,

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

YOSHIFUJI Hideaki [Thu, 22 Feb 2007 13:05:40 +0000 (22:05 +0900)]

[IPV6]: Decentralize EXPORT_SYMBOLs.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

commit | commitdiff | tree

David S. Miller [Wed, 7 Mar 2007 01:02:35 +0000 (17:02 -0800)]

[NETLINK]: Mirror UDP MSG_TRUNC semantics.

If the user passes MSG_TRUNC in via msg_flags, return
the full packet size not the truncated size.

Idea from Herbert Xu and Thomas Graf.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Thu, 19 Apr 2007 23:16:32 +0000 (16:16 -0700)]

[NET]: convert network timestamps to ktime_t

We currently use a special structure (struct skb_timeval) and plain
'struct timeval' to store packet timestamps in sk_buffs and struct
sock.

This has some drawbacks :
- Fixed resolution of micro second.
- Waste of space on 64bit platforms where sizeof(struct timeval)=16

I suggest using ktime_t that is a nice abstraction of high resolution
time services, currently capable of nanosecond resolution.

As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
a 8 byte shrink of this structure on 64bit architectures. Some other
structures also benefit from this size reduction (struct ipq in
ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)

Once this ktime infrastructure adopted, we can more easily provide
nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
SO_TIMESTAMPNS/SCM_TIMESTAMPNS)

Note : this patch includes a bug correction in
compat_sock_get_timestamp() where a "err = 0;" was missing (so this
syscall returned -ENOENT instead of 0)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
CC: John find <linux.kernel@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Mon, 26 Mar 2007 02:54:23 +0000 (19:54 -0700)]

[NET]: div64_64 consolidate (rev3)

Here is the current version of the 64 bit divide common code.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

James Morris [Mon, 5 Mar 2007 00:12:44 +0000 (16:12 -0800)]

[NET]: Convert xtime.tv_sec to get_seconds()

Where appropriate, convert references to xtime.tv_sec to the
get_seconds() helper function.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Mon, 5 Mar 2007 00:11:51 +0000 (16:11 -0800)]

[PKTGEN]: fix device name handling

Since devices can change name and other wierdness, don't hold onto
a copy of device name, instead use pointer to output device.

Fix a couple of leaks in error handling path as well.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Mon, 5 Mar 2007 00:08:08 +0000 (16:08 -0800)]

[PKTGEN]: don't use __constant_htonl()

The existing htonl() macro is smart enough to do the same code as
using __constant_htonl() and it looks cleaner.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Mon, 5 Mar 2007 00:07:28 +0000 (16:07 -0800)]

[PKTGEN]: use random32

Can use random32() now.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Stephen Hemminger [Mon, 5 Mar 2007 00:06:47 +0000 (16:06 -0800)]

[PKTGEN]: use pr_debug

Remove private debug macro and replace with standard version

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Mon, 5 Mar 2007 00:05:44 +0000 (16:05 -0800)]

[NET]: Keep sk_backlog near sk_lock

sk_backlog is a critical field of struct sock. (known famous words)

It is (ab)used in hot paths, in particular in release_sock(), tcp_recvmsg(),
tcp_v4_rcv(), sk_receive_skb().

It really makes sense to place it next to sk_lock, because sk_backlog is only
used after sk_lock locked (and thus memory cache line in L1 cache). This
should reduce cache misses and sk_lock acquisition time.

(In theory, we could only move the head pointer near sk_lock, and leaving tail
far away, because 'tail' is normally not so hot, but keep it simple :) )

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Fri, 2 Mar 2007 21:34:19 +0000 (13:34 -0800)]

[TCP]: FRTO undo response falls back to ratehalving one if ECEd

Undoing ssthresh is disabled in fastretrans_alert whenever
FLAG_ECE is set by clearing prior_ssthresh. The clearing does
not protect FRTO because FRTO operates before fastretrans_alert.
Moving the clearing of prior_ssthresh earlier seems to be a
suboptimal solution to the FRTO case because then FLAG_ECE will
cause a second ssthresh reduction in try_to_open (the first
occurred when FRTO was entered). So instead, FRTO falls back
immediately to the rate halving response, which switches TCP to
CA_CWR state preventing the latter reduction of ssthresh.

If the first ECE arrived before the ACK after which FRTO is able
to decide RTO as spurious, prior_ssthresh is already cleared.
Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be
set also in the following ACKs resulting in rate halving response
that sees TCP is already in CA_CWR, which again prevents an extra
ssthresh reduction on that round-trip.

If the first ECE arrived before RTO, ssthresh has already been
adapted and prior_ssthresh remains cleared on entry because TCP
is in CA_CWR (the same applies also to a case where FRTO is
entered more than once and ECE comes in the middle).

High_seq must not be touched after tcp_enter_cwr because CWR
round-trip calculation depends on it.

I believe that after this patch, FRTO should be ECN-safe and
even able to take advantage of synergy benefits.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Fri, 2 Mar 2007 21:27:25 +0000 (13:27 -0800)]

[TCP]: Complete icsk-to-local-variable change (in tcp_enter_cwr)

A local variable for icsk was created but this change was
missing. Spotted by Jarek Poplawski.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Tue, 27 Feb 2007 18:10:55 +0000 (10:10 -0800)]

[TCP] Sysctl documentation: tcp_frto_response

In addition, fixed minor things in tcp_frto sysctl.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Tue, 27 Feb 2007 18:09:49 +0000 (10:09 -0800)]

[TCP]: Add two new spurious RTO responses to FRTO

New sysctl tcp_frto_response is added to select amongst these
responses:
- Rate halving based; reuses CA_CWR state (default)
- Very conservative; used to be the only one available (=1)
- Undo cwr; undoes ssthresh and cwnd reductions (=2)

The response with rate halving requires a new parameter to
tcp_enter_cwr because FRTO has already reduced ssthresh and
doing a second reduction there has to be prevented. In addition,
to keep things nice on 80 cols screen, a local variable was
added.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Sat, 24 Feb 2007 00:22:06 +0000 (16:22 -0800)]

[TCP]: Correct reordering detection change (no FRTO case)

The reordering detection must work also when FRTO has not been
used at all which was the original intention of mine, just the
expression of the idea was flawed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Fri, 23 Feb 2007 06:52:59 +0000 (22:52 -0800)]

[TCP]: Make snd_cwnd_clamp a u32.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Feb 2007 11:20:44 +0000 (03:20 -0800)]

[TCP]: Keep copied_seq, rcv_wup and rcv_next together.

I noticed in oprofile study a cache miss in tcp_rcv_established() to read
copied_seq.

ffffffff80400a80 <tcp_rcv_established>: /* tcp_rcv_established total: 4034293
2.0400 */

55493  0.0281 :ffffffff80400bc9:   mov    0x4c8(%r12),%eax copied_seq
543103  0.2746 :ffffffff80400bd1:   cmp    0x3e0(%r12),%eax   rcv_nxt

if (tp->copied_seq == tp->rcv_nxt &&
        len - tcp_header_len <= tp->ucopy.len) {

In this function, the cache line 0x4c0 -> 0x500 is used only for this
reading 'copied_seq' field.

rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of
active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...)

As you suggested, I changed tcp_create_openreq_child() so that these fields
are changed together, to avoid adding a new store buffer stall.

Patch is 64bit friendly (no new hole because of alignment constraints)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 09:13:58 +0000 (01:13 -0800)]

[TCP]: struct *sock argument renamed: sp -> sk

In general, TCP code uses "sk" for struct sock pointer.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Heffner [Mon, 26 Mar 2007 02:21:45 +0000 (19:21 -0700)]

[TCP]: Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Angelo P. Castellani [Thu, 22 Feb 2007 08:23:05 +0000 (00:23 -0800)]

[TCP] YeAH-TCP: algorithm implementation

YeAH-TCP is a sender-side high-speed enabled TCP congestion control
algorithm, which uses a mixed loss/delay approach to compute the
congestion window. It's design goals target high efficiency, internal,
RTT and Reno fairness, resilience to link loss while keeping network
elements load as low as possible.

For further details look here:
http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf

Signed-off-by: Angelo P. Castellani <angelo.castellani@gmail.con>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:16:38 +0000 (23:16 -0800)]

[TCP] FRTO: Sysctl documentation for SACK enhanced version

The description is overly verbose to avoid ambiguity between
"SACK enabled" and "SACK enhanced FRTO"

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:16:11 +0000 (23:16 -0800)]

[TCP]: SACK enhanced FRTO

Implements the SACK-enhanced FRTO given in RFC4138 using the
variant given in Appendix B.

RFC4138, Appendix B:
  "This means that in order to declare timeout spurious, the TCP
   sender must receive an acknowledgment for non-retransmitted
   segment between SND.UNA and RecoveryPoint in algorithm step 3.
   RecoveryPoint is defined in conservative SACK-recovery
   algorithm [RFC3517]"

The basic version of the FRTO algorithm can still be used also
when SACK is enabled. To enabled SACK-enhanced version, tcp_frto
sysctl is set to 2.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:14:42 +0000 (23:14 -0800)]

[TCP]: Prevent reordering adjustments during FRTO

To be honest, I'm not too sure how the reord stuff works in the
first place but this seems necessary.

When FRTO has been active, the one and only retransmission could
be unnecessary but the state and sending order might not be what
the sacktag code expects it to be (to work correctly).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:13:47 +0000 (23:13 -0800)]

[TCP] FRTO: Fake cwnd for ssthresh callback

TCP without FRTO would be in Loss state with small cwnd. FRTO,
however, leaves cwnd (typically) to a larger value which causes
ssthresh to become too large in case RTO is triggered again
compared to what conventional recovery would do. Because
consecutive RTOs result in only a single ssthresh reduction,
RTO+cumulative ACK+RTO pattern is required to trigger this
event.

A large comment is included for congestion control module writers
trying to figure out what CA_EVENT_FRTO handler should do because
there exists a remote possibility of incompatibility between
FRTO and module defined ssthresh functions.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:11:57 +0000 (23:11 -0800)]

[TCP] FRTO: Reverse RETRANS bit clearing logic

Previously RETRANS bits were cleared on the entry to FRTO. We
postpone that into tcp_enter_frto_loss, which is really the
place were the clearing should be done anyway. This allows
simplification of the logic from a clearing loop to the head skb
clearing only.

Besides, the other changes made in the previous patches to
tcp_use_frto made it impossible for the non-SACKed FRTO to be
entered if other than the head has been rexmitted.

With SACK-enhanced FRTO (and Appendix B), however, there can be
a number retransmissions in flight when RTO expires (same thing
could happen before this patchset also with non-SACK FRTO). To
not introduce any jumpiness into the packet counting during FRTO,
instead of clearing RETRANS bits from skbs during entry, do it
later on.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:10:39 +0000 (23:10 -0800)]

[TCP] FRTO: Entry is allowed only during (New)Reno like recovery

This interpretation comes from RFC4138:
    "If the sender implements some loss recovery algorithm other
     than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD
     NOT be entered when earlier fast recovery is underway."

I think the RFC means to say (especially in the light of
Appendix B) that ...recovery is underway (not just fast recovery)
or was underway when it was interrupted by an earlier (F-)RTO
that hasn't yet been resolved (snd_una has not advanced enough).
Thus, my interpretation is that whenever TCP has ever
retransmitted other than head, basic version cannot be used
because then the order assumptions which are used as FRTO basis
do not hold.

NewReno has only the head segment retransmitted at a time.
Therefore, walk up to the segment that has not been SACKed, if
that segment is not retransmitted nor anything before it, we know
for sure, that nothing after the non-SACKed segment should be
either. This assumption is valid because TCPCB_EVER_RETRANS does
not leave holes but each non-SACKed segment is rexmitted
in-order.

Check for retrans_out > 1 avoids more expensive walk through the
skb list, as we can know the result beforehand: F-RTO will not be
allowed.

SACKed skb can turn into non-SACked only in the extremely rare
case of SACK reneging, in this case we might fail to detect
retransmissions if there were them for any other than head. To
get rid of that feature, whole rexmit queue would have to be
walked (always) or FRTO should be prevented when SACK reneging
happens. Of course RTO should still trigger after reneging which
makes this issue even less likely to show up. And as long as the
response is as conservative as it's now, nothing bad happens even
then.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:08:34 +0000 (23:08 -0800)]

[TCP]: Prevent unrelated cwnd adjustment while using FRTO

FRTO controls cwnd when it still processes the ACK input or it
has just reverted back to conventional RTO recovery; the normal
rules apply when FRTO has reverted to standard congestion
control.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:07:27 +0000 (23:07 -0800)]

[TCP] FRTO: frto_counter modulo-op converted to two assignments

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:06:52 +0000 (23:06 -0800)]

[TCP]: Don't enter to fast recovery while using FRTO

Because TCP is not in Loss state during FRTO recovery, fast
recovery could be triggered by accident. Non-SACK FRTO is more
robust than not yet included SACK-enhanced version (that can
receiver high number of duplicate ACKs with SACK blocks during
FRTO), at least with unidirectional transfers, but under
extraordinary patterns fast recovery can be incorrectly
triggered, e.g., Data loss+ACK losses => cumulative ACK with
enough SACK blocks to meet sacked_out >= dupthresh condition).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:06:03 +0000 (23:06 -0800)]

[TCP] FRTO: Response should reset also snd_cwnd_cnt

Since purpose is to reduce CWND, we prevent immediate growth. This
is not a major issue nor is "the correct way" specified anywhere.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:05:18 +0000 (23:05 -0800)]

[TCP] FRTO: fixes fallback to conventional recovery

The FRTO detection did not care how ACK pattern affects to cwnd
calculation of the conventional recovery. This caused incorrect
setting of cwnd when the fallback becames necessary. The
knowledge tcp_process_frto() has about the incoming ACK is now
passed on to tcp_enter_frto_loss() in allowed_segments parameter
that gives the number of segments that must be added to
packets-in-flight while calculating the new cwnd.

Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK
detection because RFC4138 states (in Section 2.2):
  If the first acknowledgment after the RTO retransmission
  does not acknowledge all of the data that was retransmitted
  in step 1, the TCP sender reverts to the conventional RTO
  recovery.  Otherwise, a malicious receiver acknowledging
  partial segments could cause the sender to declare the
  timeout spurious in a case where data was lost.

If the next ACK after RTO is duplicate, we do not retransmit
anything, which is equal to what conservative conventional
recovery does in such case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:04:11 +0000 (23:04 -0800)]

[TCP] FRTO: Ignore some uninteresting ACKs

Handles RFC4138 shortcoming (in step 2); it should also have case
c) which ignores ACKs that are not duplicates nor advance window
(opposite dir data, winupdate).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:03:35 +0000 (23:03 -0800)]

[TCP] FRTO: Use Disorder state during operation instead of Open

Retransmission counter assumptions are to be changed. Forcing
reason to do this exist: Using sysctl in check would be racy
as soon as FRTO starts to ignore some ACKs (doing that in the
following patches). Userspace may disable it at any moment
giving nice oops if timing is right. frto_counter would be
inaccessible from userspace, but with SACK enhanced FRTO
retrans_out can include other than head, and possibly leaving
it non-zero after spurious RTO, boom again.

Luckily, solution seems rather simple: never go directly to Open
state but use Disorder instead. This does not really change much,
since TCP could anyway change its state to Disorder during FRTO
using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when
a SACK block makes ACK dubious). Besides, Disorder seems to be
the state where TCP should be if not recovering (in Recovery or
Loss state) while having some retransmissions in-flight (see
tcp_try_to_open), which is exactly what happens with FRTO.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:02:30 +0000 (23:02 -0800)]

[TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthresh

In case a latency spike causes more than one RTO, the later should not
cause the already reduced ssthresh to propagate into the prior_ssthresh
since FRTO declares all such RTOs spurious at once or none of them. In
treating of ssthresh, we mimic what tcp_enter_loss() does.

The previous state (in frto_counter) must be available until we have
checked it in tcp_enter_frto(), and also ACK information flag in
process_frto().

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 07:01:36 +0000 (23:01 -0800)]

[TCP] FRTO: Comment cleanup & improvement

Moved comments out from the body of process_frto() to the head
(preferred way; see Documentation/CodingStyle). Bonus: it's much
easier to read in this compacted form.

FRTO algorithm and implementation is described in greater detail.
For interested reader, more information is available in RFC4138.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 06:59:58 +0000 (22:59 -0800)]

[TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c

In addition, removed inline.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 06:56:19 +0000 (22:56 -0800)]

[TCP] FRTO: Separated response from FRTO detection algorithm

FRTO spurious RTO detection algorithm (RFC4138) does not include response
to a detected spurious RTO but can use different response algorithms.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ilpo Järvinen [Thu, 22 Feb 2007 06:54:52 +0000 (22:54 -0800)]

[TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bit

FRTO was slightly too brave... Should only clear
TCPCB_SACKED_RETRANS bit.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Linus Torvalds [Thu, 26 Apr 2007 03:08:32 +0000 (20:08 -0700)]

Linux 2.6.21

.. ok, enough waffling about it already. "Just do it!"

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Apr 2007 20:51:45 +0000 (13:51 -0700)]

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[PARPORT] SUNBPP: Fix OOPS when debugging is enabled.
[SPARC] openprom: Switch to ref counting PCI API

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Apr 2007 20:51:21 +0000 (13:51 -0700)]

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
[NETLINK]: Infinite recursion in netlink.

commit | commitdiff | tree

Andrew Morton [Wed, 25 Apr 2007 20:01:21 +0000 (13:01 -0700)]

packet: fix error handling

The packet driver is assuming (reasonably) that the (undocumented)
request.errors is an errno. But it is in fact some mysterious bitfield. When
things go wrong we return weird positive numbers to the VFS as pointers and it
goes oops.

Thanks to William Heimbigner for reporting and diagnosis.

(It doesn't oops, but this driver still doesn't work for William)

Cc: William Heimbigner <icxcnika@mar.tar.cc>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Alexey Kuznetsov [Wed, 25 Apr 2007 20:07:28 +0000 (13:07 -0700)]

[NETLINK]: Infinite recursion in netlink.

Reply to NETLINK_FIB_LOOKUP messages were misrouted back to kernel,
which resulted in infinite recursion and stack overflow.

The bug is present in all kernel versions since the feature appeared.

The patch also makes some minimal cleanup:

1. Return something consistent (-ENOENT) when fib table is missing
2. Do not crash when queue is empty (does not happen, but yet)
3. Put result of lookup

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jens Axboe [Wed, 25 Apr 2007 09:53:48 +0000 (11:53 +0200)]

cfq-iosched: fix alias + front merge bug

There's a really rare and obscure bug in CFQ, that causes a crash in
cfq_dispatch_insert() due to rq == NULL. One example of the resulting
oops is seen here:

http://lkml.org/lkml/2007/4/15/41

Neil correctly diagnosed the situation for how this can happen: if two
concurrent requests with the exact same sector number (due to direct IO
or aliasing between MD and the raw device access), the alias handling
will add the request to the sortlist, but next_rq remains NULL.

Read the more complete analysis at:

http://lkml.org/lkml/2007/4/25/57

This looks like it requires md to trigger, even though it should
potentially be possible to due with O_DIRECT (at least if you edit the
kernel and doctor some of the unplug calls).

The fix is to move the ->next_rq update to when we add a request to the
rbtree. Then we remove the possibility for a request to exist in the
rbtree code, but not have ->next_rq correctly updated.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

YOSHIFUJI Hideaki [Wed, 25 Apr 2007 02:13:49 +0000 (11:13 +0900)]

IPv6: fix Routing Header Type 0 handling thinko

Oops, thinko. The test for accempting a RH0 was exatly the wrong way
around.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Wed, 25 Apr 2007 01:20:32 +0000 (18:20 -0700)]

Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [BNX2]: Fix occasional NETDEV WATCHDOG on 5709.
  [IPV6]: Disallow RH0 by default.
  [XFRM]: beet: fix pseudo header length value
  [TCP]: Congestion control initialization.

commit | commitdiff | tree

Michael Chan [Tue, 24 Apr 2007 22:35:53 +0000 (15:35 -0700)]

[BNX2]: Fix occasional NETDEV WATCHDOG on 5709.

Tweak a register setting to prevent the tx mailbox from halting.

Update version to 1.5.8.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

YOSHIFUJI Hideaki [Tue, 24 Apr 2007 21:58:30 +0000 (14:58 -0700)]

[IPV6]: Disallow RH0 by default.

A security issue is emerging. Disallow Routing Header Type 0 by default
as we have been doing for IPv4.
Note: We allow RH2 by default because it is harmless.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Ralf Baechle [Tue, 24 Apr 2007 20:42:20 +0000 (21:42 +0100)]

[MIPS] Fix oprofile logic to physical counter remapping

This did cause oprofile to fail on non-multithreaded systems with more
than 2 processors such as the BCM1480.

Reported by Manish Lachwani (mlachwani@mvista.com).

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>

commit | commitdiff | tree

Linus Torvalds [Tue, 24 Apr 2007 18:05:20 +0000 (11:05 -0700)]

Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6

* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  drivers/net/hamradio/baycom_ser_fdx build fix
  usb-net/pegasus: fix pegasus carrier detection
  sis900: Allocate rx replacement buffer before rx operation
  [netdrvr] depca: handle platform_device_add() failure

commit | commitdiff | tree

Andrew Morton [Tue, 24 Apr 2007 16:51:03 +0000 (12:51 -0400)]

drivers/net/hamradio/baycom_ser_fdx build fix

sparc64:

drivers/net/hamradio/baycom_ser_fdx.c: In function `ser12_open':
drivers/net/hamradio/baycom_ser_fdx.c:417: error: `NR_IRQS' undeclared (first us
e in this function)
drivers/net/hamradio/baycom_ser_fdx.c:417: error: (Each undeclared identifier is
reported only once
drivers/net/hamradio/baycom_ser_fdx.c:417: error: for each function it appears i
n.)

Cc: Folkert van Heusden <folkert@vanheusden.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

commit | commitdiff | tree

Dan Williams [Tue, 24 Apr 2007 14:20:06 +0000 (10:20 -0400)]

usb-net/pegasus: fix pegasus carrier detection

Broken by 4a1728a28a193aa388900714bbb1f375e08a6d8e which switched the
return semantics of read_mii_word() but didn't fix usage of
read_mii_word() to conform to the new semantics.

Setting carrier to off based on the NO_CARRIER flag is also incorrect as
that flag only triggers on TX failure and therefore isn't correct when
no frames are being transmitted. Since there is already a 2*HZ MII
carrier check going on, defer to that.

Add a TRUST_LINK_STATUS feature flag for adapters where the LINK_STATUS
flag is actually correct, and use that rather than the NO_CARRIER flag.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

commit | commitdiff | tree

Neil Horman [Fri, 20 Apr 2007 13:54:58 +0000 (09:54 -0400)]

sis900: Allocate rx replacement buffer before rx operation

The sis900 driver appears to have a bug in which the receive routine
passes the skbuff holding the received frame to the network stack before
refilling the buffer in the rx ring.  If a new skbuff cannot be allocated, the
driver simply leaves a hole in the rx ring, which causes the driver to stop
receiving frames and become non-recoverable without an rmmod/insmod according to
reporters.  This patch reverses that order, attempting to allocate a replacement
buffer first, and receiving the new frame only if one can be allocated.  If no
skbuff can be allocated, the current skbuf in the rx ring is recycled, dropping
the current frame, but keeping the NIC operational.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>

commit | commitdiff | tree

Andrea Righi [Tue, 24 Apr 2007 16:40:57 +0000 (12:40 -0400)]

[netdrvr] depca: handle platform_device_add() failure

The following patch fixes a kernel bug in depca_platform_probe().

We don't use a dynamic pointer for pldev->dev.platform_data, so it seems
that the correct way to proceed if platform_device_add(pldev) fails is
to explicitly set the pldev->dev.platform_data pointer to NULL, before
calling the platform_device_put(pldev), or it will be kfree'ed by
platform_device_release().

Signed-off-by: Jeff Garzik <jeff@garzik.org>

commit | commitdiff | tree

Linus Torvalds [Tue, 24 Apr 2007 16:36:53 +0000 (09:36 -0700)]

Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6

* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6:
  [PATCH] i386: Fix some warnings added by earlier patch
  [PATCH] x86-64: Always flush all pages in change_page_attr
  [PATCH] x86: Remove noreplacement option
  [PATCH] x86-64: make GART PTEs uncacheable

commit | commitdiff | tree

Linus Torvalds [Tue, 24 Apr 2007 16:32:07 +0000 (09:32 -0700)]

Merge master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6:
Revert "adjust legacy IDE resource setting (v2)"

commit | commitdiff | tree

Jiri Kosina [Mon, 23 Apr 2007 21:41:21 +0000 (14:41 -0700)]

8250: fix possible deadlock between serial8250_handle_port() and serial8250_interrupt()

Commit 40b36daa introduced possibility that serial8250_backup_timeout() ->
serial8250_handle_port() locks port.lock without disabling irqs, thus
allowing deadlock against interrupt handler (port.lock is acquired in
serial8250_interrupt()).

Spotted by lockdep.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Akinobu Mita [Mon, 23 Apr 2007 21:41:20 +0000 (14:41 -0700)]

fault injection: add entry to MAINTAINERS

Add maintainer for fault injection support.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Jiri Slaby [Mon, 23 Apr 2007 21:41:20 +0000 (14:41 -0700)]

Char: icom, mark __init as __devinit

Two functions are called from __devinit context, but they are marked as
__init. Fix this.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Jeff Mahoney [Mon, 23 Apr 2007 21:41:17 +0000 (14:41 -0700)]

reiserfs: fix xattr root locking/refcount bug

The listxattr() and getxattr() operations are only protected by a read
lock. As a result, if either of these operations run in parallel, a race
condition exists where the xattr_root will end up being cached twice, which
results in the leaking of a reference and a BUG() on umount.

This patch refactors get_xa_root(), __get_xa_root(), and create_xa_root(),
into one get_xa_root() function that takes the appropriate locking around
the entire critical section.

Reported, diagnosed and tested by Andrea Righi <a.righi@cineca.it>

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Andrea Righi <a.righi@cineca.it>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Cc: Edward Shishkin <edward@namesys.com>
Cc: Alex Zarochentsev <zam@namesys.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Jean Delvare [Mon, 23 Apr 2007 21:41:16 +0000 (14:41 -0700)]

hwmon/w83627ehf: Don't redefine REGION_OFFSET

On ia64, kernel headers define REGION_OFFSET so we can't use that.
Reported by Andrew Morton.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: David Hubbard <david.c.hubbard@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Olaf Hering [Mon, 23 Apr 2007 21:41:15 +0000 (14:41 -0700)]

do not truncate irq number for icom adapter

irq values are u32, not u8. Large irq numbers will be truncated,
free_irq may free a different irq.

Remove incorrectly sized struct member and use the one from pci_dev.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Bastian Blank [Mon, 23 Apr 2007 21:41:14 +0000 (14:41 -0700)]

Allow reading tainted flag as user

The commit 34f5a39899f3f3e815da64f48ddb72942d86c366 restricted reading
of the tainted value. The attached patch changes this back to a
write-only check and restores the read behaviour of older versions.

Signed-off-by: Bastian Blank <bastian@waldi.eu.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Andrew Morton [Mon, 23 Apr 2007 21:41:13 +0000 (14:41 -0700)]

acpi-thermal: fix mod_timer() interval

Use relative time, not absolute. Discovered by Jung-Ik (John) Lee
<jilee@google.com>.

Cc: Jung-Ik (John) Lee <jilee@google.com>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Latchesar Ionkov [Mon, 23 Apr 2007 21:41:11 +0000 (14:41 -0700)]

v9fs: don't use primary fid when removing file

v9fs_insert uses v9fs_fid_lookup (which also locks the fid) to get the
primary fid associated with the dentry and destroys the v9fs_fid struct
after removing the file. If another process called v9fs_fid_lookup on the
same dentry, it may wait undefinitely for the fid's lock (as the struct is
freed).

This patch changes v9fs_remove to use a cloned fid, so the primary fid is
not locked and freed.

Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Cc: Eric Van Hensbergen <ericvh@hera.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Stefan Richter [Mon, 23 Apr 2007 21:41:10 +0000 (14:41 -0700)]

ieee1394: update MAINTAINERS database

  - update Ben's address
  - replace Ben's contact by mine as raw1394's 2nd contact
  - eth1394's and pcilynx's maintenance doesn't really differ from that
    of other parts of the stack like video1394

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Ben Collins <ben.collins@ubuntu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Christoph Lameter [Mon, 23 Apr 2007 21:41:09 +0000 (14:41 -0700)]

page migration: fix NR_FILE_PAGES accounting

NR_FILE_PAGES must be accounted for depending on the zone that the page
belongs to. If we replace the page in the radix tree then we may have to
shift the count to another zone.

Suggested-by: Ethan Solomita <solo@google.com>
Eventually-typed-in-by: Christoph Lameter <clameter@sgi.com>
Cc: Martin Bligh <mbligh@mbligh.org>
Cc: <stable@kernel.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Miguel Ojeda [Mon, 23 Apr 2007 21:41:09 +0000 (14:41 -0700)]

Fix spelling in drivers/video/Kconfig

Signed-off-by: Miguel Ojeda Sandonis <maxextreme@gmail.com>
Cc: "Antonino A. Daplas" <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Michael Buesch [Mon, 23 Apr 2007 21:41:08 +0000 (14:41 -0700)]

Add mbuesch to .mailmap

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Alexey Dobriyan [Mon, 23 Apr 2007 21:41:07 +0000 (14:41 -0700)]

paride drivers: initialize spinlocks

pcd_lock and pf_spin_lock are passed to blk_init_queue() which, seeing them
as valid lock pointer, sets it as ->queue_lock.

The problem is that pcd_lock and pf_spin_lock aren't initialized anywhere.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

David Brownell [Mon, 23 Apr 2007 21:41:06 +0000 (14:41 -0700)]

MAINTAINERS: use lists.linux-foundation.org

Update various mailing list addresses to use "lists.linux-foundation.org"
instead of "lists.osdl.org", to help phase out the old addresses.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Balbir Singh [Mon, 23 Apr 2007 21:41:05 +0000 (14:41 -0700)]

Taskstats fix the structure members alignment issue

We broke the the alignment of members of taskstats to the 8 byte boundary
with the CSA patches.  In the current kernel, the taskstats structure is
not suitable for use by 32 bit applications in a 64 bit kernel.

On x86_64

Offsets of taskstats' members (64 bit kernel, 64 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        16,     # cpu_count
        24,     # cpu_delay_total
        32,     # blkio_count
        40,     # blkio_delay_total
        48,     # swapin_count
        56,     # swapin_delay_total
        64,     # cpu_run_real_total
        72,     # cpu_run_virtual_total
        80,     # ac_comm
        112,    # ac_sched
        113,    # ac_pad
        116,    # ac_uid
        120,    # ac_gid
        124,    # ac_pid
        128,    # ac_ppid
        132,    # ac_btime
        136,    # ac_etime
        144,    # ac_utime
        152,    # ac_stime
        160,    # ac_minflt
        168,    # ac_majflt
        176,    # coremem
        184,    # virtmem
        192,    # hiwater_rss
        200,    # hiwater_vm
        208,    # read_char
        216,    # write_char
        224,    # read_syscalls
        232,    # write_syscalls
        240,    # read_bytes
        248,    # write_bytes
        256,    # cancelled_write_bytes
    );

Offsets of taskstats' members (64 bit kernel, 32 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        12,     # cpu_count
        20,     # cpu_delay_total
        28,     # blkio_count
        36,     # blkio_delay_total
        44,     # swapin_count
        52,     # swapin_delay_total
        60,     # cpu_run_real_total
        68,     # cpu_run_virtual_total
        76,     # ac_comm
        108,    # ac_sched
        109,    # ac_pad
        112,    # ac_uid
        116,    # ac_gid
        120,    # ac_pid
        124,    # ac_ppid
        128,    # ac_btime
        132,    # ac_etime
        140,    # ac_utime
        148,    # ac_stime
        156,    # ac_minflt
        164,    # ac_majflt
        172,    # coremem
        180,    # virtmem
        188,    # hiwater_rss
        196,    # hiwater_vm
        204,    # read_char
        212,    # write_char
        220,    # read_syscalls
        228,    # write_syscalls
        236,    # read_bytes
        244,    # write_bytes
        252,    # cancelled_write_bytes
    );

This is one way to solve the problem without re-arranging structure members
is to pack the structure.  The patch adds an __attribute__((aligned(8))) to
the taskstats structure members so that 32 bit applications using taskstats
can work with a 64 bit kernel.

Using __attribute__((packed)) would break the 64 bit alignment of members.

The fix was tested on x86_64. After the fix, we got

Offsets of taskstats' members (64 bit kernel, 64 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        16,     # cpu_count
        24,     # cpu_delay_total
        32,     # blkio_count
        40,     # blkio_delay_total
        48,     # swapin_count
        56,     # swapin_delay_total
        64,     # cpu_run_real_total
        72,     # cpu_run_virtual_total
        80,     # ac_comm
        112,    # ac_sched
        113,    # ac_pad
        120,    # ac_uid
        124,    # ac_gid
        128,    # ac_pid
        132,    # ac_ppid
        136,    # ac_btime
        144,    # ac_etime
        152,    # ac_utime
        160,    # ac_stime
        168,    # ac_minflt
        176,    # ac_majflt
        184,    # coremem
        192,    # virtmem
        200,    # hiwater_rss
        208,    # hiwater_vm
        216,    # read_char
        224,    # write_char
        232,    # read_syscalls
        240,    # write_syscalls
        248,    # read_bytes
        256,    # write_bytes
        264,    # cancelled_write_bytes
    );

Offsets of taskstats' members (64 bit kernel, 32 bit application)

@taskstats'offsetof[@taskstats'indices] = (
        0,      # version
        4,      # ac_exitcode
        8,      # ac_flag
        9,      # ac_nice
        16,     # cpu_count
        24,     # cpu_delay_total
        32,     # blkio_count
        40,     # blkio_delay_total
        48,     # swapin_count
        56,     # swapin_delay_total
        64,     # cpu_run_real_total
        72,     # cpu_run_virtual_total
        80,     # ac_comm
        112,    # ac_sched
        113,    # ac_pad
        120,    # ac_uid
        124,    # ac_gid
        128,    # ac_pid
        132,    # ac_ppid
        136,    # ac_btime
        144,    # ac_etime
        152,    # ac_utime
        160,    # ac_stime
        168,    # ac_minflt
        176,    # ac_majflt
        184,    # coremem
        192,    # virtmem
        200,    # hiwater_rss
        208,    # hiwater_vm
        216,    # read_char
        224,    # write_char
        232,    # read_syscalls
        240,    # write_syscalls
        248,    # read_bytes
        256,    # write_bytes
        264,    # cancelled_write_bytes
    );

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Jiri Slaby [Mon, 23 Apr 2007 21:41:04 +0000 (14:41 -0700)]

Char: mxser, fix TIOCMIWAIT

There was schedule() missing in the TIOCMIWAIT ioctl. Solve it by moving
the code to the wait_event_interruptible.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jan Yenya Kasprzak <kas@fi.muni.cz>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Jiri Slaby [Mon, 23 Apr 2007 21:41:03 +0000 (14:41 -0700)]

Char: mxser_new, fix TIOCMIWAIT

There was schedule() missing in the TIOCMIWAIT ioctl. Solve it by moving
the code to the wait_event_interruptible.

Cc: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Jan Yenya Kasprzak [Mon, 23 Apr 2007 21:41:02 +0000 (14:41 -0700)]

Char: mxser_new, fix recursive locking

Signed-off-by: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Hugh Dickins [Mon, 23 Apr 2007 21:41:02 +0000 (14:41 -0700)]

fix OOM killing processes wrongly thought MPOL_BIND

I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog
to see lots of other processes killed with "No available memory
(MPOL_BIND)". memhog is killed correctly once we initialize nodemask in
constrained_alloc().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Taku Izumi [Mon, 23 Apr 2007 21:41:00 +0000 (14:41 -0700)]

Fix possible NULL pointer access in 8250 serial driver

I encountered the following kernel panic.  The cause of this problem was
NULL pointer access in check_modem_status() in 8250.c.  I confirmed this
problem is fixed by the attached patch, but I don't know this is the
correct fix.

sadc[4378]: NaT consumption 2216203124768 [1]
Modules linked in: binfmt_misc dm_mirror dm_mod thermal processor fan
container button sg e100 eepro100 mii ehci_hcd ohci_hcd

    Pid: 4378, CPU 0, comm: sadc
    psr : 00001210085a2010 ifs : 8000000000000289 ip : [<a000000100482071>]
    Not tainted
    ip is at check_modem_status+0xf1/0x360

    Call Trace:
    [<a000000100013940>] show_stack+0x40/0xa0
    [<a0000001000145a0>] show_regs+0x840/0x880
    [<a0000001000368e0>] die+0x1c0/0x2c0
    [<a000000100036a30>] die_if_kernel+0x50/0x80
    [<a000000100037c40>] ia64_fault+0x11e0/0x1300
    [<a00000010000bdc0>] ia64_leave_kernel+0x0/0x280
    [<a000000100482070>] check_modem_status+0xf0/0x360
    [<a000000100482300>] serial8250_get_mctrl+0x20/0xa0
    [<a000000100478170>] uart_read_proc+0x250/0x860
    [<a0000001001c16d0>] proc_file_read+0x1d0/0x4c0
    [<a0000001001394b0>] vfs_read+0x1b0/0x300
    [<a000000100139cd0>] sys_read+0x70/0xe0
    [<a00000010000bc20>] ia64_ret_from_syscall+0x0/0x20
    [<a000000000010620>] __kernel_syscall_via_break+0x0/0x20

Fix the possible NULL pointer access in check_modem_status() in 8250.c.  The
check_modem_status() would access 'info' member of uart_port structure, but it
is not initialized before uart_open() is called.  The check_modem_status() can
be called through /proc/tty/driver/serial before uart_open() is called.

Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Taku Izumi <izumi2005@soft.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

David Rientjes [Tue, 24 Apr 2007 04:36:13 +0000 (21:36 -0700)]

oom: kill all threads that share mm with killed task

oom_kill_task() calls __oom_kill_task() to OOM kill a selected task.
When finding other threads that share an mm with that task, we need to
kill those individual threads and not the same one.

(Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584)

Acked-by: William Irwin <bill.irwin@oracle.com>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Andi Kleen [Tue, 24 Apr 2007 11:05:37 +0000 (13:05 +0200)]

[PATCH] i386: Fix some warnings added by earlier patch

Signed-off-by: Andi Kleen <ak@suse.de>

commit | commitdiff | tree

Andi Kleen [Tue, 24 Apr 2007 11:05:37 +0000 (13:05 +0200)]

[PATCH] x86-64: Always flush all pages in change_page_attr

change_page_attr on x86-64 only flushed the TLB for pages that got
reverted. That's not correct: it has to be flushed in all cases.

This bug was added in some earlier changes.

Just flush all pages for now.

This could be done more efficiently, but for this late in the release
this seem to be the best fix.

Pointed out by Jan Beulich

Signed-off-by: Andi Kleen <ak@suse.de>

commit | commitdiff | tree

Andi Kleen [Tue, 24 Apr 2007 11:05:37 +0000 (13:05 +0200)]

[PATCH] x86: Remove noreplacement option

noreplacement is dangerous on modern systems because it will not replace the
context switch FNSAVE with SSE aware FXSAVE. But other places in the kernel still assume
SSE and do FXSAVE and the CPU will then access FXSAVE information with
FNSAVE and cause corruption.

Easiest way to avoid this is to remove the option. It was mostly for paranoia
reasons anyways and alternative()s have been stable for some time.

Thanks to Jeremy F. for reporting and helping debug it.

Signed-off-by: Andi Kleen <ak@suse.de>

commit | commitdiff | tree

Joachim Deguara [Tue, 24 Apr 2007 11:05:36 +0000 (13:05 +0200)]

[PATCH] x86-64: make GART PTEs uncacheable

This patches fixes the silent data corruption problems being seen using the
GART iommu where 4kB of data where incorrect (seen mostly on Nvidia CK804
systems). This fix, to mark the memory regin the GART PTEs reside on as
uncacheable, also brings the code in line with the AGP specification.

Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>

commit | commitdiff | tree

David S. Miller [Tue, 24 Apr 2007 06:33:17 +0000 (23:33 -0700)]

[PARPORT] SUNBPP: Fix OOPS when debugging is enabled.

The debugging code would dereference __iomem pointers instead
of going through sbus_{read,write}{b,w,l}().

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Alan Cox [Tue, 24 Apr 2007 05:50:53 +0000 (22:50 -0700)]

[SPARC] openprom: Switch to ref counting PCI API

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Patrick McHardy [Tue, 24 Apr 2007 05:39:02 +0000 (22:39 -0700)]

[XFRM]: beet: fix pseudo header length value

draft-nikander-esp-beet-mode-07.txt is not entirely clear on how the length
value of the pseudo header should be calculated, it states "The Header Length
field contains the length of the pseudo header, IPv4 options, and padding in
8 octets units.", but also states "Length in octets (Header Len + 1) * 8".
draft-nikander-esp-beet-mode-08-pre1.txt [1] clarifies this, the header length
should not include the first 8 byte.

This change affects backwards compatibility, but option encapsulation didn't
work until very recently anyway.

[1] http://users.piuha.net/jmelen/BEET/draft-nikander-esp-beet-mode-08-pre1.txt

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Linux 2.6 source tree