This patch removes some code which ran at every boot, but does not seem to do
anything anymore. Please test. It works for me but mistakes can happen.
Signed-off-by: Karol Swietlicki <magotari@gmail.com> Signed-off-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Dike [Tue, 5 Feb 2008 06:30:34 +0000 (22:30 -0800)]
uml: arch/um/include/init.h needs a definition of __used
init.h started breaking now for some reason. It turns out that there wasn't a
definition of __used. Fixed this by copying the relevant stuff from
compiler.h in the userspace case, and including compiler.h in the kernel case.
[xiyou.wangcong@gmail.com: added definition of __section] Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julia Lawall [Tue, 5 Feb 2008 06:30:32 +0000 (22:30 -0800)]
arch/cris: add a missing iounmap
An extra error handling label is needed for the case where the ioremap has
succeeded.
The problem was detected using the following semantic match
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
type T,T1,T2;
identifier E;
statement S;
expression x1,x2;
constant C;
int ret;
@@
T E;
...
* E = ioremap(...);
if (E == NULL) S
... when != iounmap(E)
when != if (E != NULL) { ... iounmap(E); ...}
when != x1 = (T1)E
if (...) {
... when != iounmap(E)
when != if (E != NULL) { ... iounmap(E); ...}
when != x2 = (T2)E
(
* return;
|
* return C;
|
* return ret;
)
}
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Mikael Starvik <starvik@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cc-cross-prefix is new and developed on request from Geert Uytterhoeven.
With cc-cross-prefix it is now much easier to have a few default cross compile
prefixes and defaulting to none - if none of them were present. ARCH
maintainers are expected to pick up this feature soon.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andreas Schwab <schwab@suse.de> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b43: avoid unregistering device objects during suspend
Modify the b43 driver to avoid deadlocking suspend and resume, which happens
as a result of attempting to unregister device objects locked by the PM core
during suspend/resume cycles. Also, make it use a suspend-safe method of
unregistering device object in the resume error path.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
leds: add possibility to remove leds classdevs during suspend/resume
Make it possible to unregister a led classdev object in a safe way during a
suspend/resume cycle.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HWRNG: add possibility to remove hwrng devices during suspend/resume
Make it possible to unregister a Hardware Random Number Generator
device object in a safe way during a suspend/resume cycle.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Michael Buesch <mb@bu3sch.de> Cc: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Misc: Add possibility to remove misc devices during suspend/resume
Make it possible to unregister a misc device object in a safe way during a
suspend/resume cycle.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark Gross [Tue, 5 Feb 2008 06:30:09 +0000 (22:30 -0800)]
latency.c: use QoS infrastructure
Replace latency.c use with pm_qos_params use.
Signed-off-by: mark gross <mgross@linux.intel.com> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Len Brown <lenb@kernel.org> Cc: Jaroslav Kysela <perex@suse.cz> Cc: Takashi Iwai <tiwai@suse.de> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark Gross [Tue, 5 Feb 2008 06:30:08 +0000 (22:30 -0800)]
pm qos infrastructure and interface
The following patch is a generalization of the latency.c implementation done
by Arjan last year. It provides infrastructure for more than one parameter,
and exposes a user mode interface for processes to register pm_qos
expectations of processes.
This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
one of the parameters.
Currently we have {cpu_dma_latency, network_latency, network_throughput} as
the initial set of pm_qos parameters.
The infrastructure exposes multiple misc device nodes one per implemented
parameter. The set of parameters implement is defined by pm_qos_power_init()
and pm_qos_params.h. This is done because having the available parameters
being runtime configurable or changeable from a driver was seen as too easy to
abuse.
For each parameter a list of performance requirements is maintained along with
an aggregated target value. The aggregated target value is updated with
changes to the requirement list or elements of the list. Typically the
aggregated target value is simply the max or min of the requirement values
held in the parameter list elements.
>From kernel mode the use of this interface is simple:
Will insert a named element in the list for that identified PM_QOS
parameter with the target value. Upon change to this list the new target is
recomputed and any registered notifiers are called only if the target value
is now different.
Will search the list identified by the param_id for the named list element
and then update its target value, calling the notification tree if the
aggregated target is changed. with that name is already registered.
pm_qos_remove_requirement(param_id, name):
Will search the identified list for the named element and remove it, after
removal it will update the aggregate target and call the notification tree
if the target was changed as a result of removing the named requirement.
>From user mode:
Only processes can register a pm_qos requirement. To provide for
automatic cleanup for process the interface requires the process to register
its parameter requirements in the following way:
To register the default pm_qos target for the specific parameter, the
process must open one of /dev/[cpu_dma_latency, network_latency,
network_throughput]
As long as the device node is held open that process has a registered
requirement on the parameter. The name of the requirement is
"process_<PID>" derived from the current->pid from within the open system
call.
To change the requested target value the process needs to write a s32
value to the open device node. This translates to a
pm_qos_update_requirement call.
To remove the user mode request for a target value simply close the device
node.
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix build again]
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: mark gross <mgross@linux.intel.com> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Len Brown <lenb@kernel.org> Cc: Jaroslav Kysela <perex@suse.cz> Cc: Takashi Iwai <tiwai@suse.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com> Cc: Adam Belay <abelay@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 5 Feb 2008 06:30:04 +0000 (22:30 -0800)]
agp: alpha nopage
Convert AGP alpha driver from nopage to fault.
NULL is NOPAGE_SIGBUS, so we aren't changing behaviour there.
Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Samuel Thibault [Tue, 5 Feb 2008 06:30:03 +0000 (22:30 -0800)]
Alpha doesn't use socketcall
Alpha doesn't use socketcall and doesn't provide __NR_socketcall.
Signed-off-by: Samuel Thibault <samuel.thibault@citrix.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 5 Feb 2008 06:30:02 +0000 (22:30 -0800)]
alpha: atomic_add_return() should return int
Prevents stuff like
drivers/crypto/hifn_795x.c:2443: warning: format '%d' expects type 'int', but argument 4 has type 'long int'
drivers/crypto/hifn_795x.c:2443: warning: format '%d' expects type 'int', but argument 4 has type 'long int'
(at least).
Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas Woods [Tue, 5 Feb 2008 06:30:01 +0000 (22:30 -0800)]
arch/alpha: remove duplicate includes
Signed-off-by: Lucas Woods <woodzy@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Mundt [Tue, 5 Feb 2008 06:29:59 +0000 (22:29 -0800)]
nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces.
This builds on top of the earlier vmalloc_32_user() work introduced by b50731732f926d6c49fd0724616a7344c31cd5cf, as we now have places in the nommu
allmodconfig that hit up against these missing APIs.
As vmalloc_32_user() is already implemented, this is moved over to
vmalloc_user() and simply made a wrapper. As all current nommu platforms are
32-bit addressable, there's no special casing we have to do for ZONE_DMA and
things of that nature as per GFP_VMALLOC32.
remap_vmalloc_range() needs to check VM_USERMAP in order to figure out whether
we permit the remap or not, which means that we also have to rework the
vmalloc_user() code to grovel for the VMA and set the flag.
Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David McCullough <david_mccullough@securecomputing.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FRV: move DMA macros to scatterlist.h for consistency.
To be consistent with other architectures, these two DMA macros should
be defined in scatterlist.h as opposed to dma-mapping.h
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Tue, 5 Feb 2008 06:29:53 +0000 (22:29 -0800)]
FRV: permit the memory to be located elsewhere in NOMMU mode
Permit the memory to be located somewhere other than address 0xC0000000 in
NOMMU mode. The configuration options are already present, it just
requires wiring up in the linker script.
Note that only a limited set of locations of runtime addresses are available
because of the way the CPU protection registers work.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Casey Schaufler [Tue, 5 Feb 2008 06:29:50 +0000 (22:29 -0800)]
Smack: Simplified Mandatory Access Control Kernel
Smack is the Simplified Mandatory Access Control Kernel.
Smack implements mandatory access control (MAC) using labels
attached to tasks and data containers, including files, SVIPC,
and other tasks. Smack is a kernel based scheme that requires
an absolute minimum of application support and a very small
amount of configuration data.
Smack uses extended attributes and
provides a set of general mount options, borrowing technics used
elsewhere. Smack uses netlabel for CIPSO labeling. Smack provides
a pseudo-filesystem smackfs that is used for manipulation of
system Smack attributes.
The patch, patches for ls and sshd, a README, a startup script,
and x86 binaries for ls and sshd are also available on
http://www.schaufler-ca.com
Development has been done using Fedora Core 7 in a virtual machine
environment and on an old Sony laptop.
Smack provides mandatory access controls based on the label attached
to a task and the label attached to the object it is attempting to
access. Smack labels are deliberately short (1-23 characters) text
strings. Single character labels using special characters are reserved
for system use. The only operation applied to Smack labels is equality
comparison. No wildcards or expressions, regular or otherwise, are
used. Smack labels are composed of printable characters and may not
include "/".
A file always gets the Smack label of the task that created it.
1. Any access requested by a task labeled "*" is denied.
2. A read or execute access requested by a task labeled "^"
is permitted.
3. A read or execute access requested on an object labeled "_"
is permitted.
4. Any access requested on an object labeled "*" is permitted.
5. Any access requested by a task on an object with the same
label is permitted.
6. Any access requested that is explicitly defined in the loaded
rule set is permitted.
7. Any other access is denied.
Rules may be explicitly defined by writing subject,object,access
triples to /smack/load.
Smack rule sets can be easily defined that describe Bell&LaPadula
sensitivity, Biba integrity, and a variety of interesting
configurations. Smack rule sets can be modified on the fly to
accommodate changes in the operating environment or even the time
of day.
Some practical use cases:
Hierarchical levels. The less common of the two usual uses
for MLS systems is to define hierarchical levels, often
unclassified, confidential, secret, and so on. To set up smack
to support this, these rules could be defined:
C Unclass rx
S C rx
S Unclass rx
TS S rx
TS C rx
TS Unclass rx
A TS process can read S, C, and Unclass data, but cannot write it.
An S process can read C and Unclass. Note that specifying that
TS can read S and S can read C does not imply TS can read C, it
has to be explicitly stated.
Non-hierarchical categories. This is the more common of the
usual uses for an MLS system. Since the default rule is that a
subject cannot access an object with a different label no
access rules are required to implement compartmentalization.
A case that the Bell & LaPadula policy does not allow is demonstrated
with this Smack access rule:
A case that Bell&LaPadula does not allow that Smack does:
ESPN ABC r
ABC ESPN r
On my portable video device I have two applications, one that
shows ABC programming and the other ESPN programming. ESPN wants
to show me sport stories that show up as news, and ABC will
only provide minimal information about a sports story if ESPN
is covering it. Each side can look at the other's info, neither
can change the other. Neither can see what FOX is up to, which
is just as well all things considered.
Another case that I especially like:
SatData Guard w
Guard Publish w
A program running with the Guard label opens a UDP socket and
accepts messages sent by a program running with a SatData label.
The Guard program inspects the message to ensure it is wholesome
and if it is sends it to a program running with the Publish label.
This program then puts the information passed in an appropriate
place. Note that the Guard program cannot write to a Publish
file system object because file system semanitic require read as
well as write.
The four cases (categories, levels, mutual read, guardbox) here
are all quite real, and problems I've been asked to solve over
the years. The first two are easy to do with traditonal MLS systems
while the last two you can't without invoking privilege, at least
for a while.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Cc: Joshua Brindle <method@manicmethod.com> Cc: Paul Moore <paul.moore@hp.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: "Ahmed S. Darwish" <darwish.07@gmail.com> Cc: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Moore [Tue, 5 Feb 2008 06:29:47 +0000 (22:29 -0800)]
NetLabel: introduce a new kernel configuration API for NetLabel
Add a new set of configuration functions to the NetLabel/LSM API so that
LSMs can perform their own configuration of the NetLabel subsystem without
relying on assistance from userspace.
Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Serge E. Hallyn [Tue, 5 Feb 2008 06:29:47 +0000 (22:29 -0800)]
oom_kill: remove uid==0 checks
Root processes are considered more important when out of memory and killing
proceses. The check for CAP_SYS_ADMIN was augmented with a check for
uid==0 or euid==0.
There are several possible ways to look at this:
1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN
alone. However CAP_SYS_RESOURCE is the one that really
means "give me extra resources" so allow for that as
well.
2. Any privileged code should be protected, but uid is not
an indication of privilege. So we should check whether
any capabilities are raised.
3. uid==0 makes processes on the host as well as in containers
more important, so we should keep the existing checks.
4. uid==0 makes processes only on the host more important,
even without any capabilities. So we should be keeping
the (uid==0||euid==0) check but only when
userns==&init_user_ns.
I'm following number 1 here.
Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Serge E. Hallyn [Tue, 5 Feb 2008 06:29:45 +0000 (22:29 -0800)]
capabilities: introduce per-process capability bounding set
The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.
This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.
One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.
The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do
The following test program will get and set the bounding
set (but not pI). For instance
./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)
Andrew Morgan [Tue, 5 Feb 2008 06:29:43 +0000 (22:29 -0800)]
Remove unnecessary include from include/linux/capability.h
KaiGai Kohei observed that this line in the linux header is not needed.
Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: KaiGai Kohei <kaigai@kaigai.gr.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morgan [Tue, 5 Feb 2008 06:29:42 +0000 (22:29 -0800)]
Add 64-bit capability support to the kernel
The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities. If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.
FWIW libcap-2.00 supports this change (and earlier capability formats)
We want to keep the vfs_cap_data.data[] structure, using two 'data's for
64-bit caps (and later three for 96-bit caps), whereas b68680e4731abbd78863063aaa0dca2a6d8cc723 had gotten rid of the 'data' struct
made its members inline.
The 64-bit caps patch keeps the stack abuse fix at get_file_caps(), which was
the more important part of that patch.
[akpm@linux-foundation.org: coding-style fixes] Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VFS: Reorder vfs_getxattr to avoid unnecessary calls to the LSM
Originally vfs_getxattr would pull the security xattr variable using
the inode getxattr handle and then proceed to clobber it with a subsequent call
to the LSM.
This patch reorders the two operations such that when the xattr requested is
in the security namespace it first attempts to grab the value from the LSM
directly.
If it fails to obtain the value because there is no module present or the
module does not support the operation it will fall back to using the inode
getxattr operation.
In the event that both are inaccessible it returns EOPNOTSUPP.
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
This patch modifies the interface to inode_getsecurity to have the function
return a buffer containing the security blob and its length via parameters
instead of relying on the calling function to give it an appropriately sized
buffer.
Security blobs obtained with this function should be freed using the
release_secctx LSM hook. This alleviates the problem of the caller having to
guess a length and preallocate a buffer for this function allowing it to be
used elsewhere for Labeled NFS.
The patch also removed the unused err parameter. The conversion is similar to
the one performed by Al Viro for the security_getprocattr hook.
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:37 +0000 (22:29 -0800)]
slob: reduce external fragmentation by using three free lists
By putting smaller objects on their own list, we greatly reduce overall
external fragmentation and increase repeatability. This reduces total SLOB
overhead from > 50% to ~6% on a simple boot test.
Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fengguang Wu [Tue, 5 Feb 2008 06:29:36 +0000 (22:29 -0800)]
writeback: speed up writeback of big dirty files
After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays. But sometimes the following
happens instead:
- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M
Some analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all
been written out. The big dirty file is actually still sitting in
s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty. It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).
We have to return when a full scan of s_io completes. So nr_to_write > 0
does not necessarily mean that "all data are written". This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done. With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.
In sync_sb_inodes() we only set more_io on super_blocks we actually
visited. This avoids the interaction between two pdflush deamons.
Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.
Tested-by: Mike Snitzer <snitzer@gmail.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sam Ravnborg [Tue, 5 Feb 2008 06:29:35 +0000 (22:29 -0800)]
mm: fix section mismatch warning in sparse.c
Fix following warning:
WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node()
static sparse_early_usemap_alloc() were used only by sparse_init()
and with sparse_init() annotated _init it is safe to
annotate sparse_early_usemap_alloc with __init too.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 5 Feb 2008 06:29:34 +0000 (22:29 -0800)]
mm: fix PageUptodate data race
After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the
page uptodate.
Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.
Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
PageUptodate.
Many places that test PageUptodate, do so with the page locked, and this
would be enough to ensure memory ordering in those places if
SetPageUptodate were only called while the page is locked. Unfortunately
that is not always the case for some filesystems, but it could be an idea
for the future.
Also bring the handling of anonymous page uptodateness in line with that of
file backed page management, by marking anon pages as uptodate when they
_are_ uptodate, rather than when our implementation requires that they be
marked as such. Doing allows us to get rid of the smp_wmb's in the page
copying functions, which were especially added for anonymous pages for an
analogous memory ordering problem. Both file and anonymous pages are
handled with the same barriers.
FAQ:
Q. Why not do this in flush_dcache_page?
A. Firstly, flush_dcache_page handles only one side (the smb side) of the
ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
memory barriers in a completely unrelated function is nasty; at least in the
PageUptodate macros, they are located together with (half) the operations
involved in the ordering. Thirdly, the smp_wmb is only required when first
bringing the page uptodate, wheras flush_dcache_page should be called each time
it is written to through the kernel mapping. It is logically the wrong place to
put it.
Q. Why does this increase my text size / reduce my performance / etc.
A. Because it is adding the necessary instructions to eliminate the data-race.
Q. Can it be improved?
A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
run under the page lock, we could avoid the smp_rmb places where PageUptodate
is queried under the page lock. Requires audit of all filesystems and at least
some would need reworking. That's great you're interested, I'm eagerly awaiting
your patches.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Tue, 5 Feb 2008 06:29:33 +0000 (22:29 -0800)]
page migraton: handle orphaned pages
Orphaned page might have fs-private metadata, the page is truncated. As
the page hasn't mapping, page migration refuse to migrate the page. It
appears the page is only freed in page reclaim and if zone watermark is
low, the page is never freed, as a result migration always fail. I thought
we could free the metadata so such page can be freed in migration and make
migration more reliable.
[akpm@linux-foundation.org: go direct to try_to_free_buffers()] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yasunori Goto [Tue, 5 Feb 2008 06:29:32 +0000 (22:29 -0800)]
Document lowmem_reserve_ratio
Though the lower_zone_protection was changed to lowmem_reserve_ratio, the
document has been not changed. The lowmem_reserve_ratio seems quite hard
to estimate, but there is no guidance. This patch is to change document
for it.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andrea Arcangeli <andrea@cpushare.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Masatake YAMATO [Tue, 5 Feb 2008 06:29:31 +0000 (22:29 -0800)]
check ADVICE of fadvise64_64 even if get_xip_page is given
I've written some test programs in ltp project. During writing I met an
problem which I cannot solve in user land. So I wrote a patch for linux
kernel. Please, include this patch if acceptable.
The test program tests the 4th parameter of fadvise64_64:
long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
My test case calls fadvise64_64 with invalid advice value and checks errno is
set to EINVAL. About the advice parameter man page says:
...
Permissible values for advice include:
POSIX_FADV_NORMAL
...
POSIX_FADV_SEQUENTIAL
...
POSIX_FADV_RANDOM
...
POSIX_FADV_NOREUSE
...
POSIX_FADV_WILLNEED
...
POSIX_FADV_DONTNEED
...
ERRORS
...
EINVAL An invalid value was specified for advice.
However, I got a bug report that the system call invocations
in my test case returned 0 unexpectedly.
I've inspected the kernel code:
asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
{
struct file *file = fget(fd);
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
pgoff_t start_index;
pgoff_t end_index;
unsigned long nrpages;
int ret = 0;
if (!file)
return -EBADF;
if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
ret = -ESPIPE;
goto out;
}
mapping = file->f_mapping;
if (!mapping || len < 0) {
ret = -EINVAL;
goto out;
}
if (mapping->a_ops->get_xip_page)
/* no bad return value, but ignore advice */
goto out;
...
out:
fput(file);
return ret;
}
I found the advice parameter is just ignored in the case
mapping->a_ops->get_xip_page is given. This behavior is different from
what is written on the man page. Is this o.k.?
get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
Anyway I cannot find the easy way to detect get_xip_page
field is given or CONFIG_EXT2_FS_XIP is true from the
user space.
I propose the following patch which checks the advice parameter
even if get_xip_page is given.
Larry Woodman [Tue, 5 Feb 2008 06:29:30 +0000 (22:29 -0800)]
Include count of pagecache pages in show_mem() output
The show_mem() output does not include the total number of pagecache
pages. This would be helpful when analyzing the debug information in
the /var/log/messages file after OOM kills occur.
This patch includes the total pagecache pages in that output.
Signed-off-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix dirty page accounting leak with ext3 data=journal
In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make
try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
was changed to bail out if the page was dirty.
That in turn caused truncate_complete_page to leak massive amounts of
memory, because the dirty bit was only cleared after the call to
try_to_free_buffers.
So the call to cancel_dirty_page was moved up to have the dirty bit
cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate:
clear page dirtiness before running try_to_free_buffers()").
The problem with that fix is, that the page can be redirtied after
cancel_dirty_page was called, eg. like this:
And then we end up with dirty pages being wrongly accounted.
As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect
'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
were reverted, so the original reason for the massive memory leak is
gone, and we can also revert the move of the call to cancel_dirty_page
from truncate_complete_page and get the accounting right again.
I'm not sure if it matters, but opposed to the final check in
__remove_from_page_cache, this one also cares about the task io
accounting, so maybe we want to use this instead, although it's not
quite the clean fix either.
Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Cc: Jan Kara <jack@ucw.cz> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Osterried <osterried@jesse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qi Yong [Tue, 5 Feb 2008 06:29:27 +0000 (22:29 -0800)]
set_page_refcounted() VM_BUG_ON fix
The current PageTail semantic is that a PageTail page is first a
PageCompound page. So remove the redundant PageCompound test in
set_page_refcounted().
Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qi Yong [Tue, 5 Feb 2008 06:29:23 +0000 (22:29 -0800)]
skip writing data pages when inode is under I_SYNC
Since I_SYNC was split out from I_LOCK, the concern in commit 4b89eed93e0fa40a63e3d7b1796ec1337ea7a3aa ("Write back inode data pages
even when the inode itself is locked") is not longer valid.
We should revert to the original behavior: in __writeback_single_inode(),
when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
skip writing entirely. Otherwise, we are double calling do_writepages()
Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Cc: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:29:23 +0000 (22:29 -0800)]
mm: don't waste swap on locked pages
try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
migrating), and recycles it back to the active list. But if it's an
anonymous page, we've already allocated swap to it: just wasting swap.
Spot locked pages in page_referenced_one and treat them as referenced.
This patch fixes a sles9 system hang in start_this_handle from a customer
with some heavy workload where all tasks are waiting on kjournald to commit
the transaction, but kjournald waits on t_updates to go down to zero (it
never does).
This was reported as a lowmem shortage deadlock but when checking the debug
data I noticed the VM wasn't under pressure at all (well it was really
under vm pressure, because lots of tasks hanged in the VM prune_dcache
methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
mode, the holder of the journal handle should have if this was a vm issue
in the first place).
No task was apparently holding the leftover handle in the committing
transaction, so I deduced t_updates was stuck to 1 because a journal_stop
was never run by some path (this turned out to be correct). With a debug
patch adding proper reverse links and stack trace logging in ext3 deployed
in production, I found journal_stop is never run because
mark_inode_dirty_sync is called inside release_task called by do_exit.
(that was quite fun because I would have never thought about this
subtleness, I thought a regular path in ext3 had a bug and it forgot to
call journal_stop)
do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)
The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.
I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.
Now mainline doesn't have this line in iput (like sles9 has):
if (inode->i_state & I_DIRTY_DELAYED)
mark_inode_dirty_sync(inode);
so it will probably not crash with ext3, but for example ext2 implements an
I/O-blocking ext2_put_inode that will lead to similar screwups with
ext2_free_blocks never coming back and it's definitely wrong to call
blocking-IO paths inside do_exit. So this should fix a subtle bug in
mainline too (not verified in practice though). The equivalent fix for
ext3 is also not verified yet to fix the problem in sles9 but I don't have
doubt it will (it usually takes days to crash, so it'll take weeks to be
sure).
An alternate fix would be to offload that work to a kernel thread, but I
don't think a reschedule for this is worth it, the vm should be able to
collect those entries for the synchronous release_task.
Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bron Gondwana [Tue, 5 Feb 2008 06:29:20 +0000 (22:29 -0800)]
mm/page-writeback: highmem_is_dirtyable option
Add vm.highmem_is_dirtyable toggle
A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.
Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.
[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build] Signed-off-by: Bron Gondwana <brong@fastmail.fm> Cc: Ethan Solomita <solo@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have repeatedly discussed if the cold pages still have a point. There is
one way to join the two lists: Use a single list and put the cold pages at the
end and the hot pages at the beginning. That way a single list can serve for
both types of allocations.
The discussion of the RFC for this and Mel's measurements indicate that
there may not be too much of a point left to having separate lists for
hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert Bragg [Tue, 5 Feb 2008 06:29:18 +0000 (22:29 -0800)]
mm: don't allow ioremapping of ranges larger than vmalloc space
When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
vmlist search routine in __get_vm_area_node can mistakenly allow a driver
to ioremap a range larger than vmalloc space.
If at the time of the ioremap all existing vmlist areas sit below the
determined alignment then the search routine continues past all entries and
exits the for loop - straight into the found: label - without ever testing
for integer wrapping or that the requested size fits.
We were seeing a driver successfully ioremap 128M of flash even though
there was only 120M of vmalloc space. From that point the system was left
with the remainder of the first 16M of space to vmalloc/ioremap within.
Signed-off-by: Robert Bragg <robert@sixbynine.org> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. Add comments explaining how the function can be called.
2. Collect global diffs in a local array and only spill
them once into the global counters when the zone scan
is finished. This means that we only touch each global
counter once instead of each time we fold cpu counters
into zone counters.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to change the layout of the page tables after an mmap has crossed the
adress space limit of the current page table layout a architecture hook in
get_unmapped_area is needed. The arguments are the address of the new mapping
and the length of it.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.
- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.
- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c
- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.
[akpm@linux-foundation.org: fix build] Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Tue, 5 Feb 2008 06:29:10 +0000 (22:29 -0800)]
radix-tree: avoid atomic allocations for preloaded insertions
Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags. But the
preallocation is squandered when it comes time to allocate a node, we
default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we don't
need to be using.
Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock. Pagecache insertions are always
performed with a radix tree preload, and after this change, such a
situation will never fall back to kmem_cache_alloc within
radix_tree_node_alloc.
David Miller reports seeing this allocation fail on a highly threaded
sparc64 system:
The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations. However after the list of pages
goes to mpage_readpages, there can be significant intervals (including disk
IO) before all the pages are inserted into the radix-tree. So the reserves
can easily be depleted at that point. The patch is confirmed to fix the
problem.
Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:06 +0000 (22:29 -0800)]
maps4: add /proc/kpageflags interface
This makes a subset of physical page flags available to userspace. Together
with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.
Exported flags are decoupled from the kernel's internal flags. This
allows us to reorder flag bits, and synthesize any bits that get
redefined in terms of other bits.
[akpm@linux-foundation.org: remove unneeded access_ok()]
[akpm@linux-foundation.org: s/0/NULL/] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:05 +0000 (22:29 -0800)]
maps4: add /proc/kpagecount interface
This makes physical page map counts available to userspace. Together
with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
monitor memory usage on a per-page basis.
[akpm@linux-foundation.org: remove unneeded access_ok()]
[bunk@stusta.de: make struct proc_kpagemap static] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:04 +0000 (22:29 -0800)]
maps4: add /proc/pid/pagemap interface
This interface provides a mapping for each page in an address space to its
physical page frame number, allowing precise determination of what pages are
mapped and what pages are shared between processes.
New in this version:
- headers gone again (as recommended by Dave Hansen and Alan Cox)
- 64-bit entries (as per discussion with Andi Kleen)
- swap pte information exported (from Dave Hansen)
- page walker callback for holes (from Dave Hansen)
- direct put_user I/O (as suggested by Rusty Russell)
This patch folds in cleanups and swap PTE support from Dave Hansen
<haveblue@us.ibm.com>.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:03 +0000 (22:29 -0800)]
maps4: regroup task_mmu by interface
Reorder source so that all the code and data for each interface is together.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:03 +0000 (22:29 -0800)]
maps4: move clear_refs code to task_mmu.c
This puts all the clear_refs code where it belongs and probably lets things
compile on MMU-less systems as well.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matt Mackall [Tue, 5 Feb 2008 06:29:02 +0000 (22:29 -0800)]
maps4: simplify interdependence of maps and smaps
This pulls the shared map display code out of show_map and puts it in
show_smap where it belongs.
Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Tue, 5 Feb 2008 06:28:59 +0000 (22:28 -0800)]
maps4: rework TASK_SIZE macros
The following replaces the earlier patches sent. It should address
David Rientjes's comments, and has been compile tested on all the
architectures that it touches, save for parisc.
For the /proc/<pid>/pagemap code[1], we need to able to query how
much virtual address space a particular task has. The trick is
that we do it through /proc and can't use TASK_SIZE since it
references "current" on some arches. The process opening the
/proc file might be a 32-bit process opening a 64-bit process's
pagemap file.
I'd like to have that for other architectures. So, add it
for all the architectures that actually use "current" in
their TASK_SIZE. For the others, just add a quick #define
in sched.h to use plain old TASK_SIZE.
Fengguang Wu [Tue, 5 Feb 2008 06:28:56 +0000 (22:28 -0800)]
maps4: add proportional set size accounting in smaps
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing
it. So if a process has 1000 pages all to itself, and 1000 shared with one
other process, its PSS will be 1500.
- lwn.net: "ELC: How much memory are applications really using?"
The PSS proposed by Matt Mackall is a very nice metic for measuring an
process's memory footprint. So collect and export it via
/proc/<pid>/smaps.
Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
job. They are comprehensive tools. But for PSS, let's do it in the simple
way.
vmtruncate is a twisted maze of gotos, this patch cleans it up to have a
proper if else for the two major cases of extending and truncating truncate
and thus makes it a lot more readable while keeping exactly the same
functinality.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:55 +0000 (22:28 -0800)]
tmpfs: fix shmem_swaplist races
Intensive swapoff testing shows shmem_unuse spinning on an entry in
shmem_swaplist pointing to itself: how does that come about? Days pass...
First guess is this: shmem_delete_inode tests list_empty without taking the
global mutex (so the swapping case doesn't slow down the common case); but
there's an instant in shmem_unuse_inode's list_move_tail when the list entry
may appear empty (a rare case, because it's actually moving the head not the
the list member). So there's a danger of leaving the inode on the swaplist
when it's freed, then reinitialized to point to itself when reused. Fix that
by skipping the list_move_tail when it's a no-op, which happens to plug this.
But this same spinning then surfaces on another machine. Ah, I'd never
suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
still hold page lock, which would hold off inode deletion if the page were in
pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
earlier, but then shmem_unuse_inode could never prune unswapped inodes.
Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
though I am a little uneasy about the iput which has to follow - it works, and
I see nothing wrong with it, but it is surprising that shmem inode deletion
may now occur below shmem_writepage. Revisit this fix later?
And while we're looking at these races: the way shmem_unuse tests swapped
without holding info->lock looks unsafe, if we've more than one swap area: a
racing shmem_writepage on another page of the same inode could be putting it
in swapcache, just as we're deciding to remove the inode from swaplist -
there's a danger of going on swap without being listed, so a later swapoff
would hang, being unable to locate the entry. Move that test and removal down
into shmem_unuse_inode, once info->lock is held.
Hugh Dickins [Tue, 5 Feb 2008 06:28:54 +0000 (22:28 -0800)]
tmpfs: radix_tree_preloading
Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
or swap cache, without any radix tree preload: so tending to deplete emergency
reserves of memory.
GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
being called under memory pressure, so must not wait for more memory to become
available. But shmem_unuse_inode now has a window in which it can and should
preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
add_to_page_cache.
shmem_getpage is not so straightforward: its filepage/swappage integrity
relies upon exchanging between caches under spinlock, and it would need a lot
of restructuring to place the preloads correctly. Instead, follow its pattern
of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
radix_tree_preload, followed immediately by radix_tree_preload_end - that
won't guarantee success in the next add_to_page_cache, but doesn't need to.
And we can then remove that bothersome congestion_wait: when needed, it'll
automatically get done in the course of the radix_tree_preload.
Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 5 Feb 2008 06:28:53 +0000 (22:28 -0800)]
tmpfs: open a window in shmem_unuse_inode
There are a couple of reasons (patches follow) why it would be good to open a
window for sleep in shmem_unuse_inode, between its search for a matching swap
entry, and its handling of the entry found.
shmem_unuse_inode must then use igrab to hold the inode against deletion in
that window, and its corresponding iput might result in deletion: so it had
better unlock_page before the iput, and might as well release the page too.
Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
into shmem_unuse_inode, in the case when it finds a match.
Let try_to_unuse break on error in the shmem_unuse case, as it does in the
unuse_mm case: though at this point in the series, no error to break on.
Hugh Dickins [Tue, 5 Feb 2008 06:28:52 +0000 (22:28 -0800)]
tmpfs: make shmem_unuse more preemptible
shmem_unuse is at present an unbroken search through every swap vector page of
every tmpfs file which might be swapped, all under shmem_swaplist_lock. This
dates from long ago, when the caller held mmlist_lock over it all too: long
gone, but there's never been much pressure for preemptible swapoff.
Make it a little more preemptible, replacing shmem_swaplist_lock by
shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
cond_resched_lock (on info->lock) at one convenient point in the
shmem_unuse_inode loop, where it has no outstanding kmap_atomic.
If we're serious about preemptible swapoff, there's much further to go e.g.
I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
case dictate preemptiblility for other configs. But as in the earlier patch
to make swapoff scan ptes preemptibly, my hidden agenda is really towards
making memcgroups work, hardly about preemptibility at all.
Hugh Dickins [Tue, 5 Feb 2008 06:28:51 +0000 (22:28 -0800)]
tmpfs: allocate on read when stacked
tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or
size=0). But if a stacked filesystem such as unionfs gets pages from a sparse
tmpfs file by reading holes, and then writes to them, it can easily exceed any
such limit at present.
So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed,
pessimistically mark such pages as dirty, so they cannot get reclaimed and
unaccounted by mistake. The venerable shmem_recalc_inode code (originally to
account for the reclaim of clean pages) suffices to get the accounting right
when swappages are dropped in favour of more uptodate filepages.
This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
caused by unionfs writing to a very sparse tmpfs file: to minimize memory
allocation in swapout, tmpfs requires the swap vector be allocated upfront,
which wasn't always happening in this stacked case.
Hugh Dickins [Tue, 5 Feb 2008 06:28:51 +0000 (22:28 -0800)]
tmpfs: allow filepage alongside swappage
tmpfs has long allowed for a fresh filepage to be created in pagecache, just
before shmem_getpage gets the chance to match it up with the swappage which
already belongs to that offset. But unionfs_writepage now does a
find_or_create_page, divorced from shmem_getpage, which leaves conflicting
filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.
Therefore shmem_writepage (where a page is swizzled from file to swap) must
now be on the lookout for existing swap, ready to free it in favour of the
more uptodate filepage, instead of BUGging on that clash. And when the
add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails
in shmem_getpage, it should retry in the same way it already does.
Hugh Dickins [Tue, 5 Feb 2008 06:28:50 +0000 (22:28 -0800)]
tmpfs: move swap swizzling into shmem
move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
between tmpfs page cache and swap cache, to avoid page copying) are only used
by shmem.c; and our subsequent fix for unionfs needs different treatments in
the two instances of move_from_swap_cache. Move them from swap_state.c into
their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
add_to_swap_cache externally visible.
shmem.c likes to say set_page_dirty where swap_state.c liked to say
SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
makes moot (and implies we should lose that "shift page from clean_pages to
dirty_pages list" comment: it's on neither).