Use the proper helper to open a blockdevice by name for filesystem use,
this makes sure it's properly claimed (also added for open-by-number) and
gets rid of the struct file abuse.
Tested by mounting a reiserfs filesystem with external journal.
Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jeff Mahoney <jeffm@suse.com> Acked-by: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Wed, 30 Apr 2008 07:54:42 +0000 (00:54 -0700)]
fuse: implement perform_write
Introduce fuse_perform_write. With fusexmp (a passthrough filesystem), large
(1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times
(256MB/s vs 71MB/s).
[mszeredi@suse.cz]:
- split into smaller functions
- testing
- duplicate generic_file_aio_write(), so that there's no need to add a
new ->perform_write() a_op. Comment from hch.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quoting Linus (3 years ago, FUSE inclusion discussions):
"User-space filesystems are hard to get right. I'd claim that they
are almost impossible, unless you limit them somehow (shared
writable mappings are the nastiest part - if you don't have those,
you can reasonably limit your problems by limiting the number of
dirty pages you accept through normal "write()" calls)."
Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others). This nicely solved the biggest problem: limiting the number of pages
used for write caching.
Some small details remained, however, which this largish patch attempts to
address. It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks. Performance may not be very
good for certain usage patterns, but generally it should be acceptable.
It has been tested extensively with fsx-linux and bash-shared-mapping.
fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.
The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.
For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented. The per-bdi writeback count is not decremented until the actual
write completes.
On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used at a
time for one cached page.
This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?
The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write. It may be buggy or even
malicious, and fail to complete WRITE requests. We don't want unrelated parts
of the system to grind to a halt in such cases.
Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request. There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.
Currently there are several cases where the kernel can block on page
writeback:
- allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
- page migration
- throttle_vm_writeout (through NR_WRITEBACK)
- sync(2)
Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.
As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default. This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.
With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously). This is needed, because there cannot
be any guarantee about the time in which a write will complete.
By using temporary buffers, from the MM's point if view the page is written
back immediately. If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.
This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.
[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: bdi: add separate writeback accounting capability
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().
Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Wed, 30 Apr 2008 07:54:35 +0000 (00:54 -0700)]
mm: bdi: allow setting a minimum for the bdi dirty limit
Under normal circumstances each device is given a part of the total write-back
cache that relates to its current avg writeout speed in relation to the other
devices.
min_ratio - allows one to assign a minimum portion of the write-back cache to
a particular device. This is useful in situations where you might want to
provide a minimum QoS. (One request for this feature came from flash based
storage people who wanted to avoid writing out at all costs - they of course
needed some pdflush hacks as well)
max_ratio - allows one to assign a maximum portion of the dirty limit to a
particular device. This is useful in situations where you want to avoid one
device taking all or most of the write-back cache. Eg. an NFS mount that is
prone to get stuck, or a FUSE mount which you don't trust to play fair.
Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
the global dirty threshold allocated to this bdi.
[mszeredi@suse.cz]
- fix parsing in min_ratio_store()
- document new sysfs attribute
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Register FUSE's backing_dev_info under sysfs with the name "fuse-MAJOR:MINOR"
Make the fuse control filesystem use s_dev instead of a fuse specific ID.
This makes it easier to match directories under /sys/fs/fuse/connections/ with
directories under /sys/class/bdi, and with actual mounts.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Wed, 30 Apr 2008 07:54:32 +0000 (00:54 -0700)]
mm: bdi: export BDI attributes in sysfs
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.
In particular this properly exposes the read-ahead window for all relevant
users and /sys/block/<block>/queue/read_ahead_kb should be deprecated.
With patient help from Kay Sievers and Greg KH
[mszeredi@suse.cz]
- split off NFS and FUSE changes into separate patches
- document new sysfs attributes under Documentation/ABI
- do bdi_class_init as a core_initcall, otherwise the "default" BDI
won't be initialized
- remove bdi_init_fmt macro, it's not used very much
Pavel Emelyanov [Wed, 30 Apr 2008 07:54:31 +0000 (00:54 -0700)]
pidns: make pid->level and pid_ns->level unsigned
These values represent the nesting level of a namespace and pids living in it,
and it's always non-negative.
Turning this from int to unsigned int saves some space in pid.c (11 bytes on
x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
E.g. on ia64 this removes the sign extension calls, which compiler adds to
optimize access to pid->nubers[ns->level].
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setpgid().
2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure
that we don't use the NULL pid if the task exits right after successful
find_task_by_vpid().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pids: sys_getsid: fix unsafe *pid usage, fix possible 0 instead of -ESRCH
1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setsid().
2. The task can exit between find_task_by_vpid() and task_session_vnr(),
in that unlikely case sys_getsid() returns 0 instead of -ESRCH.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
This can happen even if task == current.
Intoduce the new helper, change_pid(), which should be used instead. This way
the caller always sees the special pid != NULL, either old or new.
Also change the prototype of attach_pid(), it always returns 0 and nobody
check the returned value.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pids: de_thread: don't clear session/pgrp pids for the old leader
Based on Eric W. Biederman's idea.
Unless task == current, without tasklist_lock held task_session()/task_pgrp()
can return NULL if the caller races with de_thread() which switches the group
leader.
Change transfer_pid() to not clear old->pids[type].pid for the old leader.
This means that its .pid can point to "nowhere", but this is already true for
sub-threads, and the old leader is not group_leader() any longer. IOW, with
or without this change we can't trust task's special pids unless it is the
group leader.
Pavel Emelyanov [Wed, 30 Apr 2008 07:54:23 +0000 (00:54 -0700)]
Use find_task_by_vpid in taskstats
The pid to lookup a task by is passed inside taskstats code via genetlink
message.
Since netlink packets are now processed in the context of the sending task,
this is correct to lookup the task with find_task_by_vpid() here.
Besides, I fix the call to fill_pid() from taskstats_exit(), since the
tsk->pid is not required in fill_pid() in this case, and the pid field on
task_struct is going to be deprecated as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Jonathan Lim <jlim@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index(). This localizes the external data
structures used in managing the pts indices.
[akpm@linux-foundation.org: undo accidental mutex2sem conversion] Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Wed, 30 Apr 2008 07:54:19 +0000 (00:54 -0700)]
char serial: switch drivers to ioremap_nocache
Simple search/replace except for synclink.c where I noticed a real bug and
fixed it too. It was doing NULL + offset, then checking for NULL if the remap
failed.
Signed-off-by: Alan Cox <alan@redhat.com> Cc: Paul Fulghum <paulkf@microgate.com> Acked-by: Jiri Slaby <jirislaby@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Wed, 30 Apr 2008 07:54:18 +0000 (00:54 -0700)]
tty: add throttle/unthrottle helpers
Something Arjan suggested which allows us to clean up the code nicely
Signed-off-by: Alan Cox <alan@redhat.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Wed, 30 Apr 2008 07:54:14 +0000 (00:54 -0700)]
strip: Fix up strip for the new order
- Use the tty baud functions
- Call driver termios methods directly holding the right locking
- Check for a write method
Signed-off-by: Alan Cox <alan@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jeff Garzik <jeff@garzik.org> Cc: "John W. Linville" <linville@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Wed, 30 Apr 2008 07:54:11 +0000 (00:54 -0700)]
serial: switch the serial core to int put_char methods
Signed-off-by: Alan Cox <alan@redhat.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Wed, 30 Apr 2008 07:54:10 +0000 (00:54 -0700)]
pty: prepare for tty->ops changes
We are about to change the tty layer to avoid keeping private copies of all
the methods in each tty. We have to update the pty layer first as it
currently patches the ioctl method according to the tty type. Use multiple
tty operations sets instead.
Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename defines to be in RIO* namespace to not to collide with other defines in
tree. This broke (as akpm correctly pointed out) some allmodconfig builds,
e.g. on ppc:
In file included from drivers/char/rio/rio_linux.c:81:
drivers/char/rio/cirrus.h:202:1: warning: "COMPLETE" redefined
In file included from include/net/netns/ipv4.h:8,
from include/net/net_namespace.h:13,
from include/linux/seq_file.h:7,
from include/asm/machdep.h:12,
from include/asm/pci.h:17,
from include/linux/pci.h:951,
from drivers/char/rio/rio_linux.c:50:
include/net/inet_frag.h:28:1: warning: this is the location of the previous definition
- remove i2os.h -- there was only macro to macro renaming or useless
stuff
- remove another uselless stuf (NULLFUNC, NULLPTR, YES, NO)
- use outb/inb directly
- use locking functions directly
- don't define another ROUNDUP, use roundup(x, 2) instead
- some comments and whitespace cleanup
- remove some commented crap
- prepend the rest by I2 prefix to not collide with rest of the world
like in following output (pointed out by akpm)
In file included from drivers/char/ip2/ip2main.c:128:
drivers/char/ip2/i2ellis.h:608:1: warning: "COMPLETE" redefined
In file included from include/net/netns/ipv4.h:8,
from include/net/net_namespace.h:13,
from include/linux/seq_file.h:7,
from include/asm/machdep.h:12,
from include/asm/pci.h:17,
from include/linux/pci.h:951,
from drivers/char/ip2/ip2main.c:95:
include/net/inet_frag.h:28:1: warning: this is the location of the previous definition
Harvey Harrison [Wed, 30 Apr 2008 07:53:52 +0000 (00:53 -0700)]
epca.c: static functions and integer as NULL pointer fixes
drivers/char/epca.c:926:28: warning: Using plain integer as NULL pointer
drivers/char/epca.c:1841:2: warning: Using plain integer as NULL pointer
Forward declarations were already marked static, mark the definitions too.
drivers/char/epca.c:2493:6: warning: symbol 'digi_send_break' was not declared. Should it be static?
drivers/char/epca.c:2881:12: warning: symbol 'init_PCI' was not declared. Should it be static?
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Harvey Harrison [Wed, 30 Apr 2008 07:53:52 +0000 (00:53 -0700)]
cyclades.c: fix sparse shadowed variable warnings
Nested min() macros.
drivers/char/cyclades.c:2750:7: warning: symbol '_x' shadows an earlier one
drivers/char/cyclades.c:2750:7: originally declared here
drivers/char/cyclades.c:2750:7: warning: symbol '_x' shadows an earlier one
drivers/char/cyclades.c:2750:7: originally declared here
drivers/char/cyclades.c:2750:7: warning: symbol '_y' shadows an earlier one
drivers/char/cyclades.c:2750:7: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Harvey Harrison [Wed, 30 Apr 2008 07:53:51 +0000 (00:53 -0700)]
char: rocket.c: fix sparse variable shadowing and int as NULL pointer
Nested min() macros shadow _x, separate into two lines.
drivers/char/rocket.c:451:7: warning: symbol '_x' shadows an earlier one
drivers/char/rocket.c:451:7: originally declared here
drivers/char/rocket.c:451:7: warning: symbol '_x' shadows an earlier one
drivers/char/rocket.c:451:7: originally declared here
drivers/char/rocket.c:451:7: warning: symbol '_y' shadows an earlier one
drivers/char/rocket.c:451:7: originally declared here
drivers/char/rocket.c:1754:7: warning: symbol '_x' shadows an earlier one
drivers/char/rocket.c:1754:7: originally declared here
drivers/char/rocket.c:1754:7: warning: symbol '_x' shadows an earlier one
drivers/char/rocket.c:1754:7: originally declared here
drivers/char/rocket.c:1754:7: warning: symbol '_y' shadows an earlier one
drivers/char/rocket.c:1754:7: originally declared here
drivers/char/rocket.c:1751:20: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Harvey Harrison [Wed, 30 Apr 2008 07:53:50 +0000 (00:53 -0700)]
char: fix sparse shadowed variable warnings in esp.c
flags only use was in spin_lock_irqsave/spin_lock_irgrestore pairs, no
need to redeclare for each one.
drivers/char/esp.c:1599:17: warning: symbol 'flags' shadows an earlier one
drivers/char/esp.c:1517:16: originally declared here
drivers/char/esp.c:1615:17: warning: symbol 'flags' shadows an earlier one
drivers/char/esp.c:1517:16: originally declared here
drivers/char/esp.c:1631:17: warning: symbol 'flags' shadows an earlier one
drivers/char/esp.c:1517:16: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- moxa_flush_chars -- no code; ldics handle this well
- moxa_put_char -- only wrapper to moxa_write (same code), tty does this
the same way if tty->driver->put_char is NULL
- add locking to open/close/hangup and ioctl (tiocm)
- add pci hot-un-plug support (hangup on board remove, wait for openers)
- cleanup block_till_ready
- move close code common to close/hangup into separate function to be
able to call it from open when hangup occurs while block_till_ready
- let ldisc flush on tty layer, it will do it after we return
- merge 2 timers into one -- one can handle the emptywait as good as the other
- merge 2 separated poll functions into one, this allows handle the actions
directly and simplifies the code
- allow stats only for sys_admin
- move TCSBRK* processing to .break_ctl tty op
- let TIOCGSOFTCAR and TIOCSSOFTCAR be processed by ldisc
- remove MOXA_GET_MAJOR, MOXA_GET_CUMAJOR
- fix jiffies subtraction by time_after
- move moxa ioctl numbers into the header; still not exported to userspace,
needs _IOC and 32/64 compat cleanup anyways
The only relevant sign of port being ready is its board->ready since now.
Remove all other flags for this purpose which are set almost on the same
place. Move ports inside the board to be sure that nobody will grab reference
to the port without being sure that it exists.
- request region before remapping pci io space
- use ioremap, iounmap istead of iomap interface, because we use
readX/writeX for accessing this space because of isa support
Static ISA field is empty and probably will never be filled in, remove it.
The driver still supports ISA cards passed through module parameter. This
actually fixes one bug inside the initialization of module-param passed cards
initialization.
Alan Cox [Wed, 30 Apr 2008 07:53:34 +0000 (00:53 -0700)]
tty_ioctl: soft carrier handling
First cut at moving the soft carrier handling knowledge entirely into the core
code. One or two drivers still needed to snoop these functions to track
CLOCAL internally. Instead make TIOCSSOFTCAR generate the same driver calls
as other termios ioctls changing the clocal flag. This allows us to remove
any driver knowledge and special casing. Also while we are at it we can fix
the error handling.
Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Peterson [Wed, 30 Apr 2008 07:53:30 +0000 (00:53 -0700)]
Resume TTY on SUSP and fix CRNL order in N_TTY line discipline
Refine these behaviors in the N_TTY line discipline:
1) Handle the signal characters consistently when received in a stopped TTY
so that SUSP (typically ctrl-Z) behaves like INTR and QUIT in resuming a
stopped TTY.
2) Adjust the order in which the IGNCR/ICRNL/INLCR processing is applied to
be more logical and consistent with the behavior of other Unix systems.
Signed-off-by: Joe Peterson <joe@skyrush.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>