a "struct super_operations" which describes the next level of the
filesystem implementation.
-Usually, a filesystem uses generic one of the generic get_sb()
-implementations and provides a fill_super() method instead. The
-generic methods are:
+Usually, a filesystem uses one of the generic get_sb() implementations
+and provides a fill_super() method instead. The generic methods are:
get_sb_bdev: mount a filesystem residing on a block device
or bottom half).
alloc_inode: this method is called by inode_alloc() to allocate memory
- for struct inode and initialize it.
+ for struct inode and initialize it. If this function is not
+ defined, a simple 'struct inode' is allocated. Normally
+ alloc_inode will be used to allocate a larger structure which
+ contains a 'struct inode' embedded within it.
destroy_inode: this method is called by destroy_inode() to release
- resources allocated for struct inode.
+ resources allocated for struct inode. It is only required if
+ ->alloc_inode was defined and simply undoes anything done by
+ ->alloc_inode.
read_inode: this method is called to read a specific inode from the
mounted filesystem. The i_ino member in the struct inode is
The Address Space Object
========================
-The address space object is used to identify pages in the page cache.
-
+The address space object is used to group and manage pages in the page
+cache. It can be used to keep track of the pages in a file (or
+anything else) and also track the mapping of sections of the file into
+process address spaces.
+
+There are a number of distinct yet related services that an
+address-space can provide. These include communicating memory
+pressure, page lookup by address, and keeping track of pages tagged as
+Dirty or Writeback.
+
+The first can be used independently to the others. The VM can try to
+either write dirty pages in order to clean them, or release clean
+pages in order to reuse them. To do this it can call the ->writepage
+method on dirty pages, and ->releasepage on clean pages with
+PagePrivate set. Clean pages without PagePrivate and with no external
+references will be released without notice being given to the
+address_space.
+
+To achieve this functionality, pages need to be placed on an LRU with
+lru_cache_add and mark_page_active needs to be called whenever the
+page is used.
+
+Pages are normally kept in a radix tree index by ->index. This tree
+maintains information about the PG_Dirty and PG_Writeback status of
+each page, so that pages with either of these flags can be found
+quickly.
+
+The Dirty tag is primarily used by mpage_writepages - the default
+->writepages method. It uses the tag to find dirty pages to call
+->writepage on. If mpage_writepages is not used (i.e. the address
+provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
+almost unused. write_inode_now and sync_inode do use it (through
+__sync_single_inode) to check if ->writepages has been successful in
+writing out the whole address_space.
+
+The Writeback tag is used by filemap*wait* and sync_page* functions,
+via wait_on_page_writeback_range, to wait for all writeback to
+complete. While waiting ->sync_page (if defined) will be called on
+each page that is found to require writeback.
+
+An address_space handler may attach extra information to a page,
+typically using the 'private' field in the 'struct page'. If such
+information is attached, the PG_Private flag should be set. This will
+cause various VM routines to make extra calls into the address_space
+handler to deal with that data.
+
+An address space acts as an intermediate between storage and
+application. Data is read into the address space a whole page at a
+time, and provided to the application either by copying of the page,
+or by memory-mapping the page.
+Data is written into the address space by the application, and then
+written-back to storage typically in whole pages, however the
+address_space has finer control of write sizes.
+
+The read process essentially only requires 'readpage'. The write
+process is more complicated and uses prepare_write/commit_write or
+set_page_dirty to write data into the address_space, and writepage,
+sync_page, and writepages to writeback data to storage.
+
+Adding and removing pages to/from an address_space is protected by the
+inode's i_mutex.
+
+When data is written to a page, the PG_Dirty flag should be set. It
+typically remains set until writepage asks for it to be written. This
+should clear PG_Dirty and set PG_Writeback. It can be actually
+written at any point after PG_Dirty is clear. Once it is known to be
+safe, PG_Writeback is cleared.
+
+Writeback makes use of a writeback_control structure...
struct address_space_operations
-------------------------------
This describes how the VFS can manipulate mapping of a file to page cache in
-your filesystem. As of kernel 2.6.13, the following members are defined:
+your filesystem. As of kernel 2.6.16, the following members are defined:
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
+ /* migrate the contents of a page to the specified target */
+ int (*migratepage) (struct page *, struct page *);
};
- writepage: called by the VM write a dirty page to backing store.
+ writepage: called by the VM to write a dirty page to backing store.
+ This may happen for data integrity reasons (i.e. 'sync'), or
+ to free up memory (flush). The difference can be seen in
+ wbc->sync_mode.
+ The PG_Dirty flag has been cleared and PageLocked is true.
+ writepage should start writeout, should set PG_Writeback,
+ and should make sure the page is unlocked, either synchronously
+ or asynchronously when the write operation completes.
+
+ If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
+ try too hard if there are problems, and may choose to write out
+ other pages from the mapping if that is easier (e.g. due to
+ internal dependencies). If it chooses not to start writeout, it
+ should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
+ calling ->writepage on that page.
+
+ See the file "Locking" for more details.
readpage: called by the VM to read a page from backing store.
+ The page will be Locked when readpage is called, and should be
+ unlocked and marked uptodate once the read completes.
+ If ->readpage discovers that it needs to unlock the page for
+ some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
+ In this case, the page will be relocated, relocked and if
+ that all succeeds, ->readpage will be called again.
sync_page: called by the VM to notify the backing store to perform all
queued I/O operations for a page. I/O operations for other pages
associated with this address_space object may also be performed.
+ This function is optional and is called only for pages with
+ PG_Writeback set while waiting for the writeback to complete.
+
writepages: called by the VM to write out pages associated with the
- address_space object.
+ address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
+ the writeback_control will specify a range of pages that must be
+ written out. If it is WBC_SYNC_NONE, then a nr_to_write is given
+ and that many pages should be written if possible.
+ If no ->writepages is given, then mpage_writepages is used
+ instead. This will choose pages from the address space that are
+ tagged as DIRTY and will pass them to ->writepage.
set_page_dirty: called by the VM to set a page dirty.
+ This is particularly needed if an address space attaches
+ private data to a page, and that data needs to be updated when
+ a page is dirtied. This is called, for example, when a memory
+ mapped page gets modified.
+ If defined, it should set the PageDirty flag, and the
+ PAGECACHE_TAG_DIRTY tag in the radix tree.
readpages: called by the VM to read pages associated with the address_space
- object.
+ object. This is essentially just a vector version of
+ readpage. Instead of just one page, several pages are
+ requested.
+ readpages is only used for read-ahead, so read errors are
+ ignored. If anything goes wrong, feel free to give up.
prepare_write: called by the generic write path in VM to set up a write
- request for a page.
-
- commit_write: called by the generic write path in VM to write page to
- its backing store.
+ request for a page. This indicates to the address space that
+ the given range of bytes is about to be written. The
+ address_space should check that the write will be able to
+ complete, by allocating space if necessary and doing any other
+ internal housekeeping. If the write will update parts of
+ any basic-blocks on storage, then those blocks should be
+ pre-read (if they haven't been read already) so that the
+ updated blocks can be written out properly.
+ The page will be locked. If prepare_write wants to unlock the
+ page it, like readpage, may do so and return
+ AOP_TRUNCATED_PAGE.
+ In this case the prepare_write will be retried one the lock is
+ regained.
+
+ commit_write: If prepare_write succeeds, new data will be copied
+ into the page and then commit_write will be called. It will
+ typically update the size of the file (if appropriate) and
+ mark the inode as dirty, and do any other related housekeeping
+ operations. It should avoid returning an error if possible -
+ errors should have been handled by prepare_write.
bmap: called by the VFS to map a logical block offset within object to
- physical block number. This method is use by for the legacy FIBMAP
- ioctl. Other uses are discouraged.
-
- invalidatepage: called by the VM on truncate to disassociate a page from its
- address_space mapping.
-
- releasepage: called by the VFS to release filesystem specific metadata from
- a page.
-
- direct_IO: called by the VM for direct I/O writes and reads.
+ physical block number. This method is used by the FIBMAP
+ ioctl and for working with swap-files. To be able to swap to
+ a file, the file must have a stable mapping to a block
+ device. The swap system does not go through the filesystem
+ but instead uses bmap to find out where the blocks in the file
+ are and uses those addresses directly.
+
+
+ invalidatepage: If a page has PagePrivate set, then invalidatepage
+ will be called when part or all of the page is to be removed
+ from the address space. This generally corresponds to either a
+ truncation or a complete invalidation of the address space
+ (in the latter case 'offset' will always be 0).
+ Any private data associated with the page should be updated
+ to reflect this truncation. If offset is 0, then
+ the private data should be released, because the page
+ must be able to be completely discarded. This may be done by
+ calling the ->releasepage function, but in this case the
+ release MUST succeed.
+
+ releasepage: releasepage is called on PagePrivate pages to indicate
+ that the page should be freed if possible. ->releasepage
+ should remove any private data from the page and clear the
+ PagePrivate flag. It may also remove the page from the
+ address_space. If this fails for some reason, it may indicate
+ failure with a 0 return value.
+ This is used in two distinct though related cases. The first
+ is when the VM finds a clean page with no active users and
+ wants to make it a free page. If ->releasepage succeeds, the
+ page will be removed from the address_space and become free.
+
+ The second case if when a request has been made to invalidate
+ some or all pages in an address_space. This can happen
+ through the fadvice(POSIX_FADV_DONTNEED) system call or by the
+ filesystem explicitly requesting it as nfs and 9fs do (when
+ they believe the cache may be out of date with storage) by
+ calling invalidate_inode_pages2().
+ If the filesystem makes such a call, and needs to be certain
+ that all pages are invalidated, then its releasepage will
+ need to ensure this. Possibly it can clear the PageUptodate
+ bit if it cannot free private data yet.
+
+ direct_IO: called by the generic read/write routines to perform
+ direct_IO - that is IO requests which bypass the page cache
+ and transfer data directly between the storage and the
+ application's address space.
get_xip_page: called by the VM to translate a block number to a page.
The page is valid until the corresponding filesystem is unmounted.
Filesystems that want to use execute-in-place (XIP) need to implement
it. An example implementation can be found in fs/ext2/xip.c.
+ migrate_page: This is used to compact the physical memory usage.
+ If the VM wants to relocate a page (maybe off a memory card
+ that is signalling imminent failure) it will pass a new page
+ and an old page to this function. migrate_page should
+ transfer any private data across and update any references
+ that it has to the page.
The File Object
===============
and the dentry is returned. The caller must use d_put()
to free the dentry when it finishes using it.
-
-RCU-based dcache locking model
-------------------------------
-
-On many workloads, the most common operation on dcache is
-to look up a dentry, given a parent dentry and the name
-of the child. Typically, for every open(), stat() etc.,
-the dentry corresponding to the pathname will be looked
-up by walking the tree starting with the first component
-of the pathname and using that dentry along with the next
-component to look up the next level and so on. Since it
-is a frequent operation for workloads like multiuser
-environments and web servers, it is important to optimize
-this path.
-
-Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus
-in every component during path look-up. Since 2.5.10 onwards,
-fast-walk algorithm changed this by holding the dcache_lock
-at the beginning and walking as many cached path component
-dentries as possible. This significantly decreases the number
-of acquisition of dcache_lock. However it also increases the
-lock hold time significantly and affects performance in large
-SMP machines. Since 2.5.62 kernel, dcache has been using
-a new locking model that uses RCU to make dcache look-up
-lock-free.
-
-The current dcache locking model is not very different from the existing
-dcache locking model. Prior to 2.5.62 kernel, dcache_lock
-protected the hash chain, d_child, d_alias, d_lru lists as well
-as d_inode and several other things like mount look-up. RCU-based
-changes affect only the way the hash chain is protected. For everything
-else the dcache_lock must be taken for both traversing as well as
-updating. The hash chain updates too take the dcache_lock.
-The significant change is the way d_lookup traverses the hash chain,
-it doesn't acquire the dcache_lock for this and rely on RCU to
-ensure that the dentry has not been *freed*.
-
-
-Dcache locking details
-----------------------
-
-For many multi-user workloads, open() and stat() on files are
-very frequently occurring operations. Both involve walking
-of path names to find the dentry corresponding to the
-concerned file. In 2.4 kernel, dcache_lock was held
-during look-up of each path component. Contention and
-cache-line bouncing of this global lock caused significant
-scalability problems. With the introduction of RCU
-in Linux kernel, this was worked around by making
-the look-up of path components during path walking lock-free.
-
-
-Safe lock-free look-up of dcache hash table
-===========================================
-
-Dcache is a complex data structure with the hash table entries
-also linked together in other lists. In 2.4 kernel, dcache_lock
-protected all the lists. We applied RCU only on hash chain
-walking. The rest of the lists are still protected by dcache_lock.
-Some of the important changes are :
-
-1. The deletion from hash chain is done using hlist_del_rcu() macro which
- doesn't initialize next pointer of the deleted dentry and this
- allows us to walk safely lock-free while a deletion is happening.
-
-2. Insertion of a dentry into the hash table is done using
- hlist_add_head_rcu() which take care of ordering the writes -
- the writes to the dentry must be visible before the dentry
- is inserted. This works in conjunction with hlist_for_each_rcu()
- while walking the hash chain. The only requirement is that
- all initialization to the dentry must be done before hlist_add_head_rcu()
- since we don't have dcache_lock protection while traversing
- the hash chain. This isn't different from the existing code.
-
-3. The dentry looked up without holding dcache_lock by cannot be
- returned for walking if it is unhashed. It then may have a NULL
- d_inode or other bogosity since RCU doesn't protect the other
- fields in the dentry. We therefore use a flag DCACHE_UNHASHED to
- indicate unhashed dentries and use this in conjunction with a
- per-dentry lock (d_lock). Once looked up without the dcache_lock,
- we acquire the per-dentry lock (d_lock) and check if the
- dentry is unhashed. If so, the look-up is failed. If not, the
- reference count of the dentry is increased and the dentry is returned.
-
-4. Once a dentry is looked up, it must be ensured during the path
- walk for that component it doesn't go away. In pre-2.5.10 code,
- this was done holding a reference to the dentry. dcache_rcu does
- the same. In some sense, dcache_rcu path walking looks like
- the pre-2.5.10 version.
-
-5. All dentry hash chain updates must take the dcache_lock as well as
- the per-dentry lock in that order. dput() does this to ensure
- that a dentry that has just been looked up in another CPU
- doesn't get deleted before dget() can be done on it.
-
-6. There are several ways to do reference counting of RCU protected
- objects. One such example is in ipv4 route cache where
- deferred freeing (using call_rcu()) is done as soon as
- the reference count goes to zero. This cannot be done in
- the case of dentries because tearing down of dentries
- require blocking (dentry_iput()) which isn't supported from
- RCU callbacks. Instead, tearing down of dentries happen
- synchronously in dput(), but actual freeing happens later
- when RCU grace period is over. This allows safe lock-free
- walking of the hash chains, but a matched dentry may have
- been partially torn down. The checking of DCACHE_UNHASHED
- flag with d_lock held detects such dentries and prevents
- them from being returned from look-up.
-
-
-Maintaining POSIX rename semantics
-==================================
-
-Since look-up of dentries is lock-free, it can race against
-a concurrent rename operation. For example, during rename
-of file A to B, look-up of either A or B must succeed.
-So, if look-up of B happens after A has been removed from the
-hash chain but not added to the new hash chain, it may fail.
-Also, a comparison while the name is being written concurrently
-by a rename may result in false positive matches violating
-rename semantics. Issues related to race with rename are
-handled as described below :
-
-1. Look-up can be done in two ways - d_lookup() which is safe
- from simultaneous renames and __d_lookup() which is not.
- If __d_lookup() fails, it must be followed up by a d_lookup()
- to correctly determine whether a dentry is in the hash table
- or not. d_lookup() protects look-ups using a sequence
- lock (rename_lock).
-
-2. The name associated with a dentry (d_name) may be changed if
- a rename is allowed to happen simultaneously. To avoid memcmp()
- in __d_lookup() go out of bounds due to a rename and false
- positive comparison, the name comparison is done while holding the
- per-dentry lock. This prevents concurrent renames during this
- operation.
-
-3. Hash table walking during look-up may move to a different bucket as
- the current dentry is moved to a different bucket due to rename.
- But we use hlists in dcache hash table and they are null-terminated.
- So, even if a dentry moves to a different bucket, hash chain
- walk will terminate. [with a list_head list, it may not since
- termination is when the list_head in the original bucket is reached].
- Since we redo the d_parent check and compare name while holding
- d_lock, lock-free look-up will not race against d_move().
-
-4. There can be a theoretical race when a dentry keeps coming back
- to original bucket due to double moves. Due to this look-up may
- consider that it has never moved and can end up in a infinite loop.
- But this is not any worse that theoretical livelocks we already
- have in the kernel.
-
-
-Important guidelines for filesystem developers related to dcache_rcu
-====================================================================
-
-1. Existing dcache interfaces (pre-2.5.62) exported to filesystem
- don't change. Only dcache internal implementation changes. However
- filesystems *must not* delete from the dentry hash chains directly
- using the list macros like allowed earlier. They must use dcache
- APIs like d_drop() or __d_drop() depending on the situation.
-
-2. d_flags is now protected by a per-dentry lock (d_lock). All
- access to d_flags must be protected by it.
-
-3. For a hashed dentry, checking of d_count needs to be protected
- by d_lock.
-
-
-Papers and other documentation on dcache locking
-================================================
-
-1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
-
-2. http://lse.sourceforge.net/locking/dcache/dcache.html
+For further information on dentry locking, please refer to the document
+Documentation/filesystems/dentry-locking.txt.
Resources