Hugh Dickins [Sun, 30 Oct 2005 01:16:02 +0000 (18:16 -0700)]
[PATCH] mm: tlb_is_full_mm was obscure
tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the
mm's last user has gone and the whole mm is being torn down. And it's an
inline function because sparc64 uses a different (slightly better)
"tlb_frozen" name for the flag others call "fullmm".
And now the ptep_get_and_clear_full macro used in zap_pte_range refers
directly to tlb->fullmm, which would be wrong for sparc64. Rather than
correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
sparc64 to just use the same poor name as everyone else - is that okay?
Hugh Dickins [Sun, 30 Oct 2005 01:16:01 +0000 (18:16 -0700)]
[PATCH] mm: tlb_gather_mmu get_cpu_var
tlb_gather_mmu dates from before kernel preemption was allowed, and uses
smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather. That works
because it's currently only called after getting page_table_lock, which is not
dropped until after the matching tlb_finish_mmu. But don't rely on that, it
will soon change: now disable preemption internally by proper get_cpu_var in
tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.
Hugh Dickins [Sun, 30 Oct 2005 01:16:00 +0000 (18:16 -0700)]
[PATCH] mm: move_page_tables by extents
Speeding up mremap's moving of ptes has never been a priority, but the locking
will get more complicated shortly, and is already too baroque.
Scrap the current one-by-one moving, do an extent at a time: curtailed by end
of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided
doesn't match this usage), and by latency considerations.
One nice property of the old method is lost: it never allocated a page table
unless absolutely necessary, so you could free empty page tables by mremapping
to and fro. Whereas this way, it allocates a dst table wherever there was a
src table. I keep diving in to reinstate the old behaviour, then come out
preferring not to clutter how it now is.
Hugh Dickins [Sun, 30 Oct 2005 01:15:59 +0000 (18:15 -0700)]
[PATCH] mm: page fault handlers tidyup
Impose a little more consistency on the page fault handlers do_wp_page,
do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
arguments in the same order, called the same names?
break_cow is all very well, but what it did was inlined elsewhere: easier to
compare if it's brought back into do_wp_page.
do_file_page's fallback to do_no_page dates from a time when we were testing
pte_file by using it wherever possible: currently it's peculiar to nonlinear
vmas, so just check that. BUG_ON if not? Better not, it's probably page
table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
use that for do_wp_page's invalid pfn too.
Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
restored (and say "pud" not "pmd" in its pud_ERROR).
Hugh Dickins [Sun, 30 Oct 2005 01:15:57 +0000 (18:15 -0700)]
[PATCH] mm: unlink_file_vma, remove_vma
Divide remove_vm_struct into two parts: first anon_vma_unlink plus
unlink_file_vma, to unlink the vma from the list and tree by which rmap or
vmtruncate might find it; then remove_vma to close, fput and free.
The intention here is to do the anon_vma_unlink and unlink_file_vma earlier,
in free_pgtables before freeing any page tables: so we can be sure that any
page tables traversed by rmap and vmtruncate are stable (and other, ordinary
cases are stabilized by holding mmap_sem).
This will be crucial to traversing pgd,pud,pmd without page_table_lock. But
testing the split-out patch showed that lifting the page_table_lock is
symbiotically necessary to make this change - the lock ordering is wrong to
move those unlinks into free_pgtables while it's under ptlock.
Hugh Dickins [Sun, 30 Oct 2005 01:15:56 +0000 (18:15 -0700)]
[PATCH] mm: remove_vma_list consolidation
unmap_vma doesn't amount to much, let's put it inside unmap_vma_list. Except
it doesn't unmap anything, unmap_region just did the unmapping: rename it to
remove_vma_list.
Hugh Dickins [Sun, 30 Oct 2005 01:15:56 +0000 (18:15 -0700)]
[PATCH] mm: vm_stat_account unshackled
The original vm_stat_account has fallen into disuse, with only one user, and
only one user of vm_stat_unaccount. It's easier to keep track if we convert
them all to __vm_stat_account, then free it from its __shackles.
Hugh Dickins [Sun, 30 Oct 2005 01:15:55 +0000 (18:15 -0700)]
[PATCH] mm: anon is already wrprotected
do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
vm_page_prot must already be forcing COW, so must omit write permission, and
so the pte_wrprotect is redundant. Replace it by a comment to that effect,
and reword the comment on unuse_pte which also caused confusion.
Hugh Dickins [Sun, 30 Oct 2005 01:15:54 +0000 (18:15 -0700)]
[PATCH] mm: zap_pte_range dont dirty anon
zap_pte_range already avoids wasting time to mark_page_accessed on anon pages:
it can also skip anon set_page_dirty - the page only needs to be marked dirty
if shared with another mm, but that will say pte_dirty too.
Hugh Dickins [Sun, 30 Oct 2005 01:15:53 +0000 (18:15 -0700)]
[PATCH] mm: copy_pte_range progress fix
My latency breaking in copy_pte_range didn't work as intended: instead of
checking at regularish intervals, after the first interval it checked every
time around the loop, too impatient to be preempted. Fix that.
[PATCH] slab: add additional debugging to detect slabs from the wrong node
This patch adds some stack dumps if the slab logic is processing slab
blocks from the wrong node. This is necessary in order to detect
situations as encountered by Petr.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Lee Schermerhorn [Sun, 30 Oct 2005 01:15:51 +0000 (18:15 -0700)]
[PATCH] shrink_list(): skip anon pages if not may_swap
Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the
scan_control struct; and modified shrink_list() not to add anon pages to
the swap cache if may_swap is not asserted.
However, further down, if the page is mapped, shrink_list() calls
try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon ().
try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap
cache. Martin says he never encountered this path in his testing, but
agrees that it might happen.
This patch modifies shrink_list() to skip anon pages that are not already
in the swap cache when !may_swap, rather than just not adding them to the
cache.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andi Kleen [Sun, 30 Oct 2005 01:15:48 +0000 (18:15 -0700)]
[PATCH] Convert mempolicies to nodemask_t
The NUMA policy code predated nodemask_t so it used open coded bitmaps.
Convert everything to nodemask_t. Big patch, but shouldn't have any actual
behaviour changes (except I removed one unnecessary check against
node_online_map and one unnecessary BUG_ON)
Seth, Rohit [Sun, 30 Oct 2005 01:15:48 +0000 (18:15 -0700)]
[PATCH] mm: set per-cpu-pages lower threshold to zero
Set the low water mark for hot pages in pcp to zero.
(akpm: for the life of me I cannot remember why we created pcp->low. Neither
can Martin and the changelog is silent. Maybe it was just a brainfart, but I
have this feeling that there was a reason. If not, we should remove the
fields completely. We'll see.)
Seth, Rohit [Sun, 30 Oct 2005 01:15:47 +0000 (18:15 -0700)]
[PATCH] mm: page_alloc: increase size of per-cpu-pages
Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB.
Over 100+ runs for a workload, the difference in mean is about 2%. The best
results for both are almost same. Though the max variation in results with
1/2MB is only 2.2%, whereas with 1/4MB it is 12%.
Rik Van Riel [Sun, 30 Oct 2005 01:15:46 +0000 (18:15 -0700)]
[PATCH] swaptoken tuning
It turns out that the original swap token implementation, by Song Jiang, only
enforced the swap token while the task holding the token is handling a page
fault. This patch approximates that, without adding an additional flag to the
mm_struct, by checking whether the mm->mmap_sem is held for reading, like the
page fault code does.
This patch has the effect of automatically, and gradually, disabling the
enforcement of the swap token when there is little or no paging going on, and
"turning up" the intensity of the swap token code the more the task holding
the token is thrashing.
Thanks to Song Jiang for pointing out this aspect of the token based thrashing
control concept.
The new code shows a slight degradation over the old swap token code, but
still a big win over running without the swap token.
Rik Van Riel [Sun, 30 Oct 2005 01:15:44 +0000 (18:15 -0700)]
[PATCH] add sem_is_read/write_locked()
Add sem_is_read/write_locked functions to the read/write semaphores, along the
same lines of the *_is_locked spinlock functions. The swap token tuning patch
uses sem_is_read_locked; sem_is_write_locked is added for completeness.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nicolas Pitre [Sat, 29 Oct 2005 20:44:56 +0000 (21:44 +0100)]
[ARM] 3061/1: cleanup the XIP link address mess
Patch from Nicolas Pitre
Since vmlinux.lds.S is preprocessed, we can use the defines already
present in asm/memory.h (allowed by patch #3060) for the XIP kernel link
address instead of relying on a duplicated Makefile hardcoded value, and
also get rid of its dependency on awk to handle it at the same time.
While at it let's clean XIP stuff even further and make things clearer
in head.S with a nice code reduction.
Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Nicolas Pitre [Sat, 29 Oct 2005 20:44:55 +0000 (21:44 +0100)]
[ARM] 3060/1: allow constants found in asm/memory.h to be used in asm code
Patch from Nicolas Pitre
This patch allows for assorted type of cleanups by letting assembly code
use the same set of defines for constant values and avoid duplicated
definitions that might not always be in sync, or that might simply be
confusing due to the different names for the same thing.
Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Ralf Baechle [Thu, 20 Oct 2005 21:55:26 +0000 (22:55 +0100)]
Hack to resolve longstanding prefetch issue
Prefetching may be fatal on some systems if we're prefetching beyond the
end of memory on some systems. It's also a seriously bad idea on non
dma-coherent systems.
Andrew Isaacson [Thu, 20 Oct 2005 06:59:46 +0000 (23:59 -0700)]
pci-expmem-hack
CFE 1.2.5 and earlier fails to turn on the ExpMemEn bit in the
PCIFeatureControl register, which means that DMA does not work
beyond physical address 01_0000_0000, ergo to DRAM beyond 1GB.
With ExpMemEn turned on, 01_0000_0000-0f_ffff_ffff is mapped,
so DMA works for up to 61 GB of DRAM.
Will be fixed in CFE 1.2.6 (yet to be released).
Signed-Off-By: Andy Isaacson <adi@broadcom.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Atsushi Nemoto [Wed, 19 Oct 2005 10:57:14 +0000 (19:57 +0900)]
Fix zero length sys_cacheflush
Cacheflush(0, 0, 0) was crashing the system. This is because
flush_icache_range(start, end) tries to flushing whole address space
(0 - ~0UL) if both start and end are zero.
Ralf Baechle [Sat, 1 Oct 2005 19:22:39 +0000 (20:22 +0100)]
Don't copy SB1 cache error handler to uncached memory.
This may have made sense on a paranoid day with pass 1 BCM1250 processors
that were throwing cache error exception left and right for no good
reason. On modern silicion that hardly makes sense and the code had
gotten just an obscurity ...
tx39_flush_cache_range() does nothing if !cpu_has_dc_aliases. It should
flush d-cache and invalidate i-cache since the TX39(H2) has separate I/D
cache.
The type of sum in csum_tcpudp_nofold is "unsigned int", so when we assign
to it in an asm() block, and we're running on a system with 64-bit
registers, it is vitally important that we sign extend it correctly before
returning to C. Otherwise the stray high bits will be preserved into
csum_fold, and on the SB-1 processor, 32-bit arithmetic on a non
sign-extended register will yield surprising results.
This caused incorrect checksums in some UDP packets for NFS root. The
problem was mild when using a 10.0.1.x IP address, but severe when
using 192.168.1.x.
Signed-off-by: Daniel Jacobowitz <dan@codesourcery.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>