Speeding up mremap's moving of ptes has never been a priority, but the locking
will get more complicated shortly, and is already too baroque.
Scrap the current one-by-one moving, do an extent at a time: curtailed by end
of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided
doesn't match this usage), and by latency considerations.
One nice property of the old method is lost: it never allocated a page table
unless absolutely necessary, so you could free empty page tables by mremapping
to and fro. Whereas this way, it allocates a dst table wherever there was a
src table. I keep diving in to reinstate the old behaviour, then come out
preferring not to clutter how it now is.