Patrick Caulfeld [Thu, 17 Jan 2008 10:25:28 +0000 (10:25 +0000)]
dlm: Sanity check namelen before copying it
The 32/64 compatibility code in the DLM does not check the validity of
the lock name length passed into it, so it can easily overwrite memory
if the value is rubbish (as early versions of libdlm can cause with
unlock calls, it doesn't zero the field).
This patch restricts the length of the name to the amount of data
actually passed into the call.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Wed, 16 Jan 2008 19:02:31 +0000 (13:02 -0600)]
dlm: keep cached master rsbs during recovery
To prevent the master of an rsb from changing rapidly, an unused rsb is kept
on the "toss list" for a period of time to be reused. The toss list was
being cleared completely for each recovery, which is unnecessary. Much of
the benefit of the toss list can be maintained if nodes keep rsb's in their
toss list that they are the master of. These rsb's need to be included
when the resource directory is rebuilt during recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Wed, 9 Jan 2008 16:37:39 +0000 (10:37 -0600)]
dlm: limit dir lookup loop
In a rare case we may need to repeat a local resource directory lookup
due to a race with removing the rsb and removing the resdir record.
We'll never need to do more than a single additional lookup, though,
so the infinite loop around the lookup can be removed. In addition
to being unnecessary, the infinite loop is dangerous since some other
unknown condition may appear causing the loop to never break.
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Wed, 9 Jan 2008 15:59:41 +0000 (09:59 -0600)]
dlm: validate messages before processing
There was some hit and miss validation of messages that has now been
cleaned up and unified. Before processing a message, the new
validate_message() function checks that the lkb is the appropriate type,
process-copy or master-copy, and that the message is from the correct
nodeid for the the given lkb. Other checks and assertions on the
lkb type and nodeid have been removed. The assertions were particularly
bad since they would panic the machine instead of just ignoring the bad
message.
Although other recent patches have made processing old message unlikely,
it still may be possible for an old message to be processed and caught
by these checks.
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Tue, 8 Jan 2008 22:24:00 +0000 (16:24 -0600)]
dlm: reject messages from non-members
Messages from nodes that are no longer members of the lockspace should be
ignored. When nodes are removed from the lockspace, recovery can
sometimes complete quickly enough that messages arrive from a removed node
after recovery has completed. When processed, these messages would often
cause an error message, and could in some cases change some state, causing
problems.
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Tue, 8 Jan 2008 21:37:47 +0000 (15:37 -0600)]
dlm: another call to confirm_master in receive_request_reply
When a failed request (EBADR or ENOTBLK) is unlocked/canceled instead of
retried, there may be other lkb's waiting on the rsb_lookup list for it
to complete. A call to confirm_master() is needed to move on to the next
waiting lkb since the current one won't be retried.
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Mon, 7 Jan 2008 22:15:05 +0000 (16:15 -0600)]
dlm: recover locks waiting for overlap replies
When recovery looks at locks waiting for replies, it fails to consider
locks that have already received a reply for their first remote operation,
but not received a reply for secondary, overlapping unlock/cancel. The
appropriate stub reply needs to be called for these waiters.
Appears when we start doing recovery in the presence of a many overlapping
unlock/cancel ops.
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Mon, 7 Jan 2008 21:55:18 +0000 (15:55 -0600)]
dlm: clear ast_type when removing from astqueue
The lkb_ast_type field indicates whether the lkb is on the astqueue list.
When clearing locks for a process, lkb's were being removed from the astqueue
list without clearing the field. If release_lockspace then happened
immediately afterward, it could try to remove the lkb from the list a second
time.
Appears when process calls libdlm dlm_release_lockspace() which first
closes the ls dev triggering clear_proc_locks, and then removes the ls
(a write to control dev) causing release_lockspace().
Signed-off-by: David Teigland <teigland@redhat.com>
David Teigland [Tue, 15 Jan 2008 21:43:24 +0000 (15:43 -0600)]
dlm: use fixed errno values in messages
Some errno values differ across platforms. So if we return things like
-EINPROGRESS from one node it can get misinterpreted or rejected on
another one.
This patch fixes up the errno values passed on the wire so that they
match the x86 ones (so as not to break the protocol), and re-instates
the platform-specific ones at the other end.
Many thanks to Fabio for testing this patch.
Initial patch from Patrick.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com> Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com> Signed-off-by: David Teigland <teigland@redhat.com>
Avi Kivity [Wed, 16 Jan 2008 10:49:30 +0000 (12:49 +0200)]
KVM: Move apic timer migration away from critical section
Migrating the apic timer in the critical section is not very nice, and is
absolutely horrible with the real-time port. Move migration to the regular
vcpu execution path, triggered by a new bitflag.
Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_para.h potentially contains definitions that are to be used by userspace,
so it should not be included inside the __KERNEL__ block. To protect its own
data structures, kvm_para.h already includes its own __KERNEL__ block.
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> Acked-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Tue, 15 Jan 2008 16:27:32 +0000 (18:27 +0200)]
KVM: Fix unbounded preemption latency
When preparing to enter the guest, if an interrupt comes in while
preemption is disabled but interrupts are still enabled, we miss a
preemption point. Fix by explicitly checking whether we need to
reschedule.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Avi Kivity <avi@qumranet.com>
Izik Eidus [Sat, 12 Jan 2008 21:49:09 +0000 (23:49 +0200)]
KVM: MMU: Fix dirty page setting for pages removed from rmap
Right now rmap_remove won't set the page as dirty if the shadow pte
pointed to this page had write access and then it became readonly.
This patches fixes that, by setting the page as dirty for spte changes from
write to readonly access.
Signed-off-by: Izik Eidus <izike@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Sheng Yang [Wed, 2 Jan 2008 06:49:22 +0000 (14:49 +0800)]
KVM: x86 emulator: Only allow VMCALL/VMMCALL trapped by #UD
When executing a test program called "crashme", we found the KVM guest cannot
survive more than ten seconds, then encounterd kernel panic. The basic concept
of "crashme" is generating random assembly code and trying to execute it.
After some fixes on emulator insn validity judgment, we found it's hard to
get the current emulator handle the invalid instructions correctly, for the
#UD trap for hypercall patching caused troubles. The problem is, if the opcode
itself was OK, but combination of opcode and modrm_reg was invalid, and one
operand of the opcode was memory (SrcMem or DstMem), the emulator will fetch
the memory operand first rather than checking the validity, and may encounter
an error there. For example, ".byte 0xfe, 0x34, 0xcd" has this problem.
In the patch, we simply check that if the invalid opcode wasn't vmcall/vmmcall,
then return from emulate_instruction() and inject a #UD to guest. With the
patch, the guest had been running for more than 12 hours.
Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com> Signed-off-by: Sheng Yang <sheng.yang@intel.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Mon, 31 Dec 2007 13:27:49 +0000 (15:27 +0200)]
KVM: MMU: Move kvm_free_some_pages() into critical section
If some other cpu steals mmu pages between our check and an attempt to
allocate, we can run out of mmu pages. Fix by moving the check into the
same critical section as the allocation.
Avi Kivity [Sun, 30 Dec 2007 10:29:05 +0000 (12:29 +0200)]
KVM: MMU: Avoid calling gfn_to_page() in mmu_set_spte()
Since gfn_to_page() is a sleeping function, and we want to make the core mmu
spinlocked, we need to pass the page from the walker context (which can sleep)
to the shadow context (which cannot).
Marcelo Tosatti [Fri, 21 Dec 2007 00:18:22 +0000 (19:18 -0500)]
KVM: MMU: Concurrent guest walkers
Do not hold kvm->lock mutex across the entire pagefault code,
only acquire it in places where it is necessary, such as mmu
hash list, active list, rmap and parent pte handling.
Allow concurrent guest walkers by switching walk_addr() to use
mmap_sem in read-mode.
And get rid of the lockless __gfn_to_page.
[avi: move kvm_mmu_pte_write() locking inside the function]
[avi: add locking for real mode]
[avi: fix cmpxchg locking]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Thu, 25 Oct 2007 14:52:32 +0000 (16:52 +0200)]
KVM: Accelerated apic support
This adds a mechanism for exposing the virtual apic tpr to the guest, and a
protocol for letting the guest update the tpr without causing a vmexit if
conditions allow (e.g. there is no interrupt pending with a higher priority
than the new tpr).
Avi Kivity [Mon, 22 Oct 2007 14:50:39 +0000 (16:50 +0200)]
KVM: local APIC TPR access reporting facility
Add a facility to report on accesses to the local apic tpr even if the
local apic is emulated in the kernel. This is basically a hack that
allows userspace to patch Windows which tends to bang on the tpr a lot.
Ryan Harper [Thu, 13 Dec 2007 16:21:10 +0000 (10:21 -0600)]
KVM: VMX: Add printk_ratelimit in vmx_intr_assist
Add printk_ratelimit check in front of printk. This prevents spamming
of the message during 32-bit ubuntu 6.06server install. Previously, it
would hang during the partition formatting stage.
Signed-off-by: Ryan Harper <ryanh@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Joerg Roedel [Tue, 11 Dec 2007 14:36:57 +0000 (15:36 +0100)]
KVM: SVM: support writing 0 to K8 performance counter control registers
This lets SVM ignore writes of the value 0 to the performance counter control
registers. Thus enabling them will still fail in the guest, but a write of 0
which keeps them disabled is accepted. This is required to boot Windows
Vista 64bit.
[avi: avoid fall-thru in switch statement]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Sun, 9 Dec 2007 16:43:00 +0000 (18:43 +0200)]
KVM: MMU: Use mmu_set_spte() for real-mode shadows
In addition to removing some duplicated code, this also handles the unlikely
case of real-mode code updating a guest page table. This can happen when
one vcpu (in real mode) touches a second vcpu's (in protected mode) page
tables, or if a vcpu switches to real mode, touches page tables, and switches
back.
Avi Kivity [Sun, 9 Dec 2007 15:40:31 +0000 (17:40 +0200)]
KVM: MMU: Move set_pte() into guest paging mode independent code
As set_pte() no longer references either a gpte or the guest walker, we can
move it out of paging mode dependent code (which compiles twice and is
generally nasty).
Avi Kivity [Sun, 9 Dec 2007 15:00:02 +0000 (17:00 +0200)]
KVM: MMU: Fix inherited permissions for emulated guest pte updates
When we emulate a guest pte write, we fail to apply the correct inherited
permissions from the parent ptes. Now that we store inherited permissions
in the shadow page, we can use that to update the pte permissions correctly.
Avi Kivity [Sun, 9 Dec 2007 14:37:36 +0000 (16:37 +0200)]
KVM: MMU: Set nx bit correctly on shadow ptes
While the page table walker correctly generates a guest page fault
if a guest tries to execute a non-executable page, the shadow code does
not mark it non-executable. This means that if a guest accesses an nx
page first with a read access, then subsequent code fetch accesses will
succeed.
Avi Kivity [Sun, 9 Dec 2007 14:15:46 +0000 (16:15 +0200)]
KVM: MMU: Simplify calculation of pte access
The nx bit is awkwardly placed in the 63rd bit position; furthermore it
has a reversed meaning compared to the other bits, which means we can't use
a bitwise and to calculate compounded access masks.
So, we simplify things by creating a new 3-bit exec/write/user access word,
and doing all calculations in that.
Marcelo Tosatti [Fri, 7 Dec 2007 12:56:58 +0000 (07:56 -0500)]
KVM: MMU: Use cmpxchg for pte updates on walk_addr()
In preparation for multi-threaded guest pte walking, use cmpxchg()
when updating guest pte's. This guarantees that the assignment of the
dirty bit can't be lost if two CPU's are faulting the same address
simultaneously.
[avi: fix kunmap_atomic() parameters]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Thu, 6 Dec 2007 16:14:14 +0000 (18:14 +0200)]
KVM: x86 emulator: Fix stack instructions on 64-bit mode
Stack instructions are always 64-bit on 64-bit mode; many of the
emulated stack instructions did not take that into account. Fix by
adding a 'Stack' bitflag and setting the operand size appropriately
during the decode stage (except for 'push r/m', which is in a group
with a few other instructions, so it gets its own treatment).
Joerg Roedel [Thu, 6 Dec 2007 14:46:52 +0000 (15:46 +0100)]
KVM: SVM: Emulate read/write access to cr8
This patch adds code to emulate the access to the cr8 register to the x86
instruction emulator in kvm. This is needed on svm, where there is no
hardware decode for control register access.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Markus Rechberger <markus.rechberger@amd.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Thu, 6 Dec 2007 14:32:45 +0000 (16:32 +0200)]
KVM: VMX: Avoid exit when setting cr8 if the local apic is in the kernel
With apic in userspace, we must exit to userspace after a cr8 write in order
to update the tpr. But if the apic is in the kernel, the exit is unnecessary.
Avi Kivity [Sun, 25 Nov 2007 11:41:11 +0000 (13:41 +0200)]
KVM: Generalize exception injection mechanism
Instead of each subarch doing its own thing, add an API for queuing an
injection, and manage failed exception injection centerally (i.e., if
an inject failed due to a shadow page fault, we need to requeue it).
KVM: SVM: Remove KVM specific defines for MSR_EFER
This patch removes the KVM specific defines for MSR_EFER that were being used
in the svm support file and migrates all references to use instead the ones
from the kernel headers that are used everywhere else and that have the same
values.
Signed-off-by: Carlo Marcelo Arenas Belon <carenas@sajinet.com.pe> Signed-off-by: Avi Kivity <avi@qumranet.com>
Avi Kivity [Sun, 2 Dec 2007 08:50:06 +0000 (10:50 +0200)]
KVM: Export include/linux/kvm.h only if $ARCH actually supports KVM
Currently, make headers_check barfs due to <asm/kvm.h>, which <linux/kvm.h>
includes, not existing. Rather than add a zillion <asm/kvm.h>s, export kvm.h
only if the arch actually supports it.
Avi Kivity [Tue, 27 Nov 2007 17:30:56 +0000 (19:30 +0200)]
KVM: x86 emulator: unify four switch statements into two
Unify the special instruction switch with the regular instruction switch,
and the two byte special instruction switch with the regular two byte
instruction switch. That makes it much easier to find an instruction or
the place an instruction needs to be added in.
Jerone Young [Mon, 26 Nov 2007 14:33:53 +0000 (08:33 -0600)]
KVM: Add ifdef in irqchip struct for x86 only structures
This patch fixes a small issue where sturctures:
kvm_pic_state
kvm_ioapic_state
are defined inside x86 specific code and may or may not
be defined in anyway for other architectures. The problem
caused is one cannot compile userspace apps (ex. libkvm)
for other archs since a size cannot be determined for these
structures.
Signed-off-by: Jerone Young <jyoung5@us.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM: x86 emulator: Make a distinction between repeat prefixes F3 and F2
cmps and scas instructions accept repeat prefixes F3 and F2. So in
order to emulate those prefixed instructions we need to be able to know
if prefixes are REP/REPE/REPZ or REPNE/REPNZ. Currently kvm doesn't make
this distinction. This patch introduces this distinction.
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net> Signed-off-by: Avi Kivity <avi@qumranet.com>