Quantcast
Viewing all articles
Browse latest Browse all 171017

Windows 7 x64 VMs crashing randomly during process termination

We have a cluster of 5 proxmox nodes hosting a mix of Linux and Windows 7 x64 VMs which we use as our automated software build and test environment. When starting build and test jobs from our Jenkins server, some of the build and test jobs on the Windows 7 VM fail very regularly, because the Windows 7 VMs crash at random points during the job. This is quite annoying.

The Windows VMs use virtio drivers for both disk and network, the disks are stored on the local drive, with writeback caching. I've tried two versions of the virtio drivers (0.1-74 and 0.1-59), but haven't seen any difference. I've disabled memory ballooning on all VMs.

pveversion -v produces:
Code:

proxmox-ve-2.6.32: 3.2-121 (running kernel: 2.6.32-27-pve)
pve-manager: 3.2-1 (running version: 3.2-1/1933730b)
pve-kernel-2.6.32-27-pve: 2.6.32-121
pve-kernel-2.6.32-26-pve: 2.6.32-114
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve4
vzprocps: not correctly installed
vzquota: 3.1-2
pve-qemu-kvm: 1.7-4
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

I've used windbg to examine a number of Windows memory.dmp files that are produced during the VM crashes, and running '!analyze -v' on them, all produce similar output like below: the general pattern seems to be that a PAGE_FAULT_IN_NONPAGED_AREA exception occurs when terminating a process. In the various dump files I've seen this happening to various executables that are used as part of our build and test jobs.

If this is some kind of race condition going on, this would explain why our build and test jobs are good candidates to trigger this exception: during each job a huge number of processes are started and terminated.

Code:

*******************************************************************************
*                                                                            *
*                        Bugcheck Analysis                                    *
*                                                                            *
*******************************************************************************


PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except,
it must be protected by a Probe.  Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: fffff680003f7db8, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff800026fcdbc, If non-zero, the instruction address which referenced the bad memory
    address.
Arg4: 0000000000000002, (reserved)


Debugging Details:
------------------




READ_ADDRESS:  fffff680003f7db8


FAULTING_IP:
nt!MiDeletePageTableHierarchy+9c
fffff800`026fcdbc 498b06          mov    rax,qword ptr [r14]


MM_INTERNAL_CODE:  2


DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT


BUGCHECK_STR:  0x50


PROCESS_NAME:  grep.exe


CURRENT_IRQL:  0


ANALYSIS_VERSION: 6.3.9600.17029 (debuggers(dbg).140219-1702) amd64fre


TRAP_FRAME:  fffff88005378f00 -- (.trap 0xfffff88005378f00)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=000000fdf6e00000 rbx=0000000000000000 rcx=0000000fffffffff
rdx=0000058000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffff800026fcdbc rsp=fffff88005379090 rbp=fffffa80058b1200
r8=0000007ffffffff8  r9=0000098000000000 r10=fffffa8003601b90
r11=fffff88005379170 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0        nv up ei ng nz na po cy
nt!MiDeletePageTableHierarchy+0x9c:
fffff800`026fcdbc 498b06          mov    rax,qword ptr [r14] ds:00000000`00000000=????????????????
Resetting default scope


LAST_CONTROL_TRANSFER:  from fffff800027465e4 to fffff800026c9bc0


STACK_TEXT: 
fffff880`05378d98 fffff800`027465e4 : 00000000`00000050 fffff680`003f7db8 00000000`00000000 fffff880`05378f00 : nt!KeBugCheckEx
fffff880`05378da0 fffff800`026c7cee : 00000000`00000000 fffff680`003f7db8 00000000`0008ed00 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x43836
fffff880`05378f00 fffff800`026fcdbc : fffffa80`0299e6b0 00000000`00000001 fffffa80`0302aa80 fffff6fb`40001000 : nt!KiPageFault+0x16e
fffff880`05379090 fffff800`026998b6 : fffff700`01080510 fffffa80`058b1598 fffff700`01080000 fffff8a0`004028e8 : nt!MiDeletePageTableHierarchy+0x9c
fffff880`053791a0 fffff800`0269a892 : fffffa80`058b1200 fffffa80`00000000 fffff8a0`00000025 00000000`00000000 : nt!MiDeleteAddressesInWorkingSet+0x3fb
fffff880`05379a50 fffff800`0299e15a : fffff8a0`0b6cea90 00000000`00000001 00000000`00000000 fffffa80`05621a00 : nt!MmCleanProcessAddressSpace+0x96
fffff880`05379aa0 fffff800`029826b8 : 00000000`c0000005 00000000`00000001 00000000`7efdb000 00000000`00000000 : nt!PspExitThread+0x56a
fffff880`05379ba0 fffff800`026c8e53 : fffffa80`058b1200 00000000`c0000005 fffffa80`05621a00 00000000`7efdf000 : nt!NtTerminateProcess+0x138
fffff880`05379c20 00000000`76ee157a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`0008f758 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76ee157a




STACK_COMMAND:  kb


FOLLOWUP_IP:
nt!MiDeletePageTableHierarchy+9c
fffff800`026fcdbc 498b06          mov    rax,qword ptr [r14]


SYMBOL_STACK_INDEX:  3


SYMBOL_NAME:  nt!MiDeletePageTableHierarchy+9c


FOLLOWUP_NAME:  MachineOwner


MODULE_NAME: nt


DEBUG_FLR_IMAGE_TIMESTAMP:  521ea035


IMAGE_VERSION:  6.1.7601.18247


IMAGE_NAME:  memory_corruption


FAILURE_BUCKET_ID:  X64_0x50_nt!MiDeletePageTableHierarchy+9c


BUCKET_ID:  X64_0x50_nt!MiDeletePageTableHierarchy+9c


ANALYSIS_SOURCE:  KM


FAILURE_ID_HASH_STRING:  km:x64_0x50_nt!mideletepagetablehierarchy+9c


FAILURE_ID_HASH:  {a5101511-63a3-65ce-1b12-16e97aca479e}


Followup: MachineOwner
---------

I would be most grateful if anyone could shed some light on these annoying crashes, or give some configuration change to help prevent them.

Cheers,
Marcel Roelofs

Viewing all articles
Browse latest Browse all 171017

Trending Articles