Some nights when our backups run, one of the HA KVMs will die, and to make things worse it isn't brought up on the other node.
Here's an example:
109: Jun 23 05:20:42 INFO: Starting Backup of VM 109 (qemu)
109: Jun 23 05:20:42 INFO: status = running
109: Jun 23 05:20:44 INFO: backup mode: snapshot
109: Jun 23 05:20:44 INFO: ionice priority: 7
109: Jun 23 05:20:44 INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-109-2013_06_23-05_20_42.vma.lzo'
109: Jun 23 05:20:44 INFO: started backup task '078b2fae-37f0-489b-9910-7ecb9d2474d1'
109: Jun 23 05:20:47 INFO: status: 0% (387710976/42949672960), sparse 0% (74457088), duration 3, 129/104 MB/s
109: Jun 23 05:20:50 INFO: status: 1% (640548864/42949672960), sparse 0% (114208768), duration 6, 84/71 MB/s
snip
109: Jun 23 05:24:29 INFO: status: 16% (6906576896/42949672960), sparse 0% (282591232), duration 225, 45/45 MB/s
109: Jun 23 05:24:33 INFO: status: 17% (7336099840/42949672960), sparse 0% (318234624), duration 229, 107/98 MB/s
109: Jun 23 05:26:56 ERROR: VM 109 not running
109: Jun 23 05:26:56 INFO: aborting backup job
109: Jun 23 05:26:56 ERROR: unable to find configuration file for VM 109 - no such machine
109: Jun 23 05:26:56 INFO: no such VM ('109')
109: Jun 23 05:26:57 ERROR: Backup of VM 109 failed - VM 109 not running
Then, I see it's not running on the other node either. So, I have to log into the backup node, do a "qm unlock 109", and then I can start it and migrate it back to the primary.
We backup to an NFS share, which happens to be served by the same FreeNAS box serving the quorum disk (via iscsi), and I'm wondering if that could be part of the problem.
We haven't upgraded to 3.0 yet, but are running the most current 2.3:
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-19-pve: 2.6.32-96
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1
Suggestions?
Here's an example:
109: Jun 23 05:20:42 INFO: Starting Backup of VM 109 (qemu)
109: Jun 23 05:20:42 INFO: status = running
109: Jun 23 05:20:44 INFO: backup mode: snapshot
109: Jun 23 05:20:44 INFO: ionice priority: 7
109: Jun 23 05:20:44 INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-109-2013_06_23-05_20_42.vma.lzo'
109: Jun 23 05:20:44 INFO: started backup task '078b2fae-37f0-489b-9910-7ecb9d2474d1'
109: Jun 23 05:20:47 INFO: status: 0% (387710976/42949672960), sparse 0% (74457088), duration 3, 129/104 MB/s
109: Jun 23 05:20:50 INFO: status: 1% (640548864/42949672960), sparse 0% (114208768), duration 6, 84/71 MB/s
snip
109: Jun 23 05:24:29 INFO: status: 16% (6906576896/42949672960), sparse 0% (282591232), duration 225, 45/45 MB/s
109: Jun 23 05:24:33 INFO: status: 17% (7336099840/42949672960), sparse 0% (318234624), duration 229, 107/98 MB/s
109: Jun 23 05:26:56 ERROR: VM 109 not running
109: Jun 23 05:26:56 INFO: aborting backup job
109: Jun 23 05:26:56 ERROR: unable to find configuration file for VM 109 - no such machine
109: Jun 23 05:26:56 INFO: no such VM ('109')
109: Jun 23 05:26:57 ERROR: Backup of VM 109 failed - VM 109 not running
Then, I see it's not running on the other node either. So, I have to log into the backup node, do a "qm unlock 109", and then I can start it and migrate it back to the primary.
We backup to an NFS share, which happens to be served by the same FreeNAS box serving the quorum disk (via iscsi), and I'm wondering if that could be part of the problem.
We haven't upgraded to 3.0 yet, but are running the most current 2.3:
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-19-pve: 2.6.32-96
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-20
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-7
vncterm: 1.0-4
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-10
ksm-control-daemon: 1.1-1
Suggestions?