Hello all,
I've been working with Proxmox VE for almost two years now and I decided to go a little further with HA, so we adquiered two Dell PowerEdge and prepare them to be a two-node cluster.
Everything has gone great so far until I apply the fencing rules. I've follow the how-to http://pve.proxmox.com/wiki/Two-Node...bility_Cluster even if it seems to be for the beta version. I've fencing configured with iDRAC6 network cards and with the 'reboot' option, the problem comes when I try to test the enviroment:
1. Two nodes running. I switch off on node that has no machines on it, so far so well second node detects that the node is off.
2. When the failing node boots up it automaticlly reboots the node that's working, leaving all working machines off... good it is not on production yet.
3. The new up-node will keep rebooting the node with the machines forever and so it is impossible to reach it.
I would like for the living node to wait until the other node boots. Am I suppose to delte and readd the node after a failure? I'm sure I am missing something here.
This is my configuration:
Storage
RAID 1 80 GB hahttp://pve.proxmox.com/wiki/Two-Node_High_Availability_Clusterrd disk in which the OS is installed.
RAID 5 1024 GB hard disk configured with DRBD http://pve.proxmox.com/wiki/DRBD
I've no problems with DRBD syncronization and every split-brain has been recovered quite well, I've set up the sync parameter to 110M so it syncronizes faster (they sync over a dedictaed GbE network card directly connected)
(all hardware RAID)
all disks local
NFS 5TB shared storage for backups.
NFS 1024GB shared storage for ISO's.
I have also set up a quorum disk but is not included in the cluster.conf yet, just in case this have something to do with the problem
Network
we have 6 GbE NIC's and the iDRAC dedicated NIC on each one.
The routes are so I can SSH them through a VPN
Cluster config
Additional Info
Please, I'm desperate here, let me know if you need any additional information
I've been working with Proxmox VE for almost two years now and I decided to go a little further with HA, so we adquiered two Dell PowerEdge and prepare them to be a two-node cluster.
Everything has gone great so far until I apply the fencing rules. I've follow the how-to http://pve.proxmox.com/wiki/Two-Node...bility_Cluster even if it seems to be for the beta version. I've fencing configured with iDRAC6 network cards and with the 'reboot' option, the problem comes when I try to test the enviroment:
1. Two nodes running. I switch off on node that has no machines on it, so far so well second node detects that the node is off.
2. When the failing node boots up it automaticlly reboots the node that's working, leaving all working machines off... good it is not on production yet.
3. The new up-node will keep rebooting the node with the machines forever and so it is impossible to reach it.
I would like for the living node to wait until the other node boots. Am I suppose to delte and readd the node after a failure? I'm sure I am missing something here.
This is my configuration:
Storage
RAID 1 80 GB hahttp://pve.proxmox.com/wiki/Two-Node_High_Availability_Clusterrd disk in which the OS is installed.
RAID 5 1024 GB hard disk configured with DRBD http://pve.proxmox.com/wiki/DRBD
I've no problems with DRBD syncronization and every split-brain has been recovered quite well, I've set up the sync parameter to 110M so it syncronizes faster (they sync over a dedictaed GbE network card directly connected)
(all hardware RAID)
all disks local
NFS 5TB shared storage for backups.
NFS 1024GB shared storage for ISO's.
I have also set up a quorum disk but is not included in the cluster.conf yet, just in case this have something to do with the problem
Network
we have 6 GbE NIC's and the iDRAC dedicated NIC on each one.
The routes are so I can SSH them through a VPN
Code:
root@hypvdell02:~# cat /etc/network/interfaces
# network interface settings
auto lo
iface lo inet loopback
iface eth0 inet manual
iface eth1 inet manual
iface eth2 inet manual
iface eth3 inet manual
iface eth4 inet static
address 192.168.0.27
netmask 255.255.255.0
auto eth5
iface eth5 inet static
address 10.0.0.23
netmask 255.255.255.0
auto bond0
iface bond0 inet static
address 192.168.0.25
netmask 255.255.255.0
slaves eth0 eth1 eth2
bond_miimon 100
bond_mode 802.3ad
auto vmbr0
iface vmbr0 inet static
address 192.168.0.23
netmask 255.255.255.0
gateway 192.168.0.1
bridge_ports bond0
bridge_stp off
bridge_fd 0
up route add -net 10.12.0.0 netmask 255.255.255.0 gw 192.168.0.111
down route del -net 10.12.0.0 netmask 255.255.255.0 gw 192.168.0.111
Code:
root@hypvdell02:~# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="24" name="dellHA">
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="192.168.0.20" login="root" name="fencenode1" passwd="5SVbXsVi58S0w7YEbWOJ" secure="1"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="192.168.0.21" login="root" name="fencenode2" passwd="BPITqVvrZLmK8c1=-gT8" secure="1"/>
</fencedevices>
<clusternodes>
<clusternode name="hypvdell1" nodeid="1" votes="1">
<fence>
<method name="1">
<device action="reboot" name="fencenode1"/>
</method>
</fence>
</clusternode>
<clusternode name="hypvdell02" nodeid="2" votes="1">
<fence>
<method name="1">
<device action="reboot" name="fencenode2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="100"/>
<pvevm autostart="1" vmid="101"/>
<pvevm autostart="1" vmid="501"/>
</rm>
</cluster>
Additional Info
Code:
root@hypvdell02:~# tail /var/log/cluster/fenced.log
Mar 20 14:40:00 fenced fenced 1352871249 started
Mar 20 14:40:52 fenced fencing node hypvdell1
Mar 20 14:41:02 fenced fence hypvdell1 dev 0.0 agent fence_drac5 result: error from agent
Mar 20 14:41:02 fenced fence hypvdell1 failed
Mar 20 14:41:05 fenced fencing node hypvdell1
Mar 20 14:41:15 fenced fence hypvdell1 dev 0.0 agent fence_drac5 result: error from agent
Mar 20 14:41:15 fenced fence hypvdell1 failed
Mar 20 14:41:18 fenced fencing node hypvdell1
Mar 20 14:41:26 fenced fence hypvdell1 dev 0.0 agent fence_drac5 result: error from agent
Mar 20 14:41:26 fenced fence hypvdell1 failed
Code:
root@hypvdell02:~# tail /var/log/cluster/corosync.log
Mar 20 14:39:56 corosync [CLM ] Members Left:
Mar 20 14:39:56 corosync [CLM ] Members Joined:
Mar 20 14:39:56 corosync [CLM ] r(0) ip(192.168.0.23)
Mar 20 14:39:56 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 20 14:39:56 corosync [CMAN ] quorum regained, resuming activity
Mar 20 14:39:56 corosync [QUORUM] This node is within the primary component and will provide service.
Mar 20 14:39:56 corosync [QUORUM] Members[1]: 2
Mar 20 14:39:56 corosync [QUORUM] Members[1]: 2
Mar 20 14:39:56 corosync [CPG ] chosen downlist: sender r(0) ip(192.168.0.23) ; members(old:0 left:0)
Mar 20 14:39:56 corosync [MAIN ] Completed service synchronization, ready to provide service.