Quantcast
Channel: Proxmox Support Forum
Viewing all articles
Browse latest Browse all 170799

HUGE Fencing problem with IPMI

$
0
0
Hi all,

I have 3 HP DL 165 G7 servers which im trying to use with proxmox. If we can get this working its the plan that we get atleast a community licens but for this proof of concept we still running the open source free edition.

The Problem:

We are trying to setup fecing and its somewhat working. If i reboot a node the VMs fail over to other nodes. The problem comes when i pull the power cord directly from a server. This does NOT result the vm in migrating to other host. They stay on the failed host and over time shows the web icon for being powered off. If i at any point put power back to the node it will start to migrate the VMs during the servers post operation.

another issue:

These servers has ILO100 BMC controller and you access the BMC controller by adding a IPMI IP in the BIOS. You also have to choose shared nics which means you can reach IPMI web interface on all nics in the server. Currently my ip are as follows
proxmox00 10.10.99.20 - BMC IP 10.10.99.30
proxmox01 10.10.99.21 - BMC IP 10.10.99.31
proxmox02 10.10.99.22 - BMC IP 10.10.99.32

we have been able to both ping the BMC IP and access the web untill a few days ago. All of sudden we couldn't ping OR access the web interface. further troupleshooting on this has showed we ARE ABLE to ping the BMC IP during POST. The ping and http access stops as soon as proxmox start CMAN process. By that i mean it writes Starting CMAN.... OK

Here is my cluster conf:

root@proxmox00:~# cat /etc/pve/cluster.conf
Code:

<?xml version="1.0"?>
<cluster config_version="22" name="DingITCluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey">
  </cman>
  <fencedevices>
    <fencedevice agent="fence_ipmilan" ipaddr="10.10.99.20" lanplus="1" auth="password" login="admin" name="ipmi1" passwd="XXXXXXXXXX" power_wait="5" method="cycle" />
    <fencedevice agent="fence_ipmilan" ipaddr="10.10.99.21" lanplus="1" auth="password" login="admin" name="ipmi2" passwd="XXXXXXXXXX" power_wait="5" method="cycle"/>
    <fencedevice agent="fence_ipmilan" ipaddr="10.10.99.22" lanplus="1" auth="password" login="admin" name="ipmi3" passwd="XXXXXXXXXX" power_wait="5" method="cycle"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox00" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox01" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox02" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device name="ipmi3"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="101"/>
    <pvevm autostart="1" vmid="102"/>
    <pvevm autostart="1" vmid="103"/>
    <pvevm autostart="1" vmid="104"/>
    <pvevm autostart="1" vmid="105"/>
    <pvevm autostart="1" vmid="107"/>
    <pvevm autostart="1" vmid="108"/>
    <pvevm autostart="1" vmid="109"/>
    <pvevm autostart="1" vmid="110"/>
    <pvevm autostart="1" vmid="111"/>
  </rm>
</cluster>
root@proxmox00:~#

When we try

Code:

fence_node proxmox00 -vv
it fails with:

Code:

root@proxmox00:~# fence_node proxmox01 -vv
fence proxmox01 dev 0.0 agent fence_ipmilan result: error from agent
agent args: nodename=proxmox01 agent=fence_ipmilan ipaddr=10.10.99.21 lanplus=1 auth=password login=admin passwd=XXXXX power_wait=5 method=cycle
fence proxmox01 failed

using

Code:

fence_ipmilan
fails with

Code:

root@proxmox00:~# fence_ipmilan -l admin -p XXXXXXXX -P -a 10.10.99.32 -T 4 -o off -v
Powering off machine @ IPMI:10.10.99.32...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.10.99.32' -U 'admin' -P '[set]' -v chassis power status'...
ipmilan: Failed to connect after 20 seconds
Failed
root@proxmox00:~#



but the fence commands are being issued while the BMC IP isn't responding so it COULD be that it fails because there is no connection to the BMC controller in gennerel.. Still i'm confused and not sure if this is right cause my VMs do migrate when i do a reboot of the proxmox node which should indicate that fencing is working...

How did we set this up...

Well you got the cluster conf alleady..

All nodes are fully upgraded. Ofcause we don't get the stable updates since the servers are not licensed yet

ipmitool is installed on all servers

redhat-cluster-pve has FENCE_JOIN set to yes

fence_tool join has been run on all servers

fence_tool ls shows:

Code:

root@proxmox00:~# fence_tool ls
fence domain
member count  3
victim count  0
victim now    0
master nodeid 1
wait state    none
members      1 2 3

ANY input would be GREATLY appriciated we have been working on this for perhaps a week now...

THANKS

Casper

EDIT:

The servers are also fully firmware updated..

if you want to give the servers a look one of them has the following serial number CZJ2120JSW

Viewing all articles
Browse latest Browse all 170799

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>