So I went to join my 11th node to my cluster, got a unable to copy ssh-id and then all the other nodes in my cluster went red / offline - corosync log shows a bunch of re-transmits and then eventually an error about processor failed, forming new config and then i can see all the other nodes leaving the cluster - any ideas? I've tried rebooting a few nodes, restarting cman, pvestatsd etc, no luck.
corosync.log
My syslog is filled with tons of these:
Here is the status from the node that is 'online' still:
Here is the status from one of the others that is offline:
PVE version is: 3.1-21/93bf03d4
Let me know if there is any other information I can provide.
corosync.log
Code:
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:00 corosync [TOTEM ] Retransmit List: 110e 110f 1110 1111 1112 1113 1114 1115 1103 1104 1105 1106 1107 1108 1109 110a 110b 111b 10fb 10fc 10fd 110c 110d 1116 1117 1118 1119 111a 10fe 10ff
Feb 23 11:42:10 corosync [TOTEM ] A processor failed, forming new configuration.
Feb 23 11:42:12 corosync [CLM ] CLM CONFIGURATION CHANGE
Feb 23 11:42:12 corosync [CLM ] New Configuration:
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.2)
Feb 23 11:42:12 corosync [CLM ] Members Left:
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.3)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.4)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.6)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.7)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.9)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.10)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.11)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.12)
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.13)
Feb 23 11:42:12 corosync [CLM ] Members Joined:
Feb 23 11:42:12 corosync [QUORUM] Members[9]: 1 3 4 5 6 7 8 9 10
Feb 23 11:42:12 corosync [QUORUM] Members[8]: 1 4 5 6 7 8 9 10
Feb 23 11:42:12 corosync [QUORUM] Members[7]: 1 5 6 7 8 9 10
Feb 23 11:42:12 corosync [QUORUM] Members[6]: 1 6 7 8 9 10
Feb 23 11:42:12 corosync [CMAN ] quorum lost, blocking activity
Feb 23 11:42:12 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 23 11:42:12 corosync [QUORUM] Members[5]: 1 6 7 8 9
Feb 23 11:42:12 corosync [QUORUM] Members[4]: 1 6 7 8
Feb 23 11:42:12 corosync [QUORUM] Members[3]: 1 7 8
Feb 23 11:42:12 corosync [QUORUM] Members[2]: 1 8
Feb 23 11:42:12 corosync [QUORUM] Members[1]: 1
Feb 23 11:42:12 corosync [CLM ] CLM CONFIGURATION CHANGE
Feb 23 11:42:12 corosync [CLM ] New Configuration:
Feb 23 11:42:12 corosync [CLM ] r(0) ip(10.18.200.2)
Feb 23 11:42:12 corosync [CLM ] Members Left:
Feb 23 11:42:12 corosync [CLM ] Members Joined:
Feb 23 11:42:12 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 23 11:42:12 corosync [CPG ] chosen downlist: sender r(0) ip(10.18.200.2) ; members(old:10 left:9)
Feb 23 11:42:12 corosync [MAIN ] Completed service synchronization, ready to provide service.
Feb 23 11:47:24 corosync [SERV ] Unloading all Corosync service engines.
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync extended virtual synchrony service
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync configuration service
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync cluster config database access v1.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync profile loading service
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: openais cluster membership service B.01.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: openais checkpoint service B.01.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: openais event service B.01.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: openais distributed locking service B.03.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: openais message service B.03.01
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync CMAN membership service 2.90
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Feb 23 11:47:24 corosync [SERV ] Service engine unloaded: openais timer service A.01.01
Feb 23 11:47:24 corosync [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:1893.
Feb 23 11:49:33 corosync [MAIN ] Corosync Cluster Engine ('1.4.5'): started and ready to provide service.
Feb 23 11:49:33 corosync [MAIN ] Corosync built-in features: nss
Feb 23 11:49:33 corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Feb 23 11:49:33 corosync [MAIN ] Successfully parsed cman config
Feb 23 11:49:33 corosync [MAIN ] Successfully configured openais services to load
Feb 23 11:49:33 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Feb 23 11:49:33 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Code:
Feb 24 04:46:25 proxmox1 pmxcfs[2971]: [status] crit: cpg_send_message failed: 9
Code:
Version: 6.2.0
Config Version: 11
Cluster Name: test-cluster
Cluster Id: 20404
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 1
Expected votes: 11
Total votes: 1
Node votes: 1
Quorum: 6 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: proxmox1
Node ID: 1
Multicast addresses: 239.192.79.4
Node addresses: 10.18.200.2
Code:
Version: 6.2.0
Config Version: 10
Cluster Name: test-cluster
Cluster Id: 20404
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 1
Expected votes: 10
Total votes: 1
Node votes: 1
Quorum: 6 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: proxmox2
Node ID: 2
Multicast addresses: 239.192.79.4
Node addresses: 10.18.200.3
Let me know if there is any other information I can provide.