Quantcast
Channel: Proxmox Support Forum
Viewing all articles
Browse latest Browse all 170636

DRBD Diskless after 48 hours

$
0
0
Hi all,

I have some problem with my Proxmox cluster and DRBD between it. I set all and it's fine. Cluster working perfect. Migration, backups all VM. Storage i have on LVM on DRBD. Every vgdrbd has 1TB. But after some time i have problem with drbd:

Code:

 
0:r0  Connected Primary/Primary UpToDate/Diskless C r----- lvm-pv: drbdvg0 931.29g 861.00g
1:r1  Connected Primary/Primary UpToDate/UpToDate C r----- lvm-pv: drbdvg1 931.29g 0g

In dmesg i see:

Code:

block drbd0: Starting worker thread (from cqueue [2626])
block drbd0: open("/dev/sdb1") failed with -16
block drbd0: drbd_bm_resize called with capacity == 0
block drbd0: worker terminated
block drbd0: Terminating worker thread
block drbd1: Starting worker thread (from cqueue [2626])
block drbd1: disk( Diskless -> Attaching )
block drbd1: Found 4 transactions (70 active extents) in activity log.
block drbd1: Method to ensure write ordering: barrier
block drbd1: max BIO size = 131072
block drbd1: drbd_bm_resize called with capacity == 1953064672
block drbd1: resync bitmap: bits=244133084 words=3814580 pages=7451
block drbd1: size = 931 GB (976532336 KB)
block drbd1: bitmap READ of 7451 pages took 37 jiffies
block drbd1: recounting of set bits took additional 36 jiffies
block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd1: disk( Attaching -> UpToDate )
block drbd1: attached to UUIDs 70A8363B4F73C19E:0000000000000000:43AC9F762F8AF4F7:43AB9F762F8AF4F7
block drbd0: Starting worker thread (from cqueue [2626])
block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [2661])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd1: conn( StandAlone -> Unconnected )
block drbd1: Starting receiver thread (from drbd1_worker [2649])
block drbd1: receiver (re)started
block drbd1: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 96
block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [2670])
block drbd0: data-integrity-alg: <not-used>
block drbd0: max BIO size = 4096
block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
block drbd1: Handshake successful: Agreed network protocol version 96
block drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
block drbd1: conn( WFConnection -> WFReportParams )
block drbd1: Starting asender thread (from drbd1_receiver [2674])
block drbd1: data-integrity-alg: <not-used>
block drbd1: drbd_sync_handshake:
block drbd1: self 70A8363B4F73C19E:0000000000000000:43AC9F762F8AF4F7:43AB9F762F8AF4F7 bits:0 flags:0
block drbd1: peer 7D727C5A8840067D:70A8363B4F73C19F:43AC9F762F8AF4F7:43AB9F762F8AF4F7 bits:0 flags:0
block drbd1: uuid_compare()=-1 by rule 50
block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
block drbd0: role( Secondary -> Primary )
block drbd1: role( Secondary -> Primary )
DLM (built Oct 14 2013 08:10:28) installed
block drbd1: conn( WFBitMapT -> WFSyncUUID )
block drbd1: updated sync uuid 70A9363B4F73C19F:0000000000000000:43AC9F762F8AF4F7:43AB9F762F8AF4F7
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
block drbd1: updated UUIDs 7D727C5A8840067D:0000000000000000:70A9363B4F73C19F:70A8363B4F73C19F
block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 0 (0x0)
block drbd1: bitmap WRITE of 7451 pages took 20 jiffies
block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
ip_tables: (C) 2000-2006 Netfilter Core Team

My drbd configuration looks like this:

- global_common.conf
Code:

global {
  usage-count yes;
  # minor-count dialog-refresh disable-ip-verification
}

common {
  protocol C;

  handlers {
    # The following 3 handlers were disabled due to #576511.
    # Please check the DRBD manual and enable them, if they make sense in your setup.
    # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";

    # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
    # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
    # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
  }

  startup {
    # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
    wfc-timeout 15;
          degr-wfc-timeout 15;
          become-primary-on both;
  }

  disk {
    # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
    # no-disk-drain no-md-flushes max-bio-bvecs
  }

  net {
    # sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
    # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
    # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
          cram-hmac-alg sha1;
          shared-secret "my-secret";
          allow-two-primaries;
          after-sb-0pri discard-zero-changes;
          after-sb-1pri discard-secondary;
          after-sb-2pri disconnect;
  }

  syncer {
    # rate after al-extents use-rle cpu-mask verify-alg csums-alg
    rate 1000M;
  }
}

And resource like this: r0.res
Code:

# This is the resource used for the shared GFS2 partition.
resource r0 {
  # This is the block device path.
  device    /dev/drbd0;

  # We'll use the normal internal metadisk (takes about 32MB/TB)
  meta-disk internal;

  # This is the `uname -n` of the first node
  on node1 {
    # The 'address' has to be the IP, not a hostname. This is the
    # node's SN (bond1) IP. The port number must be unique amoung
    # resources.
    address  10.0.0.12:7788;

    # This is the block device backing this resource on this node.
    disk    /dev/sdb1;
  }
  # Now the same information again for the second node.
  on node2 {
    address  10.0.0.13:7788;
    disk    /dev/sdb1;
  }
}

I have tried a lot of stuff. But i don't have any idea now. What happened and why? Do you have some idea? Maybe disk is broken on servers?

I will be very grateful for help and answer.

Best,
Rafal

Viewing all articles
Browse latest Browse all 170636

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>