qdisk man page on YellowDog

Man page or keyword search:  
man Server   18644 pages
apropos Keyword Search (all sections)
Output format
YellowDog logo
[printable version]

QDisk(5)		      Cluster Quorum Disk		      QDisk(5)

NAME
       QDisk 1.2.1 - a disk-based quorum daemon for CMAN / Linux-Cluster

1. Overview
1.1 Problem
       In  some	 situations,  it  may  be  necessary or desirable to sustain a
       majority node failure of a cluster without  introducing	the  need  for
       asymmetric  cluster  configurations  (e.g.  client-server,  or heavily-
       weighted voting nodes).

1.2. Design Requirements
       * Ability to sustain 1..(n-1)/n simultaneous node failures, without the
       danger  of  a simple network partition causing a split brain.  That is,
       we need to be able to ensure that the  majority	failure	 case  is  not
       merely the result of a network partition.

       *  Ability  to use external reasons for deciding which partition is the
       the quorate partition in a partitioned cluster.	For  example,  a  user
       may  have  a  service running on one node, and that node must always be
       the master in the event of a network partition.	Or, a node might  lose
       all  network  connectivity  except  the cluster communication path - in
       which case, a user may wish that node to be evicted from the cluster.

       * Integration with CMAN.	 We must not require CMAN to run with  us  (or
       without	us).   Linux-Cluster does not require a quorum disk normally -
       introducing new requirements on the base of how Linux-Cluster  operates
       is not allowed.

       * Data integrity.  In order to recover from a majority failure, fencing
       is required.  The fencing subsystem is already provided by  Linux-Clus‐
       ter.

       *  Non-reliance	on  hardware  or  protocol specific methods (i.e. SCSI
       reservations).  This ensures the quorum disk algorithm can be  used  on
       the widest range of hardware configurations possible.

       *  Little  or  no  memory allocation after initialization.  In critical
       paths during failover, we do not want to	 have  to  worry  about	 being
       killed  during  a  memory  pressure situation because we request a page
       fault, and the Linux OOM killer responds...

1.3. Hardware Considerations and Requirements
1.3.1. Concurrent, Synchronous, Read/Write Access
       This quorum daemon requires  a  shared  block  device  with  concurrent
       read/write  access  from	 all  nodes  in the cluster.  The shared block
       device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
       RAIDed  iSCSI target, or even GNBD.  The quorum daemon uses O_DIRECT to
       write to the device.

1.3.2. Bargain-basement JBODs need not apply
       There is a minimum performance requirement inherent  when  using	 disk-
       based  cluster  quorum  algorithms, so design your cluster accordingly.
       Using a cheap JBOD with old SCSI2 disks on a multi-initiator  bus  will
       cause problems at the first load spike.	Plan your loads accordingly; a
       node's inability to write to the quorum disk in a  timely  manner  will
       cause  the cluster to evict the node.  Using host-RAID or multi-initia‐
       tor parallel SCSI configurations with the qdisk daemon is  unlikely  to
       work,  and  will	 probably  cause  administrators a lot of frustration.
       That having been said, because  the  timeouts  are  configurable,  most
       hardware should work if the timeouts are set high enough.

1.3.3. Fencing is Required
       In order to maintain data integrity under all failure scenarios, use of
       this quorum daemon requires adequate fencing,  preferrably  power-based
       fencing.	  Watchdog  timers  and software-based solutions to reboot the
       node internally, while possibly sufficient, are not  considered	'fenc‐
       ing' for the purposes of using the quorum disk.

1.4. Limitations
       *  At  this  time, this daemon supports a maximum of 16 nodes.  This is
       primarily a scalability issue:  As  we  increase	 the  node  count,  we
       increase	 the amount of synchronous I/O contention on the shared quorum
       disk.

       * Cluster node IDs must be statically configured	 in  cluster.conf  and
       must be numbered from 1..16 (there can be gaps, of course).

       *  Cluster  nodes must have one vote each.  The effects of nodes having
       any number of votes besides 1 has not been explored.

       * CMAN must be running before the qdisk program	can  operate  in  full
       capacity.  If CMAN is not running, qdisk will wait for it.

       *  CMAN's eviction timeout should be at least 2x the quorum daemon's to
       give the quorum daemon adequate time to converge on a master  during  a
       failure + load spike situation.

       *  For  'all-but-one'  failure  operation,  the	total  number of votes
       assigned to the quorum device should be equal to the number of nodes in
       the  cluster minus 1.  For example, if you have 3 nodes in the cluster,
       each node should get one vote and qdisk should get 2 votes.  This  then
       lets  the  cluster  operate  without  qdisk if all nodes are online for
       testing and other purposes.

       * For 'tiebreaker'  operation  in  a  two-node  cluster,	 unset	CMAN's
       two_node	 flag  (or set it to 0), set CMAN's expected votes to '3', set
       each node's vote to '1', and set qdisk's vote count  to	'1'  as	 well.
       This will allow the cluster to operate if either both nodes are online,
       or a single node & the heuristics.

       * Currently, the quorum disk daemon is difficult to use	with  CLVM  if
       the quorum disk resides on a CLVM logical volume.  CLVM requires a quo‐
       rate cluster to correctly operate, which introduces  a  chicken-and-egg
       problem	for  starting  the  cluster: CLVM needs quorum, but the quorum
       daemon needs CLVM (if and only if the quorum device lies	 on  CLVM-man‐
       aged  storage).	 One way to work around this is to *not* set the clus‐
       ter's expected votes to include the quorum daemon's votes.   Bring  all
       nodes  online, and start the quorum daemon *after* the whole cluster is
       running.	 This will allow the expected votes to increase naturally.

2. Algorithms
2.1. Heartbeating & Liveliness Determination
       Nodes update individual status blocks on the quorum  disk  at  a	 user-
       defined rate.  Each write of a status block alters the timestamp, which
       is what other nodes use to decide whether a node has hung or not.   If,
       after  a	 user-defined number of 'misses' (that is, failure to update a
       timestamp), a node is declared offline.	 After	a  certain  number  of
       'hits'  (changed	 timestamp + "i am alive" state), the node is declared
       online.

       The status block contains additional information, such as a bitmask  of
       the  nodes  that node believes are online.  Some of this information is
       used by the master - while some is just for performace  recording,  and
       may  be used at a later time.  The most important pieces of information
       a node writes to its status block are:

	    - Timestamp
	    - Internal state (available / not available)
	    - Score
	    - Known max score (may be used in the  future  to  detect  invalid
	    configurations)
	    - Vote/bid messages
	    - Other nodes it thinks are online

2.2. Scoring & Heuristics
       The  administrator  can configure up to 10 purely arbitrary heuristics,
       and must exercise caution in doing so.	At  least  one	administrator-
       defined heuristic is required for operation, but it is generally a good
       idea to have more than one heuristic.  By default, only	nodes  scoring
       over  1/2  of the total maximum score will claim they are available via
       the quorum disk, and a node (master or otherwise) whose score drops too
       low will remove itself (usually, by rebooting).

       The  heuristics	themselves  can	 be any command executable by 'sh -c'.
       For example, in early testing the following was used:

	    <heuristic program="[ -f /quorum ]" score="10" interval="2"/>

       This is a literal sh-ism which tests for the existence of a file called
       "/quorum".  Without that file, the node would claim it was unavailable.
       This is an awful example, and should never, ever be used in production,
       but is provided as an example as to what one could do...

       Typically,  the heuristics should be snippets of shell code or commands
       which help determine a node's usefulness to  the	 cluster  or  clients.
       Ideally,	 you  want  to	add traces for all of your network paths (e.g.
       check links, or ping routers), and methods to  detect  availability  of
       shared storage.

2.3. Master Election
       Only  one  master is present at any one time in the cluster, regardless
       of how many partitions exist within the cluster itself.	The master  is
       elected	by  a  simple  voting  scheme  in  which the lowest node which
       believes it is capable of running (i.e. scores high  enough)  bids  for
       master  status.	If the other nodes agree, it becomes the master.  This
       algorithm is run whenever no master is present.

       If another node comes online with a lower node ID while a node is still
       bidding	for  master  status,  it will rescind its bid and vote for the
       lower node ID.  If a master dies or a bidding  node  dies,  the	voting
       algorithm  is  started  over.  The voting algorithm typically takes two
       passes to complete.

       Master deaths take marginally longer to recover	from  than  non-master
       deaths,	because a new master must be elected before the old master can
       be evicted & fenced.

2.4. Master Duties
       The master node decides who is or is not in the	master	partition,  as
       well  as	 handles  eviction of dead nodes (both via the quorum disk and
       via the linux-cluster fencing  system  by  using	 the  cman_kill_node()
       API).

2.5. How it All Ties Together
       When  a	master	is  present,  and  if the master believes a node to be
       online, that node will advertise to CMAN that the quorum disk is avail‐
       able.  The master will only grant a node membership if:

	    (a) CMAN believes the node to be online, and
	    (b) that node has made enough consecutive, timely writes
		to the quorum disk, and
	    (c) the node has a high enough score to consider itself online.

3. Configuration
3.1. The <;quorumd> tag
       This tag is a child of the top-level <cluster> tag.

	<quorumd
	 interval="1"
	    This is the frequency of read/write cycles, in seconds.

	 tko="10"
	    This  is  the  number  of  cycles  a node must miss in order to be
	    declared dead.

	 tko_up="X"
	    This is the number of cycles a node must be seen in	 order	to  be
	    declared online.  Default is floor(tko/3).

	 upgrade_wait="2"
	    This  is the number of cycles a node must wait before initiating a
	    bid for master status after heuristic scoring becomes  sufficient.
	    The default is 2.  This can not be set to 0, and should not exceed
	    tko.

	 master_wait="X"
	    This is the number of cycles a node must  wait  for	 votes	before
	    declaring	itself	 master	  after	 making	 a  bid.   Default  is
	    floor(tko/2).  This can not be less than 2, must be	 greater  than
	    tko_up, and should not exceed tko.

	 votes="3"
	    This  is  the number of votes the quorum daemon advertises to CMAN
	    when it has a high enough score.

	 log_level="4"
	    This controls the verbosity of the quorum  daemon  in  the	system
	    logs.  0 = emergencies; 7 = debug.

	 log_facility="daemon"
	    This  controls  the syslog facility used by the quorum daemon when
	    logging.  For a complete list of available	facilities,  see  sys‐
	    log.conf(5).  The default value for this is 'daemon'.

	 status_file="/foo"
	    Write  internal  states  out  to this file periodically ("-" = use
	    stdout).  This is primarily used for debugging.  The default value
	    for this attribute is undefined.

	 min_score="3"
	    Absolute  minimum  score  to  be  consider one's self "alive".  If
	    omitted, or set to 0, the  default	function  "floor((n+1)/2)"  is
	    used,  where  n  is	 the total of all of defined heuristics' score
	    attribute.	This must  never  exceed  the  sum  of	the  heuristic
	    scores, or else the quorum disk will never be available.

	 reboot="1"
	    If set to 0 (off), qdiskd will *not* reboot after a negative tran‐
	    sition as a result in a change in score (see  section  2.2).   The
	    default for this value is 1 (on).

	 allow_kill="1"
	    If	set  to	 0  (off), qdiskd will *not* instruct to kill nodes it
	    thinks are dead (as a result of not writing to the	quorum	disk).
	    The default for this value is 1 (on).

	 paranoid="0"
	    If set to 1 (on), qdiskd will watch internal timers and reboot the
	    node if it takes more than (interval * tko) seconds to complete  a
	    quorum disk pass.  The default for this value is 0 (off).

	 scheduler="rr"
	    Valid  values are 'rr', 'fifo', and 'other'.  Selects the schedul‐
	    ing queue in the Linux kernel for operation of the	main  &	 score
	    threads  (does  not	 affect the heuristics; they are always run in
	    the 'other' queue).	 Default is 'rr'.   See	 sched_setscheduler(2)
	    for more details.

	 priority="1"
	    Valid values for 'rr' and 'fifo' are 1..100 inclusive.  Valid val‐
	    ues for 'other' are -20..20 inclusive.  Sets the priority  of  the
	    main  & score threads.  The default value is 1 (in the RR and FIFO
	    queues, higher numbers denote higher  priority;  in	 OTHER,	 lower
	    values denote higher priority).

	 stop_cman="0"
	    Ordinarily,	 cluster membership is left up to CMAN, not qdisk.  If
	    this parameter is set to 1 (on), qdiskd will tell  CMAN  to	 leave
	    the	 cluster  if it is unable to initialize the quorum disk during
	    startup.  This can be used to prevent cluster participation	 by  a
	    node  which	 has  been disconnected from the SAN.  The default for
	    this value is 0 (off).

	 use_uptime="1"
	    If this parameter is set to 1 (on), qdiskd will  use  values  from
	    /proc/uptime  for  internal	 timings.   This is a bit less precise
	    than gettimeofday(2), but the benefit is that changing the	system
	    clock  will	 not  affect  qdiskd's	behavior - even if paranoid is
	    enabled.  If set to 0, qdiskd will use gettimeofday(2),  which  is
	    more precise.  The default for this value is 1 (on / use uptime).

	 device="/dev/sda1"
	    This  is  the device the quorum daemon will use.  This device must
	    be the same on all nodes.

	 label="mylabel"/>
	    This overrides the device field if	present.   If  specified,  the
	    quorum  daemon will read /proc/partitions and check for qdisk sig‐
	    natures on every block device found, comparing the	label  against
	    the	 specified  label.  This is useful in configurations where the
	    block device name differs on a per-node basis.

	 cman_label="mylabel"/>
	    This overrides the label advertised to CMAN if present.  If speci‐
	    fied,  the	quorum	daemon will register with this name instead of
	    the actual device name.

	 max_error_cycles="0"/>
	    If we receive an I/O error during a cycle, we do not poll CMAN and
	    tell  it we are alive.  If specified, this value will cause qdiskd
	    to exit after the specified number of  consecutive	cycles	during
	    which I/O errors occur.  The default is 0 (no maximum).

	...>

3.2.  The <;heuristic> tag
       This tag is a child of the <quorumd> tag.

	<heuristic
	 program="/test.sh"
	    This  is the program used to determine if this heuristic is alive.
	    This can be anything which may  be	executed  by  /bin/sh  -c.   A
	    return  value  of  zero indicates success; anything else indicates
	    failure.  This is required.

	 score="1"
	    This is the weight of this heuristic.  Be careful when determining
	    scores for heuristics.  The default score for each heuristic is 1.

	 interval="2"/>
	    This is the frequency (in seconds) at which we poll the heuristic.
	    The default interval for every heuristic is 2 seconds.

	 tko="1"/>
	    After this many failed attempts to run the heuristic, it  is  con‐
	    sidered  DOWN, and its score is removed.  The default tko for each
	    heuristic is 1, which may be inadequate for things such as 'ping'.
	/>

3.3. Examples
3.3.1. 3 cluster nodes & 3 routers
	<cman expected_votes="6" .../>
	<clusternodes>
	    <clusternode name="node1" votes="1" ... />
	    <clusternode name="node2" votes="1" ... />
	    <clusternode name="node3" votes="1" ... />
	</clusternodes>
	<quorumd interval="1" tko="10" votes="3" label="testing">
	    <heuristic	program="ping  A  -c1  -t1"   score="1"	  interval="2"
	    tko="3"/>
	    <heuristic	 program="ping	 B  -c1	 -t1"  score="1"  interval="2"
	    tko="3"/>
	    <heuristic	program="ping  C  -c1  -t1"   score="1"	  interval="2"
	    tko="3"/>
	</quorumd>

3.3.2. 2 cluster nodes & 1 IP tiebreaker
	<cman two_node="0" expected_votes="3" .../>
	<clusternodes>
	    <clusternode name="node1" votes="1" ... />
	    <clusternode name="node2" votes="1" ... />
	</clusternodes>
	<quorumd interval="1" tko="10" votes="1" label="testing">
	    <heuristic	 program="ping	 A  -c1	 -t1"  score="1"  interval="2"
	    tko="3"/>
	</quorumd>

3.4. Heuristic score considerations
       * Heuristic timeouts should be set high enough to  allow	 the  previous
       run of a given heuristic to complete.

       *  Heuristic  scripts  returning anything except 0 as their return code
       are considered failed.

       * The worst-case for improperly configured quorum heuristics is a  race
       to fence where two partitions simultaneously try to kill each other.

3.5. Creating a quorum disk partition
       The  mkqdisk  utility  can  create and list currently configured quorum
       disks visible to the local node; see mkqdisk(8) for more details.

SEE ALSO
       mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)

				  06 Sep 2007			      QDisk(5)
[top]

List of man pages available for YellowDog

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net