OLAR_intro man page on DigitalUNIX

OLAR_intro man page on DigitalUNIX
Printed from http://www.polarhome.com/service/man/?qf=OLAR_intro&af=0&tf=2&of=DigitalUNIX
OLAR_intro(5)							 OLAR_intro(5)

NAME
       OLAR_intro,  olar_intro	-  Introduction to Online Addition and Removal
       (OLAR) Management

DESCRIPTION
   Introduction to Online Addition and Removal (OLAR) Management
       Online addition and removal management  is  used	 to  expand  capacity,
       upgrade	components, and replace failed components, while the operating
       system services and applications continues to run. This	functionality,
       sometimes referred to as "hot-swap", provides the benefits of increased
       system uptime and availability during both  scheduled  and  unscheduled
       maintenance.  Starting  with  Tru64 UNIX Version 5.1B, CPU OLAR is sup‐
       ported. Additional OLAR capabilities are planned to be added for subse‐
       quent releases of the operating system.

       OLAR  management	 is integrated with the SysMan suite of system manage‐
       ment applications, which provides the ability to manage all aspects  of
       the system from a centralized location.

       You  must  be  a privileged user to perform OLAR management operations.
       Or, you may configure privileges for selective authorized user or group
       access using Division of Privileges (DOP), as described below.

       Note  that  only	 one  administrator at a time can initiate OLAR opera‐
       tions; other administrators will	 be  prevented	from  initiating  OLAR
       operations while another operation completes.

   CPU OLAR Overview
       Tru64 UNIX supports the ability to add, replace, and/or remove individ‐
       ual CPU modules on supported AlphaServer systems	 while	the  operating
       system  and applications continue to run. Newly inserted CPUs are auto‐
       matically recognized by the operating system, but will not start sched‐
       uling  and  executing  processes until the CPU module is powered on and
       placed online through any of the supported management  applications  as
       described  below.  Conversely,  before  a CPU can be physically removed
       from the system, it must be placed offline and then powered  off.  Pro‐
       cesses  queued  for execution on a CPU that is to be placed offline are
       simply migrated to run-queues of other running (online) processors.

       By default, CPUs that are placed offline will persist across reboot and
       system  initialization, until the CPU is explicitly placed online. This
       behavior differs from the default  behavior  of	previous  versions  of
       Tru64 UNIX, where a CPU that was placed offline would return to service
       automatically after reboot or system restart. Note  that	 for  backward
       compatibility,  the psradm(8) and offline(8) commands still provide the
       non-persistent offline behavior.	 While the  psradm(8)  and  offline(8)
       commands	 are  still  provided, they are not recommended for performing
       OLAR operations.

       On platforms supporting this functionality, any CPU can participate  in
       an  OLAR operation, including the primary CPU and/or I/O interrupt han‐
       dling CPUs. These roles will be delegated to other running CPUs in  the
       event  that  a currently running primary or I/O interrupt handler needs
       to be placed offline or removed.

       Currently, the platforms which support CPU  OLAR	 are  the  AlphaServer
       GS160, and GS320 series systems. The GS80 does not support the physical
       removal of CPU modules, due to cabinet packaging design.

   Why Perform OLAR on CPUs
       OLAR of CPUs may be performed for the following reasons: A system  man‐
       ager  wants  to provide additional computational capacity to the system
       without having to bring the system down.	 As an example, an AlphaServer
       GS320  with  available CPU slots can have it's CPU capacity expanded by
       adding additional CPU modules to the system while the operating	system
       and  applications  continue  to run.  A system manager wants to upgrade
       specific system components to the latest model without having to	 bring
       the  system  down.  As an example, a GS160 with earlier model Alpha CPU
       modules can be upgraded to later model CPUs with	 higher	 clock	rates,
       while  the  operating  system  continues to run.	 A system component is
       indicating a high incidence of correctable errors and the  system  man‐
       ager  wants to perform a proactive replacement of the failing component
       before it results in a hard failure.   As  an  example,	the  Component
       Indictment  facility  (described	 below)	 has  indicated excessive cor‐
       rectable errors in a CPU	 module	 and  has  therefore  recommended  its
       replacement.  Once  the	CPU module has been placed offline and powered
       off, either through the Automatic Deallocation Facility (also described
       below)  or  through manual intervention, the CPU module can be replaced
       while the operating system continues to run.

   Cautions Before Performing OLAR on CPUs
       Before performing an OLAR operation, be aware  of  the  following  cau‐
       tions: When offlining or removing one or more CPUs, processes scheduled
       to run on the affected CPUs will be scheduled to execute on other  run‐
       ning  CPUs,  thus  redistributing  the  processing  capacity  among the
       remaining CPUs.	In general, this will result in a  system  performance
       degradation,  proportional  to  the number of CPUs taken out of service
       and the current system load, for the  period  of	 the  OLAR  operation.
       Multi-threaded applications that are written to take advantage of known
       CPU concurrencies  can  expect  to  encounter  significant  performance
       degradation  during the period of the OLAR operation.  The OLAR manage‐
       ment utilities do not presently operate with processor sets.  Processor
       sets  are  groups  of processors that are dedicated for use by selected
       processes (see processor_sets(4)). If a process has  been  specifically
       bound to run on a processor set (see runon(1), assign_pid_to_pset(3) ),
       and an OLAR operation is attempted on the last running CPU in the  pro‐
       cessor set, you will not be notified by the OLAR utilities that you are
       effectively shutting down the entire processor set. Offlining the  last
       CPU in a processor set will cause all processes bound to that processor
       set to suspend until the processor set has at least  one	 running  CPU.
       Therefore,  use	caution when performing CPU OLAR operations on systems
       that have been configured with processor sets.  If a process  has  been
       specifically  bound  to execute on a CPU (see runon(1), bind_to_cpu(3),
       and bind_to_cpu_id(3) for more information), and an OLAR	 operation  is
       attempted  on that CPU, you will be notified by the OLAR utilities that
       processes have been bound to the CPU prior to any operation being  per‐
       formed.	You  may  choose  to continue or cancel the OLAR operation. By
       choosing to continue, processes bound to a CPU will suspend their  exe‐
       cution  until  such  time  that	the process is un-bound, or the CPU is
       placed back online.  Note that choosing to offline a CPU that has  pro‐
       cesses  bound  may  cause  detrimental consequences to the application,
       depending upon the characteristics of the application.	If  a  process
       has  been  specifically	bound to execute on a Resource Affinity Domain
       (RAD) (see runon(1) and rad_bind_pid(3) for more information),  and  an
       OLAR  operation	is  attempted  on the last running CPU in the RAD, you
       will be notified by the OLAR utilities that processes have  been	 bound
       to  the	RAD  and that the last CPU in the RAD has been requested to be
       placed offline. By choosing to continue, processes  bound  to  the  RAD
       will  suspend  their  execution until such time that the process is un-
       bound, or at least one CPU in the  RAD  is  placed  online.  Note  that
       choosing	 to  offline  the  last	 CPU in a RAD with processes bound may
       cause detrimental consequences to the application, depending  upon  the
       characteristics of the application.  If you are using program profiling
       utilities such as dcpi, kprofile, or uprofile, that are	aware  of  the
       system's	 CPU  configuration, unpredictable results may occur when per‐
       forming OLAR operations. It is therefore recommended that these profil‐
       ing utilities be disabled prior to performing an OLAR operation. Ensure
       that all the  processes	including  any	associated  daemons  that  are
       related	to  these  utilities  have been stopped before performing OLAR
       operations on system CPUs.

	      The device drivers used by these profiling utilities are usually
	      configured into the kernel dynamically, so the tools can be dis‐
	      abled before each OLAR operation with the following commands:

	      # sysconfig -u pfm

	      # sysconfig -u pcount

	      The appropriate driver can be re-enabled with one of the follow‐
	      ing:

	      # sysconfig -c pfm

	      # sysconfig -c pcount

	      The  automatic  deallocation  of CPUs, enabled through the Auto‐
	      matic Deallocation Facility, should be disabled whenever the pfm
	      or pcount device drivers are configured into the kernel, or vice
	      versa.  Refer to the documentation and reference pages for these
	      utilities for additional information.

   General Procedures for Online Addition and Removal of CPUs
				       Caution

       Pay   attention	 to  the  system  safety  notes	 as  outlined  in  the
       GS80/160/320 Service Manual.

       Removing a CPU Module

	      To perform an online removal of a CPU module, follow these steps
	      using  your  preferred  management application, described in the
	      section "Tools  for Managing OLAR".  Off-line the CPU. The oper‐
	      ating  system  will  stop scheduling and executing tasks on this
	      CPU. Using your preferred OLAR management application, make note
	      of  the  quad  building  block  (QBB)  number  where this CPU is
	      inserted.	 This is the “hard” (or physical) QBB number, and does
	      not  change  if the system is partitioned.  Power the CPU module
	      off. The LED on the CPU module will illuminate yellow,  indicat‐
	      ing  that	 the CPU module is un-powered, and safe to be removed.
	      Physically remove the CPU module. Note that the operating system
	      automatically recognizes that the CPU module has been physically
	      removed.	There is no need to perform a scan operation to update
	      the    hardware configuration.  Adding a CPU module

	      To  perform  an  online  addition	 of a CPU module, follow these
	      steps using your preferred management application, described  in
	      the  section "Tools for Managing OLAR".  Select an available CPU
	      slot in one of the configured quad  building  blocks  (QBB).  If
	      there  are available slots in several QBBs, it is typically best
	      to equally distribute the number of CPUs	among  the  configured
	      QBBs.   Insert the CPU module into the CPU slot. Ensure that you
	      align the color-coded decal on the CPU module  with  the	color-
	      code decal on the CPU slot. The LED on the CPU module will illu‐
	      minate yellow, indicating that the  CPU  module  is  un-powered.
	      Note  that the CPU will be automatically recognized by the oper‐
	      ating system, even though it is un-powered. There is no need  to
	      perform  a  scan	operation for the operating system to identify
	      the CPU module.  Power the CPU module on. The  CPU  module  will
	      undergo  a short self-test (7-10 secs), after which the LED will
	      illuminate green, indicating the module is  powered-on  and  has
	      passed its self-test.  On-line the CPU. Once the CPU is on-line,
	      the operating system will automatically begin  to	 schedule  and
	      execute tasks on this CPU.

   Tools for Managing OLAR
       When  it	 is  necessary to perform an OLAR operation, use the following
       tools which are provided as part of the SysMan suite of system  manage‐
       ment utilities.

   Manage CPUs
       "Manage	CPUs" is a task-oriented application that provides the follow‐
       ing functions: Change the state of a CPU to online or offline Power  on
       or power off a CPU Determine the status of each inserted CPU

       The "Manage CPUs" application can be run equivalently from an X Windows
       display, a terminal with curses capability, or  locally	on  a  PC  (as
       described  below), thus providing a great deal of flexibility when per‐
       forming OLAR operations.

					Note

       You must be a privileged user to run  the  "Manage  CPUs"  application.
       Non-root	 users	may also run the "Manage CPUs" application if they are
       assigned the "HardwareManagement"  privilege.  To  assign  a  user  the
       "HardwareManagement"  privilege,	 issue the following command to launch
       the "Configure DOP" application:

       # sysman dopconfig [-display <hostname>]

       Please refer to the dop(8) reference page and the on-line help  in  the
       'dopconfig' application for further information. Additionally, the Man‐
       age CPUs application provides online help  capabilities	that  describe
       the operation of this application.

       The "Manage CPUs" application can be invoked using one of the following
       methods:

       SysMan Menu At the command prompt in a terminal window, enter the  fol‐
       lowing command:

	      [Note that the "DISPLAY" shell environment variable must be set,
	      or the "-display" command line option must be used, in order  to
	      launch  the  X  Windows  version of SysMan Menu.	If there is no
	      indication of which graphics display to use, or if invoking from
	      a	 character  cell  terminal,  then the curses version of SysMan
	      Menu will be launched.]

	      # sysman [-display <hostname>] Highlight	the  "Hardware"	 entry
	      and  press  "Select" Highlight the "Manage CPUs" entry and press
	      "Select" SysMan command line accelerator

	      To launch the Manage CPUs application directly via  the  command
	      prompt in a terminal window, enter the following command:

	      # sysman hw_manage_cpus [-display hostname]

	      [Note that the "DISPLAY" shell environment variable must be set,
	      or the "-display" command line option must be used, in order  to
	      launch  the  X  Windows  version of Manage CPUs.	If there is no
	      indication of which graphics display to use, or if invoking from
	      a	 character  cell  terminal,  then the curses version of Manage
	      CPUs will be launched.]  System Management Station

	      To launch the Manage CPUs application from the System Management
	      Station,	do  the following: At the command prompt in a terminal
	      window from a system that supports graphical display, enter  the
	      following command:

	      # sysman -station [-display hostname]

	      When  the	 System Management Station launches, two separate win‐
	      dows will appear. One window is the Status Monitor view, and the
	      other  window is the Hardware view, providing a graphical depic‐
	      tion of the hardware connected to your system.  Select the Hard‐
	      ware view window.	 Select the CPU for an OLAR operation by left-
	      clicking once with the mouse.  Select Tools from the  menu  bar,
	      or  right-click once with the mouse. A list of menu options will
	      appear.  Select Daily Administration from the list.  Select  the
	      Manage CPUs application.	Manage CPUs from a PC or Web Browser

	      You  can	also  perform  OLAR management from your PC desktop or
	      from within a web browser. Specifically, you can run Manage CPUs
	      via the System Management Station client installed on your desk‐
	      top, or by launching the System Management Station  client  from
	      within  a	 browser  pointed  to the Tru64 UNIX System Management
	      home page. For a detailed description of	options	 and  require‐
	      ments,  visit the Tru64 UNIX System Management home page, avail‐
	      able from any Tru64 UNIX system running V5.1A  (or  higher),  at
	      the following URL:

	      http://hostname:2301/SysMan_Home_Page

	      where  "hostname"	 is the name of a Tru64 UNIX Version 5.1B, (or
	      higher) system.

   hwmgr Command Line Interface (CLI)
       In addition to its set of generic hardware management capabilities, the
       hwmgr(8)	 command  line	interface  incorporates the same level of OLAR
       management functionality as the Manage CPUs application.	 You  must  be
       root  to run the hwmgr command; this command does not currently operate
       with DOP.

       The following describes the OLAR specific commands supported by	hwmgr.
       To obtain general help on the use of hwmgr, issue the command:

       # hwmgr -help

       To obtain help on a specific option, issue the command:

       # hwmgr -help "option"

       where option is the name of the option you want help on.	 To obtain the
       status and state information of all hardware components	the  operating
       system is aware of, issue the following command: # hwmgr -status comp
			STATUS	 ACCESS	 HEALTH	     INDICT
	HWID:	 HOSTNAME    SUMMARY	STATE	  STATE		LEVEL	  NAME
       -------------------------------------------------------------
	  3:  wild-one		 online	  available	      dmapi
	 49:  wild-one		 online	  available	      dsk2
	 50:  wild-one		 online	  available	      dsk3
	 51:  wild-one		 online	  available	      dsk4
	 52:  wild-one		 online	  available	      dsk5
	 56:  wild-one		 online	  available	      Compaq
							      Alpha Server
							      GS160 6/731
	 57:  wild-one		 online	  available	      CPU0
	 58:  wild-one		 online	  available	      CPU2
	 59:  wild-one		 online	  available	      CPU4
	 60:  wild-one		 online	  available	      CPU6

	      or, to obtain status on an individual component, use  the	 hard‐
	      ware id (HWID) of the component and issue the command:

	      # hwmgr -status comp -id 58

				 STATUS	  ACCESS    HEALTH    INDICT
	       HWID:   HOSTNAME	   SUMMARY   STATE	STATE	  LEVEL	  NAME
	      -------------------------------------------------------------
		 58:  wild-one		  online    available	      CPU2

	      To see the complete list of options  for	"-status",  issue  the
	      command:

	      # hwmgr -help status To view a hierarchical listing of all hard‐
	      ware components the operating system is aware of, issue the com‐
	      mand:

	      # hwmgr -view hier
	       HWID:  hardware	hierarchy  (!)warning  (X)critical (-)inactive
	      (see -status)
	       -------------------------------------------------------------------------
		  1: platform Compaq AlphaServer GS160 6/731
		  9:   bus wfqbb0
		  10:	  connection wfqbb0slot0
		  11:	    bus wfiop0
		  12:	      connection wfiop0slot0
		  13:		bus pci0
		  14:		  connection pci0slot1

		   o
		   o
		   o

		  57:	  cpu qbb-0 CPU0
		  58:	  cpu qbb-0 CPU2

	      This  example  shows that CPU0 and CPU2 are children of bus name
	      "wfqbb0", and that their	physical  location  is	(hard)	qbb-0.
	      Note  that  hard	QBB numbers do not change as the system parti‐
	      tioning changes.

	      To quickly identify which QBB a CPU is  associated  with,	 issue
	      the command:

	      #	  hwmgr	  -view	  hier	 -id  58  HWID:	   hardware  hierarchy
	      -----------------------------------------------------
		  58:	cpu CPU0 qbb-0 To offline a CPU that is	 currently  in
	      the online state, issue the command

	      # hwmgr -offline -id 58

	      or

	      # hwmgr -offline -name CPU2

	      Note that device names are case sensitive. In this example, CPU2
	      must be upper case. To verify the new status of CPU2, issue  the
	      command:

	      # hwmgr -status comp -id 58

				 STATUS	  ACCESS    HEALTH     INDICT
	       HWID:   HOSTNAME	   SUMMARY   STATE     STATE	  LEVEL	  NAME
	      --------------------------------------------------------------
		 58:  wild-one	 critical offline   available	       CPU2

	      Note that the offline state will be saved across future  reboots
	      of  the operating system, including power cycling the system. If
	      you want the component to return to the online  state  the  next
	      time the operating system is booted, use the "-nosave" switch.

	      # hwmgr -offline -nosave -id 58

	      or

	      # hwmgr -offline -nosave -name CPU2

	      Once again, to verify the status of CPU2, issue the command:

	      # hwmgr -status comp -id 58

				STATUS	 ACCESS		   HEALTH	INDICT
	       HWID:   HOSTNAME	  SUMMARY  STATE	    STATE	 LEVEL
	      NAME
	      ----------------------------------------------------------------------
		 58:	wild-one     critical	 offline(nosave)     available
	      CPU2

	      To power off a CPU that is currently in the offline state, issue
	      the command:

	      # hwmgr -power off -id 58

	      or

	      # hwmgr -power off -name CPU2

	      Note that a component must be in the offline state before	 power
	      can  be  removed using hwmgr. Once power has been removed from a
	      component, is it safe to remove that component from the  system.
	      To  power on a CPU that is currently powered off, issue the com‐
	      mand:

	      # hwmgr -power on -id 58

	      or

	      # hwmgr -power on -name CPU2 To place a CPU online so  that  the
	      operating	 system	 can start scheduling processes to run on that
	      CPU, issue the command:

	      # hwmgr -online -id 58

	      or

	      # hwmgr -online -name CPU2

       Refer to the hwmgr(8) reference page for additional information on  the
       use of hwmgr.

   Component Indictment Overview
       Component  indictment is a proactive notification from a fault analysis
       utility, indicating that a component is experiencing high incidence  of
       correctable  errors,  and therefore should be serviced and/or replaced.
       Component indictment involves the process of analyzing specific failure
       patterns	 from  error  log  entries, either immediately or over a given
       time interval, and recommending a component's removal. The fault analy‐
       sis utility signals the running operating system that a given component
       is suspect, causing the operating system to distribute this information
       via  an EVM indictment event such that interested applications, includ‐
       ing the System Management Station, Insight Manager, and	the  Automatic
       Deallocation  Facility  can  update their state information, as well as
       take appropriate action if so configured (see the discussion  on	 Auto‐
       matic Deallocation Facility below).

       It  is  possible	 for more than one component to be indicted simultane‐
       ously if the exact source of error  cannot  be  pinpointed.   In	 these
       cases, the most likely suspect will be indicted with a `high` probabil‐
       ity. The next likely suspect will be indicted with a `medium` probabil‐
       ity,  and the least likely suspect will be indicted with a `low` proba‐
       bility. When this situation arises, the indictment events can  be  tied
       together	 by  examining the "report_handle" variable within the indict‐
       ment events. Indictment events for the same error will contain the same
       "report_handle" value.

       The indicted state of a component will persist across reboot and system
       initialization if no action is taken to remedy the  suspect  component,
       such as an online repair operation. Once an indictment has occurred for
       a given component, another indictment event will not be	generated  for
       that  component	unless	the utility has determined, through additional
       analysis, that	the original indictment probably  should  be  updated.
       In  this case, the component will be re-indicted with the new probabil‐
       ity. Once the indicted component has been serviced, it is necessary  to
       manually	 clear	the  indicted component state with the following hwmgr
       command:

       # hwmgr -unindict -id <hwid>

       where <id> is the hardware id (HWID) of the component

       Allowing the operator to manually clear	the  indicted  problem	state,
       ensures positive identification of when a replaced component is operat‐
       ing properly.

       All  component  indictment  EVM	events	have  an   event   prefix   of
       sys.unix.hw.state_change.indicted.  You	may  view the complete list of
       all possible component indictment events that may be posted,  including
       a description of each event, by issuing the command:

       #  evmwatch  -i -f '[name sys.unix.hw.state_change.indicted]' | evmshow
       -t
	  "@name" -x | more

       You may view the list of indictment events that have occurred by	 issu‐
       ing the command:

       #  evmget  -f  '[name  sys.unix.hw.state_change.indicted]' | evmshow -t
       "@name"

       CPU modules and memory pages  are  currently  supported	for  component
       indictment.

       Compaq  Analyze,	 included as part of the Web-Based Enterprise Services
       (WEBES) 4.0 product (or higher), is the	fault  analysis	 utility  that
       supports component indictment on a Tru64 UNIX (V5.1A or higher) system.
       The WEBES product is included as part of the Tru64 UNIX operating  sys‐
       tem  distribution, and must be installed after installation of the base
       operating system. Please refer to  the  Compaq  Analyze	documentation,
       distributed with the WEBES product, for a list of AlphaServer platforms
       that support the component indictment feature.

   Automatic Deallocation Facility Overview
       The Automatic Deallocation Facility provides the ability	 to  automati‐
       cally  take  an	indicted  component out of service, thus providing the
       automated ability for the system to heal itself	while  furthering  the
       reliability  and availability of the system. The Automatic Deallocation
       Facility currently supports the ability to stop using CPUs  and	memory
       pages that have been indicted.

       The ability to tailor the behavior of the automatic deallocation facil‐
       ity can be  user-controlled  on	both  single  and  clustered  systems,
       through the use of the text-based OLAR Policy Configuration files. When
       operating in a clustered	 environment,  automatic  deallocation	policy
       applies	to  all	 members  in  a	 cluster by default. This is specified
       through the cluster-wide file /etc/olar.config.common.  However,	 indi‐
       vidual  cluster-wide  policy variables can be overridden using the mem‐
       ber-specific configuration file /etc/olar.config.

       The OLAR Policy Configuration  files  contain  configuration  variables
       that control specific behaviors of the Automatic Deallocation Facility.
       Behaviors such as whether or not to enable automatic deallocation,  and
       what  times  of the day automatic deallocation should be enabled can be
       defined. Additionally, the ability to specify a user-supplied script or
       executable  that	 provides the gating factor as to whether an automatic
       deallocation operation should occur, can be provided as well.

       Automatic deallocation is supported for those  platforms	 that  support
       the component indictment feature, as described in the Component Indict‐
       ment Overview section above.

       Refer to the olar.config(4) reference page for  additional  information
       about the OLAR Policy Configuration files.

SEE ALSO
       Commands:  sysman(8), sysman_menu(8), sysman_station(8), hwmgr(8), cod‐
       config(8), dop(8)

       Files: olar.config.common(4)

       System Administration

       Configuring and Managing Systems for Increased Availability Guide

								 OLAR_intro(5)
[top]

List of man pages available for DigitalUNIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome