OLAR_intro(5)OLAR_intro(5)NAME
OLAR_intro, olar_intro - Introduction to Online Addition and Removal
(OLAR) Management
DESCRIPTION
Introduction to Online Addition and Removal (OLAR) Management
Online addition and removal management is used to expand capacity,
upgrade components, and replace failed components, while the operating
system services and applications continues to run. This functionality,
sometimes referred to as "hot-swap", provides the benefits of increased
system uptime and availability during both scheduled and unscheduled
maintenance. Starting with Tru64 UNIX Version 5.1B, CPU OLAR is sup‐
ported. Additional OLAR capabilities are planned to be added for subse‐
quent releases of the operating system.
OLAR management is integrated with the SysMan suite of system manage‐
ment applications, which provides the ability to manage all aspects of
the system from a centralized location.
You must be a privileged user to perform OLAR management operations.
Or, you may configure privileges for selective authorized user or group
access using Division of Privileges (DOP), as described below.
Note that only one administrator at a time can initiate OLAR opera‐
tions; other administrators will be prevented from initiating OLAR
operations while another operation completes.
CPU OLAR Overview
Tru64 UNIX supports the ability to add, replace, and/or remove individ‐
ual CPU modules on supported AlphaServer systems while the operating
system and applications continue to run. Newly inserted CPUs are auto‐
matically recognized by the operating system, but will not start sched‐
uling and executing processes until the CPU module is powered on and
placed online through any of the supported management applications as
described below. Conversely, before a CPU can be physically removed
from the system, it must be placed offline and then powered off. Pro‐
cesses queued for execution on a CPU that is to be placed offline are
simply migrated to run-queues of other running (online) processors.
By default, CPUs that are placed offline will persist across reboot and
system initialization, until the CPU is explicitly placed online. This
behavior differs from the default behavior of previous versions of
Tru64 UNIX, where a CPU that was placed offline would return to service
automatically after reboot or system restart. Note that for backward
compatibility, the psradm(8) and offline(8) commands still provide the
non-persistent offline behavior. While the psradm(8) and offline(8)
commands are still provided, they are not recommended for performing
OLAR operations.
On platforms supporting this functionality, any CPU can participate in
an OLAR operation, including the primary CPU and/or I/O interrupt han‐
dling CPUs. These roles will be delegated to other running CPUs in the
event that a currently running primary or I/O interrupt handler needs
to be placed offline or removed.
Currently, the platforms which support CPU OLAR are the AlphaServer
GS160, and GS320 series systems. The GS80 does not support the physical
removal of CPU modules, due to cabinet packaging design.
Why Perform OLAR on CPUs
OLAR of CPUs may be performed for the following reasons: A system man‐
ager wants to provide additional computational capacity to the system
without having to bring the system down. As an example, an AlphaServer
GS320 with available CPU slots can have it's CPU capacity expanded by
adding additional CPU modules to the system while the operating system
and applications continue to run. A system manager wants to upgrade
specific system components to the latest model without having to bring
the system down. As an example, a GS160 with earlier model Alpha CPU
modules can be upgraded to later model CPUs with higher clock rates,
while the operating system continues to run. A system component is
indicating a high incidence of correctable errors and the system man‐
ager wants to perform a proactive replacement of the failing component
before it results in a hard failure. As an example, the Component
Indictment facility (described below) has indicated excessive cor‐
rectable errors in a CPU module and has therefore recommended its
replacement. Once the CPU module has been placed offline and powered
off, either through the Automatic Deallocation Facility (also described
below) or through manual intervention, the CPU module can be replaced
while the operating system continues to run.
Cautions Before Performing OLAR on CPUs
Before performing an OLAR operation, be aware of the following cau‐
tions: When offlining or removing one or more CPUs, processes scheduled
to run on the affected CPUs will be scheduled to execute on other run‐
ning CPUs, thus redistributing the processing capacity among the
remaining CPUs. In general, this will result in a system performance
degradation, proportional to the number of CPUs taken out of service
and the current system load, for the period of the OLAR operation.
Multi-threaded applications that are written to take advantage of known
CPU concurrencies can expect to encounter significant performance
degradation during the period of the OLAR operation. The OLAR manage‐
ment utilities do not presently operate with processor sets. Processor
sets are groups of processors that are dedicated for use by selected
processes (see processor_sets(4)). If a process has been specifically
bound to run on a processor set (see runon(1), assign_pid_to_pset(3) ),
and an OLAR operation is attempted on the last running CPU in the pro‐
cessor set, you will not be notified by the OLAR utilities that you are
effectively shutting down the entire processor set. Offlining the last
CPU in a processor set will cause all processes bound to that processor
set to suspend until the processor set has at least one running CPU.
Therefore, use caution when performing CPU OLAR operations on systems
that have been configured with processor sets. If a process has been
specifically bound to execute on a CPU (see runon(1), bind_to_cpu(3),
and bind_to_cpu_id(3) for more information), and an OLAR operation is
attempted on that CPU, you will be notified by the OLAR utilities that
processes have been bound to the CPU prior to any operation being per‐
formed. You may choose to continue or cancel the OLAR operation. By
choosing to continue, processes bound to a CPU will suspend their exe‐
cution until such time that the process is un-bound, or the CPU is
placed back online. Note that choosing to offline a CPU that has pro‐
cesses bound may cause detrimental consequences to the application,
depending upon the characteristics of the application. If a process
has been specifically bound to execute on a Resource Affinity Domain
(RAD) (see runon(1) and rad_bind_pid(3) for more information), and an
OLAR operation is attempted on the last running CPU in the RAD, you
will be notified by the OLAR utilities that processes have been bound
to the RAD and that the last CPU in the RAD has been requested to be
placed offline. By choosing to continue, processes bound to the RAD
will suspend their execution until such time that the process is un-
bound, or at least one CPU in the RAD is placed online. Note that
choosing to offline the last CPU in a RAD with processes bound may
cause detrimental consequences to the application, depending upon the
characteristics of the application. If you are using program profiling
utilities such as dcpi, kprofile, or uprofile, that are aware of the
system's CPU configuration, unpredictable results may occur when per‐
forming OLAR operations. It is therefore recommended that these profil‐
ing utilities be disabled prior to performing an OLAR operation. Ensure
that all the processes including any associated daemons that are
related to these utilities have been stopped before performing OLAR
operations on system CPUs.
The device drivers used by these profiling utilities are usually
configured into the kernel dynamically, so the tools can be dis‐
abled before each OLAR operation with the following commands:
# sysconfig -u pfm
# sysconfig -u pcount
The appropriate driver can be re-enabled with one of the follow‐
ing:
# sysconfig -c pfm
# sysconfig -c pcount
The automatic deallocation of CPUs, enabled through the Auto‐
matic Deallocation Facility, should be disabled whenever the pfm
or pcount device drivers are configured into the kernel, or vice
versa. Refer to the documentation and reference pages for these
utilities for additional information.
General Procedures for Online Addition and Removal of CPUs
Caution
Pay attention to the system safety notes as outlined in the
GS80/160/320 Service Manual.
Removing a CPU Module
To perform an online removal of a CPU module, follow these steps
using your preferred management application, described in the
section "Tools for Managing OLAR". Off-line the CPU. The oper‐
ating system will stop scheduling and executing tasks on this
CPU. Using your preferred OLAR management application, make note
of the quad building block (QBB) number where this CPU is
inserted. This is the “hard” (or physical) QBB number, and does
not change if the system is partitioned. Power the CPU module
off. The LED on the CPU module will illuminate yellow, indicat‐
ing that the CPU module is un-powered, and safe to be removed.
Physically remove the CPU module. Note that the operating system
automatically recognizes that the CPU module has been physically
removed. There is no need to perform a scan operation to update
the hardware configuration. Adding a CPU module
To perform an online addition of a CPU module, follow these
steps using your preferred management application, described in
the section "Tools for Managing OLAR". Select an available CPU
slot in one of the configured quad building blocks (QBB). If
there are available slots in several QBBs, it is typically best
to equally distribute the number of CPUs among the configured
QBBs. Insert the CPU module into the CPU slot. Ensure that you
align the color-coded decal on the CPU module with the color-
code decal on the CPU slot. The LED on the CPU module will illu‐
minate yellow, indicating that the CPU module is un-powered.
Note that the CPU will be automatically recognized by the oper‐
ating system, even though it is un-powered. There is no need to
perform a scan operation for the operating system to identify
the CPU module. Power the CPU module on. The CPU module will
undergo a short self-test (7-10 secs), after which the LED will
illuminate green, indicating the module is powered-on and has
passed its self-test. On-line the CPU. Once the CPU is on-line,
the operating system will automatically begin to schedule and
execute tasks on this CPU.
Tools for Managing OLAR
When it is necessary to perform an OLAR operation, use the following
tools which are provided as part of the SysMan suite of system manage‐
ment utilities.
Manage CPUs
"Manage CPUs" is a task-oriented application that provides the follow‐
ing functions: Change the state of a CPU to online or offline Power on
or power off a CPU Determine the status of each inserted CPU
The "Manage CPUs" application can be run equivalently from an X Windows
display, a terminal with curses capability, or locally on a PC (as
described below), thus providing a great deal of flexibility when per‐
forming OLAR operations.
Note
You must be a privileged user to run the "Manage CPUs" application.
Non-root users may also run the "Manage CPUs" application if they are
assigned the "HardwareManagement" privilege. To assign a user the
"HardwareManagement" privilege, issue the following command to launch
the "Configure DOP" application:
# sysman dopconfig [-display <hostname>]
Please refer to the dop(8) reference page and the on-line help in the
'dopconfig' application for further information. Additionally, the Man‐
age CPUs application provides online help capabilities that describe
the operation of this application.
The "Manage CPUs" application can be invoked using one of the following
methods:
SysMan Menu At the command prompt in a terminal window, enter the fol‐
lowing command:
[Note that the "DISPLAY" shell environment variable must be set,
or the "-display" command line option must be used, in order to
launch the X Windows version of SysMan Menu. If there is no
indication of which graphics display to use, or if invoking from
a character cell terminal, then the curses version of SysMan
Menu will be launched.]
# sysman [-display <hostname>] Highlight the "Hardware" entry
and press "Select" Highlight the "Manage CPUs" entry and press
"Select" SysMan command line accelerator
To launch the Manage CPUs application directly via the command
prompt in a terminal window, enter the following command:
# sysman hw_manage_cpus [-display hostname]
[Note that the "DISPLAY" shell environment variable must be set,
or the "-display" command line option must be used, in order to
launch the X Windows version of Manage CPUs. If there is no
indication of which graphics display to use, or if invoking from
a character cell terminal, then the curses version of Manage
CPUs will be launched.] System Management Station
To launch the Manage CPUs application from the System Management
Station, do the following: At the command prompt in a terminal
window from a system that supports graphical display, enter the
following command:
# sysman -station [-display hostname]
When the System Management Station launches, two separate win‐
dows will appear. One window is the Status Monitor view, and the
other window is the Hardware view, providing a graphical depic‐
tion of the hardware connected to your system. Select the Hard‐
ware view window. Select the CPU for an OLAR operation by left-
clicking once with the mouse. Select Tools from the menu bar,
or right-click once with the mouse. A list of menu options will
appear. Select Daily Administration from the list. Select the
Manage CPUs application. Manage CPUs from a PC or Web Browser
You can also perform OLAR management from your PC desktop or
from within a web browser. Specifically, you can run Manage CPUs
via the System Management Station client installed on your desk‐
top, or by launching the System Management Station client from
within a browser pointed to the Tru64 UNIX System Management
home page. For a detailed description of options and require‐
ments, visit the Tru64 UNIX System Management home page, avail‐
able from any Tru64 UNIX system running V5.1A (or higher), at
the following URL:
http://hostname:2301/SysMan_Home_Page
where "hostname" is the name of a Tru64 UNIX Version 5.1B, (or
higher) system.
hwmgr Command Line Interface (CLI)
In addition to its set of generic hardware management capabilities, the
hwmgr(8) command line interface incorporates the same level of OLAR
management functionality as the Manage CPUs application. You must be
root to run the hwmgr command; this command does not currently operate
with DOP.
The following describes the OLAR specific commands supported by hwmgr.
To obtain general help on the use of hwmgr, issue the command:
# hwmgr -help
To obtain help on a specific option, issue the command:
# hwmgr -help "option"
where option is the name of the option you want help on. To obtain the
status and state information of all hardware components the operating
system is aware of, issue the following command: # hwmgr -status comp
STATUS ACCESS HEALTH INDICT
HWID: HOSTNAME SUMMARY STATE STATE LEVEL NAME
-------------------------------------------------------------
3: wild-one online available dmapi
49: wild-one online available dsk2
50: wild-one online available dsk3
51: wild-one online available dsk4
52: wild-one online available dsk5
56: wild-one online available Compaq
Alpha Server
GS160 6/731
57: wild-one online available CPU0
58: wild-one online available CPU2
59: wild-one online available CPU4
60: wild-one online available CPU6
or, to obtain status on an individual component, use the hard‐
ware id (HWID) of the component and issue the command:
# hwmgr -status comp -id 58
STATUS ACCESS HEALTH INDICT
HWID: HOSTNAME SUMMARY STATE STATE LEVEL NAME
-------------------------------------------------------------
58: wild-one online available CPU2
To see the complete list of options for "-status", issue the
command:
# hwmgr -help status To view a hierarchical listing of all hard‐
ware components the operating system is aware of, issue the com‐
mand:
# hwmgr -view hier
HWID: hardware hierarchy (!)warning (X)critical (-)inactive
(see -status)
-------------------------------------------------------------------------
1: platform Compaq AlphaServer GS160 6/731
9: bus wfqbb0
10: connection wfqbb0slot0
11: bus wfiop0
12: connection wfiop0slot0
13: bus pci0
14: connection pci0slot1
o
o
o
57: cpu qbb-0 CPU0
58: cpu qbb-0 CPU2
This example shows that CPU0 and CPU2 are children of bus name
"wfqbb0", and that their physical location is (hard) qbb-0.
Note that hard QBB numbers do not change as the system parti‐
tioning changes.
To quickly identify which QBB a CPU is associated with, issue
the command:
# hwmgr -view hier -id 58 HWID: hardware hierarchy
-----------------------------------------------------
58: cpu CPU0 qbb-0 To offline a CPU that is currently in
the online state, issue the command
# hwmgr -offline -id 58
or
# hwmgr -offline -name CPU2
Note that device names are case sensitive. In this example, CPU2
must be upper case. To verify the new status of CPU2, issue the
command:
# hwmgr -status comp -id 58
STATUS ACCESS HEALTH INDICT
HWID: HOSTNAME SUMMARY STATE STATE LEVEL NAME
--------------------------------------------------------------
58: wild-one critical offline available CPU2
Note that the offline state will be saved across future reboots
of the operating system, including power cycling the system. If
you want the component to return to the online state the next
time the operating system is booted, use the "-nosave" switch.
# hwmgr -offline -nosave -id 58
or
# hwmgr -offline -nosave -name CPU2
Once again, to verify the status of CPU2, issue the command:
# hwmgr -status comp -id 58
STATUS ACCESS HEALTH INDICT
HWID: HOSTNAME SUMMARY STATE STATE LEVEL
NAME
----------------------------------------------------------------------
58: wild-one critical offline(nosave) available
CPU2
To power off a CPU that is currently in the offline state, issue
the command:
# hwmgr -power off -id 58
or
# hwmgr -power off -name CPU2
Note that a component must be in the offline state before power
can be removed using hwmgr. Once power has been removed from a
component, is it safe to remove that component from the system.
To power on a CPU that is currently powered off, issue the com‐
mand:
# hwmgr -power on -id 58
or
# hwmgr -power on -name CPU2 To place a CPU online so that the
operating system can start scheduling processes to run on that
CPU, issue the command:
# hwmgr -online -id 58
or
# hwmgr -online -name CPU2
Refer to the hwmgr(8) reference page for additional information on the
use of hwmgr.
Component Indictment Overview
Component indictment is a proactive notification from a fault analysis
utility, indicating that a component is experiencing high incidence of
correctable errors, and therefore should be serviced and/or replaced.
Component indictment involves the process of analyzing specific failure
patterns from error log entries, either immediately or over a given
time interval, and recommending a component's removal. The fault analy‐
sis utility signals the running operating system that a given component
is suspect, causing the operating system to distribute this information
via an EVM indictment event such that interested applications, includ‐
ing the System Management Station, Insight Manager, and the Automatic
Deallocation Facility can update their state information, as well as
take appropriate action if so configured (see the discussion on Auto‐
matic Deallocation Facility below).
It is possible for more than one component to be indicted simultane‐
ously if the exact source of error cannot be pinpointed. In these
cases, the most likely suspect will be indicted with a `high` probabil‐
ity. The next likely suspect will be indicted with a `medium` probabil‐
ity, and the least likely suspect will be indicted with a `low` proba‐
bility. When this situation arises, the indictment events can be tied
together by examining the "report_handle" variable within the indict‐
ment events. Indictment events for the same error will contain the same
"report_handle" value.
The indicted state of a component will persist across reboot and system
initialization if no action is taken to remedy the suspect component,
such as an online repair operation. Once an indictment has occurred for
a given component, another indictment event will not be generated for
that component unless the utility has determined, through additional
analysis, that the original indictment probably should be updated.
In this case, the component will be re-indicted with the new probabil‐
ity. Once the indicted component has been serviced, it is necessary to
manually clear the indicted component state with the following hwmgr
command:
# hwmgr -unindict -id <hwid>
where <id> is the hardware id (HWID) of the component
Allowing the operator to manually clear the indicted problem state,
ensures positive identification of when a replaced component is operat‐
ing properly.
All component indictment EVM events have an event prefix of
sys.unix.hw.state_change.indicted. You may view the complete list of
all possible component indictment events that may be posted, including
a description of each event, by issuing the command:
# evmwatch -i -f '[name sys.unix.hw.state_change.indicted]' | evmshow
-t
"@name" -x | more
You may view the list of indictment events that have occurred by issu‐
ing the command:
# evmget -f '[name sys.unix.hw.state_change.indicted]' | evmshow -t
"@name"
CPU modules and memory pages are currently supported for component
indictment.
Compaq Analyze, included as part of the Web-Based Enterprise Services
(WEBES) 4.0 product (or higher), is the fault analysis utility that
supports component indictment on a Tru64 UNIX (V5.1A or higher) system.
The WEBES product is included as part of the Tru64 UNIX operating sys‐
tem distribution, and must be installed after installation of the base
operating system. Please refer to the Compaq Analyze documentation,
distributed with the WEBES product, for a list of AlphaServer platforms
that support the component indictment feature.
Automatic Deallocation Facility Overview
The Automatic Deallocation Facility provides the ability to automati‐
cally take an indicted component out of service, thus providing the
automated ability for the system to heal itself while furthering the
reliability and availability of the system. The Automatic Deallocation
Facility currently supports the ability to stop using CPUs and memory
pages that have been indicted.
The ability to tailor the behavior of the automatic deallocation facil‐
ity can be user-controlled on both single and clustered systems,
through the use of the text-based OLAR Policy Configuration files. When
operating in a clustered environment, automatic deallocation policy
applies to all members in a cluster by default. This is specified
through the cluster-wide file /etc/olar.config.common. However, indi‐
vidual cluster-wide policy variables can be overridden using the mem‐
ber-specific configuration file /etc/olar.config.
The OLAR Policy Configuration files contain configuration variables
that control specific behaviors of the Automatic Deallocation Facility.
Behaviors such as whether or not to enable automatic deallocation, and
what times of the day automatic deallocation should be enabled can be
defined. Additionally, the ability to specify a user-supplied script or
executable that provides the gating factor as to whether an automatic
deallocation operation should occur, can be provided as well.
Automatic deallocation is supported for those platforms that support
the component indictment feature, as described in the Component Indict‐
ment Overview section above.
Refer to the olar.config(4) reference page for additional information
about the OLAR Policy Configuration files.
SEE ALSO
Commands: sysman(8), sysman_menu(8), sysman_station(8), hwmgr(8), cod‐
config(8), dop(8)
Files: olar.config.common(4)
System Administration
Configuring and Managing Systems for Increased Availability Guide
OLAR_intro(5)