MPI(1)MPI(1)NAMEMPI - Introduction to the Message Passing Interface (MPI)
DESCRIPTION
The Message Passing Interface (MPI) is a component of the Message Passing
Toolkit (MPT), which is a software package that supports parallel
programming across a network of computer systems through a technique
known as message passing. The goal of MPI, simply stated, is to develop
a widely used standard for writing message-passing programs. As such, the
interface establishes a practical, portable, efficient, and flexible
standard for message passing.
This MPI implementation supports the MPI 1.2 standard, as documented by
the MPI Forum in the spring 1997 release of MPI: A Message Passing
Interface Standard. In addition, certain MPI-2 features are also
supported. In designing MPI, the MPI Forum sought to make use of the
most attractive features of a number of existing message passing systems,
rather than selecting one of them and adopting it as the standard. Thus,
MPI has been strongly influenced by work at the IBM T. J. Watson Research
Center, Intel's NX/2, Express, nCUBE's Vertex, p4, and PARMACS. Other
important contributions have come from Zipcode, Chimp, PVM, Chameleon,
and PICL.
MPI requires the presence of an Array Services daemon (arrayd) on each
host that is to run MPI processes. In a single-host environment, no
system administration effort should be required beyond installing and
activating arrayd. However, users wishing to run MPI applications across
multiple hosts will need to ensure that those hosts are properly
configured into an array. For more information about Array Services, see
the arrayd(1M), arrayd.conf(4), and array_services(5) man pages.
When running across multiple hosts, users must set up their .rhosts files
to enable remote logins. Note that MPI does not use rsh, so it is not
necessary that rshd be running on security-sensitive systems; the .rhosts
file was simply chosen to eliminate the need to learn yet another
mechanism for enabling remote logins.
Other sources of MPI information are as follows:
* Man pages for MPI library functions
* A copy of the MPI standard as PostScript or hypertext on the World
Wide Web at the following URL:
http://www.mpi-forum.org/
* Other MPI resources on the World Wide Web, such as the following:
http://www.mcs.anl.gov/mpi/index.html
http://www.erc.msstate.edu/mpi/index.html
http://www.mpi.nd.edu/lam/
Page 1
MPI(1)MPI(1)
Getting Started
For IRIX systems, the Modules software package is available to support
one or more installations of MPT. To use the MPT software, load the
desired mpt module.
After you have initialized modules, enter the following command:
module load mpt
To unload the mpt module, enter the following command:
module unload mpt
MPT software can be installed in an alternate location for use with the
modules software package. If MPT software has been installed on your
system for use with modules, you can access the software with the module
command shown in the previous example. If MPT has not been installed for
use with modules, the software resides in default locations on your
system (/usr/include, /usr/lib, /usr/array/PVM, and so on), as in
previous releases. For further information, see Installing MPT for Use
with Modules, in the Modules relnotes.
Using MPI
Compile and link your MPI program as shown in the following examples.
IRIX systems:
To use the 64-bit MPI library, choose one of the following commands:
cc -64 compute.c -lmpi
f77 -64 -LANG:recursive=on compute.f -lmpi
f90 -64 -LANG:recursive=on compute.f -lmpi
CC -64 compute.C -lmpi++ -lmpi
To use the 32-bit MPI library, choose one of the following commands:
cc -n32 compute.c -lmpi
f77 -n32 -LANG:recursive=on compute.f -lmpi
f90 -n32 -LANG:recursive=on compute.f -lmpi
CC -n32 compute.C -lmpi++ -lmpi
Linux systems:
To use the 64-bit MPI library on Linux IA64 systems, choose one of the
following commands:
Page 2
MPI(1)MPI(1)
g++ -o myprog myproc.C -lmpi++ -lmpi
gcc -o myprog myprog.c -lmpi
For Altix the libmpi++.so library is not binary compatible with code
generated by g++ 3.0 compilers. For this reason an additional library is
supported for g++ 3.0 users as well as Intel C++ 8.0 users. The library
is libg++3mpi++.so and can be linked in by using -lg++3mpi++ instead of
-lmpi++.
For IRIX systems, if Fortran 90 compiler 7.2.1 or higher is installed,
you can add the -auto_use option as follows to get compile-time checking
of MPI subroutine calls:
f90 -auto_use mpi_interface -64 compute.f -lmpi
f90 -auto_use mpi_interface -n32 compute.f -lmpi
For IRIX with MPT version 1.4 or higher, and Altix with MPT 1.9 or
higher, the Fortran 90 USE MPI feature is supported. You can replace the
include 'mpif.h' statement in your Fortran 90 source code with USE MPI.
This facility includes MPI type and parameter definitions, and performs
compile-time checking of MPI function and subroutine calls.
For Altix users, if you USE MPI you must supply a -I option with the efc
command line to specify the directory in which the MPI.mod file resides.
efc will fail to find MPI.mod unless you supply a -I option; there is no
default search path for Fortran module files. For default-location
installations, -I/usr/include is correct; replace /usr/include with the
corresponding directory in your non-default-location installation if
necessary.
The Intel efc compiler does not support the notion of "allow any type"
formal arguments, so definitions for such routines as MPI_Send and
MPI_Recv which have buffer or other arguments which may be of any type
are omitted from USE MPI on Altix. Compile-time checking of these
functions is therefore not available on Altix.
NOTE: Do not use the IRIX Fortran 90 -auto_use mpi_interface option to
compile IRIX Fortran 90 source code that contains the USE MPI statement.
They are incompatible with each other.
For IRIX systems, applications compiled under a previous release of MPI
should not require recompilation to run under this new (3.3) release.
However, it is not possible for executable files running under the 3.2
release to interoperate with others running under the 3.3 release.
The C version of the MPI_Init(3) routine ignores the arguments that are
passed to it and does not modify them.
Page 3
MPI(1)MPI(1)
Stdin is enabled only for those MPI processes with rank 0 in the first
MPI_COMM_WORLD (which does not need to be located on the same host as
mpirun). Stdout and stderr results are enabled for all MPI processes in
the job, whether launched via mpirun, or one of the MPI-2 spawn
functions.
This version of the IRIX MPI implementation is compatible with the sproc
system call and can therefore coexist with doacross loops. SGI MPI can
likewise coexist with OpenMP on Linux systems. By default MPI is not
threadsafe. Therefore, calls to MPI routines in a multithreaded
application will require some form of mutual exclusion. The
MPI_Init_thread call can be used to request thread safety. In this case,
MPI calls can be made within parallel regions. MPI_Init_thread is
available on IRIX only.
For IRIX and Linux systems, this implementation of MPI requires that all
MPI processes call MPI_Finalize eventually.
Buffering
The current implementation buffers messages unless the MPI_BUFFER_MAX
environment variable is set or if the message size is large enough and
certain safe MPI functions are used.
Buffered messages are grouped into two classes based on length: short
(messages with lengths of 64 bytes or less) and long (messages with
lengths greater than 64 bytes).
When MPI_BUFFER_MAX is set, messages greater than this value are
candidates for single-copy transfers. For IRIX systems, the data from
the sending process must reside in the symmetric data, symmetric heap, or
global heap segment and be a contiguous type. For Linux systems, the
data from the sending process can reside in the static region, stack, or
private heap and must be a contiguous type.
For more information on single-copy transfers, see the MPI_BUFFER_MAX and
MPI_DEFAULT_SINGLE_COPY_OFF environment variables.
Myrinet (GM) Support
This release provides support for use of the GM protocol over Myrinet
interconnects on IRIX systems. Support is currently limited to 64-bit
applications.
Using MPI with cpusets
You can use cpusets to run MPI applications (see cpuset(4)). However, it
is highly recommended that the cpuset have the MEMORY_LOCAL attribute.
On Origin systems, if this attribute is not used, you should disable NUMA
optimizations (see the MPI_DSM_OFF environment variable description in
the following section).
Page 4
MPI(1)MPI(1)
Default Interconnect Selection
Beginning with the MPT 1.6 release, the search algorithm for selecting a
multi-host interconnect has been significantly modified. By default, if
MPI is being run across multiple hosts, or if multiple binaries are
specified on the mpirun command, the software now searches for
interconnects in the following order (for IRIX systems):
1) XPMEM (NUMAlink - only available on partitioned systems)
2) GSN
3) MYRINET
4) TCP/IP
The only supported interconnects on Linux systems are XPMEM and TCP/IP.
MPI uses the first interconnect it can detect and configure correctly.
There will only be one interconnect configured for the entire MPI job,
with the exception of XPMEM. If XPMEM is found on some hosts, but not on
others, one additional interconnect is selected.
The user can specify a mandatory interconnect to use by setting one of
the following new environment variables. These variables will be
assessed in the following order:
1) MPI_USE_XPMEM
2) MPI_USE_GSN
3) MPI_USE_GM
4) MPI_USE_TCP
For a mandatory interconnect to be used, all of the hosts on the mpirun
command line must be connected via the device, and the interconnect must
be configured properly. If this is not the case, an error message is
printed to stdout and the job is terminated. XPMEM is an exception to
this rule, however.
If MPI_USE_XPMEM is set, one additional interconnect can be selected via
the MPI_USE variables. Messaging between the partitioned hosts will use
the XPMEM driver while messaging between non-partitioned hosts will use
the second interconnect. If a second interconnect is required but not
selected by the user, MPI will choose the interconnect to use, based on
the default hierarchy.
If the global -v verbose option is used on the mpirun command line, a
message is printed to stdout, indicating which multi-host interconnect is
being used for the job.
The following interconnect selection environment variables have been
deprecated in the MPT 1.6 release: MPI_GSN_ON, MPI_GM_ON, and
MPI_BYPASS_OFF. If any of these variables are set, MPI prints a warning
message to stdout. The meanings of these variables are ignored.
Page 5
MPI(1)MPI(1)
Using MPI-2 Process Creation and Management Routines
This release provides support for MPI_Comm_spawn and
MPI_Comm_spawn_multiple. However, options must be specified as an
argument on the mpirun command line or as an environment variable to
enable this feature. On IRIX, this feature is only supported for MPI
jobs running within a single host running IRIX 6.5.2 or later. Support
on Linux is restricted to Altix numalinked systems. Consult the mpirun
man page for details on how to enable spawn support.
ENVIRONMENT VARIABLES
This section describes the variables that specify the environment under
which your MPI programs will run. Unless otherwise specified, these
variables are available for both Linux and IRIX systems. Environment
variables have predefined values. You can change some variables to
achieve particular performance objectives; others are required values for
standard-compliant programs.
MPI_ARRAY
Sets an alternative array name to be used for communicating with
Array Services when a job is being launched.
Default: The default name set in the arrayd.conf file
MPI_BAR_COUNTER (IRIX systems only)
Specifies the use of a simple counter barrier algorithm within the
MPI_Barrier(3) and MPI_Win_fence(3) functions.
Default: Enabled for jobs using fewer than 64 MPI processes.
MPI_BAR_DISSEM
Specifies the use of of a dissemination/butterfly algorithm within
the MPI_Barrier(3) and MPI_Win_fence(3) functions. This algorithm
has generally been found to provide the best performance. By
default on IRIX systems this algorithm is used for MPI_COMM_WORLD
and congruent communicators. Explicitly specifying this environment
variable also enables the use of this algorithm for other
communicators on both IRIX and Linux systems.
Default: On IRIX systems enabled for MPI_COMM_WORLD for jobs using
more than 64 processes. On Altix systems enabled by default for all
MPI communicators for all process counts.
MPI_BAR_TREE
Specifies the use of a tree barrier within the MPI_Barrier(3) and
MPI_Win_fence(3) functions. This variable can also be used to
change the default arity(fan-in) of the tree barrier algorithm.
Typically this barrier is slower than the butterfly/dissemination
barrier.
Default: Not enabled. Default arity is 8 when enabled.
Page 6
MPI(1)MPI(1)
MPI_BUFFER_MAX
Specifies a minimum message size, in bytes, for which the message
will be considered a candidate for single-copy transfer.
On IRIX, this mechanism is available only for communication between
MPI processes on the same host. The sender data must reside in
either the symmetric data, symmetric heap, or global heap. The MPI
data type on the send side must also be a contiguous type.
On IRIX, if the XPMEM driver is enabled (for single host jobs, see
MPI_XPMEM_ON and for multihost jobs, see MPI_USE_XPMEM), MPI allows
single-copy transfers for basic predefined MPI data types from any
sender data location, including the stack and private heap. The
XPMEM driver also allows single-copy transfers across partitions.
On IRIX, if cross mapping of data segments is enabled at job
startup, data in common blocks will reside in the symmetric data
segment. On systems running IRIX 6.5.2 or higher, this feature is
enabled by default. You can employ the symmetric heap by using the
shmalloc(shpalloc) functions available in LIBSMA.
On Linux, this feature is supported for both single host MPI jobs
and MPI jobs running across partitions. MPI uses the xpmem module to
map memory from one MPI process onto another during job startup.
The mapped areas include the static region, private heap, and stack
region. Single-copy is supported for contiguous data types from any
of the mapped regions.
Memory mapping is enabled by default on Linux. To disable it, set
the MPI_MEMMAP_OFF environment variable. In addition, the xpmem
kernel module must be installed on your system for single-copy
transfers. The xpmem module is released with the OS.
Testing of this feature has indicated that most MPI applications
benefit more from buffering of medium-sized messages than from
buffering of large size messages, even though buffering of medium-
sized messages requires an extra copy of data. However, highly
synchronized applications that perform large message transfers can
benefit from the single-copy pathway.
Single-copy can occur by default for certain MPI functions that
transfer large size messages. See MPI_DEFAULT_SINGLE_COPY_OFF for
more information and how to disable it.
Default: Not enabled
MPI_BUFS_PER_HOST
Determines the number of shared message buffers (16 KB each) that
MPI is to allocate for each host. These buffers are used to send
long messages and interhost messages.
Default: 32 pages (1 page = 16KB)
Page 7
MPI(1)MPI(1)
MPI_BUFS_PER_PROC
Determines the number of private message buffers (16 KB each) that
MPI is to allocate for each process. These buffers are used to send
long messages and intrahost messages.
Default: 32 pages (1 page = 16KB)
MPI_CHECK_ARGS
Enables checking of MPI function arguments. Segmentation faults
might occur if bad arguments are passed to MPI, so this is useful
for debugging purposes. Using argument checking adds several
microseconds to latency.
Default: Not enabled
MPI_COMM_MAX
Sets the maximum number of communicators that can be used in an MPI
program. Use this variable to increase internal default limits.
(Might be required by standard-compliant programs.) MPI generates
an error message if this limit (or the default, if not set) is
exceeded.
Default: 256
MPI_COREDUMP
Controls which ranks of an MPI job can dump core on receipt of a
core-dumping signal. Valid values are NONE, FIRST, ALL, or INHIBIT.
NONE means that no rank should dump core. FIRST means that the
first rank on each host to receive a core-dumping signal should dump
core. ALL means that all ranks should dump core if they receive a
core-dumping signal. INHIBIT disables MPI signal-handler
registration for core-dumping signals.
When MPI_Init() is called, the MPI library attempts to register a
signal handler for each signal for which reception causes a core
dump. If a signal handler was previously registered, MPI removes the
MPI registration and restores the other signal handler for that
signal. If no previously-registered handler is present, the MPI
handler is invoked if and when the rank receives a core-dumping
signal.
When the MPI signal handler is invoked, it displays a stack
traceback for the first rank entering the handler on each host, and
then consults MPI_COREDUMP to determine if a core dump should be
produced.
Note that process limits on core dump size interact with this
setting. First a process decides to dump core or is inhibited from
dumping core based on the MPI_COREDUMP setting. Then "limit
coredump" applies to the resulting core dump file(s), if any.
Default: FIRST
Page 8
MPI(1)MPI(1)
MPI_COREDUMP_DEBUGGER (Linux only)
This variable lets you optionally specify which debugger should be
used by MPT to display the stack traceback when your program
receives a core-dumping signal. Set MPI_VERBOSE to have MPT display
the debugger command just before it executes it. If the environment
variable is not defined, MPT uses the idb debugger.
You can specify this variable in any of the following formats:
Format Meaning
Basename of a debugger If you specify idb or gdb, MPT uses
that debugger, customizing the command
line argument and debugger commands
sent to the debugger, as appropriate.
Note that the program you specify must
be located in one of the directories
specified by the PATH environment
variable in the MPT job. This might be
different from the PATH variable in
your interactive sessions. If you
receive a message similar to sh: idb:
command not found in the stack
traceback, you can use the pathname to
the debugger (described in the
following format) to supply a full
pathname instead.
Pathname to a debugger If you specify a value that contains a
/, but no spaces, MPT takes the value
as the pathname to the debugger you
wish to use. The final four characters
of the value must be /idb or /gdb.
Command-line arguments are not
supplied to the debugger, but debugger
commands are customized according to
the debugger specified. If you need
to specify command-line arguments to
the debugger, use a complete command
line (described in the following
format).
Complete command line If the value contains a space, it is
taken as the complete command line to
be passed to system(1). Up to four
occurrences of %d in the command line
are replaced by the process ID of the
process upon which the debugger should
be run. You will need to arrange for
debugger commands to be sent to the
debugger. The third and fourth
Page 9
MPI(1)MPI(1)
examples below show samples of this.
Examples: (There are four examples here, each of which must be
typed all on one line)
setenv MPI_COREDUMP_DEBUGGER gdb
setenv MPI_COREDUMP_DEBUGGER /my/test/version/of/idb
setenv MPI_COREDUMP_DEBUGGER "(echo print my_favorite_variable; echo where; echo quit) | gdb -p %d"
setenv MPI_COREDUMP_DEBUGGER '(echo set \$stoponattach = 1; echo attach %d /proc/%d/exe; echo where; echo quit) | /sw/com/intel-compilers/7.1.013/compiler70/ia64/bin/idb | sed -e "s/^/coredump: /"'
Default: idb
MPI_COREDUMP_VERBOSE
Instructs mpirun(1) to print information about coredump control and
traceback handling. Notably, a message will be printed if a user-
or library-registered signal handler overrides a signal handler
which the MPT library would otherwise have installed. Output is
sent to stderr.
Default: Not enabled
MPI_DEFAULT_SINGLE_COPY_OFF
Disables the single-copy mode by default optimization. This
optimization causes transfers of more than 2000 bytes that use
MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and
MPI_Reduce to use the single-copy mode optimization. Users of
MPI_Send should continue to use the MPI_BUFFER_MAX environment
variable to enable single-copy.
Default: Not enabled
MPI_DIR
Sets the working directory on a host. When an mpirun(1) command is
issued, the Array Services daemon on the local or distributed node
responds by creating a user session and starting the required MPI
processes. The user ID for the session is that of the user who
invokes mpirun, so this user must be listed in the .rhosts file on
the corresponding nodes. By default, the working directory for the
session is the user's $HOME directory on each node. You can direct
all nodes to a different directory (an NFS directory that is
available to all nodes, for example) by setting the MPI_DIR variable
to a different directory.
Default: $HOME on the node. If using the -np option of mpirun(1),
the default is the current directory.
MPI_DPLACE_INTEROP_OFF (IRIX systems only)
Disables an MPI/dplace interoperability feature available beginning
with IRIX 6.5.13. By setting this variable, you can obtain the
behavior of MPI with dplace on older releases of IRIX.
Page 10
MPI(1)MPI(1)
Default: Not enabled
MPI_DSM_CPULIST
Specifies a list of CPUs on which to run an MPI application. To
ensure that processes are linked to CPUs, this variable should be
used in conjunction with the MPI_DSM_MUSTRUN variable.
For an explanation of the syntax for this environment variable, see
the section titled "Using a CPU List."
MPI_DSM_CPULIST_TYPE
Specifies the way in which MPI should interpret the CPU values given
by the MPI_DSM_CPULIST variable. This variable can be set to the
following values:
Value Action
hwgraph This tells MPI to interpret the CPU numbers
designated by the MPI_DSM_CPULIST variable as
cpunum values as defined in the hardware
graph(see hwgraph(4)). This is the default
interpretation when running MPI outside of a
cpuset(see cpuset(4)).
cpuset This tells MPI to interpret the CPU numbers
designated by the MPI_DSM_CPULIST variable as
relative processors within a cpuset. This is
the default interpretation of this list when MPI
is running within a cpuset. Setting
MPI_DSM_CPULIST_TYPE to this value when not
running within a cpuset has no effect.
MPI_DSM_DISTRIBUTE (Linux systems only)
Ensures that each MPI process gets a unique CPU and physical memory
on the node with which that CPU is associated. Currently, the CPUs
are chosen by simply starting at relative CPU 0 and incrementing
until all MPI processes have been forked. To choose specific CPUs,
use the MPI_DSM_CPULIST environment variable. This feature is most
useful if running on a dedicated system or running within a cpuset.
Some batch schedulers including LSF 5.1 will cause
MPI_DSM_DISTRIBUTE to be set automatically when using dynamic
cpusets.
Default: Not enabled
MPI_DSM_MUSTRUN
Enforces memory locality for MPI processes. Use of this feature
ensures that each MPI process will get a CPU and physical memory on
the node to which it was originally assigned. This variable has
been observed to improve program performance on IRIX systems running
release 6.5.7 and earlier, when running a program on a quiet system.
With later IRIX releases, under certain circumstances, setting this
Page 11
MPI(1)MPI(1)
variable is not necessary. Internally, this feature directs the
library to use the process_cpulink(3) function instead of
process_mldlink(3) to control memory placement.
MPI_DSM_MUSTRUN should not be used when the job is submitted to
miser (see miser_submit(1)) because program hangs may result.
The process_cpulink(3) function is inherited across process fork(2)
or sproc(2). For this reason, when using mixed MPI/OpenMP
applications, it is recommended either that this variable not be
set, or that _DSM_MUSTRUN also be set (see pe_environ(5)).
On Linux systems, this environment variable has been deprecated and
will be removed in a future release. Use the MPI_DSM_DISTRIBUTE
environment variable instead.
Default: Not enabled
MPI_DSM_OFF
Turns off nonuniform memory access (NUMA) optimization in the MPI
library.
Default: Not enabled
MPI_DSM_PLACEMENT (IRIX systems only)
Specifies the default placement policy to be used for the stack and
data segments of an MPI process. Set this variable to one of the
following values:
Value Action
firsttouch With this policy, IRIX attempts to satisfy
requests for new memory pages for stack, data,
and heap memory on the node where the requesting
process is currently scheduled.
fixed With this policy, IRIX attempts to satisfy
requests for new memory pages for stack, data,
and heap memory on the node associated with the
memory locality domain (mld) with which an MPI
process was linked at job startup. This is the
default policy for MPI processes.
roundrobin With this policy, IRIX attempts to satisfy
requests for new memory pages in a round robin
fashion across all of the nodes associated with
the MPI job. It is generally not recommended to
use this setting.
threadroundrobin
This policy is intended for use with hybrid
MPI/OpenMP applications only. With this policy,
Page 12
MPI(1)MPI(1)
IRIX attempts to satisfy requests for new memory
pages for the MPI process stack, data, and heap
memory in a roundrobin fashion across the nodes
allocated to its OpenMP threads. This placement
option might be helpful for large OpenMP/MPI
process ratios. For non-OpenMP applications,
this value is ignored.
Default: fixed
MPI_DSM_PPM
Sets the number of MPI processes per memory locality domain (mld).
For Origin 2000 systems, values of 1 or 2 are allowed. For Origin
3000 and Origin 300 systems, values of 1, 2, or 4 are allowed. On
Altix systems, values of 1 or 2 are allowed.
Default: Origin 2000 systems, 2; Origin 3000 and Origin 300
systems, 4; Altix systems, 2.
MPI_DSM_TOPOLOGY (IRIX systems only)
Specifies the shape of the set of hardware nodes on which the PE
memories are allocated. Set this variable to one of the following
values:
Value Action
cube A group of memory nodes that form a perfect
hypercube. The number of processes per host
must be a power of 2. If a perfect hypercube is
unavailable, a less restrictive placement will
be used.
cube_fixed A group of memory nodes that form a perfect
hypercube. The number of processes per host
must be a power of 2. If a perfect hypercube is
unavailable, the placement will fail, disabling
NUMA placement.
cpucluster Any group of memory nodes. The operating system
attempts to place the group numbers close to one
another, taking into account nodes with disabled
processors. (Default for Irix 6.5.11 and
higher).
free Any group of memory nodes. The operating system
attempts to place the group numbers close to one
another. (Default for Irix 6.5.10 and earler
releases).
MPI_DSM_VERBOSE
Instructs mpirun(1) to print information about process placement for
jobs running on nonuniform memory access (NUMA) machines (unless
Page 13
MPI(1)MPI(1)
MPI_DSM_OFF is also set). Output is sent to stderr.
Default: Not enabled
MPI_DSM_VERIFY (IRIX systems only)
Instructs mpirun(1) to run some diagnostic checks on proper memory
placement of MPI data structures at job startup. If errors are
found, a diagnostic message is printed to stderr.
Default: Not enabled
MPI_GM_DEVS (IRIX systems only)
Sets the order for opening GM(Myrinet) adapters. The list of devices
does not need to be space-delimited (0321 is valid). In this
release, a maximum of 8 adpaters are supported on a single host.
Default: MPI will use all available GM(Myrinet) devices.
MPI_GM_VERBOSE
Setting this variable allows some diagnostic information concerning
messaging between processes using GM (Myrinet) to be displayed on
stderr.
Default: Not enabled
MPI_GROUP_MAX
Determines the maximum number of groups that can simultaneously
exist for any single MPI process. Use this variable to increase
internal default limits. (This variable might be required by
standard-compliant programs.) MPI generates an error message if
this limit (or the default, if not set) is exceeded.
Default: 32
MPI_GSN_DEVS (IRIX 6.5.12 systems or later)
Sets the order for opening GSN adapters. The list of devices does
not need to be quoted or space-delimited (0123 is valid).
Default: MPI will use all available GSN devices
MPI_GSN_VERBOSE (IRIX 6.5.12 systems or later)
Allows additional MPI initialization information to be printed in
the standard output stream. This information contains details about
the GSN (ST protocol) OS bypass connections and the GSN adapters
that are detected on each of the hosts.
Default: Not enabled
MPI_MAPPED_HEAP_SIZE (Linux systems only)
Sets the new size (in bytes) for the amount of heap that is memory
mapped per MPI process. The default size of the mapped heap is the
physical memory available per CPU less the static region size. For
Page 14
MPI(1)MPI(1)
more information regarding memory mapping, see MPI_MEMMAP_OFF.
Default: The physical memory available per CPU less the static
region size
MPI_MAPPED_STACK_SIZE (Linux systems only)
Sets the new size (in bytes) for the amount of stack that is memory
mapped per MPI process. The default size of the mapped stack is the
stack limit size. If the stack is unlimited, the mapped region is
set to the physical memory available per CPU. For more information
regarding memory mapping, see MPI_MEMMAP_OFF.
Default: The stack limit size
MPI_MEMMAP_OFF (Linux systems only)
Turns off the memory mapping feature.
The memory mapping feature provides support for single-copy
transfers and MPI-2 one-sided communication on Linux. These
features are supported for single host MPI jobs and MPI jobs that
span partitions. At job startup, MPI uses the xpmem module to map
memory from one MPI process onto another. The mapped areas include
the static region, private heap, and stack.
Memory mapping is enabled by default on Linux. To disable it, set
the MPI_MEMMAP_OFF environment variable.
For memory mapping, the xpmem kernel module must be installed on
your system. The xpmem module is released with the OS.
Default: Not enabled
MPI_MEMMAP_VERBOSE (Linux systems only)
Allows MPI to display additional information regarding the memory
mapping initialization sequence. Output is sent to stderr.
Default: Not enabled
MPI_MSG_RETRIES
Specifies the number of times the MPI library will try to get a
message header, if none are available. Each MPI message that is
sent requires an initial message header. If one is not available
after MPI_MSG_RETRIES, the job will abort.
Note that this variable no longer applies to processes on the same
host, or when using the GM (Myrinet) protocol. In these cases,
message headers are allocated dynamically on an as-needed basis.
Default: 500
Page 15
MPI(1)MPI(1)
MPI_MSGS_MAX
This variable can be set to control the total number of message
headers that can be allocated. This allocation applies to messages
exchanged between processes on a single host, or between processes
on different hosts when using the GM(Myrinet) OS bypass protocol.
Note that the initial allocation of memory for message headers is
128 Kbytes.
Default: Allow up to 64 Mbytes to be allocated for message headers.
If you set this variable, specify the maximum number of message
headers.
MPI_MSGS_PER_HOST
Sets the number of message headers to allocate for MPI messages on
each MPI host. Space for messages that are destined for a process on
a different host is allocated as shared memory on the host on which
the sending processes are located. MPI locks these pages in memory.
Use the MPI_MSGS_PER_HOST variable to allocate buffer space for
interhost messages.
Caution: If you set the memory pool for interhost packets to a
large value, you can cause allocation of so much locked memory that
total system performance is degraded.
The previous description does not apply to processes that use the
GM(Myrinet) OS bypass protocol. In this case, message headers are
allocated dynamically as needed. See the MPI_MSGS_MAX variable
description.
Default: 1024 messages
MPI_MSGS_PER_PROC
This variable is effectively obsolete. Message headers are now
allocated on an as needed basis for messaging either between
processes on the same host, or between processes on different hosts
when using the GM (Myrinet) OS bypass protocol. The new
MPI_MSGS_MAX variable can be used to control the total number of
message headers that can be allocated.
Default: 1024
MPI_NAP
This variable affects the way in which ranks wait for events to
occur. For example, when a receive is issued for which there are as
yet no matching sends, the receiving rank awaits the matching send
issued event.
When MPI_NAP is not defined (that is, unsetenv MPI_NAP), the library
spins in a tight loop when awaiting events. While this provides the
best possible response time when the event occurs, each waiting rank
uses CPU time at wall-clock rates until then. Leaving MPI_NAP
undefined is best if sends and matching receives occur nearly
Page 16
MPI(1)MPI(1)
simultaneously.
If defined with no value (that is, setenv MPI_NAP), the library
makes a system call while waiting, which might yield the CPU to
another eligible process that can use it. If no such process
exists, the rank receives control back nearly immediately, and CPU
time accrues at near wall-clock rates. If another process does
exist, it is given some CPU time, after which the MPI rank is again
given the CPU to test for the event. This is best if the system is
oversubscribed (there are more processes ready to run than there are
CPUs). This option was previously available in MPT, but was not
documented.
If defined with a positive integer value (for example, setenv
MPI_NAP 10), the rank sleeps for that many milliseconds before again
testing to determine if an event has occurred. This dramatically
reduces the CPU time that is charged against the rank, and might
increase the system's "idle" time. This setting is best if there is
usually a significant time difference between the times that sends
and matching receives are posted.
Default: Not applicable - one of the cases above always applies.
MPI_OPENMP_INTEROP
Setting this variable modifies the placement of MPI processes to
better accomodate the OpenMP threads associated with each process.
For more information, see the section titled Using MPI with OpenMP.
NOTE: This option is available only on Origin 300 and Origin 3000
servers and Altix systems.
Default: Not enabled
MPI_REQUEST_MAX
Determines the maximum number of nonblocking sends and receives that
can simultaneously exist for any single MPI process. Use this
variable to increase internal default limits. (This variable might
be required by standard-compliant programs.) MPI generates an error
message if this limit (or the default, if not set) is exceeded.
Default: 16384
MPI_SHARED_VERBOSE
Setting this variable allows for some diagnostic information
concerning messaging within a host to be displayed on stderr.
Default: Not enabled
MPI_SIGTRAP (Linux systems only)
Specifies if MPT's signal handler should override any existing
signal handlers for signals SIGSEGV, SIGQUIT, SIGILL, SIGABRT,
SIGBUS, and SIGFPE. If set to ON, the MPT signal handler will
Page 17
MPI(1)MPI(1)
override any pre-existing signal handler for these signals. If
OFF, then the existing signal handlers will remain in effect.
These signals are sometimes handled by compiler-language-specific
runtime libraries. In some cases, the signal handler in the runtime
library makes inappropriate references to memory-mapped fetchop
areas, which may result in a system panic. This has been observed
with Intel's efc 7.x compilers.
Default: ON (This may change in future releases.)
MPI_SIGTRAP_VERBOSE (Linux systems only)
If set, MPT will display the value of the MPI_SIGTRAP environment
variable, and messages about the actions taken if MPT overrides a
pre-existing signal handler. See also MPI_COREDUMP_VERBOSE.
Default: Not enabled
MPI_SLAVE_DEBUG_ATTACH
Specifies the MPI process to be debugged. If you set
MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a
message during program startup, describing how to attach to it from
another window using the dbx debugger on IRIX or the gdb or idb
debugger on Linux. The message includes the number of seconds you
have to attach the debugger to process N. If you fail to attach
before the time expires, the process continues.
MPI_STATIC_NO_MAP (IRIX systems only)
Disables cross mapping of static memory between MPI processes. This
variable can be set to reduce the significant MPI job startup and
shutdown time that can be observed for jobs involving more than 512
processors on a single IRIX host. Note that setting this shell
variable disables certain internal MPI optimizations and also
restricts the usage of MPI-2 one-sided functions. For more
information, see the MPI_Win man page.
Default: Not enabled
MPI_STATS
Enables printing of MPI internal statistics. Each MPI process
prints statistics about the amount of data sent with MPI calls
during the MPI_Finalize process. Data is sent to stderr. To prefix
the statistics messages with the MPI rank, use the -p option on the
mpirun command. For additional information, see the MPI_SGI_stats
man page.
NOTE: Because the statistics-collection code is not thread-safe,
this variable should not be set if the program uses threads.
Default: Not enabled
Page 18
MPI(1)MPI(1)
MPI_TYPE_DEPTH
Sets the maximum number of nesting levels for derived data types.
(Might be required by standard-compliant programs.) The
MPI_TYPE_DEPTH variable limits the maximum depth of derived data
types that an application can create. MPI generates an error
message if this limit (or the default, if not set) is exceeded.
Default: 8 levels
MPI_TYPE_MAX
Determines the maximum number of data types that can simultaneously
exist for any single MPI process. Use this variable to increase
internal default limits. (This variable might be required by
standard-compliant programs.) MPI generates an error message if
this limit (or the default, if not set) is exceeded.
Default: 1024
MPI_UNBUFFERED_STDIO
Normally, mpirun line-buffers output received from the MPI processes
on both the stdout and stderr standard IO streams. This prevents
lines of text from different processes from possibly being merged
into one line, and allows use of the mpirun -prefix option.
Of course, there is a limit to the amount of buffer space that
mpirun has available (currently, about 8,100 characters can appear
between new line characters per stream per process). If more
characters are emitted before a new line character, the MPI program
will abort with an error message.
Setting the MPI_UNBUFFERED_STDIO environment variable disables this
buffering. This is useful, for example, when a program's rank 0
emits a series of periods over time to indicate progress of the
program. With buffering, the entire line of periods will be output
only when the new line character is seen. Without buffering, each
period will be immediately displayed as soon as mpirun receives it
from the MPI program. (Note that the MPI program still needs to
call fflush(3) or FLUSH(101) to flush the stdout buffer from the
application code.)
Additionally, setting MPI_UNBUFFERED_STDIO allows an MPI program
that emits very long output lines to execute correctly.
NOTE: If MPI_UNBUFFERED_STDIO is set, the mpirun -prefix option is
ignored.
Default: Not set
MPI_UNIVERSE (Linux systems only)
When running MPI applications on partitioned Altix systems which use
the MPI_Comm_spawn and MPI_Comm_spawn_multiple functions, it may be
necessary to explicitly specify the partitions on which additional
Page 19
MPI(1)MPI(1)MPI processes may be launched. The MPI_UNIVERSE environment
variable may be used for this purpose.
For more information, see the section titled "Launching Spawn
Capable Jobs on Altix Partitioned Systems" from the mpirun man page.
Default: Not set
MPI_UNIVERSE_SIZE (Linux systems only)
When running MPI applications on partitioned Altix systems which use
the MPI_Comm_spawn and MPI_Comm_spawn_multiple functions users can
now specify MPI_UNIVERSE_SIZE instead of using the -up option on the
mpirun command.
For more information, see the section titled "Launching Spawn
Capable Jobs on Altix Partitioned Systems" from the mpirun man page.
Default: Not set
MPI_USE_GM (IRIX systems only)
Requires the MPI library to use the Myrinet (GM protocol) OS bypass
driver as the interconnect when running across multiple hosts or
running with multiple binaries. If a GM connection cannot be
established among all hosts in the MPI job, the job is terminated.
For more information, see the section titled "Default Interconnect
Selection."
Default: Not set
MPI_USE_GSN (IRIX 6.5.12 systems or later)
Requires the MPI library to use the GSN (ST protocol) OS bypass
driver as the interconnect when running across multiple hosts or
running with multiple binaries. If a GSN connection cannot be
established among all hosts in the MPI job, the job is terminated.
GSN imposes a limit of one MPI process using GSN per CPU on a
system. For example, on a 128-CPU system, you can run multiple MPI
jobs, as long as the total number of MPI processes using the GSN
bypass does not exceed 128.
Once the maximum allowed MPI processes using GSN is reached,
subsequent MPI jobs return an error to the user output, as in the
following example:
MPI: Could not connect all processes to GSN adapters. The maximum
number of GSN adapter connections per system is normally equal
to the number of CPUs on the system.
If there are a few CPUs still available, but not enough to satisfy
the entire MPI job, the error will still be issued and the MPI job
Page 20
MPI(1)MPI(1)
terminated.
For more information, see the section titled "Default Interconnect
Selection."
Default: Not set
MPI_USE_TCP
Requires the MPI library to use the TCP/IP driver as the
interconnect when running across multiple hosts or running with
multiple binaries.
For more information, see the section titled "Default Interconnect
Selection."
Default: Not set
MPI_USE_XPMEM (IRIX 6.5.13 systems or later and Linux systems)
Requires the MPI library to use the XPMEM driver as the interconnect
when running across multiple hosts or running with multiple
binaries. This driver allows MPI processes running on one partition
to communicate with MPI processes on a different partition via the
NUMAlink network. The NUMAlink network is powered by block transfer
engines (BTEs). BTE data transfers do not require processor
resources.
For IRIX, the XPMEM (cross partition) device driver is available
only on Origin 3000 and Origin 300 systems running IRIX 6.5.13 or
greater.
NOTE: Due to possible MPI program hangs, you should not run MPI
across partitions using the XPMEM driver on IRIX versions 6.5.13,
6.5.14, or 6.5.15. This problem has been resolved in IRIX version
6.5.16.
For Linux, the XPMEM device driver requires the xpmem kernel module
to be installed. The xpmem module is released with the OS.
If all of the hosts specified on the mpirun command do not reside in
the same partitioned system, you can select one additional
interconnect via the MPI_USE variables. MPI communication between
partitions will go through the XPMEM driver, and communication
between non-partitioned hosts will go through the second
interconnect.
For more information, see the section titled "Default Interconnect
Selection."
Default: Not set
Page 21
MPI(1)MPI(1)
MPI_XPMEM_ON (IRIX 6.5.15 systems or later)
Enables the XPMEM single-copy enhancements for processes residing on
the same host.
The XPMEM enhancements allow single-copy transfers for basic
predefined MPI data types from any sender data location, including
the stack and private heap. Without enabling XPMEM, single-copy is
allowed only from data residing in the symmetric data, symmetric
heap, or global heap.
Both the MPI_XPMEM_ON and MPI_BUFFER_MAX variables must be set to
enable these enhancements. Both are disabled by default.
If the following additional conditions are met, the block transfer
engine (BTE) is invoked instead of bcopy, to provide increased
bandwidth:
* Send and receive buffers are cache-aligned.
* Amount of data to transfer is greater than or equal to the
MPI_XPMEM_THRESHOLD value.
NOTE: The XPMEM driver does not support checkpoint/restart at this
time. If you enable these XPMEM enhancements, you will not be able
to checkpoint and restart your MPI job.
The XPMEM single-copy enhancements require an Origin 3000 and Origin
300 servers running IRIX release 6.5.15 or greater.
Default: Not set
MPI_XPMEM_THRESHOLD (IRIX 6.5.15 systems or later)
Specifies a minimum message size, in bytes, for which single-copy
messages between processes residing on the same host will be
transferred via the BTE, instead of bcopy. The following conditions
must exist before the BTE transfer is invoked:
* Single-copy mode is enabled (MPI_BUFFER_MAX).
* XPMEM single-copy enhancements are enabled (MPI_XPMEM_ON).
* Send and receive buffers are cache-aligned.
* Amount of data to transfer is greater than or equal to the
MPI_XPMEM_THRESHOLD value.
Default: 8192
MPI_XPMEM_VERBOSE
Setting this variable allows additional MPI diagnostic information
to be printed in the standard output stream. This information
contains details about the XPMEM connections.
Page 22
MPI(1)MPI(1)
Default: Not enabled
PAGESIZE_DATA (IRIX systems only)
Specifies the desired page size in kilobytes for program data areas.
On Origin series systems, supported values include 16, 64, 256,
1024, and 4096. Specified values must be integer.
NOTE: Setting MPI_DSM_OFF disables the ability to set the data
pagesize via this shell variable.
Default: Not enabled
PAGESIZE_STACK (IRIX systems only)
Specifies the desired page size in kilobytes for program stack
areas. On Origin series systems, supported values include 16, 64,
256, 1024, and 4096. Specified values must be integer.
NOTE: Setting MPI_DSM_OFF disables the ability to set the data
page size via this shell variable.
Default: Not enabled
SMA_GLOBAL_ALLOC (IRIX systems only)
Activates the LIBSMA based global heap facility. This variable is
used by 64-bit MPI applications for certain internal optimizations,
as well as support for the MPI_Alloc_mem function. For additional
details, see the intro_shmem(3) man page.
Default: Not enabled
SMA_GLOBAL_HEAP_SIZE (IRIX systems only)
For 64-bit applications, specifies the per process size of the
LIBSMA global heap in bytes.
Default: 33554432 bytes
Using a CPU List
You can manually select CPUs to use for an MPI application by setting the
MPI_DSM_CPULIST shell variable. This setting is treated as a comma
and/or hyphen delineated ordered list, specifying a mapping of MPI
processes to CPUs. If running across multiple hosts or when using
multiple executables, the per host and per executable components of the
CPU list are delineated by colons. The shepherd process(es) and mpirun
are not included in this list. This feature is not compatible with job
migration features available in IRIX.
Examples when launching an MPI job with the following syntax:
mpirun -np 3 a.out
Page 23
MPI(1)MPI(1)
Value CPU Assignment
8,16,32 Place three MPI processes on CPUs 8, 16, and 32.
32,16,8 Place the MPI process rank zero on CPU 32, one
on 16, and two on CPU 8.
Examples when launching an MPI job with the following syntax:
mpirun -np 16 a.out
Value CPU Assignment
8-15,32-39 Place the MPI processes 0 through 7 on CPUs 8 to
15. Place the MPI processes 8 through 15 on
CPUs 32 to 39.
39-32,8-15 Place the MPI processes 0 through 7 on CPUs 39
to 32. Place the MPI processes 8 through 15 on
CPUs 8 to 15.
Example when launching an MPI job with the following syntax:
mpirun host1,host2 8 a.out
Value CPU Assignment
8-15:16-23 Place the MPI processes 0 through 7 on the first
host on CPUs 8 through 15. Place MPI processes
8 through 15 on CPUs 16 to 23 on the second
host.
Example when launching an MPI job with the following syntax:
mpirun host1,host2 8 a.out : host2 8 b.out
Value CPU Assignment
8-15:16-23:28-35
Place the MPI processes 0 through 7 running
application a.out on the first host on CPUs 8
through 15. Place MPI processes 8 through 15
running a.out on CPUs 16 to 23 on the second
host. Place MPI processes 16 to 23 running
b.out on CPUS 28 to 35 on the second host.
Note that the process rank is the MPI_COMM_WORLD rank. The
interpretation of the CPU values specified in the MPI_DSM_CPULIST depends
on whether the MPI job is being run within a cpuset. If the job is run
Page 24
MPI(1)MPI(1)
outside of a cpuset, the CPUs specify cpunum values given in the hardware
graph (hwgraph(4)). When running within a cpuset, the default behavior
is to interpret the CPU values as relative processor numbers within the
cpuset. To specify cpunum values instead, you can use the
MPI_DSM_CPULIST_TYPE shell variable.
On Linux systems, the CPU values are always treated as relative processor
numbers within the cpuset. It is assumed that the system will always
have a default (unnamed) cpuset consisting of the entire system of
available processors and nodes.
The number of processors specified should equal the number of MPI
processes (excluding the shepherd process) that will be used. The number
of colon delineated parts of the list must equal the number of hosts or
executables used for the MPI job. If an error occurs in processing the
CPU list, the default placement policy is used. If the number of
specified processors is smaller than the total number of MPI processes,
only a subset of the MPI processes will be placed on the specified
processors. For example, if four processors are specified using the
MPI_DSM_CPULIST variable, but five MPI processes are started, the last
MPI process will not be attached to a processor.
This feature should not be used with MPI jobs running in spawn capable
mode.
Using MPI with OpenMP
Hybrid MPI/OpenMP applications might require special memory placement
features to operate efficiently on ccNUMA Origin and Altix servers. A
method for realizing this memory placement is available. The basic idea
is to space out the MPI processes to accomodate the OpenMP threads
associated with each MPI process. In addition, assuming a particular
ordering of library init code (see the DSO(5) man page), procedures are
employed to insure that the OpenMP threads remain close to the parent MPI
process. This type of placement has been found to improve the performance
of some hybrid applications significantly when more than four OpenMP
threads are used by each MPI process.
To take partial advantage of this placement option, the following
requirements must be met:
* The user must set the MPI_OPENMP_INTEROP shell variable
when running the application.
* On IRIX systems, the user must use a MIPSpro compiler and
the -mp option to compile the application. This placement
option is not available with other compilers.
* The user must run the application on an Origin 300, Origin
3000, or Altix series server.
Page 25
MPI(1)MPI(1)
To take full advantage of this placement option on IRIX systems, the user
must be able to link the application such that the libmpi.so init code is
run before the libmp.so init code. This is done by linking the
MPI/OpenMP application as follows:
cc -64 -mp compute_mp.c -lmp -lmpi
f77 -64 -mp compute_mp.f -lmp -lmpi
f90 -64 -mp compute_mp.f -lmp -lmpi
CC -64 -mp compute_mp.C -lmp -lmpi++ -lmpi
This linkage order insures that the libmpi.so init runs procedures for
restricting the placement of OpenMP threads before the libmp.so init is
run. Note that this is not the default linkage if only the -mp option is
specified on the link line.
On IRIX systems, you can use an additional memory placement feature for
hybrid MPI/OpenMP applications by using the MPI_DSM_PLACEMENT shell
variable. Specification of a threadroundrobin policy results in the
parent MPI process stack, data, and heap memory segments being spread
across the nodes on which the child OpenMP threads are running. For more
information, see the ENVIRONMENT VARIABLES section of this man page.
MPI reserves nodes for this hybrid placement model based on the number of
MPI processes and the number of OpenMP threads per process, rounded up to
the nearest multiple of 4 on IRIX systems and 2 on Altix systems. For
instance, on IRIX systems, if 6 OpenMP threads per MPI process are going
to be used for a 4 MPI process job, MPI will request a placement for 32
(4 X 8) CPUs on the host machine. You should take this into account when
requesting resources in a batch environment or when using cpusets. In
this implementation, it is assumed that all MPI processes start with the
same number of OpenMP threads, as specified by the OMP_NUM_THREADS or
equivalent shell variable at job startup.
NOTE: This placement is not recommended when setting _DSM_PPM to a
non-default value (for more information, see pe_environ(5)). This
placement is also not recommended when running on a host with partially
populated nodes. Also, on IRIX systems, if you are using
MPI_DSM_MUSTRUN, it is important to also set _DSM_MUSTRUN to properly
schedule the OpenMP threads.
On Linux systems, the OpenMP threads are not actually pinned to specific
CPUs but are limited to the set of CPUs near the MPI rank. Actual
pinning of the threads will be supported in a future release.
SEE ALSOmpirun(1), shmem_intro(1)arrayd(1M)MPI_Buffer_attach(3), MPI_Buffer_detach(3), MPI_Init(3), MPI_IO(3)
Page 26
MPI(1)MPI(1)arrayd.conf(4)array_services(5)
For more information about using MPI, including optimization, see the
Message Passing Toolkit: MPI Programmer's Manual. You can access this
manual online at http://techpubs.sgi.com.
Man pages exist for every MPI subroutine and function, as well as for the
mpirun(1) command. Additional online information is available at
http://www.mcs.anl.gov/mpi, including a hypertext version of the
standard, information on other libraries that use MPI, and pointers to
other MPI resources.
Page 27