ocfs2 man page on OpenSuSE
[printable version]
OCFS2(7) OCFS2 Manual Pages OCFS2(7)
NAME
OCFS2 - A Shared-Disk Cluster File System for Linux
INTRODUCTION
OCFS2 is a file system. It allows users to store and retrieve data. The
data is stored in files that are organized in a hierarchical directory
tree. It is a POSIX compliant file system that supports the standard
interfaces and the behavioral semantics as spelled out by that specifi‐
cation.
It is also a shared disk cluster file system, one that allows multiple
nodes to access the same disk at the same time. This is where the fun
begins as allowing a file system to be accessible on multiple nodes
opens a can of worms. What if the nodes are of different architectures?
What if a node dies while writing to the file system? What data consis‐
tency can one expect if processes on two nodes are reading and writing
concurrently? What if one node removes a file while it is still being
used on another node?
Unlike most shared file systems where the answer is fuzzy, the answer
in OCFS2 is very well defined. It behaves on all nodes exactly like a
local file system. If a file is removed, the directory entry is removed
but the inode is kept as long as it is in use across the cluster. When
the last user closes the descriptor, the inode is marked for deletion.
The data consistency model follows the same principle. It works as if
the two processes that are running on two different nodes are running
on the same node. A read on a node gets the last write irrespective of
the IO mode used. The modes can be buffered, direct, asynchronous,
splice or memory mapped IOs. It is fully cache coherent.
Take for example the REFLINK feature that allows a user to create mul‐
tiple write-able snapshots of a file. This feature, like all others, is
fully cluster-aware. A file being written to on multiple nodes can be
safely reflinked on another. The snapshot created is a point-in-time
image of the file that includes both the file data and all its
attributes (including extended attributes).
It is a journaling file system. When a node dies, a surviving node
transparently replays the journal of the dead node. This ensures that
the file system metadata is always consistent. It also defaults to
ordered data journaling to ensure the file data is flushed to disk
before the journal commit, to remove the small possibility of stale
data appearing in files after a crash.
It is architecture and endian neutral. It allows concurrent mounts on
nodes with different processors like x86, x86_64, IA64 and PPC64. It
handles little and big endian, 32-bit and 64-bit architectures.
It is feature rich. It supports indexed directories, metadata check‐
sums, extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files,
unwritten extents and inline-data.
It is fully integrated with the mainline Linux kernel. The file system
was merged into Linux kernel 2.6.16 in early 2006.
It is quickly installed. It is available with almost all Linux distri‐
butions. The file system is on-disk compatible across all of them.
It is modular. The file system can be configured to operate with other
cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.
It is easily configured. The O2CB cluster stack configuration involves
editing two files, one for cluster layout and the other for cluster
timeouts.
It is very efficient. The file system consumes very little resources.
It is used to store virtual machine images in limited memory environ‐
ments like Xen and KVM.
In summary, OCFS2 is an efficient, easily configured, modular, quickly
installed, fully integrated and compatible, feature-rich, architecture
and endian neutral, cache coherent, ordered data journaling, POSIX-com‐
pliant, shared disk cluster file system.
OVERVIEW
OCFS2 is a general-purpose shared-disk cluster file system for Linux
capable of providing both high performance and high availability.
As it provides local file system semantics, it can be used with almost
all applications. Cluster-aware applications can make use of cache-
coherent parallel I/Os from multiple nodes to scale out applications
easily. Other applications can make use of the clustering facilities to
fail-over running application in the event of a node failure.
The notable features of the file system are:
Tunable Block size
The file system supports block sizes of 512, 1K, 2K and 4K
bytes. 4KB is almost always recommended. This feature is avail‐
able in all releases of the file system.
Tunable Cluster size
A cluster size is also referred to as an allocation unit. The
file system supports cluster sizes of 4K, 8K, 16K, 32K, 64K,
128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recom‐
mended. However, a larger value is recommended for volumes host‐
ing mostly very large files like database files, virtual machine
images, etc. A large cluster size allows the file system to
store large files more efficiently. This feature is available in
all releases of the file system.
Endian and Architecture neutral
The file system can be mounted concurrently on nodes having dif‐
ferent architectures. Like 32-bit, 64-bit, little-endian (x86,
x86_64, ia64) and big-endian (ppc64, s390x). This feature is
available in all releases of the file system.
Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes
The file system supports all modes of I/O for maximum flexibil‐
ity and performance. It also supports cluster-wide shared
writeable mmap(2). The support for bufferred, direct and asyn‐
chronous I/O is available in all releases. The support for
splice I/O was added in Linux kernel 2.6.20 and for shared
writeable map(2) in 2.6.23.
Multiple Cluster Stacks
The file system includes a flexible framework to allow it to
function with userspace cluster stacks like Pacemaker (pcmk) and
CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
stack.
The support for o2cb cluster stack is available in all releases.
The support for no cluster stack, or local mount, was added in
Linux kernel 2.6.20.
The support for userspace cluster stack was added in Linux ker‐
nel 2.6.26.
Journaling
The file system supports both ordered (default) and writeback
data journaling modes to provide file system consistency in the
event of power failure or system crash. It uses JBD2 in Linux
kernel 2.6.28 and later. It used JBD in earlier kernels.
Extent-based Allocations
The file system allocates and tracks space in ranges of clus‐
ters. This is unlike block based file systems that have to track
each and every block. This feature allows the file system to be
very efficient when dealing with both large volumes and large
files. This feature is available in all releases of the file
system.
Sparse files
Sparse files are files with holes. With this feature, the file
system delays allocating space until a write is issued to a
cluster. This feature was added in Linux kernel 2.6.22 and
requires enabling on-disk feature sparse.
Unwritten Extents
An unwritten extent is also referred to as user pre-allocation.
It allows an application to request a range of clusters to be
allocated, but not initialized, within a file. Pre-allocation
allows the file system to optimize the data layout with fewer,
larger extents. It also provides a performance boost, delaying
initialization until the user writes to the clusters. This fea‐
ture was added in Linux kernel 2.6.23 and requires enabling on-
disk feature unwritten.
Hole Punching
Hole punching allows an application to remove arbitrary allo‐
cated regions within a file. Creating holes, essentially. This
is more efficient than zeroing the same extents. This feature
is especially useful in virtualized environments as it allows a
block discard in a guest file system to be converted to a hole
punch in the host file system thus allowing users to reduce disk
space usage. This feature was added in Linux kernel 2.6.23 and
requires enabling on-disk features sparse and unwritten.
Inline-data
Inline data is also referred to as data-in-inode as it allows
storing small files and directories in the inode block. This not
only saves space but also has a positive impact on cold-cache
directory and file operations. The data is transparently moved
out to an extent when it no longer fits inside the inode block.
This feature was added in Linux kernel 2.6.24 and requires
enabling on-disk feature inline-data.
REFLINK
REFLINK is also referred to as fast copy. It allows users to
atomically (and instantly) copy regular files. In other words,
create multiple writeable snapshots of regular files. It is
called REFLINK because it looks and feels more like a (hard)
link(2) than a traditional snapshot. Like a link, it is a regu‐
lar user operation, subject to the security attributes of the
inode being reflinked and not to the super user privileges typi‐
cally required to create a snapshot. Like a link, it operates
within a file system. But unlike a link, it links the inodes at
the data extent level allowing each reflinked inode to grow
independently as and when written to. Up to four billion inodes
can share a data extent. This feature was added in Linux kernel
2.6.32 and requires enabling on-disk feature refcount.
Allocation Reservation
File contiguity plays an important role in file system perfor‐
mance. When a file is fragmented on disk, reading and writing to
the file involves many seeks, leading to lower throughput. Con‐
tiguous files, on the other hand, minimize seeks, allowing the
disks to perform IO at the maximum rate.
With allocation reservation, the file system reserves a window
in the bitmap for all extending files allowing each to grow as
contiguously as possible. As this extra space is not actually
allocated, it is available for use by other files if the need
arises. This feature was added in Linux kernel 2.6.35 and can
be tuned using the mount option resv_level.
Indexed Directories
An indexed directory allows users to perform quick lookups of a
file in very large directories. It also results in faster cre‐
ates and unlinks and thus provides better overall performance.
This feature was added in Linux kernel 2.6.30 and requires
enabling on-disk feature indexed-dirs.
File Attributes
This refers to EXT2-style file attributes, such as immutable,
modified using chattr(1) and queried using lsattr(1). This fea‐
ture was added in Linux kernel 2.6.19.
Extended Attributes
An extended attribute refers to a name:value pair than can be
associated with file system objects like regular files, directo‐
ries, symbolic links, etc. OCFS2 allows associating an unlimited
number of attributes per object. The attribute names can be up
to 255 bytes in length, terminated by the first NUL character.
While it is not required, printable names (ASCII) are recom‐
mended. The attribute values can be up to 64 KB of arbitrary
binary data. These attributes can be modified and listed using
standard Linux utilities setfattr(1) and getfattr(1). This fea‐
ture was added in Linux kernel 2.6.29 and requires enabling on-
disk feature xattr.
Metadata Checksums
This feature allows the file system to detect silent corruptions
in all metadata blocks like inodes and directories. This feature
was added in Linux kernel 2.6.29 and requires enabling on-disk
feature metaecc.
POSIX ACLs and Security Attributes
POSIX ACLs allows assigning fine-grained discretionary access
rights for files and directories. This security scheme is a lot
more flexible than the traditional file access permissions that
imposes a strict user-group-other model.
Security attributes allow the file system to support other secu‐
rity regimes like SELinux, SMACK, AppArmor, etc.
Both these security extensions were added in Linux kernel 2.6.29
and requires enabling on-disk feature xattr.
User and Group Quotas
This feature allows setting up usage quotas on user and group
basis by using the standard utilities like quota(1),
setquota(8), quotacheck(8), and quotaon(8). This feature was
added in Linux kernel 2.6.29 and requires enabling on-disk fea‐
tures usrquota and grpquota.
Unix File Locking
The Unix operating system has historically provided two system
calls to lock files. flock(2) or BSD locking and fcntl(2) or
POSIX locking. OCFS2 extends both file locks to the cluster.
File locks taken on one node interact with those taken on other
nodes.
The support for clustered flock(2) was added in Linux kernel
2.6.26. All flock(2) options are supported, including the ker‐
nels ability to cancel a lock request when an appropriate kill
signal is received by the user. This feature is supported with
all cluster-stacks including o2cb.
The support for clustered fcntl(2) was added in Linux kernel
2.6.28. But because it requires group communication to make the
locks coherent, it is only supported with userspace cluster
stacks, pcmk and cman and not with the default cluster stack
o2cb.
Comprehensive Tools Support
The file system has a comprehensive EXT3-style toolset that
tries to use similar parameters for ease-of-use. It includes
mkfs.ocfs2(8) (format), tunefs.ocfs2(8) (tune), fsck.ocfs2(8)
(check), debugfs.ocfs2(8) (debug), etc.
Online Resize
The file system can be dynamically grown using tunefs.ocfs2(8).
This feature was added in Linux kernel 2.6.25.
RECENT CHANGES
The O2CB cluster stack has a global heartbeat mode. It allows users to
specify heartbeat regions that are consistent across all nodes. The
cluster stack also allows online addition and removal of both nodes and
heartbeat regions.
o2cb(8) is the new cluster configuration utility. It is an easy to use
utility that allows users to create the cluster configuration on a node
that is not part of the cluster. It replaces the older utility
o2cb_ctl(8) which has being deprecated.
ocfs2console(8) has been obsoleted.
o2info(8) is a new utility that can be used to provide file system
information. It allows non-priviledged users to see the enabled file
system features, block and cluster sizes, extended file stat, free
space fragmentation, etc.
o2hbmonitor(8) is a o2hb heartbeat monitor. It is an extremely light
weight utility that logs messages to the system logger once the heart‐
beat delay exceeds the warn threshold. This utility is useful in iden‐
tifying volumes encountering I/O delays.
debugfs.ocfs2(8) has some new commands. net_stats shows the o2net mes‐
sage times between various nodes. This is useful in indentifying nodes
are that slowing down the cluster operations. stat_sysdir allows the
user to dump the entire system directory that can be used to debug
issues. grpextents dumps the complete free space fragmentation in the
cluster group allocator.
mkfs.ocfs2(8) now enables xattr, indexed-dirs, discontig-bg, refcount,
extended-slotmap and clusterinfo feature flags by default, in addition
to the older defaults, sparse, unwritten and inline-data.
mount.ocfs2(8) allows users to specify the level of cache coherency
between nodes. By default the file system operates in full coherency
mode that also serializes the direct I/Os. While this mode is techni‐
cally correct, it limits the I/O thruput in a clustered database. This
mount option allows the user to limit the cache coherency to only the
buffered I/Os to allow multiple nodes to do concurrent direct writes to
the same file. This feature works with Linux kernel 2.6.37 and later.
COMPATIBILITY
The OCFS2 development teams goes to great lengths to maintain compati‐
bility. It attempts to maintain both on-disk and network protocol com‐
patibility across all releases of the file system. It does so even
while adding new features that entail on-disk format and network proto‐
col changes. To do this successfully, it follows a few rules:
1. The on-disk format changes are managed by a set of feature flags
that can be turned on and off. The file system in kernel detects
these features during mount and continues only if it understands
all the features. Users encountering this have the option of either
disabling that feature or upgrading the file system to a newer
release.
2. The latest release of ocfs2-tools is compatible with all ver‐
sions of the file system. All utilities detect the features enabled
on disk and continue only if it understands all the features. Users
encountering this have to upgrade the tools to a newer release.
3. The network protocol version is negotiated by the nodes to
ensure all nodes understand the active protocol version.
FEATURE FLAGS
The feature flags are split into three categories, namely, Com‐
pat, Incompat and RO Compat.
Compat, or compatible, is a feature that the file system does
not need to fully understand to safely read/write to the volume.
An example of this is the backup-super feature that added the
capability to backup the super block in multiple locations in
the file system. As the backup super blocks are typically not
read nor written to by the file system, an older file system can
safely mount a volume with this feature enabled.
Incompat, or incompatible, is a feature that the file system
needs to fully understand to read/write to the volume. Most fea‐
tures fall under this category.
RO Compat, or read-only compatible, is a feature that the file
system needs to fully understand to write to the volume. Older
software can safely read a volume with this feature enabled. An
example of this would be user and group quotas. As quotas are
manipulated only when the file system is written to, older soft‐
ware can safely mount such volumes in read-only mode.
The list of feature flags, the version of the kernel it was
added in, the earliest version of the tools that understands it,
etc., is as follows:
┌─────────────────────┬────────────────┬─────────────────┬───────────┬───────────┐
│Feature Flags │ Kernel Version │ Tools Version │ Category │ Hex Value │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│backup-super │ All │ ocfs2-tools 1.2 │ Compat │ 1 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│strict-journal-super │ All │ All │ Compat │ 2 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│local │ Linux 2.6.20 │ ocfs2-tools 1.2 │ Incompat │ 8 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│sparse │ Linux 2.6.22 │ ocfs2-tools 1.4 │ Incompat │ 10 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│inline-data │ Linux 2.6.24 │ ocfs2-tools 1.4 │ Incompat │ 40 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│extended-slotmap │ Linux 2.6.27 │ ocfs2-tools 1.6 │ Incompat │ 100 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│xattr │ Linux 2.6.29 │ ocfs2-tools 1.6 │ Incompat │ 200 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│indexed-dirs │ Linux 2.6.30 │ ocfs2-tools 1.6 │ Incompat │ 400 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│metaecc │ Linux 2.6.29 │ ocfs2-tools 1.6 │ Incompat │ 800 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│refcount │ Linux 2.6.32 │ ocfs2-tools 1.6 │ Incompat │ 1000 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│discontig-bg │ Linux 2.6.35 │ ocfs2-tools 1.6 │ Incompat │ 2000 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│clusterinfo │ Linux 2.6.37 │ ocfs2-tools 1.8 │ Incompat │ 4000 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│unwritten │ Linux 2.6.23 │ ocfs2-tools 1.4 │ RO Compat │ 1 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│grpquota │ Linux 2.6.29 │ ocfs2-tools 1.6 │ RO Compat │ 2 │
├─────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
│usrquota │ Linux 2.6.29 │ ocfs2-tools 1.6 │ RO Compat │ 4 │
└─────────────────────┴────────────────┴─────────────────┴───────────┴───────────┘
To query the features enabled on a volume, do:
$ o2info --fs-features /dev/sdf1
backup-super strict-journal-super sparse extended-slotmap inline-data xattr
indexed-dirs refcount discontig-bg clusterinfo unwritten
ENABLING AND DISABLING FEATURES
The format utility, mkfs.ocfs2(8), allows a user to enable and
disable specific features using the fs-features option. The fea‐
tures are provided as a comma separated list. The enabled fea‐
tures are listed as is. The disabled features are prefixed with
no. The example below shows the file system being formatted
with sparse disabled and inline-data enabled.
# mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1
After formatting, the users can toggle features using the tune
utility, tunefs.ocfs2(8). This is an offline operation. The
volume needs to be umounted across the cluster. The example
below shows the sparse feature being enabled and inline-data
disabled.
# tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1
Care should be taken before enabling and disabling features.
Users planning to use a volume with an older version of the file
system will be better of not enabling newer features as turning
disabling may not succeed.
An example would be disabling the sparse feature; this requires
filling every hole. The operation can only succeed if the file
system has enough free space.
DETECTING FEATURE INCOMPATIBILITY
Say one tries to mount a volume with an incompatible feature.
What happens then? How does one detect the problem? How does one
know the name of that incompatible feature?
To begin with, one should look for error messages in dmesg(8).
Mount failures that are due to an incompatible feature will
always result in an error message like the following:
ERROR: couldn't mount because of unsupported optional features (200).
Here the file system is unable to mount the volume due to an
unsupported optional feature. That means that that feature is an
Incompat feature. By referring to the table above, one can then
deduce that the user failed to mount a volume with the xattr
feature enabled. (The value in the error message is in hexadeci‐
mal.)
Another example of an error message due to incompatibility is as
follows:
ERROR: couldn't mount RDWR because of unsupported optional features (1).
Here the file system is unable to mount the volume in the RW
mode. That means that that feature is a RO Compat feature.
Another look at the table and it becomes apparent that the vol‐
ume had the unwritten feature enabled.
In both cases, the user has the option of disabling the feature.
In the second case, the user has the choice of mounting the vol‐
ume in the RO mode.
GETTING STARTED
The OCFS2 software is split into two components, namely, kernel and
tools. The kernel component includes the core file system and the clus‐
ter stack, and is packaged along with the kernel. The tools component
is packaged as ocfs2-tools and needs to be specifically installed. It
provides utilities to format, tune, mount, debug and check the file
system.
To install ocfs2-tools, refer to the package handling utility in in
your distributions.
The next step is selecting a cluster stack. The options include:
A. No cluster stack, or local mount.
B. In-kernel o2cb cluster stack with local or global heartbeat.
C. Userspace cluster stacks pcmk or cman.
The file system allows changing cluster stacks easily using
tunefs.ocfs2(8). To list the cluster stacks stamped on the OCFS2 vol‐
umes, do:
# mounted.ocfs2 -d
Device Stack Cluster F UUID Label
/dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1
/dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount
/dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol
/dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol
/dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch
NON-CLUSTERED OR LOCAL MOUNT
To format a OCFS2 volume as a non-clustered (local) volume, do:
# mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1
To convert an existing clustered volume to a non-clustered vol‐
ume, do:
# tunefs.ocfs2 --fs-features=local /dev/sda1
Non-clustered volumes do not interact with the cluster stack.
One can have both clustered and non-clustered volumes mounted at
the same time.
While formating a non-clustered volume, users should consider
the possibility of later converting that volume to a clustered
one. If there is a possibility of that, then the user should add
enough node-slots using the -N option. Adding node-slots during
format creates journals with large extents. If created later,
then the journals will be fragmented which is not good for per‐
formance.
CLUSTERED MOUNT WITH O2CB CLUSTER STACK
Only one of the two heartbeat mode can be active at any one
time. Changing heartbeat modes is an offline operation.
Both heartbeat modes require /etc/ocfs2/cluster.conf and
/etc/sysconfig/o2cb to be populated as described in ocfs2.clus‐
ter.conf(5) and o2cb.sysconfig(5) respectively. The only differ‐
ence in set up between the two modes is that global requires
heartbeat devices to be configured whereas local does not.
Refer o2cb(7) for more information.
LOCAL HEARTBEAT
This is the default heartbeat mode. The user needs to
populate the configuration files as described in
ocfs2.cluster.conf(5) and o2cb.sysconfig(5). In this
mode, the cluster stack heartbeats on all mounted vol‐
umes. Thus, one does not have to specify heartbeat
devices in cluster.conf.
Once configured, the o2cb cluster stack can be onlined
and offlined as follows:
# service o2cb online
Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK
# service o2cb offline
Clean userdlm domains: OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK
GLOBAL HEARTBEAT
The configuration is similar to local heartbeat. The one
additional step in this mode is that it requires heart‐
beat devices to be also configured.
These heartbeat devices are OCFS2 formatted volumes with
global heartbeat enabled on disk. These volumes can later
be mounted and used as clustered file systems.
The steps to format a volume with global heartbeat
enabled is listed in o2cb(7). Also listed there is list‐
ing all volumes with the cluster stack stamped on disk.
In this mode, the heartbeat is started when the cluster
is onlined and stopped when the cluster is offlined.
# service o2cb online
Setting cluster stack "o2cb": OK
Registering O2CB cluster "webcluster": OK
Setting O2CB cluster timeouts : OK
Starting global heartbeat for cluster "webcluster": OK
# service o2cb offline
Clean userdlm domains: OK
Stopping global heartbeat on cluster "webcluster": OK
Stopping O2CB cluster webcluster: OK
Unregistering O2CB cluster "webcluster": OK
# service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "webcluster": Online
Heartbeat dead threshold: 31
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Heartbeat mode: Global
Checking O2CB heartbeat: Active
77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
Nodes in O2CB cluster: 92 96
CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK
Configure and online the userspace stack pcmk or cman before
using tunefs.ocfs2(8) to update the cluster stack on disk.
# tunefs.ocfs2 --update-cluster-stack /dev/sdd1
Updating on-disk cluster information to match the running cluster.
DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
Update the on-disk cluster information? y
Refer to the cluster stack documentation for information on
starting and stopping the cluster stack.
FILE SYSTEM UTILITIES
This sections lists the utilities that are used to manage the OCFS2
file systems. This includes tools to format, tune, check, mount, debug
the file system. Each utility has a man page that lists its capabili‐
ties in detail.
mkfs.ocfs2(8)
This is the file system format utility. All volumes have to be
formatted prior to its use. As this utility overwrites the vol‐
ume, use it with care. Double check to ensure the volume is not
in use on any node in the cluster.
As a precaution, the utility will abort if the volume is locally
mounted. It also detects use across the cluster if used by
OCFS2. But these checks are not comprehensive and can be over‐
ridden. So use it with care.
While it is not always required, the cluster should be online.
tunefs.ocfs2(8)
This is the file system tune utility. It allows users to change
certain on-disk parameters like label, uuid, number of node-
slots, volume size and the size of the journals. It also allows
turning on and off the file system features as listed above.
This utility requires the cluster to be online.
fsck.ocfs2(8)
This is the file system check utility. It detects and fixes on-
disk errors. All the check codes and their fixes are listed in
fsck.ocfs2.checks(8).
This utility requires the cluster to be online to ensure the
volume is not in use on another node and to prevent the volume
from being mounted for the duration of the check.
mount.ocfs2(8)
This is the file system mount utility. It is invoked indirectly
by the mount(8) utility.
This utility detects the cluster status and aborts if the clus‐
ter is offline or does not match the cluster stamped on disk.
o2cluster(8)
This is the file system cluster stack update utility. It allows
the users to update the on-disk cluster stack to the one pro‐
vided.
This utility only updates the disk if the utility is reasonably
assured that the file system is not in use on any node.
o2info(1)
This is the file system information utility. It provides infor‐
mation like the features enabled on disk, block size, cluster
size, free space fragmentation, etc.
It can be used by both priviledged and non-priviledged users.
Users having read permission on the device can provide the path
to the device. Other users can provide the path to a file on a
mounted file system.
debugfs.ocfs2(8)
This is the file system debug utility. It allows users to exam‐
ine all file system structures including walking directory
structures, displaying inodes, backing up files, etc., without
mounting the file system.
This utility requires the user to have read permission on the
device.
o2image(8)
This is the file system image utility. It allows users to copy
the file system metadata skeleton, including the inodes, direc‐
tories, bitmaps, etc. As it excludes data, it shrinks the size
of the file system tremendously.
The image file created can be used in debugging on-disk corrup‐
tions.
mounted.ocfs2(8)
This is the file system detect utility. It detects all OCFS2
volumes in the system and lists its label, uuid and cluster
stack.
O2CB CLUSTER STACK UTILITIES
This sections lists the utilities that are used to manage O2CB cluster
stack. Each utility has a man page that lists its capabilities in
detail.
o2cb(8)
This is the cluster configuration utility. It allows users to
update the cluster configuration by adding and removing nodes
and heartbeat regions. This utility is used by the o2cb init
script to online and offline the cluster.
This is a new utility and replaces o2cb_ctl(8) which has been
deprecated.
ocfs2_hb_ctl(8)
This is the cluster heartbeat utility. It allows users to start
and stop local heartbeat. This utility is invoked by
mount.ocfs2(8) and should not be invoked directly by the user.
o2hbmonitor(8)
This is the disk heartbeat monitor. It tracks the elapsed time
since the last heartbeat and logs warnings once that time
exceeds the warn threshold.
FILE SYSTEM NOTES
This section includes some useful notes that may prove helpful to the
user.
BALANCED CLUSTER
A cluster is a computer. This is a fact and not a slogan. What
this means is that an errant node in the cluster can affect the
behavior of other nodes. If one node is slow, the cluster opera‐
tions will slow down on all nodes. To prevent that, it is best
to have a balanced cluster. This is a cluster that has equally
powered and loaded nodes.
The standard recommendation for such clusters is to have identi‐
cal hardware and software across all the nodes. However, that is
not a hard and fast rule. After all, we have taken the effort to
ensure that OCFS2 works in a mixed architecture environment.
If one uses OCFS2 in a mixed architecture environment, try to
ensure that the nodes are equally powered and loaded. The use of
a load balancer can assist with the latter. Power refers to the
number of processors, speed, amount of memory, I/O throughput,
network bandwidth, etc. In reality, having equally powered het‐
erogeneous nodes is not always practical. In that case, make the
lower node numbers more powerful than the higher node numbers.
The O2CB cluster stack favors lower node numbers in all of its
tiebreaking logic.
This is not to suggest you should add a single core node in a
cluster of quad cores. No amount of node number juggling will
help you there.
FILE DELETION
In Linux, rm(1) removes the directory entry. It does not neces‐
sarily delete the corresponding inode. But by removing the
directory entry, it gives the illusion that the inode has been
deleted. This puzzles users when they do not see a correspond‐
ing up-tick in the reported free space. The reason is that
inode deletion has a few more hurdles to cross.
First is the hard link count, that indicates the number of
directory entries pointing to that inode. As long as an inode
has one or more directory entries pointing to it, it cannot be
deleted. The file system has to wait for the removal of all
those directory entries. In other words, wait for that count to
drop to zero.
The second hurdle is the POSIX semantics allowing files to be
unlinked even while they are in-use. In OCFS2, that translates
to in-use across the cluster. The file system has to wait for
all processes across the cluster to stop using the inode.
Once these conditions are met, the inode is deleted and the
freed space is visible after the next sync.
Now the amount of space freed depends on the allocation. Only
space that is actually allocated to that inode is freed. The
example below shows a sparsely allocated file of size 51TB of
which only 2.4GB is actually allocated.
$ ls -lsh largefile
2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile
Furthermore, for reflinked files, only private extents are
freed. Shared extents are freed when the last inode accessing
it, is deleted. The example below shows a 4GB file that shares
3GB with other reflinked files. Deleting it will increase the
free space by 1GB. However, if it is the only remaining file
accessing the shared extents, the full 4G will be freed. (More
information on the shared-du(1) utility is provided below.)
$ shared-du -m -c --shared-size reflinkedfile
4000 (3000) reflinkedfile
The deletion itself is a multi-step process. Once the hard link
count falls to zero, the inode is moved to the orphan_dir system
directory where it remains until the last process, across the
cluster, stops using the inode. Then the file system frees the
extents and adds the freed space count to the truncate_log sys‐
tem file where it remains until the next sync. The freed space
is made visible to the user only after that sync.
DIRECTORY LISTING
ls(1) may be a simple command, but it is not cheap. What is
expensive is not the part where it reads the directory listing,
but the second part where it reads all the inodes, also referred
as an inode stat(2). If the inodes are not in cache, this can
entail disk I/O. Now, while a cold cache inode stat(2) is
expensive in all file systems, it is especially so in a clus‐
tered file system as it needs to take a cluster lock on each
inode.
A hot cache stat(2), on the other hand, has shown to perform on
OCFS2 like it does on EXT3.
In other words, the second ls(1) will be quicker than the first.
However, it is not guaranteed. Say you have a million files in a
file system and not enough kernel memory to cache all the
inodes. In that case, each ls(1) will involve some cold cache
stat(2)s.
ALLOCATION RESERVATION
Allocation reservation allows multiple concurrently extending
files to grow as contiguously as possible. One way to demon‐
strate its functioning is to run a script that extends multiple
files in a circular order. The script below does that by writing
one hundred 4KB chunks to four files, one after another.
$ for i in $(seq 0 99);
> do
> for j in $(seq 4);
> do
> dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
> done;
> done;
When run on a system running Linux kernel 2.6.34 or earlier, we
end up with files with 100 extents each. That is full fragmenta‐
tion. As the files are being extended one after another, the on-
disk allocations are fully interleaved.
$ filefrag file1 file2 file3 file4
file1: 100 extents found
file2: 100 extents found
file3: 100 extents found
file4: 100 extents found
When run on a system running Linux kernel 2.6.35 or later, we
see files with 7 extents each. That is a lot fewer than before.
Fewer extents mean more on-disk contiguity and that always leads
to better overall performance.
$ filefrag file1 file2 file3 file4
file1: 7 extents found
file2: 7 extents found
file3: 7 extents found
file4: 7 extents found
REFLINK OPERATION
This feature allows a user to create a writeable snapshot of a
regular file. In this operation, the file system creates a new
inode with the same extent pointers as the original inode. Mul‐
tiple inodes are thus able to share data extents. This adds a
twist in file system administration because none of the existing
file system utilities in Linux expect this behavior. du(1), a
utility to used to compute file space usage, simply adds the
blocks allocated to each inode. As it does not know about shared
extents, it over estimates the space used. Say, we have a 5GB
file in a volume having 42GB free.
$ ls -l
total 5120000
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile
$ du -m myfile*
5000 myfile
$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sdd1 50G 8.2G 42G 17% /ocfs2
If we were to reflink it 4 times, we would expect the directory
listing to report five 5GB files, but the df(1) to report no
loss of available space. du(1), on the other hand, would report
the disk usage to climb to 25GB.
$ reflink myfile myfile-ref1
$ reflink myfile myfile-ref2
$ reflink myfile myfile-ref3
$ reflink myfile myfile-ref4
$ ls -l
total 25600000
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref1
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref2
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref3
-rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref4
$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sdd1 50G 8.2G 42G 17% /ocfs2
$ du -m myfile*
5000 myfile
5000 myfile-ref1
5000 myfile-ref2
5000 myfile-ref3
5000 myfile-ref4
25000 total
Enter shared-du(1), a shared extent-aware du. This utility
reports the shared extents per file in parenthesis and the over‐
all footprint. As expected, it lists the overall footprint at
5GB. One can view the details of the extents using shared-file‐
frag(1). Both these utilities are available at http://oss.ora‐
cle.com/~smushran/reflink-tools/. We are currently in the
process of pushing the changes to the upstream maintainers of
these utilities.
$ shared-du -m -c --shared-size myfile*
5000 (5000) myfile
5000 (5000) myfile-ref1
5000 (5000) myfile-ref2
5000 (5000) myfile-ref3
5000 (5000) myfile-ref4
25000 total
5000 footprint
# shared-filefrag -v myfile
Filesystem type is: 7461636f
File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 2247937 8448
1 8448 2257921 2256384 30720
2 39168 2290177 2288640 30720
3 69888 2322433 2320896 30720
4 100608 2354689 2353152 30720
7 192768 2451457 2449920 30720
. . .
37 1073408 2032129 2030592 30720 shared
38 1104128 2064385 2062848 30720 shared
39 1134848 2096641 2095104 30720 shared
40 1165568 2128897 2127360 30720 shared
41 1196288 2161153 2159616 30720 shared
42 1227008 2193409 2191872 30720 shared
43 1257728 2225665 2224128 22272 shared,eof
myfile: 44 extents found
DATA COHERENCY
One of the challenges in a shared file system is data coherency
when multiple nodes are writing to the same set of files. NFS,
for example, provides close-to-open data coherency that results
in the data being flushed to the server when the file is closed
on the client. This leaves open a wide window for stale data
being read on another node.
A simple test to check the data coherency of a shared file sys‐
tem involves concurrently appending the same file. Like running
"uname -a >>/dir/file" using a parallel distributed shell like
dsh or pconsole. If coherent, the file will contain the results
from all nodes.
# dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
# cat /ocfs2/test
Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
OCFS2 is a fully cache coherent cluster file system.
DISCONTIGUOUS BLOCK GROUP
Most file systems pre-allocate space for inodes during format.
OCFS2 dynamically allocates this space when required.
However, this dynamic allocation has been problematic when the
free space is very fragmented, because the file system required
the inode and extent allocators to grow in contiguous fixed-size
chunks.
The discontiguous block group feature takes care of this problem
by allowing the allocators to grow in smaller, variable-sized
chunks.
This feature was added in Linux kernel 2.6.35 and requires
enabling on-disk feature discontig-bg.
BACKUP SUPER BLOCKS
A file system super block stores critical information that is
hard to recreate. In OCFS2, it stores the block size, cluster
size, and the locations of the root and system directories,
among other things. As this block is close to the start of the
disk, it is very susceptible to being overwritten by an errant
write. Say, dd if=file of=/dev/sda1.
Backup super blocks are copies of the super block. These blocks
are dispersed in the volume to minimize the chances of being
overwritten. On the small chance that the original gets cor‐
rupted, the backups are available to scan and fix the corrup‐
tion.
mkfs.ocfs2(8) enables this feature by default. Users can disable
this by specifying --fs-features=nobackup-super during format.
o2info(1) can be used to view whether the feature has been
enabled on a device.
# o2info --fs-features /dev/sdb1
backup-super strict-journal-super sparse extended-slotmap inline-data xattr
indexed-dirs refcount discontig-bg clusterinfo unwritten
In OCFS2, the super block is on the third block. The backups are
located at the 1G, 4G, 16G, 64G, 256G and 1T byte offsets. The
actual number of backup blocks depends on the size of the
device. The super block is not backed up on devices smaller than
1GB.
fsck.ocfs2(8) refers to these six offsets by numbers, 1 to 6.
Users can specify any backup with the -r option to recover the
volume. The example below uses the second backup. If successful,
fsck.ocfs2(8) overwrites the corrupted super block with the
backup.
# fsck.ocfs2 -f -r 2 /dev/sdb1
fsck.ocfs2 1.8.0
[RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
Checking OCFS2 filesystem in /dev/sdb1:
Label: webhome
UUID: B3E021A2A12B4D0EB08E9E986CDC7947
Number of blocks: 13107196
Block size: 4096
Number of clusters: 13107196
Cluster size: 4096
Number of slots: 8
/dev/sdb1 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
Pass 2: Checking directory entries.
Pass 3: Checking directory connectivity.
Pass 4a: checking for orphaned inodes
Pass 4b: Checking inodes link counts.
All passes succeeded.
SYNTHETIC FILE SYSTEMS
The OCFS2 development effort included two synthetic file sys‐
tems, configfs and dlmfs. It also makes use of a third, debugfs.
configfs
configfs has since been accepted as a generic kernel com‐
ponent and is also used by netconsole and fs/dlm. OCFS2
tools use it to communicate the list of nodes in the
cluster, details of the heartbeat device, cluster time‐
outs, and so on to the in-kernel node manager. The o2cb
init script mounts this file system at /sys/kernel/con‐
fig.
dlmfs dlmfs exposes the in-kernel o2dlm to the user-space.
While it was developed primarily for OCFS2 tools, it has
seen usage by others looking to add a cluster locking
dimension in their applications. Users interested in
doing the same should look at the libo2dlm library pro‐
vided by ocfs2-tools. The o2cb init script mounts this
file system at /dlm.
debugfs
OCFS2 uses debugfs to expose its in-kernel information to
user space. For example, listing the file system cluster
locks, dlm locks, dlm state, o2net state, etc. Users can
access the information by mounting the file system at
/sys/kernel/debug. To automount, add the following to
/etc/fstab: debugfs /sys/kernel/debug debugfs defaults 0
0
DISTRIBUTED LOCK MANAGER
One of the key technologies in a cluster is the lock manager,
which maintains the locking state of all resources across the
cluster. An easy implementation of a lock manager involves des‐
ignating one node to handle everything. In this model, if a node
wanted to acquire a lock, it would send the request to the lock
manager. However, this model has a weakness: lock manager’s
death causes the cluster to seize up.
A better model is one where all nodes manage a subset of the
lock resources. Each node maintains enough information for all
the lock resources it is interested in. On event of a node
death, the remaining nodes pool in the information to recon‐
struct the lock state maintained by the dead node. In this
scheme, the locking overhead is distributed amongst all the
nodes. Hence, the term distributed lock manager.
O2DLM is a distributed lock manager. It is based on the specifi‐
cation titled "Programming Locking Application" written by
Kristin Thomas and is available at the following link.
http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlm‐
book_final.pdf
DLM DEBUGGING
O2DLM has a rich debugging infrastructure that allows it to show
the state of the lock manager, all the lock resources, among
other things. The figure below shows the dlm state of a nine-
node cluster that has just lost three nodes: 12, 32, and 35. It
can be ascertained that node 7, the recovery master, is cur‐
rently recovering node 12 and has received the lock states of
the dead node from all other live nodes.
# cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001 Key: 0x10748e61
Thread Pid: 24542 Node: 7 State: JOINED
Number of Joins: 1 Joining Node: 255
Domain Map: 7 31 33 34 40 50
Live Map: 7 31 33 34 40 50
Lock Resources: 48850 (439879)
MLEs: 0 (1428625)
Blocking: 0 (1066000)
Mastery: 0 (362625)
Migration: 0 (0)
Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty
Purge Count: 0 Refs: 1
Dead Node: 12
Recovery Pid: 24543 Master: 7 State: ACTIVE
Recovery Map: 12 32 35
Recovery Node State:
7 - DONE
31 - DONE
33 - DONE
34 - DONE
40 - DONE
50 - DONE
The figure below shows the state of a dlm lock resource that is
mastered (owned) by node 25, with 6 locks in the granted queue
and node 26 holding the EX (writelock) lock on that resource.
# debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
Lockres: M000000000000000022d63c00000000 Owner: 25 State: 0x0
Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No
Refs: 8 Locks: 6 On Lists: None
Reference Map: 26 27 28 94 95
Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action
Granted 94 NL -1 94:3169409 2 No No None
Granted 28 NL -1 28:3213591 2 No No None
Granted 27 NL -1 27:3216832 2 No No None
Granted 95 NL -1 95:3178429 2 No No None
Granted 25 NL -1 25:3513994 2 No No None
Granted 26 EX -1 26:3512906 2 No No None
The figure below shows a lock from the file system perspective.
Specifically, it shows a lock that is in the process of being
upconverted from a NL to EX. Locks in this state are are
referred to in the file system as busy locks and can be listed
using the debugfs.ocfs2 command, "fs_locks -B".
# debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
Lockres: M000000000000000000000b9aba12ec Mode: No Lock
Flags: Initialized Attached Busy
RO Holders: 0 EX Holders: 0
Pending Action: Convert Pending Unlock Action: None
Requested Mode: Exclusive Blocking Mode: No Lock
PR > Gets: 0 Fails: 0 Waits Total: 0us Max: 0us Avg: 0ns
EX > Gets: 1 Fails: 0 Waits Total: 544us Max: 544us Avg: 544185ns
Disk Refreshes: 1
With this debugging infrastructure in place, users can debug
hang issues as follows:
* Dump the busy fs locks for all the OCFS2 volumes on the
node with hanging processes. If no locks are found, then the
problem is not related to O2DLM.
* Dump the corresponding dlm lock for all the busy fs locks.
Note down the owner (master) of all the locks.
* Dump the dlm locks on the master node for each lock.
At this stage, one should note that the hanging node is waiting
to get an AST from the master. The master, on the other hand,
cannot send the AST until the current holder has down converted
that lock, which it will do upon receiving a Blocking AST. How‐
ever, a node can only down convert if all the lock holders have
stopped using that lock. After dumping the dlm lock on the mas‐
ter node, identify the current lock holder and dump both the dlm
and fs locks on that node.
The trick here is to see whether the Blocking AST message has
been relayed to file system. If not, the problem is in the dlm
layer. If it has, then the most common reason would be a lock
holder, the count for which is maintained in the fs lock.
At this stage, printing the list of process helps.
$ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
Make a note of all D state processes. At least one of them is
responsible for the hang on the first node.
The challenge then is to figure out why those processes are
hanging. Failing that, at least get enough information (like
alt-sysrq t output) for the kernel developers to review. What
to do next depends on where the process is hanging. If it is
waiting for the I/O to complete, the problem could be anywhere
in the I/O subsystem, from the block device layer through the
drivers to the disk array. If the hang concerns a user lock
(flock(2)), the problem could be in the user’s application. A
possible solution could be to kill the holder. If the hang is
due to tight or fragmented memory, free up some memory by
killing non-essential processes.
The thing to note is that the symptom for the problem was on one
node but the cause is on another. The issue can only be resolved
on the node holding the lock. Sometimes, the best solution will
be to reset that node. Once killed, the O2DLM recovery process
will clear all locks owned by the dead node and let the cluster
continue to operate. As harsh as that sounds, at times it is the
only solution. The good news is that, by following the trail,
you now have enough information to file a bug and get the real
issue resolved.
NFS EXPORTING
OCFS2 volumes can be exported as NFS volumes. This support is
limited to NFS version 3, which translates to Linux kernel ver‐
sion 2.4 or later.
If the version of the Linux kernel on the system exporting the
volume is older than 2.6.30, then the NFS clients must mount the
volumes using the nordirplus mount option. This disables the
READDIRPLUS RPC call to workaround a bug in NFSD, detailed in
the following link:
http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
Users running NFS version 2 can export the volume after having
disabled subtree checking (mount option no_subtree_check). Be
warned, disabling the check has security implications (docu‐
mented in the exports(5) man page) that users must evaluate on
their own.
FILE SYSTEM LIMITS
OCFS2 has no intrinsic limit on the total number of files and
directories in the file system. In general, it is only limited
by the size of the device. But there is one limit imposed by the
current filesystem. It can address at most four billion clus‐
ters. A file system with 1MB cluster size can go up to 4PB,
while a file system with a 4KB cluster size can address up to
16TB.
SYSTEM OBJECTS
The OCFS2 file system stores its internal meta-data, including
bitmaps, journals, etc., as system files. These are grouped in a
system directory. These files and directories are not accessible
via the file system interface but can be viewed using the
debugfs.ocfs2(8) tool.
To list the system directory (referred to as double-slash), do:
# debugfs.ocfs2 -R "ls -l //" /dev/sde1
66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 .
66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 ..
67 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 bad_blocks
68 -rw-r--r-- 1 0 0 1179648 19-Jul-2011 13:36 global_inode_alloc
69 -rw-r--r-- 1 0 0 4096 19-Jul-2011 14:35 slot_map
70 -rw-r--r-- 1 0 0 1048576 19-Jul-2011 13:36 heartbeat
71 -rw-r--r-- 1 0 0 53686960128 19-Jul-2011 13:36 global_bitmap
72 drwxr-xr-x 2 0 0 3896 25-Jul-2011 15:05 orphan_dir:0000
73 drwxr-xr-x 2 0 0 3896 19-Jul-2011 13:36 orphan_dir:0001
74 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0000
75 -rw-r--r-- 1 0 0 8388608 19-Jul-2011 13:36 extent_alloc:0001
76 -rw-r--r-- 1 0 0 121634816 19-Jul-2011 13:36 inode_alloc:0000
77 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 inode_alloc:0001
77 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:36 journal:0000
79 -rw-r--r-- 1 0 0 268435456 19-Jul-2011 13:37 journal:0001
80 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0000
81 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 local_alloc:0001
82 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0000
83 -rw-r--r-- 1 0 0 0 19-Jul-2011 13:36 truncate_log:0001
The file names that end with numbers are slot specific and are
referred to as node-local system files. The set of node-local
files used by a node can be determined from the slot map. To
list the slot map, do:
# debugfs.ocfs2 -R "slotmap" /dev/sde1
Slot# Node#
0 32
1 35
2 40
3 31
4 34
5 33
For more information, refer to the OCFS2 support guides avail‐
able in the Documentation section at http://oss.ora‐
cle.com/projects/ocfs2.
HEARTBEAT, QUORUM, AND FENCING
Heartbeat is an essential component in any cluster. It is
charged with accurately designating nodes as dead or alive. A
mistake here could lead to a cluster hang or a corruption.
o2hb is the disk heartbeat component of o2cb. It periodically
updates a timestamp on disk, indicating to others that this node
is alive. It also reads all the timestamps to identify other
live nodes. Other cluster components, like o2dlm and o2net, use
the o2hb service to get node up and down events.
The quorum is the group of nodes in a cluster that is allowed to
operate on the shared storage. When there is a failure in the
cluster, nodes may be split into groups that can communicate in
their groups and with the shared storage but not between groups.
o2quo determines which group is allowed to continue and initi‐
ates fencing of the other group(s).
Fencing is the act of forcefully removing a node from a cluster.
A node with OCFS2 mounted will fence itself when it realizes
that it does not have quorum in a degraded cluster. It does this
so that other nodes won’t be stuck trying to access its
resources.
o2cb uses a machine reset to fence. This is the quickest route
for the node to rejoin the cluster.
PROCESSES
[o2net]
One per node. It is a work-queue thread started when the
cluster is brought on-line and stopped when it is off-
lined. It handles network communication for all mounts.
It gets the list of active nodes from O2HB and sets up a
TCP/IP communication channel with each live node. It
sends regular keep-alive packets to detect any interrup‐
tion on the channels.
[user_dlm]
One per node. It is a work-queue thread started when
dlmfs is loaded and stopped when it is unloaded (dlmfs is
a synthetic file system that allows user space processes
to access the in-kernel dlm).
[ocfs2_wq]
One per node. It is a work-queue thread started when the
OCFS2 module is loaded and stopped when it is unloaded.
It is assigned background file system tasks that may take
cluster locks like flushing the truncate log, orphan
directory recovery and local alloc recovery. For example,
orphan directory recovery runs in the background so that
it does not affect recovery time.
[o2hb-14C29A7392]
One per heartbeat device. It is a kernel thread started
when the heartbeat region is populated in configfs and
stopped when it is removed. It writes every two seconds
to a block in the heartbeat region, indicating that this
node is alive. It also reads the region to maintain a map
of live nodes. It notifies subscribers like o2net and
o2dlm of any changes in the live node map.
[ocfs2dc]
One per mount. It is a kernel thread started when a vol‐
ume is mounted and stopped when it is unmounted. It down‐
grades locks in response to blocking ASTs (BASTs)
requested by other nodes.
[jbd2/sdf1-97]
One per mount. It is part of JBD2, which OCFS2 uses for
journaling.
[ocfs2cmt]
One per mount. It is a kernel thread started when a vol‐
ume is mounted and stopped when it is unmounted. It works
with kjournald2.
[ocfs2rec]
It is started whenever a node has to be recovered. This
thread performs file system recovery by replaying the
journal of the dead node. It is scheduled to run after
dlm recovery has completed.
[dlm_thread]
One per dlm domain. It is a kernel thread started when a
dlm domain is created and stopped when it is destroyed.
This thread sends ASTs and blocking ASTs in response to
lock level convert requests. It also frees unused lock
resources.
[dlm_reco_thread]
One per dlm domain. It is a kernel thread that handles
dlm recovery when another node dies. If this node is the
dlm recovery master, it re-masters every lock resource
owned by the dead node.
[dlm_wq]
One per dlm domain. It is a work-queue thread that o2dlm
uses to queue blocking tasks.
FUTURE WORK
File system development is a never ending cycle. Faster and
larger disks, faster and more number of processors, larger
caches, etc. keep changing the sweet spot for performance forc‐
ing developers to rethink long held beliefs. Add to that new use
cases, which forces developers to be innovative in providing
solutions that melds seamlessly with existing semantics.
We are currently looking to add features like transparent com‐
pression, transparent encryption, delayed allocation, multi-
device support, etc. as well as work on improving performance on
newer generation machines.
If you are interested in contributing, email the development
team at ocfs2-devel@oss.oracle.com.
ACKNOWLEDGEMENTS
The principal developers of the OCFS2 file system, its tools and the
O2CB cluster stack, are Joel Becker, Zach Brown, Mark Fasheh, Jan Kara,
Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.
Other developers who have contributed to the file system via bug fixes,
testing, etc. are Wim Coekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney,
Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.
The members of the Linux Cluster community including Andrew Beekhof,
Lars Marowsky-Bree, Fabio Massimo Di Nitto and David Teigland.
The members of the Linux File system community including Christoph
Hellwig and Chris Mason.
The corporations that have contributed resources for this project
including Oracle, SUSE Labs, EMC, Emulex, HP, IBM, Intel and Network
Appliance.
SEE ALSO
debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8)
mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1)
o2cb(7) o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.cluster.conf(5)
tunefs.ocfs2(8)
AUTHOR
Oracle Corporation
COPYRIGHT
Copyright © 2004, 2012 Oracle. All rights reserved.
Version 1.8.2 January 2012 OCFS2(7)
[top]
List of man pages available for OpenSuSE
Copyright (c) for man pages and the logo by the respective OS vendor.
For those who want to learn more, the polarhome community provides shell access and support.
[legal]
[privacy]
[GNU]
[policy]
[cookies]
[netiquette]
[sponsors]
[FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
|
Vote for polarhome
|