GLIMPSEINDEX(l)GLIMPSEINDEX(l)NAMEglimpseindex 4.1 - index whole file systems to be searched by glimpse
OVERVIEW
Glimpse (which stands for GLobal IMPlicit SEarch) is a popular UNIX
indexing and query system that allows you to search through a large set
of files very quickly. Glimpseindex is the indexing program for
glimpse. Glimpse supports most of agrep's options (agrep is our power‐
ful version of grep) including approximate matching (e.g., finding mis‐
spelled words), Boolean queries, and even some limited forms of regular
expressions. It is used in the same way, except that you don't have to
specify file names. So, if you are looking for a needle anywhere in
your file system, all you have to do is say glimpse needle and all
lines containing needle will appear preceded by the file name. See man
glimpse for details on how to use glimpse.
Glimpseindex provides three indexing options: a tiny index (2-3% of the
total size of all files), a small index (7-8%) and a medium-size index
(20-30%). Search times are normally better with larger indexes
(although unless files are quite large, the small index is just about
as good as the medium one). To index all your files, you say glimp‐
seindex ~ for tiny index (where ~ stands for the home directory),
glimpseindex-o ~ for small index, and glimpseindex-b ~ for medium.
Mail glimpse-request@cs.arizona.edu to be added to the glimpse mailing
list. Mail glimpse@cs.arizona.edu to report bugs, ask questions, dis‐
cuss tricks for using glimpse, etc. (this is a moderated mailing list
with very little traffic, mostly announcements). HTML version of these
manual pages can be found in http://glimpse.cs.arizona.edu/glimpsein‐
dexhelp.html Also, see the glimpse home pages in http://glimpse.cs.ari‐
zona.edu/
SYNOPSISglimpseindex [ -abEfFiInostT -w number -dDfilename(s)-H directory -M
number -S number ] directory_name[s]
INTRODUCTION
Glimpseindex builds an index of all text files in all the directories
specified and all their subdirectories (recursively). It is also pos‐
sible to build several separate indexes (possibly even overlapping).
The simplest way to index your files is to say
glimpseindex-o ~
The index consists of several files (described in detail below), all
with the prefix .glimpse_ stored in the user's home directory (unless
otherwise specified with the -H option). Files with one of the follow‐
ing suffixes are not indexed: ".o", ".gz", ".Z", ".z", ".hqx", ".zip",
".tar". (Unless the -z option is used, see below.) In addition,
glimpseindex attempts to determine whether a file is a text file and
does not index files that it thinks are not text files. Numbers are
not indexed unless the -n option is used. It is possible to prevent
specified files from being indexed by adding their names to the
.glimpse_exclude file (described below). The -o option builds a larger
index than without it (typically about 7-8% vs. 2-3% without -o) allow‐
ing for a faster search (1-5 times faster). The -b builds an even
larger index and allows an even faster search some of the time (-b is
helpful mostly when large files are present). There is an incremental
indexing option -f, which updates an existing index by determining
which files have been created or modified since the index was built and
adding them to the index (see -f). Glimpseindex is reasonably fast,
taking about 20 minutes to index 15,000 files of about 200MB (on an Dec
Alpha 233) and 2-4 minutes to update an existing index. (Your mileage
may vary.) It is also possible to increment the index by adding a spe‐
cific file (the -a option).
Once an index is built, searching for pattern is as easy as saying
glimpse pattern
(See man glimpse for all glimpse's options and features.)
A DETAILED DESCRIPTION OF GLIMPSEINDEX
Glimpse does not automatically index files. You have to tell it to do
it. This can be done manually, but a better way is to set it to run
every night. It is probably a good idea to run glimpseindex manually
for the first time to be sure it works properly. The following is a
simple script to run glimpseindex every night. We assume that this
script is stored in a file called glimpse.script:
glimpseindex-o -t -w 5000 ~ >& .glimpse_out
at -m 0300 glimpse.script
(It might be interesting to collect all the outputs of glimpse by
changing >& to >>& so that the file .glimpse_out maintains a history.
In this case the file must be created before the first time >>& is
used. If you use ksh, replace '>&' with '2>&1'.)
Glimpseindex stores the names of all the files that it indexed in the
file .glimpse_filenames. Each file is listed by its full path name as
obtained at the time the files were indexed. For example,
/usr1/udi/file1. Glimpse uses this full name when it performs the
search, so the name must match the current name. This may become a
problem when the indexing and the search are done from different
machines (e.g., through NFS), which may cause the path names to be dif‐
ferent. For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1. (The same is
true for several other .glimpse files. See below.)
Glimpseindex does not follow symbolic links unless they are explicitly
included in the .glimpse_include file (described below).
Glimpseindex makes an effort to identify non-text files such as binary
files, compressed files, uuencoded files, postscript files, binhex
files, etc. These files are automatically not indexed. In addition,
all files whose names end with `.o', `.gz', `.Z', `.z', `.hqx', `.zip',
or `.tar' will not be indexed (unless they are specifically included in
.glimpse_include - see below).
The options for glimpseindex are as follows:
-a adds the given file[s] and/or directories to an existing index.
Any given directory will be traversed recursively and all files
will be indexed (unless they appear in .glimpse_exclude; see
below). Using this option is generally much faster than index‐
ing everything from scratch, although in rare cases the index
may not be as good. If for some reason the index is full (which
can happen unless -o or -b are used) glimpseindex-a will pro‐
duce an error message and will exit without changing the origi‐
nal index.
-b builds a medium-size index (20-30% of the size of all files),
allowing faster search. This option forces glimpseindex to
store an exact (byte level) pointer to each occurrence of each
word (except for some very common words belonging to the stop
list).
-B uses a hash table that is 4 times bigger (256k entries instead
of 64K) to speed up indexing. The memory usage will increase
typically by about 2 MB. This option is only for indexing
speed; it does not affect the final index.
-dfilename(s)
deletes the given file(s) from the index.
-Dfilename(s)
deletes the given file(s) from the list of file names, but not
from the index. This is much faster than -d, and the file(s)
will not be found by glimpse. However, the index itself will
not become smaller.
-E does not run a check on file types. Glimpse normally attempts
to exclude non-text files, but this attempt is not always per‐
fect. With -E, glimpseindex indexes all files, except those
that are specifically excluded in .glimpse_exclude and those
whose file names end with one of the excluded suffixes.
-f incremental indexing. glimpseindex scans all files and adds to
the index only those files that were created or modified after
the current index was built. If there is no current index or if
this procedure fails, glimpseindex automatically reverts to the
default mode (which is to index everything from scratch). This
option may create an inefficient index for several reasons, one
of which is that deleted files are not really deleted from the
index. Unless changes are small, mostly additions, and -o is
used, we suggest to use the default mode as much as possible.
-F Glimpseindex receives the list of files to index from standard
input.
-H directory
Put or update the index and all other .glimpse files (listed
below) in "directory". The default is the home directory. When
glimpse is run, the -H option must be used to direct glimpse to
this directory, because glimpse assumes that the index is in the
home directory (see also the -H option in glimpse).
-i Make .glimpse_include (SEE GLIMPSEINDEX FILES) take precedence
over .glimpse_exclude, so that, for example, one can exclude
everything (by putting *) and then explicitly include files.
-I Instead of indexing, only show (print to standard out) the list
of files that would be indexed. It is useful for filtering pur‐
poses. ("glimpseindex -I dir | glimpseindex -F" is the same as
"glimpseindex dir".)
-M x Tells glimpseindex to use x MB of memory for temporary tables.
The more memory you allow the faster glimpseindex will run. The
default is x=2. The value of x must be a positive integer.
Glimpseindex will need more memory than x for other things, and
glimpseindex may perform some 'forks', so you'll have to experi‐
ment if you want to use this option. WARNING: If x is too large
you may run out of swap space.
-n Index numbers as well as text. The default is not to index num‐
bers. This is useful when searching for dates or other identi‐
fying numbers, but it may make the index very large if there are
lots of numbers. In general, glimpseindex strips away any non-
alphabetic character. For example, the string abc123 will be
indexed as abc if the -n option is not used and as abc123 if it
is used. Glimpse provides warnings (in .glimpse_messages) for
all files in which more than half the words that were added to
the index from that file had digits in them (this is an attempt
to identify data files that should probably not be indexed).
One can use the .glimpse_exclude file to exclude data files or
any other files. (See GLIMPSEINDEX FILES.)
-o Build a small index rather than tiny (meaning 7-9% of the sizes
of all files - your mileage may vary) allowing faster search.
This option forces glimpseindex to allocate one block per file
(a block usually contains many files). A detailed explanation
of how blocks affect glimpse can be found in the glimpse arti‐
cle. (See also LIMITATIONS.)
-R Recompute .glimpse_filenames_index from .glimpse_filenames. The
file .glimpse_filenames_index speeds up processing. Glimpsein‐
dex usually computes it automatically. However, if for some
reason one wants to change the path names of the files listed in
.glimpse_filenames, then running glimpseindex-R recomputes
.glimpse_filenames_index. This is useful if the index is com‐
puted on one machine, but is used on another (with the same
hierarchy). The names of the files listed in .glimpse_filenames
are used in runtime, so changing them can be done at any time in
any way (as long as just the names not the content is changed).
This is not really an option in the regular sense; rather, it
is a program by itself, and it is meant as a post-processing
step. (Avaliable only from version 3.6.)
-s supports structured queries. This option was added to support
the Harvest project and it is applicable mostly in that context.
See STRUCTURED QUERIES below for more information and also
http://harvest.transarc.com for more information about the Har‐
vest project.
-S k The number k determines the size of the stop-list. The stop-
list consists of words that are too common and are not indexed
(e.g., 'the' or 'and'). Instead of having a fixed stop-list,
glimpseindex figures out the words that are too common for every
index separately. The rules are different for the different
indexing options. The tiny index contains all words (the sav‐
ings from a stop-list are too small to bother). The small index
(-o), the number k is a percentage threshold. A word will be in
the stop list if it appears in at least k% of all files. The
default value is 80%. (If there are less than 256 files, then
the stop-list is not maintained.) The medium index (-b) counts
all occurrences of all words, and a word is added to the stop-
list if it appears at least k times per MByte. The default
value is 500. A query that includes a stop list word is of
course less efficient. (See also LIMITATIONS below.)
-t (A new option in version 3.5.) The order in which files are
indexed is determined by scanning the directories, which is
mostly arbitrary. With the -t option, combined with either -o
and -b, the indexed files are stored in reversed order of modi‐
fication age (younger files first). Results of queries are then
automatically returned in this order. Furthermore, glimpse can
filter results by age; for example, asking to look at only files
that are at most 5 days old.
-T builds the turbo file. Starting at version 3.0, this is the
default, so using this option has no effect.
-w k Glimpseindex does a reasonable, but not a perfect, job of deter‐
mining which files should not be indexed. Sometimes a large
text file should not be indexed; for example, a dictionary may
match most queries. The -w option stores in a file called
.glimpse_messages (in the same directory as the index) the list
of all files that contribute at least k new words to the index.
The user can look at this list of files and decide which should
or should not be indexed. The file .glimpse_exclude contains
files that will not be indexed (see more below). We recommend
to set k to about 1000. This is not an exact measure. For
example, if the same file appears twice, then the second copy
will not contribute any new words to the dictionary (but if you
exclude the first copy and index again, the second copy will
contribute).
-X (starting at version 4.0B1) Extract titles from HTML pages and
add the titles to the index (in .glimpse_filenames). (This fea‐
ture was added to improve the performance of WebGlimpse.) Works
only on files whose names end with .html, .htm, .shtml, and
.shtm. (see glimpse.h/EXTRACT_INFO_SUFFIX to add to these suf‐
fixes.) The routine to extract titles is called extract_info,
in index/filetype.c. This feature can be modified in various
ways to extract info from many filetypes. The titles are
appended to the corresponding filenames with a space separator.
Glimpseindex assumes that filenames don't have spaces in them.
-z Allow customizable filtering, using the file .glimpse_filters to
perform the programs listed there for each match. The best
example is compress/decompress. If .glimpse_filters include the
line
*.Z uncompress <
(separated by tabs) then before indexing any file that matches
the pattern "*.Z" (same syntax as the one for .glimpse_exclude)
the command listed is executed first (assuming input is from
stdin, which is why uncompress needs <) and its output (assuming
it goes to stdout) is indexed. The file itself is not changed
(i.e., it stays compressed). Then if glimpse -z is used, the
same program is used on these files on the fly. Any program can
be used (we run 'exec'). For example, one can filter out parts
of files that should not be indexed. Glimpseindex tries to
apply all filters in .glimpse_filters in the order they are
given. For example, if you want to uncompress a file and then
extract some part of it, put the compression command (the exam‐
ple above) first and then another line that specifies the
extraction. Note that this can slow down the search because the
filters need to be run before files are searched.
GLIMPSEINDEX FILES
All files used by glimpse are located at the directory(ies) where the
index(es) is (are) stored and have .glimpse_ as a prefix. The first
two files (.glimpse_exclude and .glimpse_include) are optionally sup‐
plied by the user. The other files are built and read by glimpse.
.glimpse_exclude
contains a list of files that glimpseindex is explicitly told to
ignore. In general, the syntax of .glimpse_exclude/include is
the same as that of agrep (or any other grep). The lines in the
.glimpse_exclude file are matched to the file names, and if they
match, the files are excluded. Notice that agrep matches to
parts of the string! e.g., agrep /ftp/pub will match
/home/ftp/pub and /ftp/pub/whatever. So, if you want to exclude
/ftp/pub/core, you just list it, as is, in the .glimpse_exclude
file. If you put "/home/ftp/pub/cdrom" in .glimpse_exclude,
every file name that matches that string will be excluded, mean‐
ing all files below it. You can use ^ to indicate the beginning
of a file name, and $ to indicate the end of one, and you can
use * and ? in the usual way. For example /ftp/*html will
exclude /ftp/pub/foo.html, but will also exclude
/home/ftp/pub/html/whatever; if you want to exclude files that
start with /ftp and end with html use ^/ftp*html$ Notice that
putting a * at the beginning or at the end is redundant (in
fact, in this case glimpseindex will remove the * when it does
the indexing). No other meta characters are allowed in
.glimpse_exclude (e.g., don't use .* or # or |). Lines with *
or ? must have no more than 30 characters. Notice that,
although the index itself will not be indexed, the list of file
names (.glimpse_filenames) will be indexed unless it is explic‐
itly listed in .glimpse_exclude.
.glimpse_filters
See the description above for the -z option.
.glimpse_include
contains a list of files that glimpseindex is explicitly told to
include in the index even though they may look like non-text
files. Symbolic links are followed by glimpseindex only if they
are specifically included here. The syntax is the same as the
one for .glimpse_exclude (see there). If a file is in both
.glimpse_exclude and .glimpse_include it will be excluded unless
-i is used.
.glimpse_filenames
contains the list of all indexed file names, one per line. This
is an ASCII file that can also be used with agrep to search for
a file name leading to a fast find command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that have
'count' in their name (including anywhere on the path from the
index). Setting the following alias in the .login file may be
useful:
alias findfile 'glimpse -h :1 ~/.glimpse_filenames'
.glimpse_index
contains the index. The index consists of lines, each starting
with a word followed by a list of block numbers (unless the -o
or -b options are used, in which case each word is followed by
an offset into the file .glimpse_partitions where all pointers
are kept). The block/file numbers are stored in binary form, so
this is not an ASCII file.
.glimpse_messages
contains the output of the -w option (see above).
.glimpse_partitions
contains the partition of the indexed space into blocks and,
when the index is built with the -o or -b options, some part of
the index. This file is used internally by glimpse and it is a
non-ASCII file.
.glimpse_statistics
contains some statistics about the makeup of the index. Useful
for some advanced applications and customization of glimpse.
STRUCTURED QUERIES
Glimpse can search for Boolean combinations of "attribute=value" terms
by using the Harvest SOIF parser library (in glimpse/libtemplate). To
search this way, the index must be made by using the -s option of
glimpseindex (this can be used in conjunction with other glimpseindex
options). For glimpse and glimpseindex to recognize "structured" files,
they must be in SOIF format. In this format, each value is prefixed by
an attribute-name with the size of the value (in bytes) present in "{}"
after the name of the attribute. For example, The following lines are
part of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse "pat‐
tern;type=Directory-Listing" will search for "pattern" only in files
whose type is "Directory-Listing". The file itself is considered to be
one "object" and its name/url appears as the first attribute with an
"@" prefix; e.g., @FILE { http://xxx... } The scope of Boolean opera‐
tions changes from records (lines) to whole files when structured
queries are used in glimpse (since individual query terms can look at
different attributes and they may not be "covered" by the record/line).
Note that glimpse can only search for patterns in the value parts of
the SOIF file: there are some attributes (like the TTL, MD5, etc.) that
are interpreted by Harvest's internal routines. See http://har‐
vest.cs.colorado.edu/harvest/user-manual/ for more detailed information
of the SOIF format.
REFERENCES
1. U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through Entire
File Systems," Usenix Winter 1994 Technical Conference (best
paper award), San Francisco (January 1994), pp. 23-32. Also,
Technical Report #TR 93-34, Dept. of Computer Science, Univer‐
sity of Arizona, October 1993 (a postscript file is available by
anonymous ftp at ftp://ftp.cs.ari‐
zona.edu/reports/1993/TR93-34.ps).
2. S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Com‐
munications of the ACM 35 (October 1992), pp. 83-91.
SEE ALSOagrep(1), ed(1), ex(1), glimpse(1), glimpseserver(1), grep(1V), sh(1),
csh(1).
LIMITATIONS
The index of glimpse is word based. A pattern that contains more than
one word cannot be found in the index. The way glimpse overcomes this
weakness is by splitting any multi-word pattern into its set of words
and looking for all of them in the index. For example, glimpse 'linear
programming' will first consult the index to find all files containing
both linear and programming, and then apply agrep to find the combined
pattern. This is usually an effective solution, but it can be slow for
cases where both words are very common, but their combination is not.
The index of glimpse stores all patterns in lower case. When glimpse
searches the index it first converts all patterns to lower case, finds
the appropriate files, and then searches the actual files using the
original patterns. So, for example, glimpse ABCXYZ will first find all
files containing abcxyz in any combination of lower and upper cases,
and then searches these files directly, so only the right cases will be
found. One problem with this approach is discovering misspellings that
are caused by wrong cases. For example, glimpse -B abcXYZ will first
search the index for the best match to abcxyz (because the pattern is
converted to lower case); it will find that there are matches with no
errors, and will go to those files to search them directly, this time
with the original upper cases. If the closest match is, say AbcXYZ,
glimpse may miss it, because it doesn't expect an error. Another prob‐
lem is speed. If you search for "ATT", it will look at the index for
"att". Unless you use -w to match the whole word, glimpse may have to
search all files containing, for example, "Seattle" which has "att" in
it.
There is no size limit for simple patterns and simple patterns with
Boolean AND or OR. More complicated patterns are currently limited to
approximately 30 characters. Lines are limited to 1024 characters.
Records are limited to 48K, and may be truncated if they are larger
than that. The limit of record length can be changed by modifying the
parameter Max_record in agrep.h.
Each line in .glimpse_exclude or .glimpse_include that contains a * or
a ? must not exceed 30 characters length.
Glimpseindex does not index words of size > 64.
A medium-size index (-b) may lead to actually slower query times if the
files are all very small.
Under -b, it may be impossible to make the stop list empty. Glimpsein‐
dex is using the "sort" routine, and all occurrences of a word appear
at some point on one line. Sort is limiting the size of lines it can
handle (the value depends on the platform; ours is 16KB). If the lines
are too big, the word is added to the stop list.
BUGS
Please send bug reports or comments to glimpse@cs.arizona.edu.
DIAGNOSTICS
(Only in version 3.6 and above.)
exit status 0: terminated normally;
exit status 1: glimpseindex errors (e.g., bad option combos, no files
were indexed, etc.)
exit status 2: system errors (e.g., write failed, sort failed, malloc
failed).
AUTHORS
Udi Manber and Burra Gopal, Department of Computer Science, University
of Arizona, and Sun Wu, the National Chung-Cheng University, Taiwan.
(Email: glimpse@cs.arizona.edu)
November 10, 1997 GLIMPSEINDEX(l)