rwsplit man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

rwsplit(1)			SiLK Tool Suite			    rwsplit(1)

NAME
       rwsplit - Divide a SiLK file into a (sampled) collection of subfiles

SYNOPSIS
	 rwsplit --basename=BASENAME
	       { --ip-limit=LIMIT | --flow-limit=LIMIT
		 | --packet-limit=LIMIT | --byte-limit=LIMIT }
	       [--seed=NUMBER] [--sample-ratio=SAMPLE_RATIO]
	       [--file-ratio=FILE_RATIO] [--max-outputs=MAX_OUTPUTS]
	       [--note-add=TEXT] [--note-file-add=FILE]
	       [--compression-method=COMP_METHOD]
	       [--print-filenames] [--site-config-file=FILENAME]
	       [--xargs[=FILE] | FILE [FILES...]]

	 rwsplit --help

	 rwsplit --version

DESCRIPTION
       rwsplit reads SiLK Flow records from the standard input or from files
       named on the command line and writes the flows into a set of subfiles
       based on the splitting criterion.  In its simplest form, rwsplit
       partitions the file, meaning that each input flow will appear in one
       (and only one) of the subfiles.

       In addition to splitting the file, rwsplit can generate files
       containing sample flows.	 Sampling is specified by using the
       --sample-ratio and --file-ratio switches.

       rwsplit reads SiLK Flow records from the files named on the command
       line or from the standard input when no file names are specified and
       --xargs is not present.	To read the standard input in addition to the
       named files, use "-" or "stdin" as a file name.	If an input file name
       ends in ".gz", the file will be uncompressed as it is read.  When the
       --xargs switch is provided, rwsplit will read the names of the files to
       process from the named text file, or from the standard input if no file
       name argument is provided to the switch.	 The input to --xargs must
       contain one file name per line.

       If you wish to use the size of the output files as the splitting
       criterion, use the --flow-limit switch.	The paramater to this switch
       should be the size of the desired output files divided by the record
       size.  The record size can be determined by rwfileinfo(1).  When the
       output files are compressed (see the description of
       --compression-method below), you should assume about a 50% compression
       ratio.

OPTIONS
       Option names may be abbreviated if the abbreviation is unique or is an
       exact match for an option.  A parameter to an option may be specified
       as --arg=param or --arg param, though the first form is required for
       options that take optional parameters.

       The splitting criterion is defined using one of the limit specifiers;
       one and only one must be specified.  They are:

       --ip-limit=LIMIT
	   Close the current subfile and begin a new subfile when the count of
	   unique source and destination IPs in the current subfile meets or
	   exceeds LIMIT.  The next-hop-IP does not count toward LIMIT.

       --flow-limit=LIMIT
	   Close the current subfile and begin a new subfile when the number
	   of SiLK Flow records in the current subfile meets LIMIT.

       --packet-limit=LIMIT
	   Close the current subfile and begin a new subfile when the sum of
	   the packet counts across all SiLK Flow records in the current
	   subfile meets or exceeds LIMIT.

       --byte-limit=LIMIT
	   Close the current subfile and begin a new subfile when the sum of
	   the byte counts across all SiLK Flow records in the current subfile
	   meets or exceeds LIMIT.  This switch does not specify the size of
	   the subfiles.

       The other switches are:

       --basename=BASENAME
	   Specifies the basename of the output files; this switch is
	   required.  The flows are written sequentially to a set of subfiles
	   whose names follow the format BASENAME.ORDER.rwf, where ORDER is an
	   8-digit zero-formatted sequence number (i.e., 00000000, 00000001,
	   and so on).	The sequence number will begin at zero and increase by
	   one for every file written, unless --file-ratio is specified,

       --seed=NUMBER
	   Use NUMBER to seed the pseudo-random number generator for the
	   --sample-ratio or --file-ratio switch.  This can be used to put the
	   random number generator into a known state, which is useful for
	   testing.

       --sample-ratio=SAMPLE_RATIO
	   Writes one flow record, chosen at random, from every SAMPLE_RATIO
	   flows that are read.

       --file-ratio=FILE_RATIO
	   Picks one subfile, chosen from random, out of every FILE_RATIO
	   names generated, for writing to disk.

       --max-outputs=NUMBER
	   Limits the number of files that are written to disk to NUMBER.

       --note-add=TEXT
	   Add the specified TEXT to the header of the output file as an
	   annotation.	This switch may be repeated to add multiple
	   annotations to a file.  To view the annotations, use the
	   rwfileinfo(1) tool.

       --note-file-add=FILENAME
	   Open FILENAME and add the contents of that file to the header of
	   the output file as an annotation.	This switch may be repeated to
	   add multiple annotations.  Currently the application makes no
	   effort to ensure that FILENAME contains text; be careful that you
	   do not attempt to add a SiLK data file as an annotation.

       --compression-method=COMP_METHOD
	   Specify how to compress the output.	When this switch is not given,
	   the output files are compressed using the default chosen when SiLK
	   was compiled.  The valid values for COMP_METHOD are determined by
	   which external libraries were found when SiLK was compiled.	To see
	   the available compression methods and the default method, use the
	   --help or --version switch.	SiLK can support the following
	   COMP_METHOD values when the required libraries are available.

	   none
	       Do not compress the output using an external library.

	   zlib
	       Use the zlib(3) library for compressing the output.  Using zlib
	       produces the smallest output files at the cost of speed.

	   lzo1x
	       Use the lzo1x algorithm from the LZO real time compression
	       library for compression.	 This compression provides good
	       compression with less memory and CPU overhead.

	   best
	       Use lzo1x if available, otherwise use zlib.

       --print-filenames
	   Print to the standard error the names of input files as they are
	   opened.

       --site-config-file=FILENAME
	   Read the SiLK site configuration from the named file FILENAME.
	   When this switch is not provided, rwsplit searches for the site
	   configuration file in the locations specified in the "FILES"
	   section.

       --xargs
       --xargs=FILENAME
	   Causes rwsplit to read file names from FILENAME or from the
	   standard input if FILENAME is not provided.	The input should have
	   one file name per line.  rwsplit will open each file in turn and
	   read records from it, as if the files had been listed on the
	   command line.

       --help
	   Print the available options and exit.

       --version
	   Print the version number and information about how SiLK was
	   configured, then exit the application.

EXAMPLES
       In the following examples, the dollar sign ("$") represents the shell
       prompt.	The text after the dollar sign represents the command line.
       Lines have been wrapped for improved readability, and the back slash
       ("\") is used to indicate a wrapped line.

       Assume a source file source.rwf; to split that file into files that
       each contain about 100 unique IP addresses:

	$ rwsplit --basename=result --ip-limit=100 source.rwf

       To split source.rwf into files that each contain 100 flows:

	$ rwsplit --basename=result --flow-limit=100 source.rwf

       The following causes rwsplit to sample 1 out of every 10 records from
       source.rwf; i.e., rwsplit will read 1000 flow records to produce each
       subfile:

	$ rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf

       When --file-ratio is specified, the file names are generated as usual
       (e.g., base-00000000, base-00000001, ...); however, one of these names
       will be chosen randomly from each set of --file-ratio candidates, and
       only that file will be written to disk.

	$ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf
	$ ls
	result-00000002.rwf
	result-00000008.rwf
	result-00000013.rwf
	result-00000016.rwf

LIMITATIONS
       rwsplit can take exactly 1 partitioning switch per invocation.

       Partitioning is not exact, rwsplit keeps appending flow records a file
       until it meets or exceeds the specified LIMIT.  For example, if you
       specify --ip-limit=100, then rwsplit will fill up the file until it has
       100 IP addresses in it; if the file has 99 addresses and a new record
       with 2 previously unseen addresses is received, rwsplit will put this
       in the current file, resulting in a 101-address file.  Similarly, if
       you specify --byte-limit=2000, and rwsplit receives a 10kb flow record,
       that flow record will be placed in the current subfile.

       The switches --sample-ratio, --file-ratio, and --max-outputs are
       processed in that order.	 So, when you specify

	$ rwsplit --sample-ratio=10 --ip-limit=100    \
	       --file-ratio=10 --max-outputs=20

       rwsplit will pick 1 out of every 10 flow records, write that to a file
       until it has 100 IP's per file, pick 1 out of every 10 files to write,
       and write up to 20 files.  If there are 1000 records, each with 2
       unique IPs in them, then rwsplit will write at most 1 file (it will
       write 200 unique IP addresses, but it may not pick one of the files
       from the set to write).

ENVIRONMENT
       SILK_CLOBBER
	   The SiLK tools normally refuse to overwrite existing files.
	   Setting SILK_CLOBBER to a non-empty value removes this restriction.

       SILK_CONFIG_FILE
	   This environment variable is used as the value for the
	   --site-config-file when that switch is not provided.

       SILK_DATA_ROOTDIR
	   This environment variable specifies the root directory of data
	   repository.	As described in the "FILES" section, rwsplit may use
	   this environment variable when searching for the SiLK site
	   configuration file.

       SILK_PATH
	   This environment variable gives the root of the install tree.  When
	   searching for configuration files, rwsplit may use this environment
	   variable.  See the "FILES" section for details.

FILES
       ${SILK_CONFIG_FILE}
       ${SILK_DATA_ROOTDIR}/silk.conf
       /data/silk.conf
       ${SILK_PATH}/share/silk/silk.conf
       ${SILK_PATH}/share/silk.conf
       /usr/local/share/silk/silk.conf
       /usr/local/share/silk.conf
	   Possible locations for the SiLK site configuration file which are
	   checked when the --site-config-file switch is not provided.

SEE ALSO
       rwfileinfo(1), silk(7), zlib(3)

SiLK 3.11.0.1			  2016-02-19			    rwsplit(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net