mailtoe man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]


MAILTOE(1)							    MAILTOE(1)

NAME
       mailtoe - a train-on-error simulator for use with dbacl.

SYNOPSIS
       mailtoe command [ command_arguments ]

DESCRIPTION
       mailtoe	automates  the task of testing email filtering and classifica‐
       tion programs such as dbacl(1).	Given a set of categorized  documents,
       mailtoe	initiates  test runs to estimate the classification errors and
       thereby permit fine tuning of the parameters of the classifier.

       Train-on-error (TOE) is a learning method which is sometimes  advocated
       for  email classifiers. Given an incoming email stream, the method con‐
       sists in reusing a fixed set of category databases until the first mis‐
       classification  occurs.	At  that point, the offending email is used to
       relearn the relevant category, until  the  next	misclassification.  In
       this  way, categories are only updated when errors occur. This directly
       models the way that some email classifiers are used in practice.

       TOE's error rates depend directly on the	 order	in  which  emails  are
       seen.   A  small	 change in ordering, as might happen due to networking
       delays, can have a large impact on the  number  of  misclassifications.
       Consequently, mailtoe does not give meaningful results, unless the sam‐
       ple emails are chosen carefully.	 However, as this method  is  commonly
       used  by	 spam  filters,	 it is still worth computing to foster compar‐
       isons. Other methods  (see mailcross(1),mailfoot(1)) attempt to capture
       the behaviour of classification errors in other ways.

       To  improve  and stabilize the error rate calculation, mailtoe performs
       the TOE simulations several times on slightly reordered email  streams,
       and  averages  the  results.  The reorderings occur by multiplexing the
       emails from each category mailbox in random order. Thus	if  there  are
       three  categories,  the	first email classified is chosen randomly from
       the front of the sample email streams of each type.  The	 second	 email
       is also chosen randomly among the three types, from the front of the
	streams	 after	the first email was removed. Simulation stops when all
       sample streams are exhausted.

       mailtoe uses the environment variable  MAILTOE_FILTER  when  executing,
       which  permits the simulation of arbitrary filters, provided these sat‐
       isfy the compatibility conditions stated	 in  the  ENVIRONMENT  section
       below.

       For  convenience,  mailtoe implements a testsuite framework with prede‐
       fined wrappers for several open source classifiers.  This  permits  the
       direct  comparison  of  dbacl(1) with competing classifiers on the same
       set of email samples. See the USAGE section below.

       During preparation, mailtoe builds a subdirectory  named	 mailtoe.d  in
       the  current  working directory.	 All needed calculations are performed
       inside this subdirectory.

EXIT STATUS
       mailtoe returns 0 on success, 1 if a problem occurred.

COMMANDS
       prepare size
	      Prepares a subdirectory named mailtoe.d in the  current  working
	      directory,  and  populates  it  with  empty  subdirectories  for
	      exactly size subsets.

       add category [ FILE ]...
	      Takes a set of emails from either FILE if specified,  or	STDIN,
	      and  associates  them  with  category.   The  ordering of emails
	      within FILE is preserved, and subsequent FILEs are  appended  to
	      the  first  in each category.  This command can be repeated sev‐
	      eral times, but should be executed at least once.

       clean  Deletes the directory mailtoe.d and all its contents.

       run    Multiplexes randomly from the email streams added	 earlier,  and
	      relearns	categories  only  when a misclassification occurs. The
	      simulation is repeated size times.

       summarize
	      Prints average error rates for the simulations.

       plot [ ps | logscale ]...
	      Plots the number	of  errors  over  simulation  time.  The  "ps"
	      option,  if present, writes the plot to a postscript file in the
	      directory mailtoe/plots, instead of being shown  on-screen.  The
	      "logscale"  option, if present, causes the plot to be on the log
	      scale for both ordinates.

       review truecat predcat
	      Scans the last run statistics  and  extracts  all	 the  messages
	      which  belong  to category truecat but have been classified into
	      category predcat.	 The extracted	messages  are  copied  to  the
	      directory mailtoe.d/review for perusal.

       testsuite list
	      Shows  a	list of available filters/wrapper scripts which can be
	      selected.

       testsuite select [ FILTER ]...
	      Prepares the filter(s) named FILTER to be used  for  simulation.
	      The  filter  name is the name of a wrapper script located in the
	      directory /usr/local/share/dbacl/testsuite.  Each filter	has  a
	      rigid  interface	documented  below, and the act of selecting it
	      copies it	 to  the  mailtoe.d/filters  directory.	 Only  filters
	      located there are used in the simulations.

       testsuite deselect [ FILTER ]...
	      Removes the named filter(s) from the directory mailtoe.d/filters
	      so that they are not used in the simulation.

       testsuite run [ plots ]
	      Invokes every selected filter on the datasets added  previously,
	      and calculates misclassification rates. If the "plots" option is
	      present, each filter simulation is plotted as a postscript  file
	      in the directory mailtoe.d/plots.

       testsuite status
	      Describes the scheduled simulations.

       testsuite summarize
	      Shows  the  cross validation results for all filters. Only makes
	      sense after the run command.

USAGE
       The normal usage pattern is the following: first, you  should  separate
       your  email collection into several categories (manually or otherwise).
       Each category should be associated with one or more folders,  but  each
       folder  should  not  contain  more  than one category. Next, you should
       decide how many runs to use, say 10.  The more runs you use, the better
       the  predicted error rates. However, more runs take more time.  Now you
       can type

       % mailtoe prepare 10

       Next, for every category, you must add  every  folder  associated  with
       this  category. Suppose you have three categories named spam, work, and
       play, which are associated with the mbox	 files	spam.mbox,  work.mbox,
       and play.mbox respectively. You would type

       % mailtoe add spam spam.mbox
       % mailtoe add work work.mbox
       % mailtoe add play play.mbox

       You  should aim for a similar number of emails in each category, as the
       random multiplexing will be unbalanced otherwise. The ordering  of  the
       email  messages in each *.mbox file is important, and is preserved dur‐
       ing each simulation. If you repeatedly add to the  same	category,  the
       later  mailboxes	 will be appended to the first, preserving the implied
       ordering.

       You can now perform as many TOE simulations as desired. The multiplexed
       emails  are classified and learned one at a time, by executing the com‐
       mand given in the environment variable MAILTOE_FILTER. If  not  set,  a
       default value is used.

       % mailtoe run
       % mailtoe summarize

       The  testsuite  commands	 are  designed to simplify the above steps and
       allow comparison of a wide range of email  classifiers,	including  but
       not  limited  to	 dbacl.	  Classifiers  are  supported  through wrapper
       scripts, which  are  located  in	 the  /usr/local/share/dbacl/testsuite
       directory.

       The  first stage when using the testsuite is deciding which classifiers
       to compare.  You can view a list of available wrappers by typing:

       % mailtoe testsuite list

       Note that the wrapper scripts are NOT  the  actual  email  classifiers,
       which must be installed separately by your system administrator or oth‐
       erwise.	Once this is done, you can select one or more wrappers for the
       simulation by typing, for example:

       % mailtoe testsuite select dbaclA ifile

       If some of the selected classifiers cannot be found on the system, they
       are not selected. Note also that some wrappers can have hard-coded cat‐
       egory  names,  e.g.  if the classifier only supports binary classifica‐
       tion. Heed the warning messages.

       It remains only to run the simulation. Beware, this  can	 take  a  long
       time (several hours depending on the classifier).

       % mailtoe testsuite run
       % mailtoe testsuite summarize

       Once you are all done, you can delete the working files, log files etc.
       by typing

       % mailtoe clean

SCRIPT INTERFACE
       mailtoe testsuite takes care of learning and classifying your  prepared
       email  corpora  for  each  selected  classifier. Since classifiers have
       widely varying interfaces, this is  only	 possible  by  wrapping	 those
       interfaces individually into a standard form which can be used by mail‐
       toe testsuite.

       Each wrapper script is a command line tool which accepts a single  com‐
       mand followed by zero or more optional arguments, in the standard form:

       wrapper command [argument]...

       Each  wrapper  script  also  makes  use	of  STDIN and STDOUT in a well
       defined way. If no behaviour is described,  then	 no  output  or	 input
       should be used.	The possible commands are described below:

       filter In this case, a single email is expected on STDIN, and a list of
	      category filenames is expected in $2, $3, etc. The script writes
	      the category name corresponding to the input email on STDOUT. No
	      trailing newline is required or expected.

       learn  In this case, a standard mbox stream is expected on STDIN, while
	      a	 suitable  category  file name is expected in $2. No output is
	      written to STDOUT.

       clean  In this case, a directory is expected in $2, which  is  examined
	      for  old	database  information. If any old databases are found,
	      they are purged or reset. No output is written to STDOUT.

       describe
	      IN this case, a single  line  of	text  is  written  to  STDOUT,
	      describing  the  filter's functionality. The line should be kept
	      short to prevent line wrapping on a terminal.

       bootstrap
	      In this case, a directory is expected in $2. The wrapper	script
	      first checks for the existence of its associated classifier, and
	      other prerequisites. If the check is successful, then the	 wrap‐
	      per is cloned into the supplied directory.  A courtesy notifica‐
	      tion should be given on STDOUT to express	 success  or  failure.
	      It is also permissible to give longer descriptions caveats.

       toe    In  this	case, a list of categories is expected in $3, $4, etc.
	      Every possible category must be listed. Preceding this list, the
	      true category is given in $2.

       foot   Used by mailfoot(1).

ENVIRONMENT
       Right  after  loading,  mailtoe reads the hidden file .mailtoerc in the
       $HOME directory, if it exists, so this would be a good place to	define
       custom values for environment variables.

       MAILTOE_FILTER
	      This variable contains a shell command to be executed repeatedly
	      during the running stage.	 The command should  accept  an	 email
	      message  on  STDIN  and output a resulting category name. On the
	      command line, it should also  accept  first  the	true  category
	      name,  then  a list of all possible category file names.	If the
	      output category does not match the true category, then the rele‐
	      vant    categories   are	 assumed   to	have   been   silently
	      updated/relearned.  If MAILTOE_FILTER is undefined, mailtoe uses
	      a default value.

       TEMPDIR
	      This  directory  is exported for the benefit of wrapper scripts.
	      Scripts which need to create temporary files should place them a
	      the location given in TEMPDIR.

NOTES
       The  subdirectory  mailtoe.d  can  grow quite large. It contains a full
       copy of the training corpora, as well as learning files for size	 times
       all the added categories, and various log files.

       While  TOE  simulations	for dbacl(1) can be used to compare with other
       classifiers, TOE should not be used  for	 real  world  classifications.
       This  is	 because,  unlike many other filters, dbacl(1) learns evidence
       weights in a nonlinear way, and	does  not  preserve  relative  weights
       between tokens, even if those tokens aren't seen in new emails.

WARNING
       Because	the ordering of emails within the added mailboxes matters, the
       estimated error rates are not well defined or  even  meaningful	in  an
       objective  sense.   However,  if	 the sample emails represent an actual
       snapshot of a user's incoming email, then the error rates are  somewhat
       meaningful. The simulations can then be interpreted as alternate reali‐
       ties where a given classifier would have intercepted the incoming mail.

SOURCE
       The source code for the latest version of this program is available  at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A. Breyer <laird@lbreyer.com>

SEE ALSO
       bayesol(1)   dbacl(1),	mailinspect(1),	  mailcross(1),	  mailfoot(1),
       regex(7)

Version 1.14.1	      Bayesian Text Classification Tools	    MAILTOE(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net