webcheck man page on DragonFly

webcheck man page on DragonFly
Man page or keyword search:
man Server 44335 pages
apropos Keyword Search (all sections)
Output format
webcheck(1)			 User Commands			   webcheck(1)

NAME
       webcheck - website link checker

SYNOPSIS
       webcheck [OPTION]...  URL

DESCRIPTION
       webcheck	 will  check  the  document  at the specified URL for links to
       other documents, follow these links recursively and  generate  an  HTML
       report.

       -i,  --internal=PATTERN
	      Mark URLs matching the PATTERN (perl-type regular expression) as
	      an internal link.	 Can be used multiple times.   Note  that  the
	      PATTERN  is  matched  against  the full URL.  URLs matching this
	      PATTERN will be considered internal, even if they match  one  of
	      the --external PATTERNs.

       -x,  --external=PATTERN
	      Mark URLs matching the PATTERN (perl-type regular expression) as
	      an external link.	 Can be used multiple times.   Note  that  the
	      PATTERN is matched against the full URL.

       -y, --yank=PATTERN
	      Do  not  check  URLs  matching  the  PATTERN  (perl-type regular
	      expression).  Like the -x flag, though this  option  will	 cause
	      webcheck	to not check the link matched by regex whereas -x will
	      check the link but not  its  children.   Can  be	used  multiple
	      times.  Note that the PATTERN is matched against the full URL.

       -b, --base-only
	      Consider	any URL not starting with the base URL to be external.
	      For example, if you run
		  webcheck -b http://www.example.com/foo
	      then http://www.example.com/foo/bar will be considered  internal
	      whereas http://www.example.com/ will be considered external.  By
	      default all the pages on the site will be considered internal.

       -a, --avoid-external
	      Avoid external links.  Normally if webcheck is examining an HTML
	      page and it finds a link that points to an external document, it
	      will check to see if that external document exists.   This  flag
	      disables that action.

       --ignore-robots
	      Do   not	retrieve  and  parse  robots.txt  files.   By  default
	      robots.txt files are retrieved and honored.  If you are sure you
	      want to ignore and override the webmaster's decision this option
	      can be used.
	      For more	information  on	 robots.txt  handling  see  the	 NOTES
	      section below.

       -q, --quiet, --silent
	      Do not print out progress as webcheck traverses a site.

       -d, --debug
	      Print  debugging	information  while  crawling  the  site.  This
	      option is mainly useful for developers.

       -o, --output=DIRECTORY
	      Output directory. Use to specify the  directory  where  webcheck
	      will  dump  its reports. The default is the current directory or
	      as specified by config.py. If this directory does not  exist  it
	      will be created for you (if possible).

       -c, --continue
	      Try  to  continue	 from  a  previous run. When using this option
	      webcheck will look for a webcheck.dat in the  output  directory.
	      This  file  is  read to restore the state from the previous run.
	      This allows webcheck to continue a previously  interrupted  run.
	      When  this option is used, the --internal, --external and --yank
	      options will be ignored as  well	as  any	 URL  arguments.   The
	      --base-only  and	--avoid-external options should be the same as
	      the previous run.
	      Note that this option is experimental  and  it's	semantics  may
	      change  with  coming  releases  (especially in relation to other
	      options).	 Also note that the stored files are not guaranteed to
	      be compatible between releases.

       -f, --force
	      Overwrite	 files	without	 asking.   This option is required for
	      running webcheck non-interactively.

       -r, --redirects=N
	      Redirect depth. the number of redirects webcheck	should	follow
	      when following a link. 0 implies to follow all redirects.

       -u, --userpass=URL
	      Specify  a URL with username and password information to use for
	      basic authentication when visiting the site.
	      e.g. http://test:secret@example.com/
	      This option may be specified multiple times.

       -w, --wait=SECONDS
	      Wait SECONDS between document retrievals. Usually webcheck  will
	      process  a  url  and immediately move on to the next. However on
	      some loaded systems it may be desirable to have  webcheck	 pause
	      between  requests.   This	 option can be set to any non-negative
	      number.

       -v, --version
	      Show version of program.

       -h, --help
	      Show short summary of options.

URL CLASSES
       URLs are divided into two classes:

       Internal URLs are retrieved and	the  retrieved	item  is  checked  for
       syntax.	 Also, the retrieved item is searched for links to other items
       (of any class) and these links are followed.

       External URLs are only retrieved to test whether they are valid and  to
       gather  some  basic  information	 from them (title, size, content-type,
       etc).  The retrieved items are not inspected for links to other items.

       Apart from  their  class,  URLs	can  also  be  considered  yanked  (as
       specified  with	the --yank or --avoid-external options).  The URLs can
       be either internal or external and will not be retrieved or checked  at
       all.  URLs of unsupported schemes are also considered yanked.

EXAMPLES
       Check  the  site www.example.com but consider any path with "/webcheck"
       in it to be external.
	   webcheck http://www.example.com/ -x /webcheck

NOTES
       When checking  internal	URLs  webcheck	honors	the  robots.txt	 file,
       identifying itself as user-agent webcheck. Disallowed links will not be
       checked at all as if the -y option was specified for that URL. To allow
       webcheck to crawl parts of a site that other robots are disallowed, use
       something like:
	   User-agent: *
	   Disallow: /foo

	   User-agent: webcheck
	   Allow: /foo

ENVIRONMENT
       <scheme>_proxy
	      Proxy url for <scheme>.

REPORTING BUGS
       Bug   reports   shoult	be   sent   to	  the	 current    maintainer
       <arthur@ch.tudelft.nl>.	 More  information  on	reporting  bugs can be
       found on the webcheck homepage:
       http://ch.tudelft.nl/~arthur/webcheck/

COPYRIGHT
       Copyright © 1998, 1999 Albert Hopkins (marduk)
       Copyright © 2002 Mike W. Meyer
       Copyright © 2005, 2006, 2007, 2008 Arthur de Jong
       webcheck is free software;  see	the  source  for  copying  conditions.
       There  is  NO  warranty;	 not even for MERCHANTABILITY or FITNESS FOR A
       PARTICULAR PURPOSE.
       The files produced as output from the  software	do  not	 automatically
       fall  under  the	 copyright  of	the software, unless explicitly stated
       otherwise.

Version 1.10.3			   Jul 2008			   webcheck(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome