webcrawl man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

WEBCRAWL(1)							   WEBCRAWL(1)

NAME
	 WebCrawl - download web sites, following links

SYNOPSIS
       webcrawl [ options ] host[:port]/filename directory

DESCRIPTION
       WebCrawl	 is  a	program designed to download an entire website without
       user interaction (although an interactive mode is available).

       WebCrawl will download the page web-address  into  a  directory	called
       destination-dir	under the compiled in server root directory (which can
       be changed with the -o option, see below).  web-address should not con‐
       tain a leading http://

       It  works  simply by starting with a single web page, and following all
       links from that page to attempt to recreate the directory structure  on
       the remote server.

       As  well as downloading the pages, it also rewrites them to use a local
       URL where URLs that would otherwise not work on the  local  system  are
       used  in	 the page (eg URLs that begin with http:// or the begin with a
       /).

       It stores the downloaded files in a directory  structure	 that  mirrors
       the  original  site's, under a directory called server.domain.com:port.
       This way, multiple sites can all be  loaded  into  the  same  directory
       structure,  and	if  they  link to each other, they can be rewritten to
       link to the local, rather than remote, versions.

       Comprehensive URL selection facilities allow you to describe what docu‐
       ments  you  want to download, so that you don't end up downloading much
       more than you need.

       WebCrawl is written in ANSI C, and should work  on  any	POSIX  system.
       With  minor modifications, it should be possible to make it work on any
       operating system that supports TCP/IP sockets. It has been tested  only
       on Linux.

OPTIONS
       URL selection

       -a      This  causes  the program to ask the user whether to download a
	       page that it hasn't been otherwise instructed to	 (by  default,
	       this means off-site pages)

       -f string
	       This  causes  the  program  to always follow links to URLs that
	       contain the string. You can use this, for example, to prevent a
	       crawl  from  going  up  beyond a single directory on a site (in
	       conjunction with the -x option below); say you  wanted  to  get
	       http://www.web-sites.co.uk/jules but not any other site located
	       on the same server. You could use the command line:

	       webcrawl -x -f /jules www.web-sites.co.uk/jules/ mirror

	       Another use would be if a site contained	 links	to  (eg)  pic‐
	       tures,  videos or sound clips on a remote server, you could use
	       the following command line to get them:

	       webcrawl -f .jpg -f .gif -f .mpg -f .wav -f  .au	 www.site.com/
	       mirror

	       Note that webcrawl always downloads inline images.

       -d string
	       The  opposite  of -f, this option tells webcrawl never to get a
	       URL containing the string.  -d takes priority  over  all	 other
	       URL  selection options (except that it won't stop it from down‐
	       loading inline images, which are always downloaded).

       -u filename
	       Causes webcrawl to log unfollowed links to the file filename.

       -x      Causes webcrawl not to automatically follow links to  pages  on
	       the  same  server.  This	 is  useful  in conjuction with the -f
	       option to specify a subsection of an entire site to download.

       -X      Causes webcrawl not to  automatically  download	inline	images
	       (which  it  would  otherwise do even when other options did not
	       indicate that the image should be loaded).  This is  useful  in
	       conjunction  with  the  -f option to specify a subsection of an
	       entire site to download, when even the  images  concerned  need
	       careful selection.

       Page re-writing:

       -n      Turns off page rewriting completely.

       -rx     Select  which  URLs  to rewrite. Only URLs that begin with / or
	       http: are considered for rewriting, all others are always  left
	       unchanged.   This  options  selects  which  of  these  URLs are
	       rewritten to point to local files, depending on the value of x.

	       a   all absolute URLs are rewritten

	       l   Only URLs that point to pages on the same site are  rewrit‐
		   ten.

	       f (default)
		   URLs	 for which the file that the rewritten URL would point
		   to exists are rewritten. Note that rewriting	 occurs	 after
		   all	links  in  a page have been followed (if required), so
		   this represents probably the most sensible option,  and  is
		   therefore the default.

       -k      Keep  original  filenames  -  disables changing of filenames to
	       remove metacharacters that may confuse a	 web  server,  and  to
	       ensure  that the extension on the end of the filename is a cor‐
	       rect .html or .htm whenever the page has	 a  text/html  content
	       type.  (See  Configuration Files below for a discusssion of how
	       to achieve this with other file types).

       -q      Disable process ID insertion  into  query  filenames.   Without
	       this flag, and whenever -k is not in use, webcrawl rewrites the
	       filenames of queries (defined as any fetch from	a  web	server
	       that  includes  a '?' character in the filename) to include the
	       process ID of the webcrawl fetching the	query  in  hexadecimal
	       after  the (escaped) '?' in the filename; this may be desirable
	       if performing the same query multiple times  to	get  different
	       results.	 This flag disables this behaviour.

       Recursion limiting:

       -l[x] number
	       This  option  is used to limit the depth to which webcrawl will
	       search the tree (forest) of interlinked pages.  There  are  two
	       parameters  that	 may be set; with x as l, the initial limit is
	       set, with x as r, the limit used after jumping to a remote site
	       is set. If x is missed out, both limits are set.

       -v      Increases  the  program's  verbosity.  Without  this option, no
	       reports on status are made unless errors occur, etc. Used once,
	       webcrawl	 will  report which URLs it is trying to download, and
	       also which links it has decided not to follow.  -v may be  used
	       more  than once, but this is probably only useful for debugging
	       purposes.

       -o dir  Change the server root directory. This is  the  directory  that
	       the  path  specified at the end of the command line is relative
	       to.

       -p dir  Change the URL rewriting prefix. This is prepended to rewritten
	       URLs, and should be a (relative) URL that points to the current
	       server root directory. An example of the use of the -o  and  -p
	       options is given below:

	       webcrawl	     -o	    /home/jules/public_html	-p     /~jules
	       www.site.com/page.html mirrors

       HTTP-related options

       -A string
	       Causes webcrawl to send the specified string as the HTTP 'User-
	       Agent'  value,  rather  than  the compiled in default (normally
	       `Mozilla/4.05 [en] (X11; I; Linux 2.0.27 i586; Nav)',  although
	       this can be changed in the file web.h at compile time).

       -t n    Specifies  a  timeout, in seconds. Default behaviour is to give
	       up after this  length  of  time	from  the  initial  connection
	       attempt.

       -T      Changes	the  timeout  behaviour.   With this flag, the timeout
	       occurs only if no data is received  from	 the  server  for  the
	       specified length of time.

CONFIGURATION FILES
       webcrawl	 uses  configuration files at present to specify rules for the
       rewriting of filenames.	It searches for files  in  /etc/webcrawl.conf,
       /usr/local/etc/webcrawl.conf,  and  $HOME/.webcrawl  and	 processes all
       files it finds in that order.  Parameters set in one file may be	 over‐
       riden  by  subsequent files.  Note that it is perfectly possible to use
       webcrawl without a configuration file - it is only  for	advanced  fea‐
       tures  that are too complex to configure on the command line that it is
       required.

       The overall syntax of the webcrawl file is  a  set  of  sections,  each
       headed by a line of the form [section-name].

       At  present, only the [rename] section is defined. This may contain the
       following commands:

       meta string
	       Sets metacharacter list. Any character in  the  list  specified
	       will be quoted in filenames produced (unless filename rewriting
	       is disabled with the  -k	 option).   Quoting  is	 performed  by
	       prepending the quoting character (default @) to the hexadecimal
	       ASCII  value  of	 the  character	 being	quoted.	 The   default
	       metacharacter list is: ?&*%=#

       quote char
	       Sets the quoting character, as described above. The default is:
	       @

       type content/type preferred [extra extra ...]
	       Sets the list of acceptable extensions for  the	specifed  MIME
	       content	type.	The  first  item  in the list is the preferred
	       extension; if renaming is not disabled (with the -k option) and
	       the  extension  of a file of this type is not on the list, then
	       the first extension on the list will be appended to its name.

	       An implicit line is defined internally, which reads:

	       type text/html html htm

	       This could be overriden; if say you preferred the 'htm'	exten‐
	       sion over 'html', you could use:

	       type text/html htm html

	       in  a  configuration  file  to cause .htm extensions to be used
	       whenever a new extension was added.

AUTHOR
       WebCrawl was written by Julian R. Hall <jules@acris.co.uk> with sugges‐
       tions and prompting by Andy Smith.

       Bugs  should  be	 submitted to Julian Hall at the address above. Please
       include information about what  architecture,  version,	etc,  you  are
       using.

webcrawl			   webcrawl			   WEBCRAWL(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net