Linbot 0.6

Installing and using Linbot varies depending on which setup you have on your
system.  If you have Python on your system, the recommended way of using it 
is to run the modules through your Python interpreter.  Python 1.5 is required
and may be download freely at http://www.python.org/

If you do not have Python and are running a Linux 2.x ELF (Intel)
system you may use the included 'frozen' executable.  This executable includes 
a  built-in Python interpreter and all the modules needed to run Linbot.

Note that, due to the nature of Python and frozen exectuatables, it is not
possible to run the frozen version of Linbot on a system that has a version
of Python earlier than 1.5 installed. This will be fixed in a later version of
Python.

-----------------
INSTALLING LINBOT
-----------------
Installation is relatively easy.  

1) Unpack the gzipped tar archive into a directory.  Recommended directories are
/usr/local/lib/linbot or ~/linbot.  Be sure to add this directory to
your PYTHONPATH environment variable:

$ tar zxvf linbot-0.6.tar.gz -C /usr/local/lib
$ PYTHONPATH="/usr/local/lib/linbot:$PYTHONPATH" ; export PYTHONPATH

2) Add a symbolic link to <main> some place in your PATH where <main> 
is:

"linbot.py" if you have Python on your system
"linbot" if you are using the frozen Linux executable

$ ln -s /usr/local/lib/linbot/linbot.py /usr/local/bin/linbot
				  or
$ ln -s /usr/local/lib/linbot/linbot /usr/local/bin/linbot


3) Edit the config.py file to your choosing.  Most of the defaults are
safe, the important items can be overridden with command-line flags.
You may want to keep a copy of the original config.py just in case.
The config.py options are documented within the file.

--------------
RUNNING LINBOT
--------------
It is simple to run Linbot.

Executing Linbot without any command-line arguments will cause it to
give a simple synopsis of it's usage and then quit:

$ linbot
linbot [-x regex]... [-y regex]... [-b][-a][-o dir][-w sec] url [location]...

Before running Linbot on a site, you should need to do a little
preparation.

One thing that Linbot needs is a directory in which to publish its
reports.  It is recommended that you choose a directory that is
empty.  Note that this directory must exist and be writable by Linbot.

$ mkdir /usr/local/httpd/htdocs/linbot

The report can be viewed using most Web browsers.  Browsers using frames
technology should initially open the "index.html" file.  Browsers not using
frames or with frames disabled can initially open the "navbar.html" file.  Note
these are the default filenames for Linbot and may be changed via the config file.

Secondly it should be decided beforehand which structures on your site 
should be considered "internal" and which should be considered
"external".  Linbot defines internal and external links as such:

An INTERNAL link is a part of your site that you have control of and
should be checked, as well as the links that it points to.  Basically
an internal link is one that, if broken, you have the power to fix.

An EXTERNAL link is one that you site points to, but you have no
jurisdiction over.  It can also be a link that you may have power to
change, but need not be checked for broken links, such as CGI scripts
or pages that were generated by an automated tool (such as Linbot or
any program that converts a document of one format to HTML.

Your BASE url is the url that is the top level of your web site.
Commonly referred to as the "home page", it is the url that points to
all other pages either directly or indirectly.  A base url can be on
one server but may point to pages that are on another server but
should still be considered internal.  An example would be a main
server www.someplaceonthenet.com in which there may be links to an
alternate or load balancing server called www2.someplaceonthenet.com.
In this example www2.someplaceonthenet.com would host internal links
even though your "home page" may be http://www.someplaceonthenet.com

That said, you should have a basic idea of what you do and do not want 
Linbot to check.  Don't be surprised if you don't get it exactly right 
the first time.  Also, consider using the robots.txt file/protocol as
Linbot honors this protocol as well as other web robots that may run
across your site.  This protocol is useful to indicate to robots that
some parts or your site, such as CGI scripts, internal documents, or
server stats, should not be explored.  The robots.txt protocol is
explained at
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html
Currently Linbot identifies itself as User-Agent: Linbot.

You can allow Linbot to search a directory but restrict other bots, for example,
like this:

   User-agent: *
   Disallow: /

   User-agent: Linbot
   Allow: /

Linbot attempts to find the author of a web page by taking advantage of the <META>
HTML element.  META is used to pass meta information to the software reading the
HTML and can be used for a variety of purposes.  META tags are contained in the
<HEAD> element.  Here is an example of using META to descript the author of a
page:

   <META NAME="Author" CONTENT="John Doe">

In the above example Linbot will flag "John Doe" as the author of that page.
The "Problems", or "Work Groups" page identifies problem pages by author.  This
can be useful for sites where pages are authored by many individuals. 

Okay, you've heard enough and you want to run the darn thing.  The
simplest way to run Linbot is:

$ linbot http://www.someplaceonthenet.com/

This will first read the robots.txt file at www.someplaceonthenet.com
and then proceed to examine every link pointed to on that site, except 
links denied by robots.txt, if that file exists.

The exact usage for Linbot is explained below:

SYNOPSIS
linbot [-x regex]... [-y regex]... [-b][-a][-o dir][-w sec] url [location]...

-x regex

	Use this option to tell Linbot to consider any url matching
	with <regex> to be external.  This option can be used multiple
	times.

-y regex
	Like the -x switch, though this option will cause linbot to not
	check the link at all, whereas -x will check the link, but not
	its children.

-b
	Base URLs only.
	Tells Linbot to consider any url that does not start with the
	base url to be external.  For example, if you run 
	'linbot -b http://www.someplaceonthenet.com/~someuser/' then 
	http://www.someplaceonthenet.com/~someuser/misc/index.html
	will be considered internal whereas
	http://www.someplaceonthenet.com/ would be considered
	external.

-a
	Avoid external links.  Normally, if Linbot is examining an
	HTML page and if finds a link that points to an external
	document, Linbot will not examine the external document.
	However, it will check to see if that document exists, since
	you may not want to point to broken links whether internal or
	external.  However, sometimes this default behavior may not
	be desirable.  If the -a option is chosen, Linbot will not
	check for the existence of external links.

-o
	Output Directory.  Used to specify the directory where Linbot
	will dump its report files.  The default is the current
	directory or as specified in config.py

-w sec
	Wait sec seconds. Usually, Linbot will processs a URL and immediately 
	move on to the next one. However, on some loaded systems, it may be more 
	desirable to have Linbot wait a while between requests.  This option 
	should be set to any non-negative number (in seconds).  
	

url
	The base url.  Linbot checks this link first, then all the
	links it points to on down the "tree".  When checking url's,
	Linbot checks http:, ftp:, and file: schemas.  All other schemas
	are not checked and treated as "external".

location
	This specifies that urls pointed to at <location> are to be
	considered internal.  This can be useful, for example, it the
	base url is on one server but points to "internal" documents
	on another server.  location is the name of that server, for
	example www2.someplaceonthenet.com.  This can also be used,
	for example, if you have an intranet where some urls may point 
	to http://www.someplaceonthenet.com whereas some urls may
	point to just 'www'.  This option may be used more than once,
	but must follow the base url 

The switches (and other options) can be changed in the config.py file.  It is
recommended that you look at (and edit) this file.

--------
EXAMPLES
--------

Here are some examples of running Linbot.

$ linbot http://manson.ddns.org/ -x /linbot starship.skyport.net

$ linbot -o /stats/altavista/ http://altavista.digital.com/

$ linbot -o ~/Lang/Python/linbot -b http://manson.ddns.org/~marduk/ manson
	
--------------------
RUNNING PERIODICALLY
--------------------

Linbot may be safely run periodically or on off-peak hours using cron
or at.  It may be safely run unattended.  You may want to redirect
Linbot's output to a null device, log file or have it emailed to an
account.  Consult your operating system manuals on how this can be
done on your system.


---------------------
QUESTIONS/BUG REPORTS
---------------------
If you have any questions about Linbot or would like to report a bug,
send electronic mail to marduk@starship.skyport.net.  Please mention Linbot in the
subject of the message.  In order to assist in tracking down bugs,
please include either a URL where the problem can be found, an HTML
file where the error occurs or a (small) tar file of a site where the
error occurs.  Suggestions for improvements are also welcomed.

