<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
        "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML LANG="en-US"><HEAD>
<TITLE>Websnob: access_log For End-Users</TITLE>
<BASE HREF="http://www.websnob.net/logs">

<META NAME=description CONTENT="Explains how a typical UNIX shell account user
can extract and archive his homepage's access records from an NCSA/Apache
web server access_log, and archive them for use by a log analyzer.">
<!--#exec cgi="./cgi/head.pl"-->
</HEAD><BODY>
<p class="advert"><!--#exec cgi="./adverts/ad_1.pl"--></p>


<H1>Websnob: access_log For End-Users</H1>

<P>This is a hands-on guide for the user who maintains his web pages on a
UNIX shell account on a host using the popular <A
HREF="http://hoohoo.ncsa.uiuc.edu/docs/Overview.html">NCSA httpd</A> or <A
HREF="http://www.apache.org/">Apache web servers</A>.  It explains how to
extract personal records (that is, the accesses for one persons's pages)
from a server-wide access_log, and how to do long-term archiving of that
information for analysis by any popular log analyzer.  Such extraction and
archiving creates smaller, focused logs that can be retained for longer
periods than a server-wide log.</P>

<H2>Step Zero: Make sure it's an NCSA/Apache server</h2>

<P>Since you've got a UNIX shell account, you've probably got access to the
<a href="http://lynx.browser.org/">Lynx web browser</a>. Visit your
homepage using Lynx and hit the <kbd>#</kbd> key to see the httpd headers,
which will include a header identifying the server software. (If you don't
have access to Lynx, visit the <a
href="http://www.netcraft.com/whats/">Netcraft Server Survey</a> and enter
your server's name.) If the server response doesn't include the words
&quot;NCSA&quot; or &quot;Apache&quot;, the rest of this tutorial is
useless to you. Sorry.</p>

<H2>Step One: Find your server's access_log</H2> <P>NCSA and Apache
servers record file requests in a file named access_log, usually in a
subdirectory of the server daemon's home directory.  You can find your
server's access_log using the <A HREF=
"http://www.solarisguide.com/cgi-bin/rtfm?cmd=find&amp;sec=1&amp;search=EXACT" >find(1)</A>
command, i.e.  <KBD>find / -name access_log* -print</KBD></P>

<P>find(1) may actually locate the server access_log more than once. 
That's OK.  Either your server has symbolic links to the access_log (pick
whichever one is easiest to type) or it's saving old logs for a short time
after closing them.  Neither case causes any problems, and might even make
your work easier.</P>


<H2>Step Two. Find out when the log is restarted</H2>
<P>No ISP can afford the diskspace to save old access_log files forever, so
you've got to learn to get your information out of the file before it's
deleted and restarted.  The restart interval varies widely from server to
server, with busy servers having to reset more often than light-load
machines.  Some servers do, however, retain recent logs for a few days
after closing them, in case the webmaster has to track down an old
error.</P>

<P>If your ISP retains old access_logs, the modification times on saved
files will probably tip you off right away -- they're the times those logs
were closed.  Otherwise, you'll have to monitor the access_log directory
until you learn the interval.  The first line of the access_log states
when the log was opened.  Set up a simple shell script (like the one
below) and run it from <A HREF=
"http://www.solarisguide.com/cgi-bin/rtfm?cmd=crontab&amp;sec=1&amp;search=EXACT" 
>crontab(1)</A>
fairly often (once an hour, at least) to build a log of restart times. 
Hopefully, you'll see a pattern form.</P>

<PRE><CODE>
#!/bin/sh

mv ~/restart_log ~/restart_log.tmp
head -1 /var/logs/www/access_log &gt;&gt; ~/restart_log.tmp
uniq ~/restart_log.tmp &gt; ~/restart_log
rm ~/restart_log.tmp
</CODE></PRE>

<H3>Step Two And A Half: When to extract your information</H3>
<P>Once you've figured when and how often your server's access_log is
reset, you have to decide when's the best time to extract your
information.  The &quot;best time&quot; will depend on whether or not the
server is retaining closed logs after a restart.</P>

<P>If your server is restarting logs on a regular basis, but isn't saving
the logs afterwards, you'll have to run your extraction program just before
every reset.  I run mine about 10 minutes before reset, because grepping
large logs can take a while.  Of course, I potentially lose 10 minutes of
stats every month, but that's life.</P>

<P>If your server is saving access_logs for a reasonable length of time,
you can get complete access logging by waiting until just <EM>after</EM>
the log restart, and extracting your information from the the just-closed
log.</P>


<H2>Step Three: Extract your personal access_log</H2>

<P>All you need to extract your personal information from an access_log is
the <A HREF=
 "http://www.solarisguide.com/cgi-bin/rtfm?cmd=grep&amp;sec=1&amp;search=EXACT"
>grep(1)</A> command.  If the access_log is uncompressed, just use this
command to extract your information to a new access_log in your home
directory, substituting your account name for mine, and the location of
your server access_log where appropriate.</P>

<KBD>grep <VAR>islander</VAR> access_log &gt;&gt; ~/access_log</KBD>

<P>Compressed access_log files require slightly more work.  Uncompress the
log to stdout and pipe it directly to grep(1).  For a gzip'ed log, the
command (which should be on one line, but your browser may wrap it) should
resemble:</P>

<KBD>gzip -dc access_log.gz | grep <VAR>islander</VAR> &gt;&gt; ~/access_log</KBD>

<P>Most of the time, grepping for your userid is sufficient, although you
may pick up some bogus entries if your userid is a common word.  A more
complex rexgp may be used, but be careful -- you can't just grep for
<KBD>~userid</KBD>, because some browsers will escape <KBD>~</KBD> as
<KBD>%7E</KBD>, and you'll miss those request in the access_log.  For
directories using ~, try <KBD>egrep &quot;\/(~|%7E)userid/&quot;</KBD>.</P>


<H3>Step Three And A Half: Compresssing your personal access_log</H3>

<P>Now that you've learned how to save your information, you need to
archive it. An access_log can grow large quickly, but they compress very
well. I recommend using <A HREF=
"http://www.solarisguide.com/cgi-bin/rtfm?cmd=gzip&amp;sec=1&amp;search=EXACT"
>gzip(1)</A>, because it can work in a command pipe and append to
already-archived files. That reduces the number of large files kept on disk
at one time, avoiding &quot;disk quota exceeded&quot; errors that can lose
your log.</P>

<P>Here are the extraction commands used above, altered to compress the
personal access_log:</P>

<KBD>grep <VAR>islander</VAR> /var/logs/www/access_log |  gzip -9 &gt;&gt; ~/access_log.gz</KBD>

<P>and</P>

<KBD>gzip -dc access_log.gz | grep <VAR>islander</VAR> | gzip -9 &gt;&gt; ~/access_log</KBD>

<P>gzip(1) can provide 80-90% compression on a log file.  In my case, 22
months of access_log entries compresses to less than 900,000 kilobytes. 
Not bad, eh?</P>


<H2>Step Four: Using your log analyzer.</H2>

<P>Now that you've got a personal access_log, reconfigure your log analyzer
to use that log instead of the site-wide log.  If the analyzer can't
decompress logs on its own, it can probably read logs from standard input,
allowing you to &quot;feed&quot; the ~/access_log to the analyzer in a
pipe.  For example:</P>

<KBD>gzip -dc ~/access_log.gz | <VAR>analyzer</VAR></KBD>

<P>(At various times, I've used personal access_logs (created using the
techniques on this page) with <A HREF=
"http://www.statslab.cam.ac.uk/~sret1/analog/">Analog</A>, Getstats, <A HREF
= "http://www.informatik.uni-frankfurt.de/~fp/Tools/Olista.html">W3Olista</A>, and <A 
HREF = "http://www.boutell.com/wusage/">wusage 3.2</A>.)</P>

<H2>Step Five: Putting it all together</H2>
<P>You now know where your servers's access_log is, how often it's reset,
how to save the information for long-term use, and how to feed the personal
log to your log analyzer.  Now you have to put it all together in a shell
script, use crontab(1) to run it at the time you chose in Step 2.5, and
your logging will be automated.  An example shell script, using Analog to process the logs:</P>

<PRE><CODE>
#!/bin/sh

SERVER_LOG=/var/www/logs/access_log
MY_LOG=$HOME/logs/access_log.gz
grep islander $LOG | gzip -9 &gt;&gt; $MY_LOG
gzip -dc $MY_LOG | analog
</CODE></PRE>

<p class="advert"><!--#exec cgi="./adverts/ad_2.pl"--></p>
<!--#exec cgi="./cgi/menu.pl"-->
<!--#exec cgi="./cgi/1995"-->
</BODY></HTML>

