Google Answers: How do log file analyzers identify visitors versus hits ?

View Question

Q: How do log file analyzers identify visitors versus hits ? ( Answered 5 out of 5 stars

Question

Subject: How do log file analyzers identify visitors versus hits ?
Category: Computers > Internet
Asked by: misterfine-ga
List Price: $15.00

Posted: 20 Jul 2003 11:04 PDT
Expires: 19 Aug 2003 11:04 PDT
Question ID: 233050

I am thinking of developing a data warehouse to analyze the traffic
for a number of web sites that I operate.  While I understand most of
the data in web log files, I don't know how the main packages
differentiate between visitors and hits.   What I need to know is :

Assuming that the site dos not use cookies or append session id's to
the urls:

1. How do web log analyzers aggregate a series of hits into a visitor
?

2. How can this be done when users are coming from an IP that
aggregates many visitors (like AOL does) ?

3. Is there any existing open source software that I can use for this
purpose to pre-process my data before it goes into the warehouse ?

Request for Question Clarification by sycophant-ga on 21 Jul 2003 04:14 PDT

Hi,

I'm still looking into your first two questions, but the third one is
a little unclear. You want software that does what exactly? And from
what log format?

I have a setup on my own Linux server that creates daily web-stats
automatically for each virtual host on the server. Is that the sort of
thing you are looking for? My implementation uses Apache, Webalizer,
and a custom Perl script.

Also, I did find this interesting white paper, which looks at the
details of log analysis:
http://www.teced.com/PDFs/whitepap.pdf
(or as HTML http://216.239.33.104/search?q=cache:8Z6xfJxM8qUJ:www.teced.com/PDFs/whitepap.pdf+server+log+visit+definition&hl=en&ie=UTF-8)

Regards,
Sycophant-ga

Clarification of Question by misterfine-ga on 21 Jul 2003 07:41 PDT

Thanks for the link, it was interesting.  In Question 3, I'm wondering
if therte is somewhere that I can find some code that would add
information to a log file (or to a related file) that identifying
which entries in the file are grouped together as a visitor.  I would
presume that the software would require me to enter some parameters,
such as if there is a session id, or the timeout that would assume the
same IP address is a new visitor.

Obviously all normal web log analyzers, including those that are open
source, have to perform this function somehow.

Answer

Subject: Re: How do log file analyzers identify visitors versus hits ?
Answered By: sycophant-ga on 21 Jul 2003 14:52 PDT
Rated: 5 out of 5 stars

Hi, 

The answer to your first question is, most do roughly same thing –
group all requests from a 'user' within a timeoout period, into a
single visit.

I only managed to find a few packages that actually went far in
describing their methods, below are some of those descriptions.

Wusage:
“A "visit" consists of one or more accesses made by the same visitor,
with no more than a certain time interval between accesses. The
maximum time interval is termined by the Max. Minutes Between Accesses
(trailtimeout) option.

The identity of the visitor is determined by combining the authorized
user name (when available), HTTP user-identifying "cookies", site (IP
address), operating system, and web browser identifying information in
order to produce the most unique "key" possible. Any such fields that
are not actually available are not used. In the simplest case, where
the log file does not contain any other user-identifying information,
only the site (IP address) of the visitor is used. When cookies are
present, they override all other factors.

When the maximum time interval has elapsed, the visit is considered to
be over, and the next access by that visitor begins a new visit.”
http://www.boutell.com/wusage/8.0/definitions.html


Webalizer:
“Visits  occur when some remote site makes a request for a page  on
your server for the first time. As long as the same site keeps making
requests within a given timeout period, they will all be considered
part of the same Visit. If the site makes a request to your server,
and the length of time since the last request is greater than the
specified timeout period (default is 30 minutes), a new Visit is
started and counted, and the sequence repeats. Since only pages  will
trigger a visit, remotes sites that link to graphic and other non-
page URLs will not be counted in the visit totals, reducing the number
of false visits.”
http://www.webalizer.org/webalizer_help.html


NetTracker:
“A visitor is a person viewing a Web site. If your Web site does not
use cookies or if the visitor does not have a cookie, a visitor is
defined as a unique combination of a user agent and a host name or IP
address. If your site uses cookies sent by the Sane Web Server
Plug-in, a visitor would be defined by the cookie transmitted by the
visitor's browser. NetTracker can also be configured to define
visitors based on their HTTP authenticated user name or a parsed
parameter.”
http://www.sane.com/support/NetTracker/faq.html#visitview


Two of the three seem to make a unique user key from the available log
information (typically IP Address and User Agent). I suspect that
others do the same, only do not really discuss that well in their
documentation.

That also addresses question two – a number of applications generate a
user key from IP Address and other header information, this is a means
of individualising users that are visiting through a proxy server,
however that is obviously still not going to be perfect.

I suspect that this approach is probably used in the majority of log
analysis software, however other programs don't detail their methods
well.

Webalizer is a little unclear about it's methods, but it's use of the
term 'site' in the above passage indicates that it's visitor counting
is based solely upon IP Address.

As for your third question, I have not really found any analysis
software that claims to alter the actual log files in any way.
However, I did find this quite basic Awk script, that generates a
basic break down of visitors into a simple log from, from IIS logs. It
seems quite straight forward, and may provide a good starting point to
a simple script to do what you want:
http://alan-ng.net/scripts/visits.htm

I hope this helps.

Regards,
Sycophant-ga

misterfine-ga rated this answer: 5 out of 5 stars

That's what I wanted to know -- thanks !

Comments

Subject: Re: How do log file analyzers identify visitors versus hits ?
From: tubs-ga on 20 Jul 2003 14:10 PDT

From my experience with Apache, there's no way to log a unique
identifier that completely distinguishes one visitor from another. 
Information about the information you can log in Apache can be found
at this URL;

http://httpd.apache.org/docs/logs.html

I imagine other webservers offer similar logging features.

Without a cookie or a session id, the information the web server logs
that would help you identify a unique visitor is the IP address, the
referer and the user agent.  I believe web analyzer software will use
one or more of these fields to make a best guess on the number of
unique visitors.  This may cause some false hits or misses but it
should give a fairly reasonable estimate on the number of unique
vistors you are generating.

Subject: Re: How do log file analyzers identify visitors versus hits ?
From: robertskelton-ga on 20 Jul 2003 15:56 PDT

Referer page plus IP plus time sequence. You would need a huge number
of visitors, or only one extremely prominent link that is the only way
to access your site, for there to me any hiccups in the data.

Subject: Re: How do log file analyzers identify visitors versus hits ?
From: wengland-ga on 21 Jul 2003 11:20 PDT

A major telecommunications company uses two ways to log visitors -
first, they have 'web bugs', zero pixel images that load a cookie to
your browser when they are rendered.  These web bugs come from a third
party company, and the third party generates statistics on unique
visitors, click paths, time on site, etc.

Secondly, they use a unique session ID and log each page you hit. 
They access this information via either an Oracle database, or from
simple printouts in the log, when the session is lost.

Subject: Re: How do log file analyzers identify visitors versus hits ?
From: rkm100-ga on 31 Jul 2003 06:28 PDT

http://www.urlanalyzer.com try it out..

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy