Google Answers Logo
View Question
 
Q: Simple Statistics question : split-run testing ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Simple Statistics question : split-run testing
Category: Science > Math
Asked by: iiservices-ga
List Price: $5.00
Posted: 12 Jun 2004 22:10 PDT
Expires: 12 Jul 2004 22:10 PDT
Question ID: 360242
I want to be able to accurately interpret the results of split-run
testing on my website (that's where I have two versions of my website
running simultaneously, and each page is shown in rotation to
alternate visitors).

For each of the two groups, there's
(1) the total number of visitors in the group
(2) the number of visitors that performed a desired action

How does one calculate the probability of the observed difference in
behavior of the test groups being a result of a random occurrence?

There's a calculator that does just this at
http://calc.in-the-name-of-profit.com/108929/ab_test.jsp

I'd just like to know the the figures are derived -- thanks!
Answer  
Subject: Re: Simple Statistics question : split-run testing
Answered By: mathtalk-ga on 13 Jun 2004 18:30 PDT
Rated:5 out of 5 stars
 
Hi, iiservices-ga:

This appears to be a straightforward application of a chi-square test
to a 2x2 tabulation of data:

[Chi Square Tutorial]
http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html

for which see this Web-based calculator as well:

[Web Chi Square Calculator]
http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html

As with many methods of "statistical inference" the idea is to
consider how likely the observed outcome would be under an assumed
"null hypothesis", meaning in this case that there is no effective
difference in the outcomes (desired action taken) due to the
treatments (selection of Web page shown to visitor).  More precisely
the idea is to compute how often an outcome will deviate as much or
more from the strict "average" behavior as is observed in the given
sample.  The more unlikely the amount of observed deviation, the more
credence is given to rejection of the null hypothesis, ie. to
asserting that the treatments _do_ have an effect on the outcomes.

Begin by totaling the number of desired actions taken by all visitors,
regardless of which Web page they were shown.  The fraction of all
visitors who took the desired action then gives a combined estimate,
under the null hypothesis, of the probability of the desired action
regardless of "treatment" by one or the other Web pages:

p = (# desired actions) / (# all visitors)

Strictly speaking the term "chi square" refers to an expression that
tells how much a particular tabulation "deviates" from the average. 
Letting:

A = # visitors who see first Web page

B = # visitors who see second Web page

then, since for each visitor the desired action is either taken or
not, we can tabulate the outcomes by treatments in a 2x2 form:

         \ desired
treatment \ action   taken   not_taken | TOTALS
           \                           |
------------ --------------------------|
            |                          |
1st Web pg  |          x       A-x     |    A
            |                          |
2nd Web pg  |          y       B-y     |    B
            |                          |
---------------------------------------
TOTALS                x+y    A-x+B-y      A+B

Now completely "average" behavior means that x would be pA and y would
be pB.  But of course it would be unusual to have exactly this average
behavior.  What we need to do is establish the probability
distribution for our chosen measure (statistic) of deviation from the
average:

chi square = SUM (observed - average)^2 / average

where the values are summed over each cell in the table.  That is, for
each cell in the table, square the difference between the actual
"observed" value in the cell and the expected "average" value (based
on assuming the probability p for the desired action independent of
Web page shown, i.e. the null hypothesis):

             (x - pA)^2     (A - x - (1-p)A)^2
chi square = ----------  +  ------------------
                 pA              (1-p)A

             (y - pB)^2     (B - y - (1-p)B)^2
           + ----------  +  ------------------
                 pB              (1-p)B

The short explanation is that the bigger chi square, the greater the
deviation from average, and the less likely that such a deviation
occurs "randomly".

If the number of observations is fairly small, it may be attractive to
use the power of the computer to crank through all the possible
outcomes, calculating their exact probabilities as determined by the
binomial distribution.  That is, for a given sample size A, the chance
that a particular number x of desired actions are taken is given by
the binomial coefficient:

C(A,x) * p^x + (1-p)^(A-x)

For a large number of observations the exact computation becomes
unwieldy, even at computer speeds, but fortunately the historic
approach of using a normal approximation to the binomial distribution
(for sufficiently big samples) is plenty accurate for our purposes.  A
rule of thumb says the normal approximation is okay to use provided
each of the four cells in the 2x2 table has at least 10 observations.

Without looking into the code behind the calculator on the page you
cited, it would be a matter of trial and error to work out whether and
when the exact computations are done versus using the normal
approximation.

For additional discussion and a worked example, please see my earlier Answer here:

[Q: Statistic test]
http://answers.google.com/answers/threadview?id=317634

In particular note that the calculator page linked there allows for
the comparison of the exact binomial computations and the normal
approximation, together with an intermediate sort of method, Yates
correction to the normal approximation.

regards, mathtalk-ga

Request for Answer Clarification by iiservices-ga on 14 Oct 2004 03:51 PDT
Thanks for the great answer Mathtalk.

Just a couple of small clarifications:

1) It this a "two tailed" chi square test?

2) Is there an easy way of calculating the P value once we get the chi
squared value? (The script I referred to in my intial question does it
somehow).

Most tables I've seen only offer a handful P values for each DF; I'd
like to be able to calculate the P value more exactly (or at least,
have a table with many more P values, e.g. one for each percentile >
5%).

Thanks!

Request for Answer Clarification by iiservices-ga on 14 Oct 2004 04:01 PDT
Sorry, one more thing:

In one of the resources you gave, it says "Fisher's test is the best
choice as it always gives the exact P value, while the chi-square test
only calculates an approximate P value. Only choose chi-square if
someone requires you to. ".

Would I be better off using Fisher's test, as this suggests?

Thanks.

Clarification of Answer by mathtalk-ga on 14 Oct 2004 04:29 PDT
Yes, that's right if you have a computer program to do it for you.

As I mentioned earlier in the Answer, the exact binomial computation
becomes unwieldy, even for a computer, if the sample sizes are large
enough.  But a good implementation at that point will take care of
substituting the chi-square/normal approximation for you "behind the
scenes".  When the numbers get this big, there's little difference
between the exact results and the traditional approximation.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 14 Oct 2004 05:09 PDT
I didn't realize at first that there was a two-part request above the last one.

1.  I would not describe the chi-square test as "two tailed".  The
chi-square statistic itself is never negative, because it's computed
as a sum of squares.  The test is "one sided" in the sense that the
"null hypothesis" is rejected only when the chi-square statistic is
too big, never when it is too small.  However there is a "non-central"
chi-square distribution that comes into play when one has a more
complicated hypothesis to test.

2.  If you are interested in computing within a script the P value (as
a function of the chi-square value), this raises a number of issues. 
The chi-square distribution depends on the "degrees of freedom"
parameter.  If the observations fit into a 2x2 classification, then
there is one degree of freedom and implementing the computation only
for this case is certainly easier than doing it in general.

The exact definition of the P value is an improper integral (from the
given chi-square value out to infinity).  Given enough restrictions on
the degrees of freedom and the range of chi-square values, a good
approximation can be implemented in almost any programming language
that allows floating point arithmetic.  If you'd be interested in an
approximate formula, I suggest you post a new Question because it goes
beyond the scope of your original post.  While evaluating such an
approximation might reasonably be called "easy", deriving it and
discussing the tradeoffs in limiting the range will not be.

regards, mathtalk-ga
iiservices-ga rated this answer:5 out of 5 stars

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy