This appears to be a straightforward application of a chi-square test
to a 2x2 tabulation of data:
[Chi Square Tutorial]
for which see this Web-based calculator as well:
[Web Chi Square Calculator]
As with many methods of "statistical inference" the idea is to
consider how likely the observed outcome would be under an assumed
"null hypothesis", meaning in this case that there is no effective
difference in the outcomes (desired action taken) due to the
treatments (selection of Web page shown to visitor). More precisely
the idea is to compute how often an outcome will deviate as much or
more from the strict "average" behavior as is observed in the given
sample. The more unlikely the amount of observed deviation, the more
credence is given to rejection of the null hypothesis, ie. to
asserting that the treatments _do_ have an effect on the outcomes.
Begin by totaling the number of desired actions taken by all visitors,
regardless of which Web page they were shown. The fraction of all
visitors who took the desired action then gives a combined estimate,
under the null hypothesis, of the probability of the desired action
regardless of "treatment" by one or the other Web pages:
p = (# desired actions) / (# all visitors)
Strictly speaking the term "chi square" refers to an expression that
tells how much a particular tabulation "deviates" from the average.
A = # visitors who see first Web page
B = # visitors who see second Web page
then, since for each visitor the desired action is either taken or
not, we can tabulate the outcomes by treatments in a 2x2 form:
treatment \ action taken not_taken | TOTALS
1st Web pg | x A-x | A
2nd Web pg | y B-y | B
TOTALS x+y A-x+B-y A+B
Now completely "average" behavior means that x would be pA and y would
be pB. But of course it would be unusual to have exactly this average
behavior. What we need to do is establish the probability
distribution for our chosen measure (statistic) of deviation from the
chi square = SUM (observed - average)^2 / average
where the values are summed over each cell in the table. That is, for
each cell in the table, square the difference between the actual
"observed" value in the cell and the expected "average" value (based
on assuming the probability p for the desired action independent of
Web page shown, i.e. the null hypothesis):
(x - pA)^2 (A - x - (1-p)A)^2
chi square = ---------- + ------------------
(y - pB)^2 (B - y - (1-p)B)^2
+ ---------- + ------------------
The short explanation is that the bigger chi square, the greater the
deviation from average, and the less likely that such a deviation
If the number of observations is fairly small, it may be attractive to
use the power of the computer to crank through all the possible
outcomes, calculating their exact probabilities as determined by the
binomial distribution. That is, for a given sample size A, the chance
that a particular number x of desired actions are taken is given by
the binomial coefficient:
C(A,x) * p^x + (1-p)^(A-x)
For a large number of observations the exact computation becomes
unwieldy, even at computer speeds, but fortunately the historic
approach of using a normal approximation to the binomial distribution
(for sufficiently big samples) is plenty accurate for our purposes. A
rule of thumb says the normal approximation is okay to use provided
each of the four cells in the 2x2 table has at least 10 observations.
Without looking into the code behind the calculator on the page you
cited, it would be a matter of trial and error to work out whether and
when the exact computations are done versus using the normal
For additional discussion and a worked example, please see my earlier Answer here:
[Q: Statistic test]
In particular note that the calculator page linked there allows for
the comparison of the exact binomial computations and the normal
approximation, together with an intermediate sort of method, Yates
correction to the normal approximation.
Clarification of Answer by
14 Oct 2004 04:29 PDT
Yes, that's right if you have a computer program to do it for you.
As I mentioned earlier in the Answer, the exact binomial computation
becomes unwieldy, even for a computer, if the sample sizes are large
enough. But a good implementation at that point will take care of
substituting the chi-square/normal approximation for you "behind the
scenes". When the numbers get this big, there's little difference
between the exact results and the traditional approximation.
Clarification of Answer by
14 Oct 2004 05:09 PDT
I didn't realize at first that there was a two-part request above the last one.
1. I would not describe the chi-square test as "two tailed". The
chi-square statistic itself is never negative, because it's computed
as a sum of squares. The test is "one sided" in the sense that the
"null hypothesis" is rejected only when the chi-square statistic is
too big, never when it is too small. However there is a "non-central"
chi-square distribution that comes into play when one has a more
complicated hypothesis to test.
2. If you are interested in computing within a script the P value (as
a function of the chi-square value), this raises a number of issues.
The chi-square distribution depends on the "degrees of freedom"
parameter. If the observations fit into a 2x2 classification, then
there is one degree of freedom and implementing the computation only
for this case is certainly easier than doing it in general.
The exact definition of the P value is an improper integral (from the
given chi-square value out to infinity). Given enough restrictions on
the degrees of freedom and the range of chi-square values, a good
approximation can be implemented in almost any programming language
that allows floating point arithmetic. If you'd be interested in an
approximate formula, I suggest you post a new Question because it goes
beyond the scope of your original post. While evaluating such an
approximation might reasonably be called "easy", deriving it and
discussing the tradeoffs in limiting the range will not be.