View Question
Q: Simple Statistics question : split-run testing ( Answered ,   0 Comments )
 Question
 Subject: Simple Statistics question : split-run testing Category: Science > Math Asked by: iiservices-ga List Price: \$5.00 Posted: 12 Jun 2004 22:10 PDT Expires: 12 Jul 2004 22:10 PDT Question ID: 360242
 ```I want to be able to accurately interpret the results of split-run testing on my website (that's where I have two versions of my website running simultaneously, and each page is shown in rotation to alternate visitors). For each of the two groups, there's (1) the total number of visitors in the group (2) the number of visitors that performed a desired action How does one calculate the probability of the observed difference in behavior of the test groups being a result of a random occurrence? There's a calculator that does just this at http://calc.in-the-name-of-profit.com/108929/ab_test.jsp I'd just like to know the the figures are derived -- thanks!```
 ```Hi, iiservices-ga: This appears to be a straightforward application of a chi-square test to a 2x2 tabulation of data: [Chi Square Tutorial] http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html for which see this Web-based calculator as well: [Web Chi Square Calculator] http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html As with many methods of "statistical inference" the idea is to consider how likely the observed outcome would be under an assumed "null hypothesis", meaning in this case that there is no effective difference in the outcomes (desired action taken) due to the treatments (selection of Web page shown to visitor). More precisely the idea is to compute how often an outcome will deviate as much or more from the strict "average" behavior as is observed in the given sample. The more unlikely the amount of observed deviation, the more credence is given to rejection of the null hypothesis, ie. to asserting that the treatments _do_ have an effect on the outcomes. Begin by totaling the number of desired actions taken by all visitors, regardless of which Web page they were shown. The fraction of all visitors who took the desired action then gives a combined estimate, under the null hypothesis, of the probability of the desired action regardless of "treatment" by one or the other Web pages: p = (# desired actions) / (# all visitors) Strictly speaking the term "chi square" refers to an expression that tells how much a particular tabulation "deviates" from the average. Letting: A = # visitors who see first Web page B = # visitors who see second Web page then, since for each visitor the desired action is either taken or not, we can tabulate the outcomes by treatments in a 2x2 form: \ desired treatment \ action taken not_taken | TOTALS \ | ------------ --------------------------| | | 1st Web pg | x A-x | A | | 2nd Web pg | y B-y | B | | --------------------------------------- TOTALS x+y A-x+B-y A+B Now completely "average" behavior means that x would be pA and y would be pB. But of course it would be unusual to have exactly this average behavior. What we need to do is establish the probability distribution for our chosen measure (statistic) of deviation from the average: chi square = SUM (observed - average)^2 / average where the values are summed over each cell in the table. That is, for each cell in the table, square the difference between the actual "observed" value in the cell and the expected "average" value (based on assuming the probability p for the desired action independent of Web page shown, i.e. the null hypothesis): (x - pA)^2 (A - x - (1-p)A)^2 chi square = ---------- + ------------------ pA (1-p)A (y - pB)^2 (B - y - (1-p)B)^2 + ---------- + ------------------ pB (1-p)B The short explanation is that the bigger chi square, the greater the deviation from average, and the less likely that such a deviation occurs "randomly". If the number of observations is fairly small, it may be attractive to use the power of the computer to crank through all the possible outcomes, calculating their exact probabilities as determined by the binomial distribution. That is, for a given sample size A, the chance that a particular number x of desired actions are taken is given by the binomial coefficient: C(A,x) * p^x + (1-p)^(A-x) For a large number of observations the exact computation becomes unwieldy, even at computer speeds, but fortunately the historic approach of using a normal approximation to the binomial distribution (for sufficiently big samples) is plenty accurate for our purposes. A rule of thumb says the normal approximation is okay to use provided each of the four cells in the 2x2 table has at least 10 observations. Without looking into the code behind the calculator on the page you cited, it would be a matter of trial and error to work out whether and when the exact computations are done versus using the normal approximation. For additional discussion and a worked example, please see my earlier Answer here: [Q: Statistic test] http://answers.google.com/answers/threadview?id=317634 In particular note that the calculator page linked there allows for the comparison of the exact binomial computations and the normal approximation, together with an intermediate sort of method, Yates correction to the normal approximation. regards, mathtalk-ga``` Request for Answer Clarification by iiservices-ga on 14 Oct 2004 03:51 PDT ```Thanks for the great answer Mathtalk. Just a couple of small clarifications: 1) It this a "two tailed" chi square test? 2) Is there an easy way of calculating the P value once we get the chi squared value? (The script I referred to in my intial question does it somehow). Most tables I've seen only offer a handful P values for each DF; I'd like to be able to calculate the P value more exactly (or at least, have a table with many more P values, e.g. one for each percentile > 5%). Thanks!``` Request for Answer Clarification by iiservices-ga on 14 Oct 2004 04:01 PDT ```Sorry, one more thing: In one of the resources you gave, it says "Fisher's test is the best choice as it always gives the exact P value, while the chi-square test only calculates an approximate P value. Only choose chi-square if someone requires you to. ". Would I be better off using Fisher's test, as this suggests? Thanks.``` Clarification of Answer by mathtalk-ga on 14 Oct 2004 04:29 PDT ```Yes, that's right if you have a computer program to do it for you. As I mentioned earlier in the Answer, the exact binomial computation becomes unwieldy, even for a computer, if the sample sizes are large enough. But a good implementation at that point will take care of substituting the chi-square/normal approximation for you "behind the scenes". When the numbers get this big, there's little difference between the exact results and the traditional approximation. regards, mathtalk-ga``` Clarification of Answer by mathtalk-ga on 14 Oct 2004 05:09 PDT ```I didn't realize at first that there was a two-part request above the last one. 1. I would not describe the chi-square test as "two tailed". The chi-square statistic itself is never negative, because it's computed as a sum of squares. The test is "one sided" in the sense that the "null hypothesis" is rejected only when the chi-square statistic is too big, never when it is too small. However there is a "non-central" chi-square distribution that comes into play when one has a more complicated hypothesis to test. 2. If you are interested in computing within a script the P value (as a function of the chi-square value), this raises a number of issues. The chi-square distribution depends on the "degrees of freedom" parameter. If the observations fit into a 2x2 classification, then there is one degree of freedom and implementing the computation only for this case is certainly easier than doing it in general. The exact definition of the P value is an improper integral (from the given chi-square value out to infinity). Given enough restrictions on the degrees of freedom and the range of chi-square values, a good approximation can be implemented in almost any programming language that allows floating point arithmetic. If you'd be interested in an approximate formula, I suggest you post a new Question because it goes beyond the scope of your original post. While evaluating such an approximation might reasonably be called "easy", deriving it and discussing the tradeoffs in limiting the range will not be. regards, mathtalk-ga```