View Question
 ```In a sample of 500, I see 5 red balls. It is expected that one sees 15 red balls. Is my observation of 5 significantly different from expected 15, assuming pval of 5%? Can someone walk me thru the logic to solve this? Also, how would I determine what sample size would be sufficient (for pval 5%) if i had observed 5?```
 ```Hi, lochness-ga: We have to put some assumptions around the bare bones of what you've stated in order to make mathematical sense of the question. Suppose you have a sample of 500 balls, drawn independently from an inexhaustable supply of balls both red and non-red. The supply has 3% red balls, so that the "expected value" of the number of red balls in such a sample is apriori 15. However in some particular sample you see only 5 balls. With these assumptions we have define a binomial distribution. Each ball sampled is red with probability 0.03 and non-red with probability 0.97. Using the familiar binomial expansion of: ( 0.03 + 0.97 )^500 = 1 we can identify the probability of: 0 red balls, 1 red ball, 2 red balls, etc. The terminology of "significance values" is confusing to most people in large part because it is a well-intentioned attempt to put the cart before the horse. Instead of saying, gee, I wonder what the chance of getting 5 or fewer red balls in this situation, where 15 would be the average, the professional statistician will seize on that same (presumably small) likelihood and proclaim that it is the "p value" (significance level) of the null hypothesis. The reason for this circumlocution is that the statistician has been asked to apply what is known about the single sample drawn to the larger question of whether or not we should accept the claim that the balls were drawn in the manner described, aka the null hypothesis. The logical answer is that we have no idea. It's not like a case where we were promised that jelly beans would be drawn from a supply with no licorice ones, only to find that there are 5 out of 500 of them in the sample. Instead it's a "shades of grey" situation, possible for the sample to occur, but perhaps so unlikely as to shake our confidence in the original assumptions. In this case the chance of getting 5 or fewer red balls in a random sample of 500 is calculated like this: Pr( 0 red out of 500 ) = C(500,0) * (0.97)^500 = 0.000000243146... Pr( 1 red out of 500 ) = C(500,1) * (0.03) * (0.97)^499 = 0.000003759990... Pr( 2 red out of 500 ) = C(500,2) * (0.03)^2 * (0.97)^498 = 0.000029013941... Pr( 3 red out of 500 ) = C(500,3) * (0.03)^3 * (0.97)^497 = 0.000148958173... Pr( 4 red out of 500 ) = C(500,4) * (0.03)^4 * (0.97)^496 = 0.000572414010... Pr( 5 red out of 500 ) = C(500,5) * (0.03)^5 * (0.97)^495 = 0.001756189787... The probability of 5 or fewer red balls is then: Pr( 5 or less red out of 500 ) [total of six terms above] = 0.002510579047... or roughly 0.25%. In particular the difference between what was observed (5 red balls) and what was expected (15 red balls) appears to be significant at the 5% level. Common sense tells us our assumption that the sample was independently drawn from a "population" containing 3% red balls is shaken. However we cannot simply say "the chance the population contains 3% red balls is fewer than one in a thousand", because logically this is a quite different statement from what we know (that _if_ the population contained 3% red balls, then the chance of randomly sampling 500 and getting at most 5 red ones is less than one in a thousand). It is for this reason that the statistician adopts the terminology of significance values in describing the relationship between a sample and a hypothesis about the underlying population, which reasonably puts a burden on us to stop and think about what is meant. * * * * * * * * * * * * * * * * * * To determine what sample size would be sufficient for pval 5% "if [you] had observed 5" we must also specify what is to be assumed about the expected number of red balls. That is, if you keep as constant the expected number of 15 red balls, regardless of sample size, this calls for different calculations than if we were keeping a 3% fraction of red balls, independent of sample size. Generally experimental design must deal with the second sort of question, ie. assuming the expected fraction of observations rather than the absolute expected number of observations is independent of sample size, and then choosing a sample size based on that. But allow me the liberty of taking first the interpretation to be one of changing only the sample size, keeping the observed 5 and expected 15. In this case we see that decreasing the sample size only makes the observation increasingly signficant! Imagine the limiting case of a sample of 15, in which all the balls were expected to be red, but only one-third turned out to be so! To confirm this intuition, let's run through the calculation with a sample size of 100 (in place of 500 before): Pr( 0 red out of 100 ) = 0.85^100 = 0.0000000874767... Pr( 1 red out of 100 ) = 100*0.15*0.85^99 = 0.0000015437071... Pr( 2 red out of 100 ) = C(100,2)*0.15^2*0.85^98 = 0.0000134847356... Pr( 3 red out of 100 ) = C(100,3)*0.15^3*0.85^97 = 0.0000777355349... Pr( 4 red out of 100 ) = C(100,4)*0.15^4*0.85^96 = 0.0003326623626... Pr( 5 red out of 100 ) = C(100,5)*0.15^5*0.85^95 = 0.0011271383581... ------------------ Pr( 5 or less red out of 100) = 0.0015526521750... or roughly 0.16%, i.e. more significant (less probable) than the result previously considered of observing 5 out of 500 when 15 were expected. On the other hand a different sort of complication ensues if we keep 5 observed red balls and lower the sample size while maintaining a population assumption of 3% red balls, namely that pretty quickly 5 red balls will meet or exceed the expected number. For example, a sample size of 200 means that 6 red balls are expected, while a sample size of 100 gives 3 red balls expected! In other words the point sample size N at which 5 observed red balls is insignificant (at the 5% level) will not be terribly less than 500 just because expectations will now drop in proportion to the sample size. A good way of gauging where the 5% significance "break" occurs is to look at the single largest term, namely Pr( 5 red out of N ). Clearly for the total to be less than 5% requires in particular that this term by itself be less than that. Assuming 3% of the underlying population is red (and independently sampled): N Pr( 5 red out of N ) -------- ----------------------- 200 0.16224965026... 300 0.05958590002... 400 0.01204097665... Now we see that 5 observed red balls would not be significant at the 5% level for a sample size of 300 (relative to a population assumption of 3% read balls), but that it might be significant for a sample size of 400 (we need to add in the other terms, chances of observing less than 5 red balls, to be certain). But we can proceed in this way to narrow down the smallest N for which observing 5 red balls would be significant at the 5% level relative to a "null hypothesis" that the sample is independently drawn from a population with 3% red balls. I think however that the "right" question to ask about a breakpoint in the sample sizes is this: At what sample size N is an observation of 1% or fewer red balls significant at the 5% level, relative to an assumed population of 3% red balls? I would choose a design question of this form, rather than the two we considered earlier, because it sets both the observation and expectation values as percents (fractions) relative to the sample size. You may have good reasons, however, for asking the question as you did, about what smallest N will make an observation of 5 red balls significant (at the 5% level), but one should be leery of "shopping" data looking for significance. At one level an exploratory study will of course involve an exercise of this kind, but an experimental study should have a clear hypothesis and sample size formulated before the statistical analysis is performed. Otherwise a "reporting" bias can be created in which only the unlikeliest aspects of one's experimental observations are publicized. regards, mathtalk-ga```