View Question
Q: Significance of discrete random variable with outcomes on a nominal scale ( No Answer,   3 Comments )
 Question
 Subject: Significance of discrete random variable with outcomes on a nominal scale Category: Science > Math Asked by: azeric64-ga List Price: \$20.00 Posted: 16 Dec 2004 10:59 PST Expires: 15 Jan 2005 10:59 PST Question ID: 443517
 ```I am sampling a single discrete random variable with k bins. Each bin represents an outcome on a nominal scale, so none of the usual distributions apply. Given k and n (n being the sample size), I need to compute the probability that the sample represents the distribution of outcomes in the population. I suppose some sort of confidence interval comes into play here as well. Is there a direct formula or set of formulas that determine this probability? If so, what is it, and what are the caveats when applying it?``` Clarification of Question by azeric64-ga on 20 Dec 2004 09:08 PST ```Thanks for your comments, and allow me to clarify. There is a one-to-one correspondence between outcomes and bins: a sample with a particular outcome always goes into the bin for that outcome. So all samples with the outcome "A" go into the "A" bin, all samples with outcome "B" go into the "B" bin, etc. There is typically a small number of bins, with k ranging from 2 up to maybe 25 as an upper bound. Wouldn't the probability of 1/k only apply if I could assume that the outcomes all have uniform probability? I do know that the outcomes definitely do not have uniform probability; the idea is that we have a good guess as to the distribution of the outcomes, but want to use statistical feedback to correct the distribution, and want to know when we have collected enough results to replace our initial guess. Mathtalk, you make an excellent point about using all of the bins in the sample. One requirement of this measurement is that we want to collect a "complete" sample in the sense that every bin must contain at least one outcome. However, it may be that certain outcomes that we think will occur may never occur, especially if we know that they will be particularly rare, and we want to know if that's the case too, while still being able to know how prevalent the other outcomes are. The way that we are applying this involves using our initial guess for the prevalence of the outcomes while collecting the actual number of outcomes until we are confident that the data gives us a better picture of the prevalence. Ideally, we'd like to make incremental corrections to the prevalences to gradually bring them into line with the actual data. Finally, I should have mentioned at the beginning that the population is sampled without replacement. Not sure of this makes a difference or not.``` Request for Question Clarification by mathtalk-ga on 20 Dec 2004 13:21 PST ```Hi, azeric64-ga: In general outline one needs to define a measure of how close (or far away) a sample population is from the hypothesized distribution. A typical approach does this by a weighted sum-of-squares "error" statistic, i.e. add for all bins the sum of squares of differences between observed and expected content size, divided by the expected content size. Such a statistic has a discrete distribution for any fixed sample size, but if the sample size is large enough one can apply a continuous approximation. That the "error" statistic is large enough to have small probability of arising by "random" chance is then consider "signficant" in the context of rejecting the hypothesized distribution. Would this sort of information respond to your request for a "direct" formula or formulas and caveats for applying them? regards, mathtalk-ga``` Clarification of Question by azeric64-ga on 20 Dec 2004 16:18 PST ```Hi again, Mathtalk-ga: I'm not trying to test a hypothesis as such, just trying to adjust what I initially believe to be the distribution of outcomes. (I'm building a "self-learning" troubleshooting application.) I don't think that a weighted sum-of-squares error statistic is quite what I'm after. In fact, if I can possibly avoid it, I don't want to compare the hypothesized distribution to the collected sample in order to determine the significance. Instead, I'd like to look at the sample by itself and have a function using only the sample size (n) and the number of bins (k) to give me a numerical rating, say between 0 and 1 inclusive, of how statistically significant the sample is. I'm assuming that the more bins you have, the larger the sample size you need to ensure that it is statistically significant. Is this assumption correct? I think that what I want is more related to how, in designing a survey, you would determine the minimum sample size given a discrete random variable such as I have described, except that I want to apply it to a sample that I already have to see if whether or not it meets the minimum criteria for being statistically significant. If the sample is statistically significant, I want to just use the sample as my measure of the prevalence of each outcome; but if the sample is not significant, I want to use a measure of its "fractional significance", if that term makes sense, as a weighting factor against the hypothesized distribution when I display the prevalence of the outcomes to the user. I'm guessing that, for a given value of k, the function of n that yields the "fractional significance" of a not-yet-significant sample resembles an elongated "S", much like the cumulative normal distribution. The inflection point would be where the fractional significance equals 0.5, and then the fractional significance would asymptotically approach 1 as n continues to increase. The effect I'm looking for is that as actual samples accumulate, one-by-one, the user of the software would observe over time the prevalence transitioning from the "hypothesized" distribution to the actual distribution (I will actually be using the entire population as a sample, with the population starting at zero and growing by increments of one). So another way of asking this is determine when the population is large enough to allow you to accurately determine the distribution of outcomes as it continues to increase. But I still need an initial guess to work with until it reaches that point, while also being able to make gradual adjustments in the interim to give the best possible guess as to the distribution at every increment of the population. I'm looking for something relatively simple because this measure has to be made by a JavaScript program running in a Web browser, so it can have a quick-and-dirty aspect that won't necessarily be valid for boundary conditions, such as when the sample size is below a certain threshold, which I can test for in my JavaScript program. Granted, this is a somewhat perverse use of a statistical measurement, but it seems that it should still yield sensible results. Thanks for your perseverence! Regards, AZeric64-ga``` Clarification of Question by azeric64-ga on 20 Dec 2004 16:25 PST ```One last point: because the function only equals 1 as a limit as n approaches infinity, once the fractional significance reaches, say, 0.95 or 0.99, I'd consider the sample to be "fully" significant. This would not have to be as rigorous as other significance tests that yield a "p" value, just something that works in a practical application. It could even use an extra fudge factor to err on the side of caution so that I would have to have a larger sample size than is strictly necessary to attain that level of significance. Thanks again! AZeric64-ga``` Request for Question Clarification by mathtalk-ga on 20 Dec 2004 17:31 PST ```Sorry, I'm not grasping what it is you are looking for. Of course you can constantly adjusted your estimated distribution based on what samples you've observed, and you can base the adjustment on very trivial calculations if you like. However you've said above that you want to assign a number between 0 and 1 that describes "fractional significance" of a sample _without_ comparing "the hypothesized distribution to the collected sample in order to determine the significance." Now the notion of "statistical significance" is already perverse enough, so I'm not concerned that we're tampering with sanctified truth! But I just don't get what makes a sample, in and of itself, more or less significant. In one respect I suppose you could base the figure simply on the size of the sample, regardless of what bins the observations fall into. Bigger is better, and we could just pick a mathematical function of n that tends to 1 as n goes to infinity. So, I think I'm missing something central about what you'd like to do with the "significance" values. regards, mathtalk-ga``` Clarification of Question by azeric64-ga on 21 Dec 2004 15:50 PST ```Mathtalk, Thanks for bearing with me through this. Actually, we're already on the same page, the same paragraph in fact, I think at this point its just a matter of particulars. Talking this through with you has been a great help so far! I think at this point its best if I outline a solution that I have just come up with, and then have you tell me where the flaws are and further if they really matter in a practical sense. Applying the problem requires that we divide it into two parts: The first part determines the parameters we use for the fractional significance formula, and the second part determines how we apply it. First part: Recall a statement I made in an earlier clarification: "the more bins you have, the larger the sample size you need to ensure that it is statistically significant." A Web page dealing with determining sample size bears this out: http://www.isixsigma.com/library/content/c000709.asp This page also gives me some insights that I think will help. The formula for sample size, if you look 2/3 of the way down the page (before you get to all the banner ads and just after the illustration of the normal distribution curve) directly supports this idea, if you liken E (the margin of error) to 1/k (the inverse of the number of outcomes). By increasing k, you increase the required sample size, with the required sample size increasing proportionally to the square of k. This happens because when we substitute E with 1/k, the denominator disappears and we are left with the formula s = [z(sub alpha/2) * sd * k]^2 where s is the required sample size z(sub alpha/2) is the critical value, and sd is the standard deviation. The critical value and standard deviation ought to be constant for all applications of the formula. The trick is determining what their values need to be: the critical value can depend on the degree of confidence we want, typically 95% but I'd like 99% if that is still practical, which you can determine with a table of standard Z-scores (although I haven't taken a statistics class in about 12 years so the technique escapes me). Here's the approach to determining the standard deviation that I am considering: even if we have a nominal scale, can we still assume that all outcomes would fall within three standard deviations of the mean? (This covers 99.7% of cases, a close enough approximation to 100%), therefore making the standard deviation equal to one-sixth of the range within which all the outcomes would fall. Does this make sense? Remember, its a nominal scale, so I feel that I'm stretching it here. So what we have so far, assuming that I have not completely left the realm of statistical sanity, is a technique for determining the minimum population size (which, recall, is equal to the sample size) necessary to be able to replace the initial guess of the outcome distribution with the distribution indicated by the collected results. This will be an integer constant (call it "s", as indicated above) that we can determine at the time that we build our applciation, and will factor into the fractional significance formula that we will apply to the collected results each time we run it. So what I would need from you at this point, assuming that I have explained all this clearly enough, is a determination of what the critical value and the standard deviation should be. Again, the critical value should come directly from the standard Z-score for the confidences I idicated, and if my assumptions about what I can do with the sd are correct, sd might simply be 0.166667. I need your opinion on both counts. Second part: Now comes the point where, each time we run the program, we apply the elongated S-curve function I previously described, adjusted according to k, so that it indicates that the population has statistically significant size when it equals s. For values less than s, the result of the function yields the fractional significance. The quandary now is that the cumulative normal distribution is difficult to calculate, and I need to determine my fractional significance on-the-fly. I could implement the Taylor series for it if I had to, but instead I am considering a different approximation. And here is where I really want to commit a statistical heresy for the sake of simplifying the calculations and being happy with an approximation (your testimony that significance measures are already perverse enough gives me some confidence here). The cumulative normal distribution, for the range of values that we are considering (three standard deviations on either side of the mean), strongly resembles the sin(x) function in the range -pi/2 to pi/2, along with a little translation and scaling to get the outcomes between zero and one. (sin() is a function included with JavaScript.) So if you ignore the requirement that the distribution have asymptotic tails, run it through the sin() function with the appropriate adjustments to the parameter and the result, it should yield a value between 0 and 1 that we can use to weigh the sample outcomes against our guesses for them. Low values of n will yield values near zero, and values of n near s will yield values near one. sin() also satisfies the requirement that the inflection point occur when the result equals 0.5. The actual formula would be [1 + sin(pi*n/s - pi/2)] / 2 Now to get the outcome distribution that we want to display, we take the fractional significance value yielded from our formula, and call it the sample significance. Then we subtract it from one to get the initial guess significance. Then, for each outcome k, the estimated prevalence is the sample significance multiplied by the number of actual outcomes k plus the initial guess significance multiplied by the initial guess for outcome k. In other words, its just a weighted sum of the initial guess with the collected results. The initial guesses, of course, will first have to be normalized according to s. So there you have it. Again, I think all I need from you at this point is what the critical factor and the standard deviation should be, and an opinion as to whether I can use the sin() function as I have described. Thanks Again, AZeric64``` Request for Question Clarification by mathtalk-ga on 21 Dec 2004 21:40 PST ```Hi, azeric64-ga: I've posted some initial remarks in response to your ideas below, as a Comment. regards, mathtalk-ga``` Request for Question Clarification by mathtalk-ga on 22 Dec 2004 05:07 PST ```Maybe a few of questions can help me make useful suggestions: 1) When you sample the distribution (I believe you mentioned at one point that this is "without replacement"), are the outcomes chosen one at a time? If not one at a time, are the sizes of samples themselves a random outcome? (For example, in birding, the number of observations in a day will fluctuate.) 2) Are the samples chosen "independently"? This is a key issue from the point of view of statistical estimation. For example, there was a requirement in the construction of the Alaska pipeline that X-rays be taken of a certain percentage of the welds, as a QA check that the work was done properly. However it turned out that a contractor had taken one X-ray and duplicated it hundreds of times! Not exactly independent observations! But observations can be "correlated" and fail to be independent due to other more subtle factors. "Unscientific" polling in which samples are submitted by enthusiasts will often be "biased" by interested parties providing "duplicate copies" of favored samples. In trying to estimate frequencies in the unbiased population, the duplication in sampling creates a significant challenge. 3) What costs or risks are associated with updating (or failing to update) your estimated distribution of outcomes? To illustrate, suppose you were a jelly bean vendor, and you initially estimate the demand for various colors of jelly beans to be equal. As time goes by, your inventory of black jelly beans surges, but you continue to forecast equal sales of each kind. Now as long as your restocking of inventory is smart enough to only purchase those colors of jelly beans that are depleted, the economic harm of maintaining the unrealistic forecast of equal demand for all colors is not too serious. Your reordering over time would probably provide data for a better forecast, but a case could be made for not bothering to analyze the situation too closely. regards, mathtalk-ga``` Clarification of Question by azeric64-ga on 28 Dec 2004 14:23 PST ```Hello again Mathtalk-ga, Sorry about the lag in my response, I was away for the Christmas holiday and out of Internet range (harder to do every year, but achieveable if you put your mind to it). To answer your questions: 1) The outcomes are chosen one at a time. 2) Samples are independent. 3) The costs for failing to update the outcome distribution are not catastrophic. It is more a case of trying to optimize the prevalence of each cause of a problem in order to reduce the overall cost of troubleshooting. I also appreciate your latest comments. True, I'm not specifically trying to measure something like process improvement, but I am searching for techniques that may have been developed in that venue that I can still apply here. Looking forward to your reply, AZeric64-ga```
 There is no answer at this time.

 ```This will surely depend on the distribution of your random variable - more precisely on the probability of the outcome lying within each bin. At one extreme, if the outcomes are always within a single bin then the sample will always represent the distribution of outcomes exactly. At the other extreme, if each of the k bins has probability 1/k of being selected by a given outcome, and supposing for convenience that n is a multiple of k (say n = mk) then the probability of getting an exactly correct representation is the probability that each bin gets m outcomes, i.e. (n, m) (1/k)^m . (n-m, m) (1/k)^m . (n-2m, m) (1/k)^m . ... . (m, m) (1/k)^m where I am using (n, m) to represent n choose m, i.e. n! / (m! (n-m)!). This simplifies to n! / [(m!)^k . k^n]. Of course, you're more likely to be interested in the probability that the distribution among the k bins is "close enough" to the correct distribution. The answer to that will depend on how you define "close enough", and as before on the distribution of the variable - though the uniform random version will at least give you a lower bound.```
 ```As manuka-ga's Comment outlines, it is implied that a multinomial distribution exists among the bins. Developing a "confidence interval" for each individual bin might be conservatively treated by reducing to several binomial distribution cases, e.g. either in Bin X or not in Bin X. Significance testing, on the other hand, would be more powerful if all bins are used in a sampling statistic. regards, mathtalk-ga```
 ```It's important to bear in mind the context of the discussion of sample size and standard deviation at the site linked above by azeric64-ga. The premise there is settling a question of whether a change to process has resulted in improvement, where the measurable improvement is a quantity that has a random fluctuation according to a normal distribution quite independent of whether any changes are made to the process. One should therefore be skeptical of whether an observed improvement in the measured outcome can reasonably be ascribed to chance rather than to the designed changes. In a situation of this kind we asks for repeated independent "trials" both with and without the designed changes. The empirical testing "samples" the outcomes, and a mean of the samples for both treatments is derived by averaging the respective measurements. For a normally distributed population, the sample means of a fixed size k will themselves be normally distributed, with standard deviations that are a factor of SQRT(k) less than the standard deviation of the "host" population. Assuming that the process change shifts the mean of the population (but not its variance or standard deviation), one can estimate in advance how large a sample size will be needed to detect a shift in the mean of a certain size relative to the underlying standard deviation with a certain level of "significance". The notion of "significance" here is tricky, or even as we previously alluded, perverse. It does not mean what we would wish it to, namely how likely it is that an observed impact is real. Instead it is a "converse" notion. It tells us how "unlikely" an observed difference would be under a hypothesis that the design changes were completely ineffective (as if nothing whatever had actually been done to change the setup). A small "level of significance" is important, in this context. The smaller the level of significance, the more unlikely an observed change _would_ occur by random chance (given the assumed normal distribution of outcomes). Now the situation of outcomes being distributed into bins differs from this "process improvement" case study in a number of ways. First and most obviously, the distribution of outcomes is inherently discrete; only a finite number of outcomes are possible, which is to be compared with the continuous distribution for which a normal distribution is the elementary paradigm. Second, the motive for planning a sufficient sample size in a process improvement study is clear enough that a rational basis exists for balancing the costs of increasing sample size versus the potential benefit of an improvement. For example, a sample size may be chosen so that it will be likely to produce a significant "result" _if_ the magnitude of the improvement is big enough to be profitable. There is no absolute notion of a sample being significant or not significant, even on a sliding scale from 0 to 1. The part of statistics that is most closely tied to testing the "level of significance" of experimental results is called statistical inference. Basically it incorporates all the methods which might be invented to estimate parameters of distributions _or_ to make comparisons between distributions that are known only or partly through observations. I have the feeling that some study (a good night's rest!) will be needed before I can bring into sharp focus the new direction in which you've pointed. regards, mathtalk-ga```