View Question
 Question
 Subject: Quantile calculation Category: Science > Math Asked by: multifactor-ga List Price: \$30.00 Posted: 18 May 2005 17:41 PDT Expires: 17 Jun 2005 17:41 PDT Question ID: 523143
 ```I am looking for the typical algorithm used to divide a cross section of datapoints into quantiles (quintiles and deciles especially). I have not been able to find a 'consensus' technique that can map a distribution (potentially unequal) of observations into quintile or decile groups.``` Request for Question Clarification by mathtalk-ga on 23 May 2005 18:42 PDT ```Hi, multifactor-ga: I've been tempted to post an Answer several times, but I'm not sure what balance to strike between "typical algorithm" and "'consensus' technique". Maybe I'm splitting hairs here. There are are two broad categories of approaches. One "preserves" the discrete nature of the empirical sample, giving you back an observed value for every quantile requested. While that has some attractions, it can lead to awkwardness when the number of observations is comparatively low (relative to the granularity demanded by the quantiles). So a more common approach would be some kind of linear interpolation of the empirical cumulative distribution function, allowing an "exact" result at the risk of making up values that in principle may never occur. The case of the median is illustrative. If there are an odd number of values in the sample, pretty much everyone will agree the middle one (of sorted observations) is the median. If there are an even number of values in the sample, a frequent approach is to average the two "middle" observations. Would an approach often used in practice which generalizes this sort of computation be helpful to you? Any specific approach will have circumstances that give rise to shortcomings -- perhaps you are more interested in a technique appropriate to a specific situation? regards, mathtalk-ga``` Clarification of Question by multifactor-ga on 23 May 2005 19:46 PDT ```Thanks for the response. In the specific case I am trying to apply this calculation to I would prefer to avoid interpolation as I am trying to bin a specific list of companies into quantiles. In essence preserving the nature of the sample set. I am trying to apply this routine to the quantization of financial data which are observations per company at a particular point in time. A typical example would be to assign a decile number to a cross section of 1000 companies' earnings/price ratio. Thanks.```
 ```Hi, multifactor-ga: Thanks for clarifying that you want to avoid interpolation and have the quantile levels match up with observed values. For reference I will point out two widely used statistical packages that implement the sample quantile method which I describe below. The interesting thing is that both packages implement more than one method, some "discrete" and some "continuous": SAS is a standard commercial statistical package implemented on many platforms: [SAS Elementary Statistics Procedures -- Keywords and Formulas] http://support.sas.com/onlinedoc/913/getDoc/en/proc.hlp/a002473330.htm See about 3/4ths down the page, under "Quantile and Related Statistics" and especially the table "Methods for Computing Quantile Statistics" in which the parameter values for QNTLDEF are described. R is an open source implementation of another statistical language, also available on a variety of platforms: [R Documentation -- Sample Quantiles] http://jsekhon.fas.harvard.edu/stats/html/quantile.html Note that three of the nine quantile methods are "discrete" and the other six "continuous" (meaning that interpolation is required). The method I describe below always produces an observed value for each quantile. In the SAS framework it corresponds to QNTLDEF = 2, and in the R documentation it appears to be quantile method type 3. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * [Closest Numbered Observation Method] Sort the observations into ascending order: x_1, x_2, . . . , x_n With as many observations as you contemplate here, on the order of a thousand, a good sorting routine is well worth implementing, e.g. a heapsort or a quicksort method. Let me know what your programming environment is if you'd like pointers to coding this part. Given that numbering of these observations, then for any fraction p in [0,1], define the sample quantile level corresponding to p to be the observation whose numbering is closest to np. If np is exactly halfway between two positive integers, then pick the even numbered value. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Examples -- Observation Numbered Closest to np: Suppose n = 1000 and p = 0.5. Then np = 500, and the median (quantile level for p = 1/2) by this method is x_500, the 500th observation in ascending order. Suppose n = 867 and p = 0.2. Then np = 173.4, and the first quintile (quantile level for p = 1/5) is x_173, the 173rd observation in order. Suppose n = 995 and p = 0.3. Then np = 298.5, and the third decile (quantile level for p = 3/10) is x_298, because we round to an even number when np is exactly halfway between 298 and 299. Note that by a strict application of this rule, if p = 0, then np = 0 whatever n is, so x_1 is the observation numbered nearest to np. At the other extreme, if p = 1, then np = n and naturally x_n is chosen. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Some final notes: The sorted observations are called "order statistics". It may have no special importance here, but in some cases it may be of interest to know for a given underlying distribution what the expected values, etc. are for x_1, x_2, ... (which depends on n as well as the distribution). The algorithm described here assumes any repeated values are included with repetition in the sorted list x_1, x_2, ... , x_n accordingly. Further information about the R statistical program, also known as GNU S, may be found here: [The R Project for Statistical Computing] http://www.r-project.org/ regards, mathtalk-ga```