Hi, multifactor-ga:
Thanks for clarifying that you want to avoid interpolation and have
the quantile levels match up with observed values. For reference I
will point out two widely used statistical packages that implement
the sample quantile method which I describe below. The interesting
thing is that both packages implement more than one method, some
"discrete" and some "continuous":
SAS is a standard commercial statistical package implemented on many
platforms:
[SAS Elementary Statistics Procedures -- Keywords and Formulas]
http://support.sas.com/onlinedoc/913/getDoc/en/proc.hlp/a002473330.htm
See about 3/4ths down the page, under "Quantile and Related Statistics"
and especially the table "Methods for Computing Quantile Statistics" in
which the parameter values for QNTLDEF are described.
R is an open source implementation of another statistical language,
also available on a variety of platforms:
[R Documentation -- Sample Quantiles]
http://jsekhon.fas.harvard.edu/stats/html/quantile.html
Note that three of the nine quantile methods are "discrete" and the
other six "continuous" (meaning that interpolation is required).
The method I describe below always produces an observed value for each
quantile. In the SAS framework it corresponds to QNTLDEF = 2, and in
the R documentation it appears to be quantile method type 3.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
[Closest Numbered Observation Method]
Sort the observations into ascending order:
x_1, x_2, . . . , x_n
With as many observations as you contemplate here, on the order of a
thousand, a good sorting routine is well worth implementing, e.g. a
heapsort or a quicksort method. Let me know what your programming
environment is if you'd like pointers to coding this part.
Given that numbering of these observations, then for any fraction p in
[0,1], define the sample quantile level corresponding to p to be the
observation whose numbering is closest to np. If np is exactly halfway
between two positive integers, then pick the even numbered value.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Examples -- Observation Numbered Closest to np:
Suppose n = 1000 and p = 0.5. Then np = 500, and the median (quantile
level for p = 1/2) by this method is x_500, the 500th observation in
ascending order.
Suppose n = 867 and p = 0.2. Then np = 173.4, and the first quintile
(quantile level for p = 1/5) is x_173, the 173rd observation in order.
Suppose n = 995 and p = 0.3. Then np = 298.5, and the third decile
(quantile level for p = 3/10) is x_298, because we round to an even
number when np is exactly halfway between 298 and 299.
Note that by a strict application of this rule, if p = 0, then np = 0
whatever n is, so x_1 is the observation numbered nearest to np. At
the other extreme, if p = 1, then np = n and naturally x_n is chosen.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Some final notes:
The sorted observations are called "order statistics". It may have no
special importance here, but in some cases it may be of interest to
know for a given underlying distribution what the expected values, etc.
are for x_1, x_2, ... (which depends on n as well as the distribution).
The algorithm described here assumes any repeated values are included
with repetition in the sorted list x_1, x_2, ... , x_n accordingly.
Further information about the R statistical program, also known as
GNU S, may be found here:
[The R Project for Statistical Computing]
http://www.r-project.org/
regards, mathtalk-ga |