Google Answers: Quantile calculation

View Question

Q: Quantile calculation ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Quantile calculation
Category: Science > Math
Asked by: multifactor-ga
List Price: $30.00

Posted: 18 May 2005 17:41 PDT
Expires: 17 Jun 2005 17:41 PDT
Question ID: 523143

I am looking for the typical algorithm used to divide a cross section
of datapoints into quantiles (quintiles and deciles especially). I
have not been able to find a 'consensus' technique that can map a
distribution (potentially unequal) of observations into quintile or
decile groups.

Request for Question Clarification by mathtalk-ga on 23 May 2005 18:42 PDT

Hi, multifactor-ga:

I've been tempted to post an Answer several times, but I'm not sure
what balance to strike between "typical algorithm" and "'consensus'
technique".

Maybe I'm splitting hairs here.  There are are two broad categories of
approaches.  One "preserves" the discrete nature of the empirical
sample, giving you back an observed value for every quantile
requested.

While that has some attractions, it can lead to awkwardness when the
number of observations is comparatively low (relative to the
granularity demanded by the quantiles).  So a more common approach
would be some kind of linear interpolation of the empirical cumulative
distribution function, allowing an "exact" result at the risk of
making up values that in principle may never occur.

The case of the median is illustrative.  If there are an odd number of
values in the sample, pretty much everyone will agree the middle one
(of sorted observations) is the median.  If there are an even number
of values in the sample, a frequent approach is to average the two
"middle" observations.

Would an approach often used in practice which generalizes this sort
of computation be helpful to you?  Any specific approach will have
circumstances that give rise to shortcomings -- perhaps you are more
interested in a technique appropriate to a specific situation?

regards, mathtalk-ga

Clarification of Question by multifactor-ga on 23 May 2005 19:46 PDT

Thanks for the response.

In the specific case I am trying to apply this calculation to I would
prefer to avoid interpolation as I am trying to bin a specific list of
companies into quantiles. In essence preserving the nature of the
sample set.

I am trying to apply this routine to the quantization of financial
data which are observations per company at a particular point in time.
A typical example would be to assign a decile number to a cross
section of 1000 companies' earnings/price ratio.

Thanks.

Answer

Subject: Re: Quantile calculation
Answered By: mathtalk-ga on 25 May 2005 04:11 PDT
Rated: 5 out of 5 stars

Hi, multifactor-ga:

Thanks for clarifying that you want to avoid interpolation and have
the quantile levels match up with observed values.  For reference I
will point out two widely used statistical packages that implement
the sample quantile method which I describe below.  The interesting
thing is that both packages implement more than one method, some
"discrete" and some "continuous":

SAS is a standard commercial statistical package implemented on many
platforms:

[SAS Elementary Statistics Procedures -- Keywords and Formulas]
http://support.sas.com/onlinedoc/913/getDoc/en/proc.hlp/a002473330.htm

See about 3/4ths down the page, under "Quantile and Related Statistics"
and especially the table "Methods for Computing Quantile Statistics" in
which the parameter values for QNTLDEF are described.

R is an open source implementation of another statistical language,
also available on a variety of platforms:

[R Documentation -- Sample Quantiles]
http://jsekhon.fas.harvard.edu/stats/html/quantile.html

Note that three of the nine quantile methods are "discrete" and the
other six "continuous" (meaning that interpolation is required).

The method I describe below always produces an observed value for each
quantile.  In the SAS framework it corresponds to QNTLDEF = 2, and in
the R documentation it appears to be quantile method type 3.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

[Closest Numbered Observation Method]

Sort the observations into ascending order:

  x_1, x_2, . . . , x_n

With as many observations as you contemplate here, on the order of a 
thousand, a good sorting routine is well worth implementing, e.g. a
heapsort or a quicksort method.  Let me know what your programming
environment is if you'd like pointers to coding this part.

Given that numbering of these observations, then for any fraction p in
[0,1], define the sample quantile level corresponding to p to be the
observation whose numbering is closest to np.  If np is exactly halfway
between two positive integers, then pick the even numbered value.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Examples -- Observation Numbered Closest to np:  

  Suppose n = 1000 and p = 0.5.  Then np = 500, and the median (quantile
  level for p = 1/2) by this method is x_500, the 500th observation in
  ascending order.

  Suppose n = 867 and p = 0.2.  Then np = 173.4, and the first quintile
  (quantile level for p = 1/5) is x_173, the 173rd observation in order.

  Suppose n = 995 and p = 0.3.  Then np = 298.5, and the third decile
  (quantile level for p = 3/10) is x_298, because we round to an even
  number when np is exactly halfway between 298 and 299.

Note that by a strict application of this rule, if p = 0, then np = 0
whatever n is, so x_1 is the observation numbered nearest to np.  At
the other extreme, if p = 1, then np = n and naturally x_n is chosen.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Some final notes:

The sorted observations are called "order statistics".  It may have no
special importance here, but in some cases it may be of interest to
know for a given underlying distribution what the expected values, etc.
are for x_1, x_2, ... (which depends on n as well as the distribution).

The algorithm described here assumes any repeated values are included
with repetition in the sorted list x_1, x_2, ... , x_n accordingly.

Further information about the R statistical program, also known as
GNU S, may be found here:

[The R Project for Statistical Computing]
http://www.r-project.org/



regards, mathtalk-ga

multifactor-ga rated this answer: 5 out of 5 stars

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy