Google Answers Logo
View Question
 
Q: statistics, measure of significance ( No Answer,   6 Comments )
Question  
Subject: statistics, measure of significance
Category: Science > Math
Asked by: lusus-ga
List Price: $2.50
Posted: 30 Oct 2002 17:15 PST
Expires: 29 Nov 2002 17:15 PST
Question ID: 93738
I have two lists of numbers that represent a diverse list of statistics
taken at two different times (let's say network performance).  the list
is large and I want to highlight differences which are most likely to
be significant or interesting.  I do not have a large historical sample
to base this on.. it can only be a function of the two data points.

a straight difference is not good because large value stats have larger
differences than smaller value stats. (a change from 1000 to 1100 appears 
more significant than 1 to 50.)

a percent difference is not good because small value stats have erratic
changes which are big in relative (percentage) terms.  e.g. a change
from 2 to 4 looks more significant than 170,000-180,000.

I probably don't know how to state this properly, but large magnitude
values have a tendency to hover around a typical value (like the size
of raindrops) while small values can go from 0 to other small values 
such as 5, fairly easily.

I was thinking of something like using the larger of the two values as
the assumed magnitude, by which the significance of a difference is
scaled down.

am I making any sense?  if so, there must be a very standard
statistical way of saying this properly.

I prefer a single value that could be scaled to a fixed range, e.g.
from 0 to 100 so that I can adjust the threshold of "interesting".

I just want to help highlight the values in these long lists, which are
most worthy of inspection.

Clarification of Question by lusus-ga on 31 Oct 2002 09:15 PST
mathtalk, your point is well taken about this not being truly
'statistical', and about needing a model to have truly meaningful
answers.  I realize I'm asking for something a little bogus and
magical by insisting on comparing only two points and by insisting on
no knowledge of the meaning of the data.  but having said that: an
absolute differce does do that magical job, albeit poorly.  and a
relative differnce does an even better job, bordering on good enough,
but I'd like to know if there's a next logical improvement that has
better properties that either of these by themselves, without putting
more demands on the front end of this whole process.

this simple expression does a soso job: (larger+1)/(smaller+1)
(+1 because there are 0s in the data)

all this is really, is a prioritization for which of these differences
I may select as "interesting" and go on to model, as the next step.

Request for Question Clarification by mathtalk-ga on 05 Nov 2002 08:13 PST
Hi, lusus:

I've had to address a similar issue, with "regression" testing of a
computer software application.  Running the software against a large
number of inputs (benchmark test cases) before and after a change to
the software would normally produce some expected and some unexpected
changes in the output.

Since the output was much too extensive for a human being to reliable
compare, an automated comparison was made to identify "big" changes.

My suggestion is that you go through the various categories of
measurements to your problem and assign to them some "modelling"
labels:

- absolute versus relative:  Should the threshold of "big" change be
defined for this category in absolute or relative terms, i.e. X(1) -
X(2) or the percentage difference of X(2) with X(1)?

- primary versus secondary:  Is the category a primary indicator,
something central to the business purpose to be monitored (e.g.
"downtime" or perhaps "dropped connections"), or is it secondary,
either in the sense of being of peripheral importance or being a kind
of intermediate/explanatory value that might be ignored unless related
primary indicators demonstrate a "big" change?

This is the sort of thing I meant by saying it is a "modelling"
question, rather than a "statistical" one.  As the "domain expert" you
would need to be the lead in assigning these categories.  I could
certainly provide guidance on how to implement an automated review of
the two data sets, using your guidelines.

Is this the sort of help you are interested in?

regards, mathtalk-ga
Answer  
There is no answer at this time.

Comments  
Subject: Re: statistics, measure of significance
From: mathtalk-ga on 30 Oct 2002 19:25 PST
 
From a true statistical point of view, no, it does not make sense.

Let me make sure I understand the setup.  At two different points in
time, you take a large set of "measurements" on a complex system
(network).  For example, there might be a count of users logged in,
the number of files open on a file server, the number of memory pages
swapped out on a database server, etc.  Almost all of the
corresponding numbers at the two points in time differ.  You ask for a
way to know which differences are most likely to be worth noticing.

It is not much a statistical problem, in so far as statistics deals
with repeated measurements, because each distinct measurement is only
taken twice.

What you have is a modelling problem.  You need a model or
"hypothesis" relating all these varied measurements to help formulate
a notion of whether a difference in measurements is significant or
not.  Distinguishing variations that are relatively large or small
versus ones that are absolutely large or small is probably a good
first step, but it is far from the whole story.

Let's turn the question around and ask this.  Suppose someone with a
Crystal Ball could tell you unequivocally, these pair of measurements
exhibit the most significant difference.  What would you do with that
information?  How would you proceed?  What "investigation" concerning
the difference between those two numbers would be possible?  Or
worthwhile?

Those are the sorts of issues that a "model" addresses.  A model of
human physiology, for example, tells us that a 10 percent variation in
blood temperature is more significant than a 10 percent variation in
blood sugar, and that a one uint change in blood pH is more
significant that a one unit change in blood volume.  So we would need
to know more about the "context" of your measurements to decide
whether a variation has significance or not, or more to the point,
whether a pattern or "constellation" of changes in measurement
indicates an underlying event of importance (e.g. a "viral" attack
either in the human patient or on a network).

regards, mathtalk-ga
Subject: Re: statistics, measure of significance
From: starrebekah-ga on 30 Oct 2002 22:14 PST
 
Use a statistical computer program (such as SPSS) to convert the raw
scores into z scores.  You can then compare means, standard
deviations, and do other statistical analysis.

  You can get a trial version of SPSS at www.spss.com

Good Luck!

-Rebekah

PS - Here are instructions on exactly how to do this using SPSS:
    http://www.uoguelph.ca/~psystats/raw_to_z-score_conversions.htm
Subject: Re: statistics, measure of significance
From: starrebekah-ga on 30 Oct 2002 22:16 PST
 
PPS - This program will also let you make graphs - which will help you
see those 'outliers' (or values you think might be significant to look
at for further investigation.  Makes it a lot easier.

-Rebekah
Subject: Re: statistics, measure of significance
From: lusus-ga on 31 Oct 2002 09:20 PST
 
oh, and I am processing this with a program, but it's probably not
going to be worth the effort in my situation if the expression is more
than a single line.  I'd like to be able to set an arbitrary threshold
and say "show me the differences that rank > 80 out of a possible
score of 100."
Subject: Re: statistics, measure of significance
From: rsquared-ga on 31 Oct 2002 18:36 PST
 
I'm not sure I fully understand the question and I may be repeating
what has already been said, but...
Are you looking for what is known as a standard deviation?  This is
basically a measure of how much a set of data vary around the mean. 
It's fairly simple to compute; I'm sure there are computer programs
that do it.  You might even be able to use Excel - I don't know.

If you want more info on standard deviation, do a Google search.  You
are bound to find more than you'd ever care to know!  Good luck.
Subject: Re: statistics, measure of significance
From: douglas256-ga on 01 Nov 2002 01:01 PST
 
As stated, since you are only wanting to compare two numbers, this is
not a question of statistics but of a discrete derivative.

Your first attempt, a_{i+1} - a_i, was not sufficient.  Your second
attempt, 200*|a_{i+1} - a_i|/(|a_{i+1}| + |a_i| + 1) worked fairly
well, but gave too much weight to small values of a_i and not enough
weight to large values of a_i.

If you want the difference to be between [0,100] and be dependent on
the relative size of a_i, a global maximum is needed.  Then, you could
use: 20000 * (2*max - |a_{i+1}| - |a_i|)/max * (|a_{i+1} -
a_i|/(|a_{i+1}| + |a_i| + 1)).

You could of course very the size weighting by (2*max - |a_{i+1}| -
|a_i|)/max to either a fractional power (e.g. 1/2) or a positive power
(e.g. 2).  The higher power applying more weight to size and a lower
power applying less weight to the size.  It should be noted, that the
constant 20000 would have to be varied if the power is changed.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy