Google Answers: regression analysis/statistics

View Question

Q: regression analysis/statistics ( Answered, 4 Comments )

Question

Subject: regression analysis/statistics
Category: Business and Money > Finance
Asked by: whosher-ga
List Price: $30.00

Posted: 30 Jul 2002 07:08 PDT
Expires: 29 Aug 2002 07:08 PDT
Question ID: 46840

The British Banker's Association wanted to look a the relationship between the amount of deposits made (in Billions of [f]) and the number of customers that a bank had. Analysts collected dataon six different large banks and found the following information: Bank Name Deposits (F billion) Customers (million) Abbey National 101.7 13.6 Barclays 108.2 10.0 Lloyds 96.9 15.0 National Westminster 113.8 7.5 Woolrich 27.5 4.0 Halifax 77.1 7.6 a. Which variable is the independent variable? Which is the Dependent variable? b. what is a scatter plot of the data? Does it appear that the amount of deposits is related to the number of customers? c. Find the equation of the regression line for the data d. Plot the regression line on the same plot as the data. e. Calculate the standard error of the estimate, [sy/x,]for the regression line f. At the 0.05 level, is the model significant.
Request for Question Clarification by netcrazy-ga on 30 Jul 2002 07:13 PDT Hello, Is this a homework question?? Thanks netcrazy
Clarification of Question by whosher-ga on 30 Jul 2002 07:32 PDT no this is not home for the 4th time i am not in school is learning to do statistic by a book that was brought out of a bookstore and is not sure how they came up with the answer so i figure you can show me the steps you are taking to arrive at the answer you are getting ...I want to compare the tools being use
Request for Question Clarification by calebu2-ga on 30 Jul 2002 12:13 PDT See my comments/clarification on question 46845. If you cut and paste the data you have into excel, follow the steps to do regression analysis, you can get back the results. Admittedly there are a lot of different statistics produced, but if they are unclear, you would probably be better suited asking a generic question about interpreting the output of an excel regression analysis rather than posting it in the current form. I'd be happy to give you a non-specific guide to using excel to do regression analysis - without rewording your questions (so that they are not straight from a text) I can't really do much more to help you. A good question for you to ask the researchers would be "Could you come up with a question similar to the one listed, that isn't taken from a textbook (so that we don't infringe copyright) and a detailed solution so that I can get a feel for how questions of this type are done". That would require extra work on the part of the researcher, but it would be far more readily answered.
Clarification of Question by whosher-ga on 31 Jul 2002 07:38 PDT Hello In relation to comments how will I know which formula to use? I guess if I can separate the formula on when and how to use the problem will make more since, but the book does not go into details in reference to regression analysis. So what ever info you can provide will be greatly appreciated thanks.
Request for Question Clarification by calebu2-ga on 31 Jul 2002 18:02 PDT I;ve created the following question, which I have no problems answering if you think it will help you understand the banker's question : Consider the following hypothetical standings of a completely imaginary baseball league - it is believed that there might be a relationship with runs scored and games won : Team Wins Runs Georgia Bravehearts 68 464 Steinbrenner's Moneymachine 67 628 Tacoma Seamen 65 533 Phoenix Snakes 64 536 Twin City… Twins 64 559 Anaheim Angles 63 548 Buckner's Recurring Nightmare 62 571 Big Arch Birdies 59 489 Oakland Sportspeople 61 510 Bay Area Big guys 59 509 Holywood Avoiders 59 449 Cincinatti Rouges 55 471 Enron Astros 54 501 Piazza's Pals 54 459 Canadian Contractions 53 466 Florida Fishies 51 463 Windy city plain socks 51 565 Phillie Philadelphias 49 473 Maryland Birdies 48 452 Pittsburgh Napsters 48 408 Boulder Boulders 48 496 Ohio Natives 47 460 Canadian Birdies 45 508 SoCal Padres 45 420 Windy City Young Bears 44 448 W's Winners 44 518 Royal Blues 44 471 Missing Pistons 40 408 Wisconsin Beermen 38 442 Tampa Bay nohopers 35 430 Answer the following questions: a. Which story makes most sense - "More runs give you more wins" or "More wins give you more runs"? Hence, what are the dependent and independent variables? b. Use the graph function on excel to plot these data points. Just by looking at them do you see a relationship? c. Add a trendline to the graph d. By using the Tools-Data Analysis run a regression on the data (make sure to get the dependent and independent variables the right way round). Explain the numbers you get as a result. e. What is the p-value for the x-coefficient. Is there a statistically significant relationship between runs and wins? question copyright calebu2-ga and google answers (did not come from a textbook) As I said, if you are ok with it, I am more than willing to work through that question as a detailed example, explaining each step (with clarification if needed) to provide an understanding of the steps that would be needed to answer the Bankers question. This way I am able to help you with your study of business statistics without either infringing the copyright of the question authors or adding to the ethical dilemmas which have prevented your question being answered to this point.
Request for Question Clarification by tehuti-ga on 03 Aug 2002 18:42 PDT Re rbnn-ga's comment, I would say that the deposits are the dependent variable, because if there are zero customers, there will definitely be zero deposits. However, this then raises the question to what extent does the amount of deposits depend on the number of customers. This is what the regression analysis will tell you. A scatter plot involves plotting the number of customers on the horizontal axis, and the respective deposit amount on the vertical axis. If the points follow more or less a straight line, then you can say that the amount of deposits is directly related to the number of customers in a linear manner. You can already see just by looking at the figures that this will not be the case, eg Halifax and Nat West have almost the same number of customers, but Nat West has 50% more deposits; Nat West has the highest level of deposits, but Lloyds has the highest number of customers, etc, etc. The regression coefficient is a way of attempting to put a straight line over the more scattered data points that you have in your plot. The line should reflect an underlying relationship. I used an online regression equation calculator at http://people.hofstra.edu/faculty/Stefan_Waner/RealWorld/newgraph/regressionframes.html First, I entered customer numbers in the x column and deposits in the y column. I selected the equation for linear regression, y = mx + b, because I want to know if there is a simple linear relationship between number of customers and amount of deposits. This gave a regression coefficient of r = 0.632499. If there was a direct positive relationship between the two variables, the coefficient would be 1, while if there was a direct negative relationship it would be -1. Zero correlation would give a coefficient of 0. You can press "graph" to obtain an illustration of the calculated regression line versus the data. It is clear that the regression line only has a couple of data points within a reasonable distance from it. You can also see that figures now appear in the third column of data. These are the values of the deposits as predicted by the regression equation. You can see straight off that in only two cases do they bear any relationship to the real figures. This web site does not have the facility to calculate the standard error and the significance. However, given the poor fit between the regression line and the data, I would hazard a guess that the model is not statistically significant. The significance is tested either using a t test or ANOVA. Basically, what it does is to see how useful the regression coefficient would be to predict the amount of deposits for any specific number of customers. In real life, the hypothesis you are testing is assuming that if a bank has more customers, it will have more money placed on deposit than a bank with fewer customers. Of course, that need not be the case, because such an assumption leaves out demographics. If bank A attracts many millionaires and other rich people, while bank B attracts students and other poorer people, it is reasonable to assume that even with fewer clients, bank A will have more money on deposit than bank B. Something of this sort is operating here. Lloyds customers are putting away less than half that put away by Nat West customers. This is a fairly simplistic explanation, and it does not deal with all the points in your question, so I am posting it as a clarification rather than as an answer. However, sometimes it helps to think in a simplistic way about a situation before applying full statistical analysis.

Answer

Subject: Re: regression analysis/statistics
Answered By: calebu2-ga on 04 Aug 2002 20:22 PDT

Whosher,

OK. Let's see if I can give you your $30's worth on this question. My
goal in answering the question is going to be twofold - to give you
the techniques needed to answer the question (using Microsoft Excel as
the weapon of choice) and to understand the statistical theory behind
it. When I'm not being a google researcher, I do quite a bit of
college level business statistics tutoring, so consider this answer as
an online tutorial. Feel free to come back for clarification if
anything doesn't make sense - a good tutorial is a dynamic one!

I should also say that I will try not to be too technical in my
answers - I know that this might ruffle the feathers of some of the
stats experts who are reading the answer, especially if I fudge the
explanation - my aim is to get the general concepts across in one
shot.

Step 1 - Data Entry
-------------------
I won't spend too long on this, but I want to kind of set the question
up so that we are on the same page when talking about cell references.
I took your text and pasted it into excel.

I then used the Data, Text to Columns menu option to covert the data
into columns. The default (fixed width) settings work fine. You might
need to cut and paste the text to get it in the same cells as mine.
The following link shows how the data looks :

http://www2.bc.edu/~lawrenst/whosher/stats1.gif

In column A, we have the bank names (I shortened some). Column B has
the Deposits (numeric values in B2:B7). Column C has the Number of
Customers (Numeric values in C2:C7).

Question A - Independent and Dependent Variables
------------------------------------------------
a. Which variable is the independent variable? Which is the Dependent
variable?

Before I get to the answer, a brief explanation of the two different
types of variable in a regression and how to identify them.

Any regression has to have a dependent variable and at least one
independent variable. The simplest way of thinking about the variables
is to say that the dependent variable *depends* on the independent
variables. Said another way, the dependent variable can be partially
explained or described by the independent variables.

When answering this question, you want to think to yourself about
cause and effect - does a larger number of deposits lead to a larger
number of customers? or does a larger number of customers lead to a
larger number of deposits? Without getting too philosophical, I'm
going to argue that it is the number of customers that causes or
describes the deposits. You need customers to have deposits.

*** Hence the answer to this question is that Customers is the
INDEPENDENT variable and deposits is the DEPENDENT variable. ***

Microsoft Excel and the theory of statistics do not tell you which
variable is the dependent variable - you get results (statistical
output) no matter which way round you treat variables - it is your
common sense that guides you as to which way round to write a
regression. If you get it round the wrong way, the interpretation of
your regression is potentially meaningless or incorrect.

Question B - Scatter Plots and Interpretation
---------------------------------------------
b. what is a scatter plot of the data? Does it appear that the amount
of deposits is related to the number of customers?

There are two parts to this question - creating a scatter plot and
interpreting it. Without interpretation (as is always the case in
statistics), all you have is a nice pattern of dots.

To plot a scatter plot in Excel, follow these steps :
1) Select Insert, Chart from the Menu or click on the Chart Wizard
button
2) Select XY Scatter in the dialog box
(http://www2.bc.edu/~lawrenst/whosher/stats2.gif)
3) Click on "Next" and select the Series Tab. Click "Add".
4) Enter your independent variable (Customers) as the X variable. ie.
enter C2:C7 in the X values box.
5) Enter your dependent variable (Deposits) as the Y variable. ie.
enter B2:B7 in the Y values box.
(http://www2.bc.edu/~lawrenst/whosher/stats3.gif)
6) Click "Next". You can add a title, axis titles and change the
formatting if you want to, but it isn't really important for learning
statistics, so I'll leave it as it is.
7) Click finish

You should have a sheet that looks as follows:
http://www2.bc.edu/~lawrenst/whosher/stats4.gif

This is our scatter plot. Now we have to interpret it. Our six points
are arranged in a weird T shape, but there is generally an upward
slope to the data.

There are numerous possible scatterplots you could get. I'm going to
classify a few of them and explain what they mean before putting an
interpretation to our bank scatterplot.

Perfect Positive Correlation
http://www2.bc.edu/~lawrenst/whosher/stats5.gif
In this case, the data points fall in a straight upwards sloping line.
An increase in the x variable is reflected by a proportional increase
in the y variable. There is no deviation from this line. Perfectly
correlated variables mean that one variable completely describes the
other. If you know the value of x, you can estimate exactly the value
of y.

Perfect Negative Correlation
http://www2.bc.edu/~lawrenst/whosher/stats6.gif
In this case, the data points fall in a straight downwards sloping
line. An increase in x is reflected by a proportional decrease in the
y variable.

Positive Correlation
http://www2.bc.edu/~lawrenst/whosher/stats7.gif
In this case, the data points fall about an upwards sloping line, but
do not fall exactly on it. An increase in x is generally reflected by
an increase in y, but not exactly. If you know the value of x, you can
make a rough guess of what y should be but there is a margin of error.

Negative Correlation
http://www2.bc.edu/~lawrenst/whosher/stats8.gif
In this case, the data points fall about an downwards sloping line,
but do not fall exactly on it. An increase in x is related to a
decrease in y, but the relationship is not exact.

No Noticeable Correlation
http://www2.bc.edu/~lawrenst/whosher/stats9.gif
In this case, the data points fall randomly and show no particular
trend. An increase in x is not related to any particular change in y.

Given these 5 possibilities, it is easy to see that our question falls
into the situation of positive (not perfect) correlation. Much of the
variation in the y variable (deposits) can be explained by the
variation in the x variable (number of customers). The more customers,
the more deposits (makes sense, right?)

Question C - Regression Equations
---------------------------------
c. Find the equation of the regression line for the data

Two parts to this question - doing it in excel, and understanding what
you are doing.

When we run a regression on data, we are finding the line that fits
best through the data. Technically, by "best" we mean the line which
minimizes the sum of the squared vertical distances between each point
and the line.

This line tries to use as much of the variation in x to explain the
variation in y. If the variables are perfectly correlated, the
regression equation describes the line that joins the points.

To calculate the regression equation in excel we can do it in two
ways. One way is to plot a trendline and get the regresion equation,
but I will leave that for the next question.

Here's how we run the regression analysis:
1) Select Tools, Data Analysis from the menu. If it is not available,
you need to install the analysis toolpak (see Tools, Add-ins on the
menu).
2) Select Regression.
(http://www2.bc.edu/~lawrenst/whosher/stats10.gif)
3) Enter the y values (B2:B7) and the x values (C2:C7)
(http://www2.bc.edu/~lawrenst/whosher/stats11.gif)
4) Click "OK"

You should get the following output:
http://www2.bc.edu/~lawrenst/whosher/stats12.gif

A brief explanation of the important parts of the output (If you want
to know the definitions/explanations of the other statistics, i'd be
happy to answer that in a separate question) :

* First Panel - Overall regression statistics
R-Squared : How much of the variance in Y is described by X? In this
example 40% of the variation in the deposits can be explained by the
number of customers. The other 60% we presume are bank specific issues
(perhaps people who bank with Natwest are richer in general)

Observations : How many observations are there in your data set. In
this case we have data for 6 banks

* Second Panel - Anova Section :
This part tells us how much significance can be drawn from the
regression as a whole - ie. how much sense does it make to talk about
the best fit line in general?
- Significance F : This number tells us the overall significance of
the regression. It can be between 0 and 1 - the closer to zero, the
more significant it is. In this case, it is 0.17 which is not that
significant. But then again we only have 6 observations and it is
tough to say much about that few observations.

* Third Panel - Coefficient statistics
Intercept : These statistics tell us information about the point at
which our best fit line crosses the Y axis (x = 0).

X variable 1 : These statistics tell us information about the slope of
the best fit line. The steeper the slope, the more Y (deposits)
changes when X (customers) changes.

Coefficients : This tells us the actual value of the intercept (value
of Y when X = 0 on the best fit line) and the slope (how much Y
increases by when a X is increased by 1 on the best fit line). The
larger the slope coefficient, the steeper the slope.

Standard Error : How much uncertainty there is about the coefficient.
Unless the points are in a perfect straight line, any best fit line is
only an approximation. The more variation there is from the straight
line, the larger the standard error of the slope coefficient. (More on
this in part E)

P-value : How significantly different is the coefficient from zero?
The smaller the number, the more significant it is. (More on this in
part F) We normally say that a p-value less than 0.05 is significant.
If it is less than 0.01, it is highly significant.

Question D - Plotting the Best Fit Line (Trendline)
---------------------------------------------------
d. Plot the regression line on the same plot as the data.

To do this:
1) click on the chart. Click on one of the data points so that all of
the points are selected
(http://www2.bc.edu/~lawrenst/whosher/stats13.gif)
2) right click and select "Add Trendline" (or go to Chart, Add
Trendline)
3) Choose a linear line and go to the options tab
4) Check the "Display Equation on Chart box"
(http://www2.bc.edu/~lawrenst/whosher/stats14.gif)
5) Click OK.

You should see the following :
http://www2.bc.edu/~lawrenst/whosher/stats15.gif

This trendline reflects everything we have described so far. It is
represented by the equation shown (notice how the coefficients appear
in the equation).

Questions E and F - Using and interpreting the regression output
----------------------------------------------------------------
e. Calculate the standard error of the estimate, [sy/x,]for the
regression line
f. At the 0.05 level, is the model significant.

The standard error of the estimate, sy/x is our uncertainty about the
slope of the regression line. *** From part C, we see that it is
3.003688135 *** (approximately 3). This means that we are about 68%
sure that the slope of the line is 4.90 +/- 3.00. The larger this
number, the less certain we are about whether the slope is what we
estimated it to be.

*** The p-value of this coefficient is 0.177768364. This is not
significant at the 0.05 level *** (the number is bigger than 0.05).
This means that we are not sure whether the slope is significant - we
see that it is positive in our diagram, but this could be an accident.
The p-value tells us how likely it is that this happened 'by
accident'.

That's all that the question asks for, so I'll leave my answer at that
for now - if you want me to go into more detail on any section, feel
free to ask for additional clarification. I will try to answer those
questions where appropriate. If you want to go into more detail or
look at different questions, you can reopen some of your other
questions or ask new ones.

Glad I could help you and hopefully I've given you a clear and
understandable guide to statistics and the use of Microsoft Excel.

Enjoy

calebu2-ga

Comments

Subject: Re: regression analysis/statistics
From: rbnn-ga on 30 Jul 2002 12:16 PDT

I'm not sure how to tell which is the independent and which the
dependent variable actually. It seems to me technically one could want
to predict number of customers from deposits, (in which case
"#customers" would be the dependent variable") or one could want to
predict deposits from #customers, in which case the dependent variable
would be deposits. (Recall, the "dependent variable" is the quantity
we are trying to predict, and the "independent variable" is the
quantity that we know already.

Mind if I make another comment? 

Generally in statistics there are two ways to do things, from "first
principles" or using an existing software toolkit. In the olden days,
to "know regression" really meant being able, with pencil and paper 
to solve small regressions, or at least know the algorithms.

Nowadays, most of the time people just use some existing software
package toolkit to solve regressions. For instance, Excel, or Matlab,
or S, or SPSS and so on (there are lots of them out there).

Anyway, if you want, you might want to specify a bit more just what
kind of explanation you are looking for. Do you want to understand the
actual math? Or how to use a toolkit (which one?) Or exactly what the
meaning of the output of such a toolkit is?

Subject: Re: regression analysis/statistics
From: rbnn-ga on 30 Jul 2002 13:59 PDT

Oh, a more fundamental problem here: I'm not sure how to send a plot! Oh well.

Subject: Re: regression analysis/statistics
From: jeremymiles-ga on 03 Aug 2002 17:56 PDT

Hi Whosher;
This isn't a hard question, if you know what you are doing.  However,
I suspect that you want to know more than the answer - you want some
understanding of where the answer came from.  I don't want to seem
over-blunt, but your anser suggests to me that you are trying to learn
too much about statistics too fast.  The answer might tell you the SE
of the regression line, and the probability value associated with it,
but interpreting these is another matter.   I would humbly suggest
that you went for a simpler book or web page - I would be happy to
make some recommendations, although my area of interest is not in the
area of business statistics, and others could perhaps provide better
recommendations than me.
 
If you have Excel, and really just want the answers, then have a look
at:

http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/excel/index.html

Subject: Re: regression analysis/statistics
From: jeremymiles-ga on 03 Aug 2002 18:04 PDT

One more thing.   I have just looked at the data more closely - there
is quite an interesting relationship going on - this might be
developed later in the book. Without giving anything away, you might
have to have an understanding of the banking system in the UK to
really understand these data.

I realise that this diversion isn't necessarily what you are after,
but it is an interesting question, that can be answered on a number of
levels.  I think that the researchers are interested in answering your
question in a way that satisfies the ethical problems, but also
provides you with a satisfactory answer.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy